Wednesday, 13 July 2011
A few times a day, I get a tweet directed at me from someone I don't know, who neither offers something related to what I have recently tweeted.
The first few times that happened, a few months ago, I clicked the URL and ended up at some site after being redirected via more than a few others (hello pageview counter), and saw nothing I was interested in
Since, I recognise the pattern: tweet is short and unrelatable, picture is attractive in either which way, and tweet always contains a URL. When I visit that tweeps' tweets, out of the last 20 a few are repeated
Pattern established - why now can't Twitter do that automatically and block these spammers before they harass me?
Well, they can - in theory. There are a few caveats though.
First of all, the amount of Twitter users and their number of tweets is enormous: 200 million tweets are sent daily. As a side-effect of specialising in Integration, I know by heart that there are 86,400 seconds in a day, so those 200 m mean 2,300 and then some tweets per second.
Every single one of those tweets has to be handled
First, it has to be distributed to everyone who "has a right to read". That means an either-or decision depending on the tweeter's status: public or private?
Second, .. - oh wait! There is no second. Well, maybe there's a 1.5: data can be fed by client applications, like geo information, that is not to be accepted based on a user's profile settings. So either the next application (the one retrieving the very tweet) is checked to see what can be retrieved, or the information is just stored according to what is allowed.
Given the fact that changes can be made regarding these settings after the tweet is sent, I have to assume Twitter opts for the second - that means a lot of deciding to do when tweets are retrieved
The other tweet scenario is an @reply: this is a direct reply from one user to the other.
Similar decisions: may, or may not? Directly related to public or private status, although the point of no return here is reached upon posting, not retrieval
The third filter is easy, and applies to both situations: has the user sending the tweet been blacklisted by Twitter? This has to be covered from the sending and receiving side as well, as the user may have been blacklisted just a while ago, and is still logged in, or his or her (well let's say its) account has been disabled
All other processing is unnecessary: so every additional processing would be adding to overhead
I've lived and breathed in a million-people P2P environment, and made many anti-spam and anti-flood tools, among others, both server-side and client-side. The thing is, that every other evaluation of a text message adds to extra processing, and consequentially message delay. At 2,300+ tweets a second, that is theoretically doable but it will cost dearly - as the variables to check differ from day to day, if not minute. The hordes of bad bots and their creators far outweigh the people you can deploy, let alone at cost
Oh and by the way, those 2,300+ are on average, with regular peaks at 5,000 and offshoots at 7,000
So, I can happily live with the fact that Twitter effectively uses crowd-sourced content curation. Say what? Yes, that's what is preventing spammers from invading Twitter. Not from entering, but from invading - and keeping them at a great distance from taking over.
Tweeps have the ability to block, and / or report as spam, other tweeps. That's content curation in its purest form. Sure it's strictly limited to down-voting people, but it works
So, dear tweeps: block and report for spam every single one you think is a spammer - and Twitter will take care of the rest. How? That's subject to an algorithm that will remain as obfuscated as their daily trends - but hey