A detailed explanation of the scoring mechanism used by TwitBlock.
Some people have complained that they get a high spam score and point out that they are not spammers. There are a number of important things to note about this.
- This software is in alpha – these indicators and the scoring mechanisms attached to them will change.
- As the system gathers data it will rely less on heuristics and more on cross-referencing (e.g. how many people have blocked an account)
- Some of these tests are only indicators of automation, not specifically of malicious behaviour.
- The spam rating has no limit – Scoring 40 may be high for a “legimate” account, but in a list with real spammers scoring 300+ you’ll be way down the bottom.
- If you display characteristics of a spammer then perhaps this amounts to the same thing as being a spammer. Most normal users score zero.
Roughly in order of accuracy, here are the 8 tests currently performed in the standard TwitBlock scan.
1. Ignore factor.
This could also be called “inverse popularity”. If you follow 200 people and only 50 follow you back your ignore factor is 75%. Whether or not these 50 are the same people you follow is not analysed. The cut-off for scoring is 50%. Every 1% above 50 currently yields one point.
This simple and easily calculable factor is quite accurate because it reflects real human behaviour that can be observed. An account that is clearly spam, such as an “adult” account will have many times less followers than friends.
Naturally some spammers have found ways to beat this indicator. In some cases spam accounts follow each other to build up numbers, but a more cunning technique is the “sleeper” approach. Sleeper accounts pose as real people using stolen tweets pulled from the public timeline. TwitBlock may eventually crawl Twitter looking for these accounts, so expect more about this in future posts.
2. Follow Rate
The average number of people you follow per day forms your follow rate. This is calculated as the number of people you follow divided by the number of days you’ve been on Twitter. Although it’s a crude average, it is very telling and probably the second most reliable heuristic indicator. Even if you occasionally add a hundred people in a day it’s unlikely you can keep this up, so your average will drop. Averages are generally low even for power users, so a higher value is a strong indication of automation. The current cut-off (considered normal) is 10 per day. A point is added for every follower per day above 10.
[UPDATE – Aug 12]
Many popular accounts have high follow rates due to a “following back” policy, whether automated or not. The rate at which an account is followed is now subtracted from this value. This may result in lowering spam scores of real spammers, but it also reduces the number of false positives. So now, the rate at which an account follows without reciprocation is known as the “Stalking rate”.
3. Blocked by others
When you log into TwitBlock the system has access to your blocks and currently refreshes this list once per day until you revoke your authorization of the app. This is a key indicator that will become much more interesting as TwitBlock gathers data. Currently
10 5 points are applied for each block on an account.
[ UPDATE – Aug 22 ]
Whitelisting now used to counteract blocks
[ UPDATE – Aug 24]
Blocks are now diluted by follower count
4. Identical profile pics
Spammers commonly reuse the same image on multiple accounts. This is particularly common with the “adult” accounts. TwitBlock crawls all the blocked accounts it knows about and stores an MD5 checksum of the profile image file. This way any account’s profile image can be cross-reference with this database. 10 points are applied for each account known to use the same image.
This test could be easily foiled by spammers. Even using the same photo, it would be trivial alter the checksum. So far however, they appear not to be doing so.
5. Tweets via API
Status updates that are submitted without using a registered application (e.g. TweetDeck) will appear as having come “from API” (See Twitter FAQ). This is very useful, because spammers don’t want their activity to be tied to a registered application. If they start to do so then a list of known spammer applications will have to be compiled. 10 points are applied for API updates, although only the most recent tweet is analysed for performance reasons.
The points applied are deliberately low because people often give their password to applications that tweet on their behalf. e.g. “I just signed up to this awesome app and got 1,000 new followers”. This practice seriously needs to die out, but that’s another blog post for another day. Additionally many spam tweets appear as “from Web”, which suggest they are using the public web interface.
6. Missing profile info
This is not a very reliable indicator and may be dropped. There are 4 profile fields that can be left empty: (Bio, Location, URL and profile image). Most legitimate users fill in at least two of these. Currently 2 points are added if you leave all 4 empty. A drop in the ocean compared with other indicators. As I write this I realise this is due a review.
7. Username looks dodgy
For a human this is a strong indicator, but incredibly hard to implement programmatically. Currently this test performs some very crude tests on the username, such as being all numbers, having no vowels, and checking for a common format used by spammers where two words are followed by a number. Further research is required in this area, but it’s unlikely to form a reliable indicator going forward because it’s so easy to fool.
10 5 points are applied for a username that looks randomly generated.
8. Spammy words in bio and status
This more traditional test merely checks the bio against a list of bad words. The word list needs development and is currently not big enough to be useful. I intend to use the known blocked accounts to build a list of most common words found in spam accounts. An arbitrary score per word found is currently applied. For example “Naughty videos” yields 10 points.
Stay tuned for updates, as all these indicators are likely to change.