A detailed explanation of the scoring mechanism used by TwitBlock.
Some people have complained that they get a high spam score and point out that they are not spammers. There are a number of important things to note about this.
- This software is in alpha – these indicators and the scoring mechanisms attached to them will change.
- As the system gathers data it will rely less on heuristics and more on cross-referencing (e.g. how many people have blocked an account)
- Some of these tests are only indicators of automation, not specifically of malicious behaviour.
- The spam rating has no limit – Scoring 40 may be high for a “legimate” account, but in a list with real spammers scoring 300+ you’ll be way down the bottom.
- If you display characteristics of a spammer then perhaps this amounts to the same thing as being a spammer. Most normal users score zero.
Roughly in order of accuracy, here are the 8 tests currently performed in the standard TwitBlock scan.
1. Ignore factor.
This could also be called “inverse popularity”. If you follow 200 people and only 50 follow you back your ignore factor is 75%. Whether or not these 50 are the same people you follow is not analysed. The cut-off for scoring is 50%. Every 1% above 50 currently yields one point.
This simple and easily calculable factor is quite accurate because it reflects real human behaviour that can be observed. An account that is clearly spam, such as an “adult” account will have many times less followers than friends.
Naturally some spammers have found ways to beat this indicator. In some cases spam accounts follow each other to build up numbers, but a more cunning technique is the “sleeper” approach. Sleeper accounts pose as real people using stolen tweets pulled from the public timeline. TwitBlock may eventually crawl Twitter looking for these accounts, so expect more about this in future posts.
2. Follow Rate
The average number of people you follow per day forms your follow rate. This is calculated as the number of people you follow divided by the number of days you’ve been on Twitter. Although it’s a crude average, it is very telling and probably the second most reliable heuristic indicator. Even if you occasionally add a hundred people in a day it’s unlikely you can keep this up, so your average will drop. Averages are generally low even for power users, so a higher value is a strong indication of automation. The current cut-off (considered normal) is 10 per day. A point is added for every follower per day above 10.
[UPDATE – Aug 12]
Many popular accounts have high follow rates due to a “following back” policy, whether automated or not. The rate at which an account is followed is now subtracted from this value. This may result in lowering spam scores of real spammers, but it also reduces the number of false positives. So now, the rate at which an account follows without reciprocation is known as the “Stalking rate”.
3. Blocked by others
When you log into TwitBlock the system has access to your blocks and currently refreshes this list once per day until you revoke your authorization of the app. This is a key indicator that will become much more interesting as TwitBlock gathers data. Currently 10 5 points are applied for each block on an account.
[ UPDATE – Aug 22 ]
Whitelisting now used to counteract blocks
[ UPDATE – Aug 24]
Blocks are now diluted by follower count
4. Identical profile pics
Spammers commonly reuse the same image on multiple accounts. This is particularly common with the “adult” accounts. TwitBlock crawls all the blocked accounts it knows about and stores an MD5 checksum of the profile image file. This way any account’s profile image can be cross-reference with this database. 10 points are applied for each account known to use the same image.
This test could be easily foiled by spammers. Even using the same photo, it would be trivial alter the checksum. So far however, they appear not to be doing so.
5. Tweets via API
Status updates that are submitted without using a registered application (e.g. TweetDeck) will appear as having come “from API” (See Twitter FAQ). This is very useful, because spammers don’t want their activity to be tied to a registered application. If they start to do so then a list of known spammer applications will have to be compiled. 10 points are applied for API updates, although only the most recent tweet is analysed for performance reasons.
The points applied are deliberately low because people often give their password to applications that tweet on their behalf. e.g. “I just signed up to this awesome app and got 1,000 new followers”. This practice seriously needs to die out, but that’s another blog post for another day. Additionally many spam tweets appear as “from Web”, which suggest they are using the public web interface.
6. Missing profile info
This is not a very reliable indicator and may be dropped. There are 4 profile fields that can be left empty: (Bio, Location, URL and profile image). Most legitimate users fill in at least two of these. Currently 2 points are added if you leave all 4 empty. A drop in the ocean compared with other indicators. As I write this I realise this is due a review.
7. Username looks dodgy
For a human this is a strong indicator, but incredibly hard to implement programmatically. Currently this test performs some very crude tests on the username, such as being all numbers, having no vowels, and checking for a common format used by spammers where two words are followed by a number. Further research is required in this area, but it’s unlikely to form a reliable indicator going forward because it’s so easy to fool. 10 5 points are applied for a username that looks randomly generated.
8. Spammy words in bio and status
This more traditional test merely checks the bio against a list of bad words. The word list needs development and is currently not big enough to be useful. I intend to use the known blocked accounts to build a list of most common words found in spam accounts. An arbitrary score per word found is currently applied. For example “Naughty videos” yields 10 points.
Stay tuned for updates, as all these indicators are likely to change.
I’ve found myself with a high rating on new account (22). I am in the process of closing my old account, and opening a new (protected) one for those people who I followed and who followed me that I really want to stay in contact with. As this is less than a day old, it looks like I am not followed back much (yet) as I have added oodles of people but not yet been followed back much.
Very interesting tool you’ve come up with and something Twitter themselves should really be tackling better.
I reckon the bio info (point number 6) is more important than you think. Also that the user name thing should have less importance – there will be so many lovely ladies who won’t be able to use their real names soon! And as you say is programmatically very hard.
[Insert shameless plug here] I have a score of zero by the way peeps – @isemann
R!
I agree, the “follow back” algorithm, will have to be fixed some how.
for example, I follow GuyKawasaki, Scobleizer, TheOnion, CNNbrkn, CNN_top, etc.etc.
LOTS of “important” people who I want to know what they are saying, but obviously I’m wise enough to know they couldn’t give two dimes what I have to say.(thus will never follow me back).which is cool.
I don’t use Twitter as a way to “keep up with friends”, I use it as a way to keep up with “The World” I care about (Technology, among other things).
Seems as though this algorithm penalizes those who use Twitter it it’s MOST PURE FORM—ie. A Random Tweet to Friends, but mainly a silent observer listening to those with opinions who have value (but will NEVER “follow back”).
Just my $0.02
your random name calculator needs some work. “prgrmr” is a contraction of programmer. removing the vowels and doubled consonants is a standard way of contracting in unix & linux programming. there’s an algorithm for “uncontracting” words back into dictionary words that you should look into.
Tried TwitBlock on my 2 Twitter accounts, with interesting but weird results:
– I have a spamminess rating of 14 on one of my accounts, I guess because I follow 72 people but only have 22 followers. Like Leighton_Poar above, I mainly use Twitter to keep up with professional & news organizations who rarely follow people back + I block spammers and marketeers every day. Does keeping a close eye on one’s feed/followers, and not having the time/use to enter into a follow me/follow you Twitter race, increase the chances of being listed as a spammer? What’s the rationale here?
– On my other account, a lady who has registered to Twitter to follow me but hasn’t posted anything yet and is not following anyone else is registered as a spammer: quid?
Just some user feedback. Feel free to email if you have questions/comments.
I got a score of 27, based only on people not following me back. I mainly use Twitter to keep up with celebrities and news organisations who rarely follow people back. Maybe it would be an idea to take that calculation into account somehow, perhaps by ignoring people I’m following who have Verified Accounts. :-)
1. Ignore factor. — Worthless for people who have protected updates. I deny tons of people to try to follow me. If I wasn’t running protected updates I’d have far more followers than I do now, granted 3/4 would be spam-bots.
I find I have to view each new follower and look at who they follow to get a good idea if others are spammers or bots. I am seeing the same pictures but with different names of ones I blocked due to offensive content. This is very time consuming but cuts down alot of new followers that are really spam from other account followers.
Bare in mind that spam accounts often ‘steal’ tweets to appear like real people. So analysing these can be very misleading.
Hey, I am still working on an algorithm which identifies spammer. Check out the prototype http://wayati.com I am using other rules and techniques to identify spammer. Mostly I analyze the structure of the last 20 tweets. Currently I am working on a new version of this algorithm which is more object oriented and makes much more checks. Maybe we can work together to get the best possible tool to fight spammer.
Congratulations on the whitelisting!