<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>timwhitlock.info &#187; junk</title>
	<atom:link href="http://timwhitlock.info/blog/tag/junk/feed/" rel="self" type="application/rss+xml" />
	<link>http://timwhitlock.info</link>
	<description>Tim Whitlock&#039;s personal site and blog</description>
	<lastBuildDate>Thu, 15 Dec 2011 13:51:45 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
		<item>
		<title>TwitBlock spam ratings explained</title>
		<link>http://timwhitlock.info/blog/2009/08/03/twitblock-spam-ratings-explained/</link>
		<comments>http://timwhitlock.info/blog/2009/08/03/twitblock-spam-ratings-explained/#comments</comments>
		<pubDate>Mon, 03 Aug 2009 22:12:13 +0000</pubDate>
		<dc:creator>tim</dc:creator>
				<category><![CDATA[General]]></category>
		<category><![CDATA[junk]]></category>
		<category><![CDATA[spam]]></category>
		<category><![CDATA[twitblock]]></category>
		<category><![CDATA[twitter]]></category>

		<guid isPermaLink="false">http://web.2point1.com/2009/08/03/twitblock-spam-ratings-explained/</guid>
		<description><![CDATA[A detailed explanation of the scoring mechanism used by TwitBlock. Some people have complained that they get a high spam score and point out that they are not spammers. There are a number of important things to note about this. This software is in alpha &#8211; these indicators and the scoring mechanisms attached to them [...]]]></description>
			<content:encoded><![CDATA[<h4></h4>
<h3>A detailed explanation of the scoring mechanism used by <a href="http://twitblock.org/?wp" target="_blank">TwitBlock</a>.</h3>
<p>Some people have complained that they get a high spam score and point out that they are not spammers. There are a number of important things to note about this.</p>
<ul>
<li><a href="http://twitblock.org/?wp" target="_blank">This software</a> is in alpha &#8211; these indicators and the scoring mechanisms attached to them <strong>will</strong> change.</li>
<li>As the system gathers data it will rely less on <a href="http://en.wikipedia.org/wiki/Heuristic" target="_blank">heuristics</a> and more on cross-referencing (e.g. how many people have blocked an account)</li>
<li>Some of these tests are only indicators of <strong>automation</strong>, not specifically of malicious behaviour.</li>
<li>The spam rating has <strong>no limit</strong> &#8211; Scoring 40 may be high for a &#8220;<em>legimate</em>&#8221; account, but in a list with real spammers scoring 300+ you&#8217;ll be way down the bottom.</li>
<li>If you display characteristics of a spammer then perhaps this amounts to the same thing as being a spammer. Most normal users score <em>zero.</em></li>
</ul>
<p>Roughly in order of accuracy, here are the 8 tests currently performed in the standard <a href="http://twitblock.org/scan_followers.php?wp" target="_blank">TwitBlock scan</a>.</p>
<h4><span id="more-128"></span>1. Ignore factor.</h4>
<p>This could also be called &#8220;inverse popularity&#8221;. If you follow 200 people and only 50 follow you back your ignore factor is 75%. Whether or not these 50 are the same people you follow is not analysed. The cut-off for scoring is 50%. <strong>Every 1% above 50 currently yields one point</strong>.</p>
<p>This simple and easily calculable factor is quite accurate because it reflects real human behaviour that can be observed. An account that is clearly spam, such as an &#8220;adult&#8221; account will have many times less followers than friends.</p>
<p>Naturally some spammers have found ways to beat this indicator. In some cases spam accounts follow each other to build up numbers, but a more cunning technique is the &#8220;sleeper&#8221; approach. Sleeper accounts pose as real people using stolen tweets pulled from the public timeline. TwitBlock may eventually crawl Twitter looking for these accounts, so expect more about this in future posts.</p>
<h4>2. Follow Rate</h4>
<p>The average number of people you follow per day forms your follow rate. This is calculated as the number of people you follow divided by the number of days you&#8217;ve been on Twitter. Although it&#8217;s a crude average, it is very telling and probably the second most reliable heuristic indicator. Even if you occasionally add a hundred people in a day it&#8217;s unlikely you can keep this up, so your average will drop. Averages are generally low even for power users, so a higher value is a strong indication of automation. The current cut-off (considered normal) is 10 per day. <strong>A point is added for every follower per day above 10.</strong></p>
<p><strong>[UPDATE - Aug 12]</strong><br />
Many popular accounts have high follow rates due to a &#8220;following back&#8221; policy, whether automated or not. The rate at which an account is followed is now subtracted from this value. This may result in lowering spam scores of real spammers, but it also reduces the number of false positives. So now, the rate at which an account follows without reciprocation is known as the &#8220;Stalking rate&#8221;.</p>
<h4>3. Blocked by others</h4>
<p>When you log into TwitBlock the system has access to your blocks and currently refreshes this list once per day until you revoke your authorization of the app. This is a key indicator that will become much more interesting as TwitBlock gathers data. <strong>Currently <strike>10</strike> 5 points are applied for each block on an account</strong>.</p>
<p><strong>[ UPDATE - Aug 22 ]<br />
</strong><a href="http://web.2point1.com/2009/08/21/twitblock-trialing-whitelist-feature/">Whitelisting now used to counteract blocks<br />
</a></p>
<p><strong>[ UPDATE - Aug 24]<br />
</strong><a href="http://web.2point1.com/2009/08/24/diluting-block-counts/">Blocks are now diluted by follower count</a></p>
<h4></h4>
<h4>4. Identical profile pics</h4>
<p>Spammers commonly reuse the same image on multiple accounts. This is particularly common with the &#8220;adult&#8221; accounts. TwitBlock crawls all the blocked accounts it knows about and stores an <a href="http://en.wikipedia.org/wiki/MD5" target="_blank">MD5 checksum</a> of the profile image file. This way any account&#8217;s profile image can be cross-reference with this database. <strong>10 points are applied for each account known to use the same image</strong>.</p>
<p>This test could be easily foiled by spammers. Even using the same photo, it would be trivial alter the checksum. So far however, they appear not to be doing so.</p>
<h4>5. Tweets via API</h4>
<p>Status updates that are submitted without using a registered application (e.g. TweetDeck) will appear as having come &#8220;from API&#8221; (<a href="http://apiwiki.twitter.com/FAQ#HowdoIget%E2%80%9CfromMyApp%E2%80%9DappendedtoupdatessentfrommyAPIapplication" target="_blank">See Twitter FAQ</a>). This is very useful, because spammers don&#8217;t want their activity to be tied to a registered application. If they start to do so then a list of known spammer applications will have to be compiled.<strong> 10 points are applied for API updates</strong>, although only the most recent tweet is analysed for performance reasons.</p>
<p>The points applied are deliberately low because people often give their password to applications that tweet on their behalf. e.g. &#8220;I just signed up to this awesome app and got 1,000 new followers&#8221;. This practice seriously needs to die out, but that&#8217;s another blog post for another day. Additionally many spam tweets appear as &#8220;from Web&#8221;, which suggest they are using the public web interface.</p>
<h4> 6. Missing profile info</h4>
<p>This is not a very reliable indicator and may be dropped. There are 4 profile fields that can be left empty: (Bio, Location, URL and profile image). Most legitimate users fill in at least two of these. <strong>Currently 2 points are added if you leave all 4 empty</strong>. A drop in the ocean compared with other indicators. As I write this I realise this is due a review.</p>
<h4>7. Username looks dodgy</h4>
<p>For a human this is a strong indicator, but <a href="http://stackoverflow.com/questions/1164186/how-to-check-if-a-string-looks-randomized-or-human-generated-and-pronouncable" target="_blank">incredibly hard to implement programmatically</a>. Currently this test performs some very crude tests on the username, such as being all numbers, having no vowels, and checking for a common format used by spammers where two words are followed by a number. Further research is required in this area, but it&#8217;s unlikely to form a reliable indicator going forward because it&#8217;s so easy to fool. <strong><strike>10</strike> 5 points are applied for a username that looks randomly generated</strong>.</p>
<h4>8. Spammy words in bio and status</h4>
<p>This more traditional test merely checks the bio against a list of bad words. The word list needs development and is currently not big enough to be useful. I intend to use the known blocked accounts to build a list of most common words found in spam accounts. <strong>An arbitrary score per word found is currently applied</strong>. For example &#8220;Naughty videos&#8221; yields 10 points.</p>
<p>Stay tuned for updates, as all these indicators are likely to change.</p>
]]></content:encoded>
			<wfw:commentRss>http://timwhitlock.info/blog/2009/08/03/twitblock-spam-ratings-explained/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
		</item>
		<item>
		<title>TwitBlock is born</title>
		<link>http://timwhitlock.info/blog/2009/07/27/twitblock-is-born/</link>
		<comments>http://timwhitlock.info/blog/2009/07/27/twitblock-is-born/#comments</comments>
		<pubDate>Mon, 27 Jul 2009 22:36:04 +0000</pubDate>
		<dc:creator>tim</dc:creator>
				<category><![CDATA[General]]></category>
		<category><![CDATA[data mining]]></category>
		<category><![CDATA[junk]]></category>
		<category><![CDATA[spam]]></category>
		<category><![CDATA[twitblock]]></category>
		<category><![CDATA[twitter]]></category>

		<guid isPermaLink="false">http://web.2point1.com/2009/07/27/twitblock-is-born/</guid>
		<description><![CDATA[A bulk blocking and spam filter tool for Twitter www.twitblock.org I&#8217;ve finally got round to building the Twitter app I&#8217;ve been thinking about for months. While everyone else is preoccupied with making fun, or cool apps, I&#8217;ve been thinking about the increasing problem of spam and junk followers on Twitter. I won&#8217;t go into why [...]]]></description>
			<content:encoded><![CDATA[<h3>A bulk blocking and spam filter tool for Twitter</h3>
<p><strong><a href="http://twitblock.org/">www.twitblock.org</a></strong></p>
<p>I&#8217;ve finally got round to building the Twitter app I&#8217;ve been thinking about for months. While everyone else is preoccupied with making fun, or cool apps, I&#8217;ve been thinking about the increasing problem of spam and junk followers on Twitter. I won&#8217;t go into why I think this is such a problem right now, plenty of time for that later.</p>
<p>This is just a quick announcement to say that I&#8217;ve released an early <em>alpha</em> version of a tool that I hope to develop into something genuinely useful. Currently it&#8217;s a <a href="http://twitblock.org/scan_followers.php">simple scanner</a> that analyses your followers for signs of &#8220;spammy&#8221; behaviour. I&#8217;ll post more details about these <em>indicators</em> soon, and I&#8217;ll also share some of the interesting discoveries I&#8217;ve been making about Twitter spam as I go on my mission.</p>
<p>UPDATE: I have posted <a href="http://web.2point1.com/2009/08/03/twitblock-spam-ratings-explained/">about these indicators</a></p>
<p><span id="more-125"></span></p>
<h3>Data mining for good, not evil</h3>
<p>One of the principal aims of <a href="http://twitblock.org/">TwitBlock</a> is to gather data in order to improve the service &#8211; i.e. to make it accurate enough that it could [in theory] be used to <em>automatically</em> filter spam out like an email junk filter endeavours.</p>
<p>By logging into TwitBlock (<a href="http://blog.twitter.com/2009/04/whats-deal-with-oauth.html" target="_blank">via Twitter OAuth of course</a>) you are sharing the list of people that you block. As long as the app is authorized I can update this list and the app can learn from it.</p>
<p>Additionally I will be writing various bots (crawlers) that analyse Twitter activity in terms of suspicious behaviour and mine more data. More about these bots later too <img src='http://timwhitlock.info/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
]]></content:encoded>
			<wfw:commentRss>http://timwhitlock.info/blog/2009/07/27/twitblock-is-born/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
	</channel>
</rss>

