How to eliminate spam bots from AWStats for good

The two most common approaches in Web analytics are:

  1. Web server logfile analysis
  2. Page tagging

Page tagging is the method of choice from the commercial standpoint. However, it’s got its characteristic drawbacks:

  • changes to the web application are needed
  • vendor lock-in of some sort takes place (regardless whether you use a subscription-based solution or acquire a hosted one).

On the other hand there is one nice web stats tool operating in the old good logfile analysis realm, which is AWStats. Until recently it was a reliable work horse for many webmasters delivering quite useful reports about origin breakdown, sessions (visits duration), lists of landing pages (“entry”) & exit pages – categories commonly associated with the more complex page tagging statistics systems.

What happened to it?

It’s spam bots and referrer spammers who now spoil the reports produced by AWStats.

You know, when you look at e.g. the Top Hosts report by AWStats and you see that (almost) all of the Top 10 are non-humans it’s kinda frustrating – they may not be making up the whole lot to the totals, but they shift the real people down and beyond the Top 10 report’s boundaries and you simply loose this whole part of your stats which isn’t really playing in favor of the stats system in use.

And exactly this section of AWStats page – Top Hosts – made me think out the ways to cure the problem.

As I mentioned above there are two distinctive types of spoilers in the stats. They are somewhat similar as both are represented by robots, yet they’ve got major differences:

  1. Referrer spammers.
    These are specifically targeting the logs being analyzed so they’ve been combated for some time already.
  2. Comment spammers.
    These do not target logs in their malice, instead messed logs are the by-product of their activities. Because of this there seems to be inadequate attention paid to them to the date.

So here’s an idea on how to weed out web server statistics from comments spam bots activity, presumably working for referrer spam bots as well.

When you look at a Top Hosts report for a highly spammed web site you most surely may notice remarkably similar digits for each host in the two columns – Pages and Hits.

That equality meaning the Pages/Hits ratio being 1 suggests one very special characteristics common of human users accessing web pages:
Real people’s browsers request some non-page elements besides pages themselves.

Technically this could be detected as:

  • requesting some files from this list of extensions: (taken directly from AWStats config file)
    NotPageList=”css js class gif jpg jpeg png bmp”
  • having some of the requests return HTTP codes 304, 303 and such.

Anybody NOT requesting at least something of the above is very likely a spamming robot. Robots aren’t fond of style, are they?

One little problem I see with this method is loosing mobile users, slow connect users – those who try to cut traffic in every possible way.

But hey, they aren’t likely to spend money with you anyway (I mean, they are respectable users, but nobody really loses anything skipping them from the web site stats)

Yet another class of a potential blunder might be some old site featuring text-only pages. Here again, if you’re planning to use those pages as an advertising media you’ll have to find way to include additional elements in the pages, and that will trigger the filter to distinguish humans from machines here.

Therefore, the easily implemented solution (see on page 2 below) being based upon 1-line idea seems to have huge effect with little effort.

And I’m telling you, I literally reinvented some of the AWStats reports after installation of the new log file filter.

For example, I found out that about 30% of STEREO.org.ua visitors in the old stats were comment spammers. In some less popular sites this figure was making up for more than half page traffic!

Or, before filter I saw only modest 15-20% of visitors bookmarking my website (and I thought that was nice), then I was amazed with around 30% visitors adding my blog to favourites.
I have to say that is really encouraging.

A recent trend of Search Keywords spamming appears to be eliminated too.

In short – just try for yourself, and who knows, may be you’ll be able to waive a purchase of a commercial analysis package until the next “big” spammers’ attack.

1 Star2 Stars3 Stars4 Stars5 Stars (1 votes, 5.00 of 5)


15 thoughts on “How to eliminate spam bots from AWStats for good

  1. Pingback: Results of taking care for the site •STEREO-blog

  2. Pingback: Penultimate Reality » Blog Archive » Spambots Hurting Statistics

  3. If you have this error:

    Use of uninitialized value in pattern match (m//) at ./remove_spambots.pl line 19.

    Change the lines at line 19 and line 45 that seems like:

    while () {

    by

    while () {

  4. Thanks for contributing, Ruben.

    Actually, I see the script is in need of revision as my stats once again started to seem poisoned.

    Looks like .js access is the single most reliable indication for humanness, robots are rarely interested in JavaScript. Only problem – not all sites readily include this type of files in every page, so this test isn’t totally universal out of the box.

  5. You’ll need to re-run logs through new filter obviously.
    The old reports just lack the needed key information to distinguish trash within them.

    BTW, I believe it’s a good policy to keep all your web access logs. Quite a few of my websites take only 300 MB uncompressed in log files space per month (just over 1 million hits). Disks are cheap nowadays and you may get new ideas of what to do with the logs in future.

  6. Can someone please give a short intro as how to implement this Perl script on my Apache server? i.e.: where to save the file, how to make it read the log file, etc.
    Much appreciated.

  7. Hi Alex,

    Real people’s browsers request some non-page elements besides pages themselves.

    This is very true – but Awstats doesn’t show that, and we are seeing a 1:1 ratio (pages:hits) across the board even for non-spam entries.

    I don’t know why this happens, it could be an Awstats problem. But what i do know is that this problem should not be resolved at the Awstats level, but rather at the .htaccess level.

  8. Our sites in the last few months have been hit hard by this issue. I’d really appreciate any help anyone could give me in implementing this. I’m not really well versed in AWstats but have people breathing down my neck about the stats jumping so much

Leave a Reply

Your email address will not be published.