How to eliminate spam bots from AWStats for good

Here’s example Perl implementation of the Apache access.log filter for “telling human from computer [site visitors] apart”.
For the description please see main page.

#!/usr/bin/perl -w
#
# Extract human-like entries from httpd server log.
# Note:
# Some legitimate users may be filtered out,
# however, they probably are not interesting economically anyway,
# so not really required in the analysis.

my $logfile = defined $ARGV[0] ? $ARGV[0] : "";

# Pass 1: get the list of (spam) bots.
open(L, $logfile) or die "Logfile $logfile unaccessible (pass 1).\n";
my %bots = ();
my %humans = ();
my $bad_lines = 0;
my $MAX_BAD = 3;
my $verb = 0;
while () {
        /^(\S+) .+? (\S*?) HTTP\/\S*? (\d\d\d) / or do {
                ++$bad_lines;
                $verb and $bad_lines <= $MAX_BAD and
                        warn "Bad line: $_";
                next;
        };
        $host           = $1;
        $request        = $2;
        $status         = $3;
        if (like_human()) {
                $humans{$host} = 1;
                delete $bots{$host};
        }
        else {
                exists $humans{$host} or $bots{$host} = 1;
        }
}
close L;
$verb && printf STDERR " bots: %d, humans: %d, bad lines: %d\n",
        scalar keys %bots,
        scalar keys %humans,
        $bad_lines;

#Pass 2: extract human-like entries
open(L, $logfile) or die "Logfile $logfile unaccessible (pass 2).\n";
while () {
        /^(\S+) / or next;
        $host           = $1;
        print unless exists $bots{$host};
}
close L;

sub like_human
{
        return 1 if exists $humans{$host};
        return
                $request =~ /^\// &&
                $status eq "304" ||
                $request =~ /\.(?:js|css|jpe?g|png|gif|ico|pdf|mp3|avi)$/i
                # Is "htm" ok?..
                ?
                1 : 0;
}

1 Star2 Stars3 Stars4 Stars5 Stars


15 thoughts on “How to eliminate spam bots from AWStats for good

  1. Pingback: Results of taking care for the site •STEREO-blog

  2. Pingback: Penultimate Reality » Blog Archive » Spambots Hurting Statistics

  3. If you have this error:

    Use of uninitialized value in pattern match (m//) at ./remove_spambots.pl line 19.

    Change the lines at line 19 and line 45 that seems like:

    while () {

    by

    while () {

  4. Thanks for contributing, Ruben.

    Actually, I see the script is in need of revision as my stats once again started to seem poisoned.

    Looks like .js access is the single most reliable indication for humanness, robots are rarely interested in JavaScript. Only problem – not all sites readily include this type of files in every page, so this test isn’t totally universal out of the box.

  5. You’ll need to re-run logs through new filter obviously.
    The old reports just lack the needed key information to distinguish trash within them.

    BTW, I believe it’s a good policy to keep all your web access logs. Quite a few of my websites take only 300 MB uncompressed in log files space per month (just over 1 million hits). Disks are cheap nowadays and you may get new ideas of what to do with the logs in future.

  6. Can someone please give a short intro as how to implement this Perl script on my Apache server? i.e.: where to save the file, how to make it read the log file, etc.
    Much appreciated.

  7. Hi Alex,

    Real people’s browsers request some non-page elements besides pages themselves.

    This is very true – but Awstats doesn’t show that, and we are seeing a 1:1 ratio (pages:hits) across the board even for non-spam entries.

    I don’t know why this happens, it could be an Awstats problem. But what i do know is that this problem should not be resolved at the Awstats level, but rather at the .htaccess level.

  8. Our sites in the last few months have been hit hard by this issue. I’d really appreciate any help anyone could give me in implementing this. I’m not really well versed in AWstats but have people breathing down my neck about the stats jumping so much

Leave a Reply

Your email address will not be published.