Sifting Out Rude Robots And StealthBots

By Gerry Patterson

The PGTS website has been gathering data on agent strings since 2002, by gathering data from the apache log files. One of the first problems encountered was detection of Rude Robots and StealthBots. Lately there has been a lot of strange agent strings appearing in my log files. I needed a way to identify these and decide whether a particular agent string was "reasonable";

Last month (Jan 2004), there were almost a quarter of a million hits logged at the PGTS website. This translates into roughly 8000 hits per day. Visitors who look at the statistics that are posted on the site (and there are a lot of statistics) would probably be surprised at the extraordinary low hit counts that this site reports.

As it turns out a lot of those hits are not human beings.

Give examples of agent strings

The first thing that was required was a list of words that are usually found in agent strings. This was best obtained from the existing database.

Agent string data at the PGTS website is kept in a few postgres tables. The main table is called webagents which has the following structure:

			 Table "webagents"
	  Attribute   |           Type           | Modifier
	--------------+--------------------------+----------
	 agent_id     | integer                  | not null
	 agent_string | character varying(256)   | not null
	 name         | character varying(40)    | not null
	 version      | character varying(15)    |
	 os           | character varying(40)    |
	 comments     | character varying(512)   |
	 owner        | character varying(256)   |
	 robot_ind    | integer                  |
	 text_only    | boolean                  |
	 last_visit   | timestamp with time zone | not null
	 create_date  | timestamp with time zone | not null
	 update_date  | timestamp with time zone | not null
	Indices: webagents_agent_id_key,
		 webagents_agent_string_key

After cleaning up the data, a list of the reasonable agent_strings could be obtained with the following query;

	\o agent_strings.txt
	select agent_string from webagents where name != 'unknown';
	\o

The file was cleaned up with vi. After which, a new table was created to hold common words that are deemed to be "ok". This table was created in postgres:

	create table ok_words (
		word		varchar(25) not null unique,
		word_count	int);

Note: This creates an implicit index 'ok_words_word_key' on the word column.

Next a perl script was created in order to extract the words from the agent_strings. This started as a simple script that split each agent string into words with the "split" function, and then counted them. It gradually became more complex. The final code was as follows:

	#!/usr/bin/perl
	foreach $x (1990 .. 2004){$W{$x} = 0};
	$W{"rv:2"} = $W{"rv:1"} = $W{"rv:0"} = $W{www} = 0;
	while(<>){
		chomp;
		s/[\(\)\/{};,_@'-]/ /g;
		s/:/ /g unless (/rv:[0-9]/);
		foreach $x (split(' ')) {
			next if (length(($x)) > 25 || length(($x)) < 4);
			next if (($x > 0 && int($x) == $x) &&
				($x < 1990 || $x > 2004));
			foreach $y (keys %W){
				if (index($y,$x) > -1 && $x ne $y){
					$W{$x} += $W{$y};
					delete $W{$y};
					last;
				}
				if (index($x,$y) > -1 && $x ne $y){
					$W{$y} += $W{$x};
					delete $W{$x};
					$x = $y;
					last;
				}
			}
			$W{$x}++;
		}
	}
	print "COPY ok_words FROM stdin;\n";
	foreach $x(sort keys %W) {
		next if ($W{$x} < 5 || ($W{$x} < 10 && $x > 0) );
		print "$x\t$W{$x}\n" if ($W{$x} > 4);
	}
	print "\\.\n";

The script looks daunting, however it is simple to follow the transformations. The hash %W is used to store the words contained in the agent_strings. It is seeded with the strings "rv:2", "rv:1", "rv:0" and "www".

The transformation (s/[\(\)\/{;,_@'-]/ /g) substitutes any of the characters "(){;,_@'-" with a space. The next transformation substitutes ":" with a space provided the agent string does not match the regexp /rv:[0-9]/ (This revision number string is found in many versions of Mozilla). The agent string is split into words (using the space character as the split delimiter). Only words between 3 and 25 characters in length are considered.

Words consisting only of digits (except the numbers 1990 - 2004) are discarded. Finally before incrementing the count, the hash %W is checked to see if a portion of the current word is contained in the existing words lists or any of the existing words are contained within the current word. If so, it is translated accordingly. For example: the word "Amiga" is contained in the word "AmigaOS", so only the word "Amiga" is retained and counted in the hash.

Once the hash has been constructed, it is printed. First the SQL COPY command is generated. This will cause psql to copy data from stdin to the table ok_words. Only the alphanumerics with a count greater than 5 or the numerics with a count greater than 10 are printed along with the word count. The final table can be viewed here. The SQL COPY command is terminated with the \. string on a single line.

Also it seemed as though certain three letter words like "www", "w3m", "Bot" and "Web" should be included.

Now the acid test. The following query tests whether this is satisfactory:

	select agent_string from webagents
	where not exists (select 'x' from ok_words
		where strpos(webagents.agent_string,ok_words.word) > 0 );

This searched through all 12000 entries in the PGTS database in a matter of a few seconds and displayed the suspicious agent_strings. The production version was much faster, and returned a result in the blink of an eye:

	select agent_id from webagents
	where create_date > now() - 2
	and name != 'unknown'
	and not exists (select 'x' from ok_words
		where strpos(webagents.agent_string,ok_words.word) > 0 );

Readers who don't regularly use SQL, may find it hard to believe that a query as cumbersome in apperance as the previous example could be so efficient. The "not exists" clause can be a powerful and efficient device for "sorting wheat from chaff". What the above query achieves is listing of any new agent_strings that do not contain any of the words in the table ok_words. Note the "not exists" clause usually performs well when the columns have been indexed. Obviously full table scans will not perform as well (depending on the size of the tables).