Round Up The Usual Suspects.

Round Up The Usual Suspects

By Gerry Patterson

In a previous article I presented a method of parsing logfiles to extract information on visitors, robots and worms. As the traffic at my site increased, I realised that this first script had a few shortcomings. This article presents an updated procedure for parsing the logfiles.

Just who is crawling my site?

Major Strasser has been shot ...
Round up the usual suspects!
- Captain Louis Renault (Claude Rains), "Casablanca".

The simple script that I wrote for parsing the logfiles and determining statistics visitors, used a simplistic assumption that any agent who fetched /robots.txt was a robot. And this is a reasonable assumption. For unruly robots, I included a hard-coded list of agent_strings. As it turned out, there was a flaw in this logic. Each month the logfiles are rolled and a new logfile is started. Some quite well-behaved robots might work their way slowly through a list of files, checking robots.txt every fifty files or so. This could mean that the robot checked robots.txt at the end of one month, and continued crawling the next month, when my simple script would count it as a browser.

Eventually, I just added each robot to the hard-coded list. This was a simple approach that gave a more or less accurate picture of the official crawlers that visit my site. This left a body of unknown crawlers, that did not use an easily identifiable agent string. Some of them would use an agent string, which although unique was sufficiently similar to other agent strings that they could be confused as variants of those agents. This camouflage may have been adopted for practical reasons, such as get around the practice of serving up specific pages for specific agents. Or they might have more sinister reasons for adopting camouflage.

The most obvious camouflage to adopt would be MSIE 6.0. This is the most popular browser (currently accounting for 36.83 percent of hits to this site). Some camouflaged crawlers would still make the agent string unique, so it would be possible to use the agent string to identify them, however others adopted a string that was indistinguishable from (genuine) browsers. These can only be identified by their behaviour. Interestingly enough there was a camouflaged crawler which pretended to be the GoogleBot (see bibliography).

Needless to say this meant that the process of identifying crawlers and gathering statistics, was becoming complex. It has become very difficult to accomplish the simultaneous task of gathering statistics and identifying crawlers in a single pass of the logfile. This is because the decision about the nature of an agent can only be made after statistics have been gathered on its' pattern of behaviour. In choosing the interval for the observation time, I had to balance the fact that some robots crawl very slowly, sometimes fetching from different IP addresses (this is the classic Robot Gang behaviour described in the earlier article). And other less polite crawlers just slurp up huge volumes in a short interval of a few seconds. Some come from IP addresses that have been permanently allocated and others come from dial-in connections. I have chosen eight hours as a (nominal) session time cutoff. This represents a good compromise between these considerations. Also it has become necessary to store the information in a database. I have chosen postgres for the RDBMS. But any database will do. Even text files will do. The reason I chose a database was to facilitate reporting.

Furthermore the procedure for parsing logfiles has now been broken into two phases. The phase which is presented here runs every 24 hours, examines agent behaviour and flags it as Browser, Robot or Suspect. Robots are identified by the agent strings. Robot suspects are identified by their behaviour and anything else is considered a browser. Any agent which has a unique agent string that is not like a known browser and is only every involved in suspect behaviour, is given a permanent entry in the robots table at the end of the month. Thereafter agents with that string will be considered as robots.

So what constitutes robot (or crawler) behaviour? Like so many problems in pattern recognition these are easy tasks for a human to perform and quite difficult to put into an algorithm. After sifting through the logfiles and comparing what I thought was obviously robotic behaviour patterns with what seemed obviously human, I came up with the following criteria:

Crawlers most often have an empty referer string.
Crawlers usually fetch files of one type. Either all text or all graphics (more about these later).
Aggressive crawlers will fetch large numbers of pages in a very short time. Any agent that fetches more than a human would be capable of reading is usually a robot.
Agents that use the HEAD command are usually robots.

At first, I thought that agents which fetch only graphics files must be robots. However I discovered some agents which seemed to be working in concert with a browser These seem to be caching agents. The purpose of such and agent would be to gather graphics and cache them for the browser.

For example, several months ago, I identified an agent named BorderManager as a crawler. It certainly seemed to fulfill the necessary criteria. In fact, some of you may have noticed that iBorderManager version 3 was previously listed here as a crawler that showed a preference for images and icons. The CIDR is owned by a security and technology company, that seems legit. Still I wondered why were they crawling my site looking for images? In fact they don't (or not deliberately). A close inspection of the logfiles reveals that the guys from this company also visit my site with browsers. The browsers are regular type GUI browsers (like MSIE) and they behave normally, except that they only fetch HTML and text pages!. It looks as though the agent called BorderManager is fetching the graphics and then serving them up to the browsers. I suppose this is done via some sort of cache (perhaps a proxy server from within their network). It seems the BorderManager agent will go out and refresh the cache at odd intervals. And I am guessing here ... that the programmer who wrote this wanted to give the impression that the agent is well behaved. and it hits "/robots.txt". Either that or one of the users decided to take a peek at "/robots.txt" This was why the first version of the script flagged that agent as a crawler. I probably glanced at the pattern of behaviour at the time and decided that it was a crawler. And so BorderManager went into my robots file. It seems that BorderManager, comes back to fetch graphics at odd stages. Possibly this may just be a mechanism for keeping the cache fresh until the users stop requesting files that contain a particular graphic.

A similar agent is called "Dual Proxy". This seems to come along with a regular browser and pick up all the graphics. This one seems quite popular. And since it was visiting from so many different source IPs, I did not flag it as a crawler. Although at one stage it was a crawler suspect.

Also I found examples of agents which perform a similar function, but arrive from an IP address that is different from the process (with the same agent ID) which is fetching HTML and text. Some of the organisations that employ this technique include some big names. In fact, some of the biggest on the Internet. The purpose seems to be, once again, for caching. This caused a problem for my script. In order to recognise this behaviour, I have resorted to a kludge. If the script finds a hit to a regular page and then within 30 seconds another hit arrives with the same agent string and the same (16 bit) subnet, then it aliases the IP (i.e. makes it identical to the original IP). This is done purely for the the purposes of the statistics analysis algorithm. Ok, it's ugly. I make no further apologies. I just wanted to get it done.

I am so behind schedule on this documentation that I am going to publish the script without extensive comments. It is very terse and hastily written. Also it makes extensive use of referencing and de-referencing. If you don't understand arrays, hashes, referencing and de-referencing in Perl, then this code will look more like the semi-random scratchings of a flock of chooks in a pen, rather than a sequence of coherent and logical commands. If enough people show interest in the script, I will add some more comprehensive comments. The reason for such an obscure approach (i.e. referencing and de-referencing), was for efficiency. Rather than use hashes, to record values for certain agent strings, I have assigned each agent a unique number and used this number in an array.

Perhaps an example will illustrate. If you wanted to record the number of hits that a certain agent has been responsible for, you could use the following statement in perl:

	$Hits{$agent}++;

Where $agent is the variable which contains the agent string and %Hits is the hash that (ultimately) will deliver the total hits that each agent string has been responsible for. On the other hand if we assign each agent string a unique number, we would use

	$Hits[$agent_id]++;

Where $agent_id is the variable which contains the said unique number and @Hits is an array containing the total number of hits rather than a hash.

After a substantial portion of my life tinkering with databases, I realise that most problems in data manipulation come down to assigning unique numbers, in the virtual world, to things that are usually identified by strings in the real world. I know that perl is incredibly efficient at handling hashes and I remain mightily impressed with this powerful feature. But the efficiency gains of reducing this to a simple number are hard to match, even though perl effectively treats all variables (string and numeric) as equivalent.

I decided to solve the problem of storing a variable amount of information for each IP address with referencing and de-referencing. In this script I store the time, agent_id, page_type and referer type for each hit. These are all integers. The time has been reduced to the system time (an integer). Agent_id is, as described previously, a unique integer and page_type and ref_type are two variables which, if you check the code, I have also contrived to be integers. So it is possible to create an array of four items which is in fact, four integers, and to store an address that points to this array in another array, which is contained in a hash. The code to do this is as follows:

	my @v = ($htime,$agent_id,$page_type,$ref_type);
	push (@{$IPdata{$IP}},\@v);

I know this is starting to get a bit obscure. What this achieves is the creation of a hash reference which is actually a pointer to an array of addresses. Each address points to one of these four cell arrays. The my declarative makes certain that perl restricts the reference to the enclosed block, but the memory allocated to the array hangs around as long as there is a live reference to the array (i.e. the address in the hash of arrays). So the variable $IPdata{$IP} is actually the address of an array. That array contains the addresses of more arrays each one of these arrays has four items -- ($htime, $agent_id, $page_type, $ref_type) Is that clear? As clear as mud, no doubt. But if you understand that \@v represents the address of the array @v and you understand the concepts of arrays, hashes and scope, you should be able to figure it out. If you didn't understand any of the preceding sentences, then the only reason you might have for checking the source code (see bibliography), is to marvel at how anyone can express such a torrent of inane computer techno-babble and still convey a vague sense of coherence.

Initially, I created the agent_id on the fly. However once I decided to put an RDBMS on the backend, something like an agent_id becomes an obvious candidate for a unique key. I have chosen postgres as the backend. The code to create a table with unique keys might look something like this:

	create sequence agent_id_s;

	create table agents (
		agent_id	int not null default nextval('agent_id_s'),
		agent_string	varchar(256) not null unique,
		name		varchar(40) not null,
		version		varchar(15),
		os		varchar(40),
		ip_addr		varchar(2048),
		comments 	varchar(512),
		owner		varchar(256),
		hits		int,
		last_visit	datetime not null,
		create_date	datetime not null,
		update_date	datetime not null );

	create unique index webagent_ndx on agents(agent_id);

This means that, as long as insertion of new values in the agents table is under the control of scripts like the perl script (see the bibliography), the agent_id will automatically assume the value of the next highest agent_id. New agent_ids are handled inside the suspects perl script with this assumption. Obviously there is quite a lot of backend code required for the RDBMS. I have not included any of this, since the choice and configuration of the backend is site specific.

The backend script agent_data reads the the details from the agents table. This table contains the agent_string, agent_id and other details about each agent including a flag to indicate whether the agent is a robot.

BIBLIOGRAPHY:


PGTS Robot Info	When is the GoogleBot not the GoogleBot? Here is a robot that (almost) pretends to be the GoogleBot, but does not come from Google. They have made the agent string unique, so it is not employing deep cover, still it would be easy to mistake it for the GoogleBot. And that might be why they have used this technique. Very few sites say no to the GoogleBot. In any case it appears to have been a one-off experiment.
Source code	Suspects -- Perl script Here is the rather raw source code for the suspects script.