Examining Apache logfiles with perl

Parsing the website logfile

By Gerry Patterson

If you are a webmaster or a content provider, you may have many questions about the traffic coming to your website. You might like to know where it it is coming from, or when traffic is heaviest. You might want to check on the effectiveness of promotional campaigns, or you might wish to see which pages are most popular with a certain range of users. If you run an Apache server on Linux, the answers to questions like these can be found in the logfiles. There are many programs which promise to extract the information for you. However if you have a unique requirement, you will need to go back to the source of the data. And the raw data is in the logfiles. Even if you are not the sysadmin, you can still do this, provided you understand perl, and you can persuade your sysadmin to give you a copy of the log file. This article discusses how to process the information from apache logfiles with perl, and also contains some observations about robot behaviour.

The Answers are in the Logfiles

The information in the log file is self-evident. I was able to deduce the meaning of the fields I required without reading about the log file format. However some of the questions I later had about user-agents, did require some background reading. Here is what I found out about the log file format:

The format of the log file varies according to your installation. Synonyms for log file format are usually defined in the Apache configuration file. These are defined with the LogFormat directive. The so called common logfile format is defined in my configuration file as follows:

LogFormat "%h %l %u %t \"%r\" %>s %b" common

where:

%h is the remote host as an IP address or as hostname (depending on whether DNS Lookup is on)
%l is the remote logname of the user.
%u is the authenticated username.
%t is the timestamp of the request enclosed in []. Default format is local time plus TZ.
"%r" is the request command as it came from the client.
%>s is the HTTP status code returned to the client.
%b is the size, in bytes, of the file that was transferred as a result of the request.

A full description of the logfile format and how to customise it can be found at www.apacheweek.com/features/logfiles.

I find the pre-defined combined format to be well-suited to a small site. It gives all of the above information plus two additional fields:

Referer The original URL from which the request was transmitted.
Agent A string that identifies the application. For a regular visitor this will be sent from the browser. There is a long, sad history behind this string. Originally it was intended that it would identify a client application (browser) and system. However, the agent string sustained some collateral damage during the "Browser Wars". As a result, many browsers today report themselves as "Mozilla" (Netscape), and reveal their true selves in the system string (in brackets). Some applications disguise themselves completely by reporting a fake string. I have seen this behaviour referred to as "cloaking". However, there seems to be some disagreement about the actual definition of "cloaking". Some people use it to refer to the process of feeding spam to a search agent. And others use it to refer to the process of stealing spider addresses and using them.

Parsing the logfile.

Perl is the obvious choice for parsing log files. I created a script called "hits". In its' initial form this script would just count the number of hits and ignore Microsoft worms. By Microsoft worms, I mean programs that are scanning for a susceptible Microsoft IIS. I must say I was surprised to discover how many of these there are. When I first brought my website online on the 1st of March, 2002, I was amazed to discover that someone had visited my site after only one hour. The reason for my amazement was that nobody knew about my site, apart from a few friends. And yet there were hits already! My joy soon turned to disappointment when I saw entries like the following in the log file:

203.255.28.165 - - [01/Mar/2002:01:57:34 +1100] "GET /scripts/root.exe?/c+dir HTTP/1.0" 404 309 "-" "-"

Oh Dear! I thought, some misguided little Microhead is kicking my tyres. I tried a reverse lookup on the address to see if I could find something about who it was. Since the attacks looked about as threatening as assault with an over-ripe banana, I didn't lose too much sleep worrying about them. But the hits just kept on coming! Always from different addresses. It soon became apparent that these attacks were the work of mischevious programs, mostly variants of Code Red and Nimda. I have never payed much attention to malicious programs. Since I have chosen to use Unix and have avoided MUAs (Mail User Agents) like Outlook, which help transmit them. And so the news about the latest virus or worm usually hasn't held more than conversational interest for me. In some ways such news is a form of post-modern social glue. Colleagues can gather in coffee shops and pubs and exchange war-stories about how they were infected with such and such a computer virus.

So I modified the hits script to count the number of attacks from Microsoft worms. When these programs first appeared they caused degradation of network performance, not because they were capable of attacking Unix systems, but because one of the consequences of so many Microsoft systems aggressively scanning for other Microsoft machines was reduced bandwidth. In fact if it were not for a bug in the initial version of Code Red, the Internet could easily have been brought down! (scary isn't it?). I find it disturbing that not much attention has been given to this fact in the mainstream media.

Later when I managed to entice some robots to visit. I thought it might be good to track how many had visited and how many pages they indexed so I included the ability to count robots. Getting robots to visit was another saga or more correctly an exercise in self-promotion which I have documented in the article Getting Listed in The Search Engines.

And yet later, when I finally started to get visitors, I wanted to count which pages were getting hit. And I thought it might be nice to know which browsers were doing the hitting. So some further additions were made. By this stage I was rolling the logfiles periodically (on the first day of the month). This was done with cron. Since my monthly figures were so tiny, I thought it would be good to read in the previous month's figures to give a running total. So I added the feature of reading, from a file, the accumulated head count (except that I used the term bum count, in keeping with the show-biz terminology of "bums on seats"). I also read in the accumulated hit count statistics. So for example, if you store your statistics in a file called /home/mycron/site.stats, you can add these figures to the current total by using a command like:

hits -is /home/mycron/site.stats /mypath/current_log

Where /mypath/current_log is the file spec for the current log file. Of course this is not much good unless there is also an option to write the accumulated figures to a file. So I added the -o option. The statistics file can be created with the command:

hits -o /home/mycron/site.stats /mypath/previous_log

Where /mypath/previous_log is the file spec for the previous log file. This command would be run each month immediately after the log files are rolled. If the file /home/mycron/site.stats exists, it reads in the running totals. When it has finished the calculations the new figures are output to the file /home/mycron/site.stats. This job is an obvious candidate for being cron'ed.

Putting the stats on a web page

The last enhancement was to write a script that updates my home page with site statistics. A job can then be cron'ed to update the home page (or whatever) with the these statistics. The format of the cron'ed job would be something like:

#!/bin/bash
hits -i /home/mycron/site.stats -s /mypath/current_log | upd_home > /tmp/index.tmp
if [ $? -ne 0 ] ; then
	echo error encountered processing hits
	exit 1
fi
grep -q 'Some Phrase in MY HOME PAGE' /tmp/hits.tmp
if [ $? -ne 0 ] ; then
        echo $0: Failed Sanity Check!
	exit 1
fi
touch -r/MyHTML/index.html /tmp/index.tmp
mv /tmp/index.tmp /MyHTML/index.html

Where /MyHTML/index.html is the site home page. The script "upd_home" can be found in the PGTS Download area. If you look at the shell script above you can see that there is a sanity check that looks for a phrase unique to your home page (add your own phrase). This is to minimise the risk of clobbering your home page. You may prefer not to preserve the date of the file. I have a reason for doing it (using touch). And that is because of the method I use for moving files from test into production.

Robot Gangs?

By this stage the little script had grown to over 200 lines (with comments). If you inspect the code in the "hits" script, in the PGTS download area, you can see that it relies on the following empirical assumptions about the data:

Malicious programs usually set the user-agent to "-".
Well behaved robots start their scan with "/robots.txt"
Well behaved non-Netscape applications often have the word "compatible;" in the system string to indicate that they are not really Netscape. For example Internet Explorer version 6.0 on NT 4.0 should report the following agent string:
```
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 4.0)
```

As with most empirical assumptions there is no guarantee that they will always hold. There was also a problem with the original script. If you look at the logic that is used to count individual visits, you will see that the IP address is used to uniquely identify each process. This seemed reasonable to me when I designed the script. And for human visitors it is reasonable. However, some robots seem to visit in small gangs. Each gang seems to be on the same subnet. I will illustrate with an example. Here are some statistics on the behaviour of the Googlebot:

IP Address      Date        Time      Target
-------------   ----------- --------  -------
216.239.46.79   22-May-2002 10:22:11  /robots.txt
216.239.46.79   22-May-2002 10:22:12  /
216.239.46.58   22-May-2002 10:33:31  /page01.html
216.239.46.79   22-May-2002 10:33:32  /page02.html
216.239.46.79   22-May-2002 10:44:43  /pgtsj/pgtsj0205b.html
216.239.46.79   22-May-2002 11:04:36  /page03.html
216.239.46.200  22-May-2002 11:04:49  /download/
216.239.46.8    22-May-2002 11:13:30  /download/misc/
216.239.46.200  22-May-2002 11:13:39  /download/txt/
216.239.46.8    22-May-2002 11:13:46  /download/humour/

There are more entries but always from these four IP addresses. Which would appear to be the work of a gang of four robots. The alternative explanation of a solitary robot that changes its' IP address four times seems less likely. Having said that I must confess that without insider knowledge, this is pure speculation. The behaviour could just be an artifact due to the Google engine running on some sort of parallel processing cluster or it could indeed be a single robot that changes its' IP address. I am already indulging in the popular web pastime of passing unsubstantiated comments on Search Engine behaviour, so I must be a bona fide web citizen now. In any case, this Googlebot session/visit appears, at my end, to be the work of a gang of four robots on the same 24-bit subnet. The lead member of the gang on 216.239.46.79, fetches "/robots.txt" at 10:22:11, and then one second later fetches the root index. Eleven minutes later the next member of the gang on 216.239.46.58 fetches "/page01.html". Furthermore the logic of the original "hits" script would fail to identify subsequent members of the gang as robots. This is because the IP addresses are different. Initially I just used a workaround of putting in a hard-coded agent list for robots that seemed to work in gangs. This is messy, Because additional robots will need to be added to the list if they also use this gang technique. Also some robots may adopt this technology when their Search Engine is upgraded. And even though they are working as a team, each robot waits at least ten minutes after its' colleague has fetched some data, before commencing its' work. This would cause my original script to count a hit from a robot as a new visit, because sometimes a certain member of a gang may wait more than thirty minutes before hitting another page. Strictly speaking the whole gang should be counted as one visit (rather than four) for the duration of the indexing session. The statistics from my original script over-represented robot visits. It seemed that the best way to tackle this problem, was to use the 24-bit subnet and the agent string as the unique identifier for robots. This is the meaning of the variable "$GangID" in the script. Two things that will break the new algorithm that results from this assumption are:

Simultaneous visits from different robots with the same agent string on the same subnet.
A gang of robots spread across more than one subnet.

These occurences would seem unlikely. There is another thing that will break the algorithm. And that is badly behaved robots (see below). In which case, the hard-coded agent list seems like a good back-up to use as a kludge for unexpected robot behaviour.

Finally some hits started to come in. Now I wanted to find out who had referred a visitor to my site. So I added a modification for the reporting of referer stats. There were two types of reports. One of them featured the fully-qualified referer (with the query portion of the HTML request removed) and the other showed just the domains (which I find more useful). The query portion of the HTML request can be seen as the part that follows the question mark on the full HTML request. When more hits come in from the search engines, I may also add some code to analyse hits with respect to key phrases, and to see which key phrases visitors are entering. I also added an option to show stats for each IP address and each subnet. In its' present form, the hits script is cron'ed to run every 30 minutes and update the hits statistics in my home page. I also use it to report on activity at various times. The following is a list of options that can be added to the command line:

-a - print a list of agents.
-A - print a list of agent short names with hit statistics.
-d - print a list of referer domains with hit statistics.
-f - print a list of referers with hit statistics.
-i input_file - read the initial values from input_file.
-n - print a list of hosts (IP address) and visits.
-N - print a list of subnets and visits.
-o output_file - write final value of variables to output_file.
-p - print a list of pages with hit statistics.
-r - print a list of robot visits.
-s - print a summary of statistics (can be used to update HTML).
-V - print a list of visitor hits.
-v - print a list of visitor visits.

This script is a little cumbersome to use from the command line. I wrote it with the intention of using it in other scripts. Which is why I leave it up to the calling routine to specify the log file. So I wrote a small shell script which can be called from the command line to give instant feedback on the current state of the site. This saves me typing the full file spec of the log file every time I wish to view statistics. The shell script is called hit and contains the most common hits commands I use. The script is as follows:

#!/bin/bash
# Put your paths here ...
logfile=/mypath/current_log
statfile=/home/mycron/site.stats
case $1 in
	-t)
		hits -si $statfile $logfile
		;;
	-p)
		hits -p $logfile | less
		;;
	-d)
		hits -ds $logfile
		;;
	-f)
		hits -f $logfile | less
		;;
	-v)
		hits -v $logfile | tail -20
		;;
	-r)
		hits -r $logfile | tail -20
		;;
	*)
		hits $* $logfile | less
esac

I saved this script as "hit". So If I wanted to total hits for the site I can type hit -t.

Customising the Scripts for Your Site.

If you think you may want to use these scripts there are some modifications you may need to make. There are quite a few of them so I will put them in a list

If your site does not use the Apache combined LogFormat you will have to alter the parsing of the various fields. I have used split() to parse the input lines.
Modify the Sanity Check in the perl script. There are probably quite a few changes that need to made to the sanity check. For example you will have to modify the part of the sanity check that validates the timezone (unless you live in South Eastern Australia). As it stands the sanity check looks like:
```
die "Error insane at line $.:$_" unless
	( $w[0] =~ /[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+/ &&
	  $w[1] =~ /-/ && $w[2] =~ /-/ &&
	  $w[4] =~ /^\+1[01]00\]$/);
```
The fourth line is checking the timezone, which for South Eastern Australia is GMT +1000 or GMT +1100. If you live in New York then the last line would be:
```
	  $w[4] =~ /^\-0[45]00\]$/);
```
Alternatively you could just leave the timezone check out altogether. I prefer to validate wherever possible. And if you know that a certain field in your site's data will always have a particular value, it is a good candidate for a sanity check.
If your site uses DNS to convert IP addresses in the logfile then you will have to remove the section of the sanity check that looks for an IP address in the first field. Also the code that constructs the GangID will be broken. You will have to fix it. I have chosen not to have hostnames, because this can lead to a mixture of IP addresses and hostnames. And this is less amenable to manipulation with scripts. However if you are not the owner and sysadmin of your site, you will have to live with it.
Add (or remove) values for your subnets. Most people would want to exclude the local host (127.0.0.1) from hits stats. Other subnets can be excluded by using a little imagination and regular expressions.
Change the upd_home script to suit your own home page.
Depending on your directory structure, you will have to modify the hit exclusion lists for robots and visitors. In this version anything beginning with "/icons" or "/images" is excluded. However if you don't store images and icons in a separate directory, you will have to change the regular expression to check file types. Unless of course you want to count hits on image files. Which is not very honest if you are using the figures to boast about hits.
You may wish to add to the list of submission aids. These are agents like the W3C Validator that you may have invited to your site. And strictly speaking should not be part of your hits stats. Once again it would be dishonest to count them as hits.
Send me an email with any comments. There is no obligation, but I would like to hear from you.

Unruly Robots

This script will detect most well-behaved robots. Unruly robots can be added to the hard-coded list of robots. The following is a list of strange and/or unruly robots:

larbin_unspecifiedBot. This script, even in its' original form managed to identify this rather mysterious robot. This is not its' real name, however I have used this nickname to refer to the robot which always has "larbin@unspecified.mail" or a variant in the system portion of the agent string while constantly changing the rest of the agent string (such behaviour is sometimes referred to as "cloaking", see comments earlier). Furthermore, this robot has an in-addr.arpa entry of gw.ocg-corp.com, which is not a registered domain. All of which suggests that our little mate larbin@unspecified.mail might be dallying with the Dark Side or at the very least being naughty. I was able to google a genuine robot called Larbin created by Sebastien Aillereti who has made it available along with the source code under a GPL. If you look at Sebastien's source code, you will see that larbin@unspecified.mail is the default e-mail in the larbin.conf configuration file. The instructions in the README request that you replace this with your real e-mail address. It seems likely that someone has copied Sebastien's code and used it to create a robot that prospects for and mines email addresses buried in web-pages for subsequent use in a spam operation. Because of this robot's suspicious behaviour I added another hard-coded list to force mutating cloaked agent strings to be a single consistent string. I only did this because it was annoying to see several agent strings for the one robot. I could have just banned its' IP address, but it is not worth the effort since it doesn't seem to visit often and if I ban it and it has more than one IP address, I may have to chase all of them down.
Mozilla/3.01 (compatible;). This robot is reasonably well-behaved. It is mysterious because it fetches only graphic files. I can find no mention of an official robot that does this. It seems to hit hard for a brief interval and then go away. Also, it seems to come from all over the web. It may be some sort of tool used by people interested in graphic design. Or it might be a malicious program. Or maybe I'm getting paranoid?
ExactSeek Crawler/0.1. This robot would, at face value, have been sent by the ExactSeek Search Engine, which is a high profile engine. It bypasses /robots.txt, hits the home page and goes away. That's ok. However it needs to be added to the hard-coded list.
Robozilla/1.0. Ditto. This robot also bypasses /robots.txt.
Mystery Robots. There are some mysterious agents that display robot-like behaviour, and cannot be identified by the script in its' present form. For example on 04-Jun-2002, an agent identified by the string "Mozilla/4.0 (compatible; MSIE 4.01; Windows NT)", visited my site and behaved in an impolite robot-like manner. It did not fetch /robots.txt and proceeded to agressively scan. The scan pattern seemed to follow page popularity. The activity was as follows:
```
Date        Time      Page
----------- --------  ----------------------
04-Jun-2002 11:24:53  /pgtsj/pgtsj0204a.html
04-Jun-2002 11:24:54  /page02.html
04-Jun-2002 11:24:55  /pgtsj/pgtsj0204c.html
04-Jun-2002 11:24:57  /page01.html
04-Jun-2002 11:24:59  /page03.html
04-Jun-2002 11:25:01  /download/
04-Jun-2002 11:25:12  /pgtsj/pgtsj0206a.html
04-Jun-2002 11:25:14  /download/scripts/upd_home
04-Jun-2002 11:25:14  /download/scripts/hits
04-Jun-2002 11:25:37  /bs_cv.html
04-Jun-2002 11:25:45  /pgtsj/pgtsj0206c.html
04-Jun-2002 11:25:50  /download/scripts/mlink
04-Jun-2002 11:25:55  /gp_cv.html
04-Jun-2002 11:26:18  /pgtsj/pgtsj0206b.html
```
The mysterious aspects of this behaviour are; The agent string which claims to Internet Explorer 4.01 on Windows NT; And the address it came from, which was IP 63.148.99.233. I could not ping this address a few hours after the visit. This behaviour seems similar to MSIE Crawler, which is also an impolite robot. However, even MSIE Crawler identifies itself. Now, unless there is some sort of plug-in for IE 4.01 or an early version of MSIE Crawler (and there could be -- I am not very familiar with IE 4.01) this behaviour would border on sinister. If anyone knows about mysterious cloaked crawlers that do not fetch /robots.txt and agressivly scan the more popular pages and local outbound links, and then nick off, I'd like to hear from you.

Conclusion

This suite of perl and shell scripts is very handy for giving me instant feedback whilst at the command line. The hits script can be used to automatically update a hit counter at specified intervals.

You are welcome to take a copy of these scripts. There are no restrictions on their use for any legitimate purpose.