Extracting Search Keyword Phrases From The Logfile With Perl

Extracting Keywords From The Apache Logfile With Perl

By Gerry Patterson

This article describes a technique of extracting search engine key words from an apache log file with perl. It is illustrated with actual log file entries, that use the apache "combined" format. These examples have been modified. The IP addresses have been altered in order to protect the identity of the users.

The Source Code is available for download.

The Keys To The Kingdom.

This article describes a perl routine which parses an apache logfile and extracts keywords. The source code can be viewed here.

The topic of "keywords" seems to be quite a common one amongst webmasters. Keywords can give a good idea of what users are looking for when they use the search engines. Looking at them can be illuminating and amusing. However, in my opinion, it would be inadvisable to try and tailor documents to meet common keyword phrases, since blatant tailoring of a page might be deemed to be an attempt to "spam" the engines. Nevertheless, investigating keywords might give web-authors an inkling of the "hottest" topics and possibly shape the choice of future topics, in the same way that a working muso eventually relents and learns to play "Piano Man" and "American Pie", if only to give the punters what they want.

In the past whenever I wanted to investigate key search phrases, I would just peruse the logfile and be able to guess the search terms from the general appearance of the log entries. For example the following line can easily be deciphered by eye:

192.168.1.1 - - [01/Jun/2003:04:16:04 +1000] "GET /pgtsj/pgtsj0204c.html HTTP/1.1" 200 16076 "http://www.google.com/search?hl=en&lr=&ie=UTF-8&oe=UTF-8&q=build+cyrus+IMAP+cygwin" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)"

For someone familiar with logfiles, it should be easy to see that this represents a visit from 192.168.1.1 on 01-Jun-2003 at 04:16 (GMT + 10hrs). The visitor was using MSIE 6.0 with .NET extensions and was referred to http://pgtsj/pgtsj0204c.html by Google, after searching for the keywords "build+cyrus+IMAP+cygwin". Logfiles, however can be difficult to read. Some of the query strings may be quite obscure, when they contain many hex codes. A simple example like the following is still easy to decipher:

192.168.1.1 - - [01/Jun/2003:08:34:45 +1000] "GET /download/misc/ HTTP/1.1" 200 2130 "http://www.google.com/search?q=%22perl+5.6%22+msi&hl=en&lr=&ie=UTF-8&oe=UTF-8&start=10&sa=N" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)"

The character string "%22" represents the '"' character, which is 0x22 (hexadecimal 22) in ASCII. We can surmise that this user was searching for an exact match on "perl 5.6". However a large number of hex characters is going to tax even the most hairy-chested, bare-knuckled and grizzled old hacker. Still even if you are veritable hex wizard able to decipher hex representation of ASCII at a glance, the information can become obscured by the clutter of everything else in the logfile. Logfiles contain so much detail that it is difficult to separate the wheat from the chaff.

And you know what I am going to say next ... Obviously this calls for a program that extracts search phrases from the logfile. At first glance, this appears to require a simple script that breaks the query string up into individual parameters. Of course nothing is as a simple as it seems at first glance.

The first thing that is required is a subroutine that parses the logfile (i.e. breaks it up into individual components). The logfile at the PGTS site uses the apache combined format which consists of the following fields:

	remotehost login authuser [date] "request" status bytes "Referer" "Agent"
	where:	remotehost = IP address
		[date]     = timestamp and tz (always +1000 or +1100 for VIC)
		login      = remote login as per RFC931 (always -)
		authuser   = authenticated username (always -)
		"request"  = request cmd sent from the remote agent (enclosed in quotes)
		status     = numeric status returned by apache
		bytes      = number of bytes transmitted
		"Referer"  = URL of the Referer (enclosed in quotes)
		"Agent"    = name of the remote user agent (enclosed in quotes)

The PGTS webhost is configured not to try and resolve host addresses. This is futile since many of them cannot be resolved. It is better to have a consistent output. Also, if you do not have the referer string in your logfiles than you will not be able to extract search strings.

The following subroutine will perform the parsing function:

sub parse_log{
	my @w = split ( ' ', $_[0]);
	# remove the '[' from the date and convert to timestamp with timelocal()
	$w[3] =~ s/^\[//;
	my @t = split(/:/,$w[3]);
	my @d = split( /\//,shift( @t) );
	my @Htime = (reverse(@t),$d[0],$mth{$d[1]},$d[2] - 1900);
	my $ltime = timelocal(@Htime);
	# Original HTML cmd, referer and agent are all enclosed in '"'
	@t = split( '"',$_[0]);
	# extract the status and size
	$t[2] =~ s/^\s+//;
	my @t1 = split(' ',$t[2]);
	# Make allowance for '"' embedded in query strings
	# (split on ' "' and remove trailing '"')
	my @t2 = split( ' "',$_[0]);
	chop $t2[2];
	chomp $t2[3];
	chop $t2[3];
	return($w[0],$ltime,$t[1],$t1[0],$t1[1],$t2[2],$t2[3]);
}

No doubt there are more elegant (and efficient) methods of parsing the logfile. If you believe you have a better way, please feel free to share it, by sending an e-mail. The subroutine above turns the time into an internal timestamp, which can be manipulated arithmetically.

The next thing we need is a subroutine that splits the query up and extracts the query string. After looking at some entries in the logfiles, I came to the conclusion that the parameter was called {q} (for query). So I knocked the following subroutine together:

sub split_query {
	my @w = split(/\&/,$_[0]);
	$w[0] = substr($w[0],index($w[0],'?')+1);
	foreach my $x(@w) {
		next unless ($x =~ /^([A-Za-z_]+)=/);
		my $k = $1;
		# will need another subroutine to pretty it up
		my $v = make_readable($');
		next unless $v;
		# if we find a parameter 'q', look no further ...
		return $v if($k eq "q");
		$srch_parm{$k} = $v;
	}
	return ("");
}

It seemed that most search engines other than Google either handed the query straight on to Google or also used a parameter called {q}. So it seemed reasonable to postulate that all that was required was a subroutine that makes the query string more readable. The following subroutine will transform the strings coded with the '%' character in HTML, and substitute space for '+':

# make a query string more readable -- substitute hex chars and remove '+'
sub make_readable {
	my $s = $_[0];
	$s =~ s/\+/ /g;
	$s =~ s/\%([0-9A-Fa-f]{2})/sprintf("%s",pack('H*',$1))/eg;
	$s =~ s/\s+$//;
	$s =~ s/^\s+//;
	return($s);
}

The above subroutine converts all the '+' chars into spaces, and then converts the hex characters (%NN) into their ASCII representation. Finally it cleans up leading and trailing blanks. The result is a space delimited string which could be easily displayed in a cell in a spreadsheet or table. The substitution of the hex characters (%NN) is accomplished with the 's' operator (s///) with the /e option. The /e option makes perl interpret the right side as an expression. The right hand side contains a call to the pack() function which is fed with a backreference ($1). The 'H*' template converts this hex representation to an ASCII character. The /g option means do it globally. This still means multiple backreferences, which is not the most efficient way of transforming the string. But hey! I've got Pentium IV with 2GB of RAM, so who cares about efficiency? Omigod! The rot is setting in ...

As an aside, this could be used as a simple command that transforms HTML hexadecimal codes into ASCII. For example the following one-liner would transform a string as follows:

echo 'string' | perl -npe 's/\%([0-9A-Fa-f]{2})/sprintf("%s",pack("H*",$1))/eg'

Where the string should be enclosed in quotes to protect against shell meta-characters. Windows 2000 users could also use this one-liner. However they would need to reverse the quotes:

echo "string" | perl -npe "s/\%([0-9A-Fa-f]{2})/sprintf('%s',pack('H*',$1))/eg"

And, if you are Windows 95/98 user ... you should consider upgrading to Linux!

These subroutines can form the basis of a perl program to extract keyword search phrases from the logfile. It all looks easy. Of course it wasn't that easy ...

The Best Laid Plans ...

Life wasn't meant to be easy ...

-- Old Australian Politicians' Proverb

I initially reached the conclusion that other search engines had fortuitously used the {q} parameter. However it turned out that quite a few of them use other parameters. Ok, not a problem ... we can construct a hash for them. Something like:

%engine_specific = qw(  Altavista aqa
			AOL	  userQuery
			Cometway  keywords
			Lycos	  wfq
			Netscape  Keywords
			Rediff    MT
			Virgilio  qs
			Yahoo	  p );

By altering the subroutine to check for the $engine_specific{$engine} hash reference as well as {q}, these can be picked up. Also by saving query parameters in a hash named %srch_parm, there can be a general check for the common and/or obvious parameters, and these can be handled in bulk rather than including them in the engine_specific hash. These would look like the following:

	return $srch_parm{key}		if($srch_parm{key});
	return $srch_parm{query}	if($srch_parm{query});
	return $srch_parm{ask}		if($srch_parm{ask});
	return $srch_parm{searchfor}	if($srch_parm{searchfor});
	return $srch_parm{qry}		if($srch_parm{qry});

But an additional complication was the other Google queries, like advanced search and translation services. Here are the query parameters that I was able to decipher:

        q	      query
        as_q	      All Words
        as_epq	      Exact Phrase
        as_oq	      Any Words
        as_eq	      Exclude these Words
        as_ft	      Include (i) or exclude (e) filetype
        as_qdr	      Restrict date on returns
    	                all, m3 - 3 months, m6 = 6 months, y = 12 months
        as_occt	      Specifies where search term must occur
    		        any, title, body, url, links
        start	      start
        num	      Num results per page
	safe	      Safe Search (exclude adult content)
        as_sitesearch Include/Exclude site
        hl	      host language
        sl	      source language

If I have guessed any of these incorrectly, please send me an e-mail and I will humbly apologise.

Now the algorithm grows in complexity. If the {q} parameter is empty the other parameters need to be searched in order to extract the keywords.

The other complication for Google is cache. Many users fetch from Google's cache, because it is often orders of magnitude faster and more reliable than the target host. If the user employs a GUI browser that is fetching graphics, than it will collect them from the target host when the page is displayed. As it does so, Google kindly put the words "cache:" in front of the search terms, in the referer URL. There is also an internal identifier included after the word cache followed by the URL with the "http://" portion missing. This makes it possible to work out which page was fetched from cache. So, choosing another example from the PGTS logfile, if the browser was fetching a graphic for http://www.pgts.com.au/page02.html, the referer URL might contain a {q} parameter that looks like the following:

	q=cache:q5YkWAsP_ugJ:www.pgts.com.au/page02.html+perl%2Blinux

In this case if we wish to include the search terms in the list we need to substitute the image target with the page. In the above example the command string would have been something like:

	GET /images/pgtsj_head.gif HTTP/1.1

So rather than a target of pgtsj_head.gif, our search output should associate the search terms with /page02.html. Of course if the user employed a text only browser like w3m or lynx or one of the GUI browsers that can switch off the graphics fetch, all bets are off, because such a user would silently (and rapidly) fetch pages from Google's cache without a single word in the logfiles.

There is another complication for pages fetched by Google's translation engine. For example, if the user chooses to have the page mechanically translated. (e.g. from English to French), the translation engine fetches the page from a website and Google thoughtfully puts the original search phrase into the parameter {prev}. However, all the non standard characters are encoded as hex representations. This prevents them from being parsed on the first pass (which is quite logical). This means that in order to decipher the string, the previously mentioned subroutine split_query() needs to be called recursively.

There is a similar situation for users who look at the image section on Google's toolbar. In this case Google, once again include the search string encoded in the {prev} parameter. As in the previous case the split_query() subroutine needs to be called recursively, Also as with images fetched to go with search targets in cache, the ultimate target file should be the page that is contained in the {imgrefurl} parameter, rather than the graphic that was fetched from your website to go with the page that was fetched from Google's cache and displayed on the user's console. On caveat about this is that Google seem to be very slow updating this index, so it is possible to find some rather old references in this index.

Some engines like webferret.search.com use an unusual method of representing the query string. I hesitate to say non-standard, since no standards exist. However webferret store the components seperated with commas. It would have been possible to modify the script to handle the webferret format, but since webferret only contribute a small number of hits this has not been done. Instead the split_query() subroutine hits the webferret string with a Procrustean kludge. It transforms the first portion, which is usually ?wf, into ?wf=. The next ',' is transformed into "&tail=". The string then looks as if it contains a {wf} parameter, it is laid not so gently in its' Procrustean couch and deciphered (Ve haff vays and means of making you decipher! ...)

Engines like Lycos and Altavista have comprehensive schemes with a lot of information in the referer string. However this script does not adequately address the complexities of these engines. In the case of Lycos the approach has been one of simplistically opting for the contents of the {wfq} parameter if it exists, otherwise returning the standard {query} parameter (if it exists). There appears to be many Lycos search parameters. For example a value in {wb_lp}, which may indicate a translated search string, in the form of two concated languages codes.

Is that all clear? Yeah, I'll bet! ... as clear as mud!

Perhaps some examples will illustrate this. First an example of a graphic file that is being fetched when the search target page is fetched from cache:

192.168.1.1 - - [04/Jun/2003:18:26:38 +1000] "GET /icons/pgts_smlogo.gif HTTP/1.0" 200 1552 "http://www.google.co.uk/search?q=cache:LcXlwobmYMQJ:www.pgts.com.au/cgi-bin/pgtsj?file=pgtsj0211b+using+SVRMGRL+in+windows&hl=en&ie=UTF-8" "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)"

This user has searched for "using SVRMGRL in windows" and then fetched the article http://www.pgts.com.au/cgi-bin/pgtsj?file=pgtsj0211b from Google's cache. Along the way the browser (MSIE 5.5) has fetched the icon pgts_smlogo.gif from the PGTS site, and Google has left a copy of the original search terms in the referer URL.

The following appears to be a user that has clicked on the image section of Google's toolbar:

192.168.1.1 - - [06/Jun/2003:22:22:57 +1000] "GET /images/suse80.gif HTTP/1.0" 200 7970 "http://images.google.com/imgres?imgurl=www.pgts.com.au/images/suse80.gif&imgrefurl=http://www.pgts.com.au/cgi-bin/pgtsj?file=pgtsj0210&h=222&w=296&prev=/images%3Fq%3Dperl%2Blinux%26svnum%3D10%26hl%3Des%26lr%3D%26ie%3DUTF-8%26oe%3DUTF-8&frame=small" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"

The search string is (encoded) in {prev}. This is a little bit more difficult to read. However, once the hex characters are deciphered the {q} string can be seen inside the {prev} string. The search phrase was "perl linux". This example would require a recursive call to the subroutine split_query().

And an example of a file which is being translated from English to French:

192.168.1.1 - - [08/Jun/2003:22:09:43 +1000] "GET /download/humour/ HTTP/1.1" 200 30250 "http://translate.google.com/translate_n?hl=fr&sl=en&u=http://www.pgts.com.au/download/humour/&prev=/search%3Fq%3Dvideo%252Bhumour%252Bdownload%26start%3D30%26hl%3Dfr%26lr%3D%26ie%3DUTF-8%26sa%3DN" "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)"

Once again, the search string is (encoded) in {prev}. This user chose a search string of "video humour download". In which case I fear he would have been disappointed with the result.

There are some referer strings that this script fails to decipher. In many cases the reason for failing are quite obvious. For example it may fail to decipher a hit from a search engine because there was no query string. As is the case for someone using Google Directory Services.

Some engines like aolsearch.aol.com encode the string. An example of the aolsearch encoding is as follows:

    encquery=6D41C83931B2FE2236C6D4135D0D855CE58309B2157455C44960B4F4296065245087A58F59A2B274&invocationType=keyword

If anyone understands this, please send an e-mail.

Putting it together ...

A draft version of this script is now available for download. The base URL must also be supplied to the script. This is required by the section of code which attempts to backcount pages which are fetching an image.

The source code can be viewed here. The way in which this program is invoked as follows:

	qstr http://mydomain.com /var/log/my_apache.log

The two arguments of domain and logfile must be supplied. Also the "http://" component must be supplied with the domain.

This script recognises two command line options, which were included mainly for debugging purposes. These are:

The -d option: will print the search parameters as they are parsed. This option has the potential of producing a lot of output.
The -e option: will print the errors. This is a list of calls that the script failed to decipher as discussed above. This option can be used to zoom in on these failures and further refine the algorithm.

Obviously if your logfile is not in the "combined" format, then you will have to re-engineer the parsing subroutine. Whichever format the web server uses however, it must include the referer string. Because without it there will be no search keywords.

The examples given above have been taken from the PGTS logfile. However the IP addresses have been altered to 192.168.1.1. This is a non-routable address.

You are welcome to copy this code and use it for your own purposes, provided they are not illegal or connected with spamming.