PGTS Browser and Robot Taxonomy

By Gerry Patterson

For the past ten months I have been tracking agent strings that visit this site. Many webmasters have shown interest in this project. At the same time I have been gathering usage information to get an estimate of the types of client systems being used on the Internet. However there are several caveats which need to be heeded before interpreting this data.

Browsers and HTML.

I first started examining agent strings out of curiosity. Just to get an idea of who was visiting this site. It soon became apparent that the question of agent strings was a big ugly tin of worms. My early attempts at deciphering agent strings worked, to a point. But as the task grew in complexity, I decided to employ a database to cope with the workload. I chose postgres as the RDBMS.

This has allowed me to compile a usage report for browsers and Operating Systems. There are still some problems with the data. However, before looking at those issues I would like to put in a strong plug for W3C compliant HTML. If you are using these lists for Browser Sniffing, I urge you to reconsider this ill-advised strategy. Most of the problems people think they are trying to solve by employing Browser Sniffing simply disappear if the decision is made to implement W3C standard HTML.

If you are website owner struggling to improve your site's visibility and appeal, you should consider standard HTML and CGI scripts rather than complex client side operations. Complex server-side implementations start to break down because pages become dependent on the type of hardware and software that your clients use. This makes your site vulnerable to changes in the computer market, which despite the universal downturn, is still evolving quite rapidly. By supporting simple standards that are at the rock-bottom of the Internet, you can fortify your site against such changes, as well as broadening your potential audience. This frees you to concentrate your efforts on content rather than form. And content is what the users want.

On the other hand, if you are a user, surfing the web, and you encounter a site that only supports a restricted number of browsers, you should e-mail that site and let them know that they may be depriving themselves of business. You might also point out that they can open reach a large audience and decrease the maintenance costs by employing standard HTML.

There used to be a site at www.anybrowser.org that had example letters which could be used as templates for e-mails to be sent to offending websites. These were polite and to the point. Of course these days there is only one market leader, so the complaints would mostly be directed at sites that tailor their pages exclusively for MSIE. The site at www.anybrowser.org is no longer available. However the information is still on the Internet (see bibliography).

Agent String Caveats.

There are several problems with the data being gathered. The following is a list of these problems and the proposed remedies:

Bias: Early figures (from Jul 2002 to Sep 2002) show a bias towards open source browsers and operating systems. This was due to links from numerous sites which advocate or promote open source. After I realised that forging reciprocal links with like-minded sites was the most effective way to establish a web presence, I spent some time engaging in Open Source Advocacy. Since many of the in-bound links were from sites who were also concerned with this issue there was a strong bias towards open source browsers and operating systems. Around about this time. Hits started to come straight from Google. These showed a distribution that would look more like that from Oct 2002 onwards.
Fix: The best remedy for this is to do nothing. I believe that the statistics gathered from October 2002 onwards are a representative sample. This means that the bias should work its' way out of the system, as the data ages.
Complexity: The agent string is very convoluted. The ad-hoc nature of browser evolution and the (lack of) standards for agent string nomenclature means that it is very difficult to interpret a string that has not previously been encountered. Also there is a certain amount of inertia in the status quo. Rather than rock the boat, new agent strings on the market emulate the market-leader. This means that many programs that try to identify such new agents count them as the market-leader, which often results in inflation of the statistics for the market-leader.
Fix: The remedy for this was construction of a database of agent strings, which is what prompted the gathering of agent strings in the first place. Most browsers do include subtle differences in the agent string, in order to order to establish a unique identity for themselves. The browser identities are derived from analysis of the large database of agent strings.
Mozilla: Apart from the fact that most agent strings begin with the word Mozilla, there are a lot of Mozilla variants, among them: Chimera, K-Meleon, MultiZilla and Phoenix (now known as Firebird). Such a large number of separate browser species makes the browser zoo very crowded. Especially when the actual numbers don't seem to justify a separate species.
Fix: These minor browser have been renamed as Mozilla Chimera 6.0, Mozilla Phoenix 0.5 etc. If any of these owners object strongly please send an (articulate) protest,
Crawlers: Quite a few crawlers attempt to disguise themselves as browsers. Some of these account for a considerable amount of traffic and hence can skew the statistics considerably. The most popular agent string on the Internet at present, seems to be:
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
This is the plain vanilla MSIE 6.0 string for Windows XP. Hence a robot that used this agent string would be hard to detect if the only criteria for detection was the reported agent string. Furthermore if statistics are being gathered according to number of hits, an agressive robot could run up an impressive tab.
Fix: The routine which gathers statistics is being re-worked. After this it will be more sophisticated and should detect many of these crawlers, and hits will not be credited to the agents that these Stealth Crawlers pretend to represent. The statistics gathering process will also be altered to record numbers of visits rather than number of hits, for those Stealth Crawlers that fall through the cracks. This means that if a crawler hits 40 pages in one visit, it still only counts as one visit. These changes will be rolled out in an Agent Switchboard Upgrade which should be ready towards the end of May or early June 2003. As part of this upgrade, statistics for previous months will be recalculated using the new algorithms.
Google Cache: Many users fetch pages from Google cache or other caches. The current routine that gathers hits only counts hits to pages so these are missed. GUI browser that fetch from Google cache usually come to the original site to fetch graphics. At the PGTS site the hit counter will not be updated however, since the browser only fetchs a few small graphics.
Fix: Fortunately Google puts a referer string in these fetchs so in the case of traffic from the search engine giant it is possible to backtrack and work out which pages have been hit. Unfortunately, it will add to the complexity of the statistics gathering. These changes will be included in the Agent Switchboard Upgrade scheduled for late next month.

Thanks For The Data.

Last of all I would like to thank those people who updated the details for various browsers. When I first started drawing up this list, I guessed what the meaning of the various agent strings was and then used those guesses to construct a perl script that automated the guessing process. There were many gaps in the data. The feedback from people who visit this site has helped fill these gaps.

It seems that a few people have taken concerns about privacy to the extremes of excessive paranoia and have blocked their browsers from supplying any information. Fortunately the majority of users do not do this. Some of the people who block the agent string do have laudible motives such as trying to discourage websites from the misguided practice of Browser Sniffing. However, obscuring the agent string is not a very sensible way to do this. By using the standard agent string that shipped with your browser, your opinion, or your vote for a particular browser is being heard. If you wish to express your disapproval of websites that employ browser sniffing, it is more effective to send them an e-mail expressing your disaproval. This will be more persuasive (and articulate) than setting your agent string to a string that tells them to f*** off.

BIBLIOGRAPHY:

The page of example letters which used to be at www.anybrowser.org.


AnyBrowser	Example Letters
PGTS	Agent Strings in Popular Browsers. The first essay which I wrote on this topic, which could also have been titled A Short Summary of the Long Sad History of Agent Strings.
PGTS	Agent String Switchboard. This page gives access to the various pages relating to agent strings and data gathered for Browsers and Crawlers.
PGTS	Check Your Own Browser. This page runs a test on your browser. If any of the details are incorrect please take the time to update them. This is how much of this data has been corrected.