Stop the Rot.

Stop the Rot

By Gerry Patterson

So-called link rot is a perennial problem which afflicts all websites. Like rising damp, it creeps up from the foundations of your external links and over time seeps through your carefully crafted HTML documents and turns them into a load of old wombat poo. So what is a good maintenance strategy to avoid this?

This article discusses remedies to insidious link rot

That Insidious Link Rot.

We've all seen it. A website that has a good summary of the topic of you are searching for and interesting looking links which appear to be relevant. You click on the links and all you get is server not found or document not found error messages. Did you report the broken link to the webmaster?

And the same thing is probably happening on your own site. When was the last time someone reported a broken link? ... Has anyone ever reported a broken link?

There are some solutions, such as the LinkWalker robot, which will give you a report of broken links on your site. LinkWalker also offers a report that lists the sites linking to your site. Also the W3C organisation offers an online link checker.

I would prefer a script which runs on my own site, because this gives me better control over when it runs and how I should be notified. It occured to me that it would be easy to write a script which checks my own site using LWP (Library for WWW access in Perl). However before attempting it, I thought it would be a good idea to check to see what is available in the public domain.

The checkbot script seemed like a good choice. The latest stable version is 1.67. The version that I downloaded and tested seemed to work ok. The one thing that I noticed that was an intermittant problem with some links. I assumed that these were just remote host problems.

Checkbot could be cron'ed and the output is stored as an HTML file.

Putting It Together.

The checkbot script can be cron'ed to run at a regular interval. It would not need to be run more often than once a week. I decided to render the HTML output report with lynx (using the -dump option) and push it off to myself as an e-mail. The scripts accepts a single parameter which is the URL to check. The URL should not contain the http:// suffix. (eg www.mydomain.com). Here is a copy of the script:

#!/bin/bash
# chklink - Call checkbot to check the links at the specfied URL
# Do not include the http:// at the start of the URL
# G. Patterson, Sept 2002

FROM_ADDR=nobody@mydomain.com
TO_ADDR=user@mydomain.com
MAIL_FILE=/tmp/checkbot.mail
if [ $# -ne 1 ] ; then
	echo Usage: URL
	exit 1
fi
DS=`date -I`
echo -e "From: $FROM_ADDR\nTo: $TO_ADDR\nSubject: Checkbot $DS\n" > $MAIL_FILE
cd /tmp
checkbot http://$1
lynx -dump /tmp/checkbot-$1.html | awk '
	NR==6{	if ($1 > 100) error++;}
	NR>5 {	if ( $0 ~ /____________/) exit(error);
		if (error) print; }' >> $MAIL_FILE
if [ $? -ne 0 ] ; then
	echo -e "\n.\n" >> $MAIL_FILE
	sendmail $TO_ADDR <$MAIL_FILE
fi

The idea behind this script is to take the report file and mail it to user@mydomain.com. The sender's e-mail address is nobody@mydomain.com (change where appropriate). If your system does not have a recent GNU version of the date command then the "date -I" command (which supplies and ISo-8601 date string) will not work. Just use a format string to specify your own format. Invoking lynx with the -dump command sends the rendered output to stdout. The script uses awk to search for the underline characters that get output by this particular version of checkbot (If your version is different then this may have to be changed). In this version if the first field on line 6 is a number greater than 100 it represents a HTTP status return.

BIBLIOGRAPHY:

CheckBot Home Page for the download.