Search This Blog

Wednesday, February 27, 2013

W3C - LinkChecker - post processing with Awk

The W3C link checker program is very useful for checking large numbers of html files.
I have used the Perl CPAN implementation
One way to check a lot of pages from a command prompt is a script like this:

(
cd /home/webserver/pages
baseurl='http://mywebserver.com/'pages'
for page in *.htm *.html; do
/usr/bin/linkcheck -s ${baseurl}${page} >> outputfile
done
)

This may take a while to run the checking process is very thorough and the reports are quite verbose.
A common requirement is to just check for 404 (bad link) errors. To only report these I filtered the output file through an awk script:


# from FTP::webx-johnr\/home/johnr/librarycheck|linkfilter.awk
BEGIN { url = ""; }
/^Processing/  { url=$2;
errorCount=0;}
/^http:/ { link=$1;  }
/^ Lines: / {lines = $2 $3; }
/^  Code: 404 Not Found/  { 
 if (! errorCount) printf "\n\nCompany page: %s\n", url;
 errorCount++;
printf "link: %s lines %s; %s\n", link, lines, $0; 
}
END {}

This only lists pages where 404 errors have occurred, ok pages or other 'error's such as redirections or ignored links are not listed.




No comments:

Post a Comment