Sunday, November 11, 2007

A Simple Blog Backup Tool

I wanted to backup this blog. I looked briefly on the web. There don't appear to be a lot of good alternatives. The instructions on the Blogger site are complex and very manual. There do not appear to be a lot of open source alternatives.

So, I wrote a quick and dirty bourne shell script. Right now it is pretty rough, but it works. It relies upon a number of standard Unix tools; wget, xsltproc, and tidy. These may not be installed on an out of the box system, but can be downloaded as binaries for either windows (from cygwin) or a mac (from fink) so this script should work from any suitably configured computer. I am running my backups on a Mac. The script relies on the Blogger "Blog Archive" widget being present in the blogs template. This may be a problem for older Blogger blogs.

Below is the script. Copy it to a file, make the file executable, cd to the root of your backup directory and you should have a backup faster than you can say "screen scraping".

- J

#!/bin/sh
# Backs up a Blogger blog.
# Should work in any suitably equipped unix-like environment. Requires the
# following programs:
#
# 1. wget
# 2. tidy
# 3. xsltproc
# 4. xargs
PROGNAME=`basename $0`
DIRNAME=`dirname $0`

function usage() {
echo "usage: $PROGNAME [blogURL]"
exit 1
}

TMPFILE=/tmp/$PROGNAME-tmp-$$.html
XSLFILE=/tmp/$PROGNAME-xslt-$$.html
if [ "$1" = "" ] ; then usage; fi
URL=$1

# create xslfile from here-document
cat > $XSLFILE <<- "END-HERE" <?xml version='1.0'?> <xsl:stylesheet xmlns:html="http://www.w3.org/1999/xhtml" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"> <xsl:output method="text" version="1.0" encoding="UTF-8" /> <xsl:template match="/" > <xsl:apply-templates select="//html:ul[@class='posts']/html:li/html:a"/> </xsl:template> <xsl:template match="html:a"> <xsl:value-of select="@href"/><xsl:text> </xsl:text> </xsl:template> </xsl:stylesheet> END-HERE # get the index file of the blog # clean it up with tidy in case it has bad xml syntax. Blogger uses xhtml. # write to a temporary file. wget -O - $URL | tidy 2>/dev/null > $TMPFILE

# read from temporary file.
# extract from a list of individual posts.
# get each post and all the files needed to render them locally.
xsltproc $XSLFILE $TMPFILE |
xargs wget -E -H -k -K -p

# remove the temporary files.
rm -f $TMPFILE
rm -f $XSLFILE

No comments: