vrijdag 15 maart 2013

Tar-Based Back-ups

A few months ago, I found out that I had to change the back-up strategy on my personal laptop. Until then I had used Areca, which in itself worked fine, but I was looking for something that could be scripted and used from the command line, and that was easy to install. As often is the case in the Linux world, it turned out you can easily script a solution on your own using some basic building blocks. For this particular task, the building blocks are Bash, tar, rm and split.

What was my problem with Areca? First of all, from time to time, Areca had to be updated. This is usually a good thing, but not if the new version is incompatible with the old archives. This can also cause problems when restoring archives, e.g. from one computer to another, or after a complete reinstallation of the operating system. Furthermore, since Areca uses a graphical user interface, scripting and running the back-up process from the command line (or crontab) wasn't possible.

My tar-based back-up script starts with a shebang interpreter directive to the Bash shell. Then it sets up four environment variables: a base directory in BASEDIR, the back-up directory where all archives will be stored in BACKUPDIR, the number of the current month (two digits) in MONTH, and the first argument passed to the script in LEVEL. The LEVEL variable represents the back-up level, i.e. 1 if only the most important directories should be archived, 2 if some less important directories should be archived too, etc…


#!/bin/bash
#
# Creates a local back-up.
# The resulting files can be dumped to a media device.

BASEDIR=/home/filip
BACKUPDIR=${BASEDIR}/backup

MONTH=`date +%m`

LEVEL=$1


Next we define a two parameter function that backs up a particular directory to a file. First it echoes to the console what it's going to back up, then uses tar to do the actually archiving, and finally creates a SHA-256 digest from the result. Notice that the output of tar is redirected to a log file. That way we keep the console output tidy, and at the same time can browse through the log file if something went wrong. That's also why we included v (verbosely list files processed) in the option list for tar.


function back_up_to_file {
   echo "Backing up $1 to $2."
   tar -cvpzf ${BACKUPDIR}/$2.tar.gz ${BASEDIR}/$1 &> ${BACKUPDIR}/$2.log
   sha256sum -b ${BACKUPDIR}/$2.tar.gz > ${BACKUPDIR}/$2.sha256
}


Here are some examples of how the function can be used.


back_up_to_file bin bin
back_up_to_file dev dev
back_up_to_file Documents Documents-${MONTH}
back_up_to_file .thunderbird/12345678.default Thunderbird-${MONTH}


Notice the use of the variable MONTH in the example above to create rolling archives. The directories bin and dev will always be backed up to the same archive file, but for the documents directory and Thunderbird, and new one will be created every month. Of course, if the script is run a second time during the same month, the archive file for the documents directory and Thunderbird will be overwritten. Also, the same will happen when the script is run a year later: the one year old archive file will then be overwritten with a fresh back-up of the documents directory and Thunderbird. Tailor to your needs in your own back-up script!

LEVEL can be used in the following manner to differentiate between important and often-changing directories on the one hand, and more stable directories you do not want to archive every time you run the script on the other hand.


# Backup of directories subject to changes
if [ ${LEVEL} -ge 1 ]; then
   back_up_to_file bin bin-${MONTH}
   back_up_to_file Documents Documents-${MONTH}
   back_up_to_file .thunderbird/12345678.default Thunderbird-${MONTH}
   back_up_to_file dev dev-${MONTH}
   …
fi

# Backup of relatively stable directories
if [ ${LEVEL} -ge 2 ]; then
   back_up_to_file Drawings Drawings
   back_up_to_file Photos/2010 Photos-2010
   back_up_to_file Movies/2013 Movies-2010
   back_up_to_file .fonts fonts
   …
fi

# Backup of stable directories
if [ ${LEVEL} -ge 3 ]; then
   back_up_to_file Music Music
   …
fi


Next, I'd like to split large files into chunks that are easier to handle. This makes it easier to move archives between computers or to external media. The following function splits a large file into chuncks of 4 GB. Before it does that, it removes the chunks from the previous run, and when it's done, it also removes the original file.


# Split large files
function split_large_file {
   echo "Going to split $1.tar.gz."
   rm ${BACKUPDIR}/$1.tar.gz.0*
   split -d -b 3900m ${BACKUPDIR}/$1.tar.gz ${BACKUPDIR}/$1.tar.gz.
   rm ${BACKUPDIR}/$1.tar.gz
}


The example below shows how the function can be used.


if [ ${LEVEL} -ge 2 ]; then
   split_large_file Photos-2010
fi


Notice that it uses the LEVEL variable to control when the function is run on the various archive files. If there's a mismatch, the script would try to split non-existing files. That wouldn't hurt, but we also want to avoid the unnecessary error messages that would pop up on the console. A better solution would probably be to automatically detect whether there are any large files in the back-up directory and only split them, but I haven't had time to implement that yet.

Finally, at the end of the script, we write to the console that we're done. I like to do that to indicate explicitly that everything went well, especially since this script can take a while.


echo "Done."


For the moment, I copy the resulting archive files manually from a local directory on the hard disk to an external disk, based on the timestamps. A better solution would be to create another script that can check the SHA-256 sums on the external disk against the local sums, and copy only those archives that are different. We'll save that one for another time.