woensdag 24 juli 2013

Converting Youtube Channels To Podcasts

I spend a lot of time on the bus commuting to and from work, and I like to listen to podcasts while I do that. I never got into the habit of watching channels on YouTube, so when CNET and StratFor moved from podcasting to a channel on YouTube, I pretty much lost track of both of them.

Sure, there's extra value in a video when you want to show a new product, screen shots, maps or some video footage, but most of the time the audio track is all you really need to get the essence. It's definitely better than not watching or listening to the video at all, and if I find out I would like to see the video too, I can always do that later. The problem is only: how do I convert a YouTube channel to MP3 files, just as if it ware a regular podcast feed?

In Ubuntu, it turns out to be surprisingly easy to create a little Bash script that does just that. Basic ingredients of the script are curl, youtube-dl and one text file per channel to keep track of already downloaded videos. Let's work our way through the script.

#!/bin/bash
#
# “Catches” a YouTube feed, i.e. reads a YouTube feed, downloads videos that
# haven't been downloaded before, and converts them to MP3.
#

# Check if we are the only local instance
if [[ "`pidof -x $(basename $0) -o %PPID`" ]]; then
 echo "This script is already running with PID `pidof -x $(basename $0) -o %PPID` -- exiting."
 exit
fi

cd <your directory here>

First of all, we want to make sure we're the only instance of the script running. The reason for this is that if we have a slow connection, and we've added the script to crontab to be run once every hour, we don't want to end up having two or more instances of the script trying to download the same video. Once we've established we're the only running instance of the script, we move to the directory where we want the files to be downloaded, and we're ready to do the actual work.

That work is done in a function called catch_feed, which takes two arguments: the name of the feed, which we'll use as a key to label the resulting MP3 files and the history file, and the URL of the feed. We then check whether there is already a history file for the feed, and if not we create an empty one touch'ing it.

function catch_feed {
 FEEDNAME=$1
 FEEDURL=$2
 HISTORY=${FEEDNAME}.hist

 if [ ! -f ${HISTORY} ]
 then
  touch ${HISTORY}
 fi
 echo "Downloading the feed for ${FEEDNAME}..."
 curl -s ${FEEDURL} -o ${FEEDNAME}.html
 FILES=`cat ${FEEDNAME}.html | grep -o "href=\"/watch?v=[^\"]*\"" | grep -o "=[^\"&]*" | grep -o "[^=]*"`
 for FILE in $FILES
 do
  DOWNLOADED=`grep -Fx -e "${FILE}" ${HISTORY}`
  if [[ ! $DOWNLOADED ]]
  then
   FILENAME="${FEEDNAME}-${FILE}"
   echo "Downloading ${FILENAME}..."
   youtube-dl --extract-audio --audio-format=mp3 --output="${FILENAME}.%(ext)s" --quiet "http://www.youtube.com/watch?v=${FILE}"
   if [ -f "${FILENAME}.mp3" ]
   then 
    echo "${FILE}" >> ${HISTORY}
   fi
  fi
 done
 rm ${FEEDNAME}.html
}

Using curl, we download the page with the YouTube channel's feed. We save the page to a file, and then use grep to find all the links to videos. We do this in three stages: first we try to find all links to YouTube videos, then we remove the starting part of the links up to the video's ID, and then we get rid of the ending part of it. The result is a list of YouTube video IDs, which we then can match against the history file, and download if we want to.

First we match the ID against the history file. Notice that YouTube video IDs can contain dashes (“-”), so we have to use the F option to match using fixed strings, not regular expressions. We also use the option x to match complete lines.

If there's no match for the YouTube video ID in the history file, we download the video and convert it to an MP3 file. This is done using youtube-dl, which has a special option extract-audio to extract the audio track from the video file once it's downloaded. We also use the option quiet so that we keep our own log messages clean. Once we've downloaded the video file and converted it to MP3, we append the YouTube video ID to the history file, so it isn't downloaded a second time.

Notice that we check whether the MP3 file really exists before we add it to the history file. Otherwise, if the internet connection goes down during the download, or another error occurs that stops youtube-dl without stopping the whole script, we would add a YouTube video ID to the history file without having really downloaded.

Finally, when we're done processing a YouTube channel, we remove the downloaded page with the channel's feed.

The rest of the script just calls the function from above, as shown below. The three calls catch the feeds for CNET News, CNET Update and StratFor respectively.

catch_feed cnetnews http://www.youtube.com/show/cnetnews/feed
catch_feed cnetupdate http://www.youtube.com/show/cnetupdate/feed
catch_feed stratfor http://www.youtube.com/user/STRATFORvideo/feed

echo "Done."

Just a final notice: I have no idea whether using a script like this to listen to the audio tracks only instead of watching the actual videos is permitted by YouTube or the organizations producing the videos. But I'm assuming that if this would be an infringement on YouTube's end user license agreement, it's not one of their top priorities. Otherwise, YouTube surely would have broken youtube-dl a long time ago.

vrijdag 15 maart 2013

Tar-Based Back-ups

A few months ago, I found out that I had to change the back-up strategy on my personal laptop. Until then I had used Areca, which in itself worked fine, but I was looking for something that could be scripted and used from the command line, and that was easy to install. As often is the case in the Linux world, it turned out you can easily script a solution on your own using some basic building blocks. For this particular task, the building blocks are Bash, tar, rm and split.

What was my problem with Areca? First of all, from time to time, Areca had to be updated. This is usually a good thing, but not if the new version is incompatible with the old archives. This can also cause problems when restoring archives, e.g. from one computer to another, or after a complete reinstallation of the operating system. Furthermore, since Areca uses a graphical user interface, scripting and running the back-up process from the command line (or crontab) wasn't possible.

My tar-based back-up script starts with a shebang interpreter directive to the Bash shell. Then it sets up four environment variables: a base directory in BASEDIR, the back-up directory where all archives will be stored in BACKUPDIR, the number of the current month (two digits) in MONTH, and the first argument passed to the script in LEVEL. The LEVEL variable represents the back-up level, i.e. 1 if only the most important directories should be archived, 2 if some less important directories should be archived too, etc…


#!/bin/bash
#
# Creates a local back-up.
# The resulting files can be dumped to a media device.

BASEDIR=/home/filip
BACKUPDIR=${BASEDIR}/backup

MONTH=`date +%m`

LEVEL=$1


Next we define a two parameter function that backs up a particular directory to a file. First it echoes to the console what it's going to back up, then uses tar to do the actually archiving, and finally creates a SHA-256 digest from the result. Notice that the output of tar is redirected to a log file. That way we keep the console output tidy, and at the same time can browse through the log file if something went wrong. That's also why we included v (verbosely list files processed) in the option list for tar.


function back_up_to_file {
   echo "Backing up $1 to $2."
   tar -cvpzf ${BACKUPDIR}/$2.tar.gz ${BASEDIR}/$1 &> ${BACKUPDIR}/$2.log
   sha256sum -b ${BACKUPDIR}/$2.tar.gz > ${BACKUPDIR}/$2.sha256
}


Here are some examples of how the function can be used.


back_up_to_file bin bin
back_up_to_file dev dev
back_up_to_file Documents Documents-${MONTH}
back_up_to_file .thunderbird/12345678.default Thunderbird-${MONTH}


Notice the use of the variable MONTH in the example above to create rolling archives. The directories bin and dev will always be backed up to the same archive file, but for the documents directory and Thunderbird, and new one will be created every month. Of course, if the script is run a second time during the same month, the archive file for the documents directory and Thunderbird will be overwritten. Also, the same will happen when the script is run a year later: the one year old archive file will then be overwritten with a fresh back-up of the documents directory and Thunderbird. Tailor to your needs in your own back-up script!

LEVEL can be used in the following manner to differentiate between important and often-changing directories on the one hand, and more stable directories you do not want to archive every time you run the script on the other hand.


# Backup of directories subject to changes
if [ ${LEVEL} -ge 1 ]; then
   back_up_to_file bin bin-${MONTH}
   back_up_to_file Documents Documents-${MONTH}
   back_up_to_file .thunderbird/12345678.default Thunderbird-${MONTH}
   back_up_to_file dev dev-${MONTH}
   …
fi

# Backup of relatively stable directories
if [ ${LEVEL} -ge 2 ]; then
   back_up_to_file Drawings Drawings
   back_up_to_file Photos/2010 Photos-2010
   back_up_to_file Movies/2013 Movies-2010
   back_up_to_file .fonts fonts
   …
fi

# Backup of stable directories
if [ ${LEVEL} -ge 3 ]; then
   back_up_to_file Music Music
   …
fi


Next, I'd like to split large files into chunks that are easier to handle. This makes it easier to move archives between computers or to external media. The following function splits a large file into chuncks of 4 GB. Before it does that, it removes the chunks from the previous run, and when it's done, it also removes the original file.


# Split large files
function split_large_file {
   echo "Going to split $1.tar.gz."
   rm ${BACKUPDIR}/$1.tar.gz.0*
   split -d -b 3900m ${BACKUPDIR}/$1.tar.gz ${BACKUPDIR}/$1.tar.gz.
   rm ${BACKUPDIR}/$1.tar.gz
}


The example below shows how the function can be used.


if [ ${LEVEL} -ge 2 ]; then
   split_large_file Photos-2010
fi


Notice that it uses the LEVEL variable to control when the function is run on the various archive files. If there's a mismatch, the script would try to split non-existing files. That wouldn't hurt, but we also want to avoid the unnecessary error messages that would pop up on the console. A better solution would probably be to automatically detect whether there are any large files in the back-up directory and only split them, but I haven't had time to implement that yet.

Finally, at the end of the script, we write to the console that we're done. I like to do that to indicate explicitly that everything went well, especially since this script can take a while.


echo "Done."


For the moment, I copy the resulting archive files manually from a local directory on the hard disk to an external disk, based on the timestamps. A better solution would be to create another script that can check the SHA-256 sums on the external disk against the local sums, and copy only those archives that are different. We'll save that one for another time.

woensdag 9 januari 2013

SHA-1 Cracking Improvements and Cryptanalysis

In December of last year, researcher Jens Steube presented a big improvement in the efficiency to crack passwords using SHA-1 at the Passwords^12 conference in Oslo. In short, by focusing on the word expansion phase of SHA-1, he was able to reduce the number of operations by 21%. The reason why this is possible is that under some given conditions, a number of XOR operations have always a fixed result or cancel each other out. The result is that if you arrange your work in a smart way, password cracking can be speeded up with a factor of 25%.

His results seem amazing, especially because they seem so basic at the same time as many researchers have been trying to break SHA-1 for so many years. Indeed, as Joachim Strömbergson noted, SHA-1 was published in 1995, almost twenty years ago. One would expect that finding such simplifications would be the first thing a researcher would try to do. There are a few factors that should be considered though:

First of all, Jens Steube was able to make these reductions in the context of brute-force password cracking. Brute-force password cracking is basically a cipher-text only attack, and then it's indeed possible to arrange the chosen plaintexts such that you can exploit the optimisations that Jens Steube discovered. Cryptanalysis, however, usually concentrates on trying to find a collision in the hash function, and except for brute-force birthday attacks, this means in most cases you can't choose the plaintext any way you like.

Second, even though Jens Steube was able to find some shortcuts in the SHA-1 algorithm under a certain set of conditions, it doesn't seem that he was able to reduce the fundamental complexity of the SHA-1 algorithm. I'm no expert on SHA-1 and therefore in no position to really consider how good the attack is, but the number of conditions may just as well outweigh the progress. But I'll come back to that shortly.

Third, and finally, as impressing as a reduction of operations by 21% may sound, it doesn't represent such a big progress in the world of cryptology. In that world, progress isn't measured in percentages, but on an exponential scale. The base for that scale is usually 2, so that the results can be related to the number of bits in the search space. For SHA-1, the digest size is 160, which means that the search space is 2160. A brute-force birthday attack would then roughly require about 280 cipher-texts to be calculated in order to find a collision. A reduction of 21% would then be equivalent to reducing this number to 279.66. This number should be compared to the best known attack on SHA-1, by Marc Stevens, which requires 260 SHA-1 operations. Or, if you prefer to work with percentages, Marc Stevens' attack is equivalent to reducing the calculation time of the brute-force birthday attack by 99.9999%.

Having said all this, I hope the reader doesn't have the impression that I don't think Jens Steube's attack is impressive. Because it really is. Cryptanalysis is a one-sided arms race, where the attackers invent new weapons against old algorithms all the time. Jens Steube's attack is such a new weapon against SHA-1. Often, new weapons can be combined with old weapons to build even better weapons. This means that in the worst case, Jens Steube's attack brings no progress except for password cracking. But we can hope that his findings can be combined with somebody else's attack, like e.g. the one from Marc Stevens, or give some other inspiration to improve it. If some sort of combination of attacks is possible, one can probably expect that the number of operations could be reduced from 260 to 259.66.

Theoretically possible, but very unlikely because of the preconditions needed to apply the attack, would be a reduction of 21% of the complexity to attack SHA-1, not just the calculation of SHA-1. That would reduce the complexity from 260 to 247.4. However, the conditions to apply Jens Steube's attack may represent some added complexity or extra calculations needed to actually find a collision, and therefore increase the number again. But maybe, just maybe, Jens Steube's attack contains a clue to reduce it even further, to make breaking SHA-1 trivial. I don't think that's very likely, but you never know.