woensdag 24 juli 2013

Converting Youtube Channels To Podcasts

I spend a lot of time on the bus commuting to and from work, and I like to listen to podcasts while I do that. I never got into the habit of watching channels on YouTube, so when CNET and StratFor moved from podcasting to a channel on YouTube, I pretty much lost track of both of them.

Sure, there's extra value in a video when you want to show a new product, screen shots, maps or some video footage, but most of the time the audio track is all you really need to get the essence. It's definitely better than not watching or listening to the video at all, and if I find out I would like to see the video too, I can always do that later. The problem is only: how do I convert a YouTube channel to MP3 files, just as if it ware a regular podcast feed?

In Ubuntu, it turns out to be surprisingly easy to create a little Bash script that does just that. Basic ingredients of the script are curl, youtube-dl and one text file per channel to keep track of already downloaded videos. Let's work our way through the script.

#!/bin/bash
#
# “Catches” a YouTube feed, i.e. reads a YouTube feed, downloads videos that
# haven't been downloaded before, and converts them to MP3.
#

# Check if we are the only local instance
if [[ "`pidof -x $(basename $0) -o %PPID`" ]]; then
 echo "This script is already running with PID `pidof -x $(basename $0) -o %PPID` -- exiting."
 exit
fi

cd <your directory here>

First of all, we want to make sure we're the only instance of the script running. The reason for this is that if we have a slow connection, and we've added the script to crontab to be run once every hour, we don't want to end up having two or more instances of the script trying to download the same video. Once we've established we're the only running instance of the script, we move to the directory where we want the files to be downloaded, and we're ready to do the actual work.

That work is done in a function called catch_feed, which takes two arguments: the name of the feed, which we'll use as a key to label the resulting MP3 files and the history file, and the URL of the feed. We then check whether there is already a history file for the feed, and if not we create an empty one touch'ing it.

function catch_feed {
 FEEDNAME=$1
 FEEDURL=$2
 HISTORY=${FEEDNAME}.hist

 if [ ! -f ${HISTORY} ]
 then
  touch ${HISTORY}
 fi
 echo "Downloading the feed for ${FEEDNAME}..."
 curl -s ${FEEDURL} -o ${FEEDNAME}.html
 FILES=`cat ${FEEDNAME}.html | grep -o "href=\"/watch?v=[^\"]*\"" | grep -o "=[^\"&]*" | grep -o "[^=]*"`
 for FILE in $FILES
 do
  DOWNLOADED=`grep -Fx -e "${FILE}" ${HISTORY}`
  if [[ ! $DOWNLOADED ]]
  then
   FILENAME="${FEEDNAME}-${FILE}"
   echo "Downloading ${FILENAME}..."
   youtube-dl --extract-audio --audio-format=mp3 --output="${FILENAME}.%(ext)s" --quiet "http://www.youtube.com/watch?v=${FILE}"
   if [ -f "${FILENAME}.mp3" ]
   then 
    echo "${FILE}" >> ${HISTORY}
   fi
  fi
 done
 rm ${FEEDNAME}.html
}

Using curl, we download the page with the YouTube channel's feed. We save the page to a file, and then use grep to find all the links to videos. We do this in three stages: first we try to find all links to YouTube videos, then we remove the starting part of the links up to the video's ID, and then we get rid of the ending part of it. The result is a list of YouTube video IDs, which we then can match against the history file, and download if we want to.

First we match the ID against the history file. Notice that YouTube video IDs can contain dashes (“-”), so we have to use the F option to match using fixed strings, not regular expressions. We also use the option x to match complete lines.

If there's no match for the YouTube video ID in the history file, we download the video and convert it to an MP3 file. This is done using youtube-dl, which has a special option extract-audio to extract the audio track from the video file once it's downloaded. We also use the option quiet so that we keep our own log messages clean. Once we've downloaded the video file and converted it to MP3, we append the YouTube video ID to the history file, so it isn't downloaded a second time.

Notice that we check whether the MP3 file really exists before we add it to the history file. Otherwise, if the internet connection goes down during the download, or another error occurs that stops youtube-dl without stopping the whole script, we would add a YouTube video ID to the history file without having really downloaded.

Finally, when we're done processing a YouTube channel, we remove the downloaded page with the channel's feed.

The rest of the script just calls the function from above, as shown below. The three calls catch the feeds for CNET News, CNET Update and StratFor respectively.

catch_feed cnetnews http://www.youtube.com/show/cnetnews/feed
catch_feed cnetupdate http://www.youtube.com/show/cnetupdate/feed
catch_feed stratfor http://www.youtube.com/user/STRATFORvideo/feed

echo "Done."

Just a final notice: I have no idea whether using a script like this to listen to the audio tracks only instead of watching the actual videos is permitted by YouTube or the organizations producing the videos. But I'm assuming that if this would be an infringement on YouTube's end user license agreement, it's not one of their top priorities. Otherwise, YouTube surely would have broken youtube-dl a long time ago.