On 9 May, I held a twenty minute presentation at the OSDC.no conference here in Oslo about “How Free Data Can Drive Some of the Monkey Business Out of Political Journalism and Science” (see also slides). But what is that “monkey business” about? What is it that often goes wrong when journalists report about the results of a new opinion poll, and why?
In my experience, margins of error are the most common problem. Journalists often forget that opinion polls come with a margin of error on the results, and that these margins of error depend not only on the total sample size of the opinion poll, but also on the size of each particular result. That means that within the same opinion poll, the result for a party polling at forty percent has a different margin of error than the result for a party polling at less than a percent. If only one margin of error is given, it usually refers to half of the width of the 95% confidence interval for a result around fifty percent. But a margin of error of say ±2% doesn't make much sense for a party polling at 1%: that would put its support somewhere between +3% and… -1%. Obviously, the latter is impossible, no matter how impopular a political party has become.
Also, small differences are often blown out of proportion. This happens a lot both when two parties polling close to each other are compared, but also when the results for the same party across multiple polls over time are discussed. It's my impression that for the former case, American and British journalists are better at calling the results of an opinion poll a “tie” or “too close to call” than journalists living in other countries. That has of course a lot to do with the electoral systems and the two-party political landscape in the US and the UK. Journalists living in countries with a proportional electoral system and many parties are more often tempted to call a party polling 0.1% higher than a competitor “larger”, even though that doesn't make sense statistically speaking.
Something closely related to the previous problem are polling results around thresholds. In some countries, parties reaching a particular threshold are rewarded with extra seats in parliament, or, the other way around, barred from parliament if they don't reach the threshold. As a consequence, opinion poll results close to the threshold get extra attention. However, a party polling at 5.1% isn't at all above a 5% threshold, as journalists often conclude. In fact, its odds are roughly 50/50 of being above or below the threshold. The same goes for a party polling at 4.9%: it's not below the 5% threshold, it too is almost 50/50 round it. Of course, the closer a party's polling to a particular threshold, the more interested people are in knowing whether it's above or below the threshold. The irony is though that the closer a party is polling to a threshold, the less you can say about it with certainty.
In most electoral systems, projecting the results of an opinion poll into a seat distribution in parliament can quickly become a complicated issue. First-past-the-post systems like in the UK or two-round systems like in France are especially cumbersome, but even in electoral systems based on a proportional distribution, the exercise can be challenging. Luckily, errors in one constituency will often be compensated by dual errors in another constituency, such that the overall result is usually more or less correct. There's one catch though: as I already said, opinion poll results aren't exact results, they come with margins of errors. And I can't remember I've ever seen margins of error mentioned on a seat distribution, certainly not in the press. As a consequence, conclusions made on such seat distributions without margins of error are almost always wrong, especially if the electoral system involves any sort of thresholds.
Finally, journalists often forget that you don't need fifty percent plus one vote to get a majority in parliament. As obvious as it may sound: a majority in the number of seats really is enough. Again, and for obvious reasons, journalists in the US and UK are better at remembering this than journalists living in countries with proportional electoral systems. In proportional systems, it's often enough to get 47% of the popular vote to get a majority in parliament, and sometimes even less.
After my presentation at OSDC.no, somebody in the audience asked me what we can do about these problems. The answer is of course: education, both for journalists and the audience. However, understanding statistics is difficult, even if you're interested in it and used some time to study it. But we should try to inform at least the journalists better, and create tools that can give them a better understanding of what the polls really say. Collecting data and providing it as free or open data is a first step in the right direction.
Two Random Words
woensdag 20 mei 2015
maandag 11 mei 2015
Også smidige prosjekter feiler
Siden artikkelen «Vi kan få langt mer igjen for IT-investeringene»
ble publisert for et par uker siden, har det fått en del oppmerksomhet i
de kretsene jeg beveger meg i. Jeg har i hvert fall fått tilsendt
anbefalinger fra flere personer om å lese den så langt. Konklusjonen fra
artikkelen er at det offentlige burde strukturere IKT-prosjektene sine
mye mer smidig, slik som man også gjør i det private.
Det er ikke vanskelig å være enig i artikkelens analyse av problemet. Vi har i det siste sett en del offentlige IKT-prosjekter som har brukt fryktelig mye penger uten at de har produsert noen resultater, eller i hvert fall bare små resultater. Noen IKT-prosjekter har til og med brukt mye penger uten at de kom seg så langt som oppstartsfasen. Men jeg er ikke så sikker på om løsningen som artikkelen foreslår virkelig løser hele problemet.
For det første er ansvarskjeden i det offentlige skrudd sammen på en hel annen måte enn i det private. Når det kommer til stykket står man i en privatbedrift ikke til ansvar for noen andre enn aksjonærene. Oppstår det en konflikt, f.eks. fordi man er uenig i en satsning, kan aksjonærene velge å kaste ledelsen og få rettet opp kursen. Og aksjonærene som blir stilt i minoritet kan velge å selge sine aksjer, for heller å investere i en bedrift som gjør ting på en måte de liker bedre.
I det offentlige er denne øvelsen mye vanskeligere. I prinsippet kan man selvfølgelig alltid flytte til et annet land og til og med bytte statsborgerskap, men det er veldig få som gjør det p.g.a. et par mislykkede IKT-prosjekter i det offentlige.
Konsekvensen er at man i det offentlige er mye mer, og bør være mye mer forsiktig med å bruke skattepenger. Det er derfor ikke feil å kjøre et lite forprosjekt for å undersøke modenheten av en ny teknologi eller et nytt produkt, slik at man får tenkt seg godt om før man setter i gang et større prosjekt. I det private er det derimot mye lettere –og heller ikke feil– å komme i gang med en ny teknologi eller et nytt produkt, bare man får overbevist de riktige personene i bedriften.
Legg dog merke til at jeg skriver et lite forprosjekt, ikke et svært forprosjekt som noen ganger kan bli nesten like stort som selve hovedprosjektet. Da kan det fort bli for mye konsulenteri uten at det blir levert noen substans. Og for all del, forprosjektet må gjerne kjøres i form av en proof of concept istedenfor å bare produsere en rapport med mye tekst.
Men et annet problem som jeg har med artikkelen er at det også går i fella som heter survivorship bias. Når man henviser til vellykkede privatbedrifter som Amazon, Google, Facebook, Ebay, Netflix, Spotify, FINN.no (og de kunne ha lagt til LinkedIn), så må man huske at det også finnes titalls, om ikke hundretalls andre privatbedrifter som prøvde å gjøre akkurat det samme, men mislykkes. Ofte har de vært like smidige i utviklingsmetodikken sin som eksemplene nevnt ovenfor, men likevel mislykkes fordi de satset på feil teknologi, feil produkt, kom for tidlig til markedet, kom for sent til markedet, var lokalisert i feil land, eller bare hadde litt uflaks.
For offentlige prosjekter er det som regel ikke et problem å komme for tidlig eller for sent til markedet, fordi staten har monopol på en del tjenester og derfor definerer markedet selv. Men når det gjelder teknologi- og produktvalg er det offentlige like mye utsatt for feilvalg som en enkelt privatbedrift, samtidig som det ikke kan dra nytte av markedsøkonomien i form utvelgelsesprosessen blant mange privatbedrifter samlet for å kåre vinneren. Og det betyr at hvis et offentlig prosjekt satser på feil teknologi eller feil produkt for å implementere en tjeneste, så vil man bli sittende med dette valget i potensielt mange år fremover.
Jeg har flere ganger vært vitne til offentlige prosjekter der man har tatt en sjanse på en ny teknologi eller et nytt produkt, og bommet skikkelig. Noen ganger var det fordi man var for tidlig ute, noen ganger fordi man rett og slett hadde uflaks. Men har man først valgt feil teknologi eller feil produkt, er det vanskelig å komme seg ut av det uten at det fører til store ekstrakostnader. Da vil man heller prøve å stå løpet ut, selv om det betyr at det vil være stadig vanskeligere å finne folk som vil jobbe på prosjektet, og stadig vanskeligere å videreutvikle prosjektet eller vedlikeholde det.
I mange av disse tilfellene hadde smidig utvikling ikke endret noe på utfallet av prosjektet. Noen av disse tilfellene var faktisk smidige prosjekter. Men fordi teknologi- og produktvalg er et binært valgt –man bruker det, eller man bruker det ikke– hjelper det ikke å prøve seg litt frem i begynnelsen, og ta små skritt. At man har valgt feil teknologi eller produkt er dessuten noe man ofte ikke får vite før et par år etterpå, og da er det for sent.
Så ja, det er surt å se at en masse skattepenger blir brukt på store forprosjekter uten at det kommer noe nyttig ut av det (hvis man snevrer begrepet nyttig inn til «kjørende kode»). Det er også surt å se at noen IT-systemer hindrer viktige og nødvendige lovendringer. Og det er sikkert riktig at smidig utvikling i mange tilfeller vil kunne gi mer verdi for pengene enn hva som er tilfellet nå. Men nei, heller ikke smidig prosjektutvikling ville kunne gi noen garantier for suksess i alle offentlige IT-investeringer. Det er en viktig nyanse å huske, slik at ikke smidig prosjektutvikling i seg selv får skylden neste gang det oppstår problemer i et offentlig IKT-prosjekt.
Det er ikke vanskelig å være enig i artikkelens analyse av problemet. Vi har i det siste sett en del offentlige IKT-prosjekter som har brukt fryktelig mye penger uten at de har produsert noen resultater, eller i hvert fall bare små resultater. Noen IKT-prosjekter har til og med brukt mye penger uten at de kom seg så langt som oppstartsfasen. Men jeg er ikke så sikker på om løsningen som artikkelen foreslår virkelig løser hele problemet.
For det første er ansvarskjeden i det offentlige skrudd sammen på en hel annen måte enn i det private. Når det kommer til stykket står man i en privatbedrift ikke til ansvar for noen andre enn aksjonærene. Oppstår det en konflikt, f.eks. fordi man er uenig i en satsning, kan aksjonærene velge å kaste ledelsen og få rettet opp kursen. Og aksjonærene som blir stilt i minoritet kan velge å selge sine aksjer, for heller å investere i en bedrift som gjør ting på en måte de liker bedre.
I det offentlige er denne øvelsen mye vanskeligere. I prinsippet kan man selvfølgelig alltid flytte til et annet land og til og med bytte statsborgerskap, men det er veldig få som gjør det p.g.a. et par mislykkede IKT-prosjekter i det offentlige.
Konsekvensen er at man i det offentlige er mye mer, og bør være mye mer forsiktig med å bruke skattepenger. Det er derfor ikke feil å kjøre et lite forprosjekt for å undersøke modenheten av en ny teknologi eller et nytt produkt, slik at man får tenkt seg godt om før man setter i gang et større prosjekt. I det private er det derimot mye lettere –og heller ikke feil– å komme i gang med en ny teknologi eller et nytt produkt, bare man får overbevist de riktige personene i bedriften.
Legg dog merke til at jeg skriver et lite forprosjekt, ikke et svært forprosjekt som noen ganger kan bli nesten like stort som selve hovedprosjektet. Da kan det fort bli for mye konsulenteri uten at det blir levert noen substans. Og for all del, forprosjektet må gjerne kjøres i form av en proof of concept istedenfor å bare produsere en rapport med mye tekst.
Men et annet problem som jeg har med artikkelen er at det også går i fella som heter survivorship bias. Når man henviser til vellykkede privatbedrifter som Amazon, Google, Facebook, Ebay, Netflix, Spotify, FINN.no (og de kunne ha lagt til LinkedIn), så må man huske at det også finnes titalls, om ikke hundretalls andre privatbedrifter som prøvde å gjøre akkurat det samme, men mislykkes. Ofte har de vært like smidige i utviklingsmetodikken sin som eksemplene nevnt ovenfor, men likevel mislykkes fordi de satset på feil teknologi, feil produkt, kom for tidlig til markedet, kom for sent til markedet, var lokalisert i feil land, eller bare hadde litt uflaks.
For offentlige prosjekter er det som regel ikke et problem å komme for tidlig eller for sent til markedet, fordi staten har monopol på en del tjenester og derfor definerer markedet selv. Men når det gjelder teknologi- og produktvalg er det offentlige like mye utsatt for feilvalg som en enkelt privatbedrift, samtidig som det ikke kan dra nytte av markedsøkonomien i form utvelgelsesprosessen blant mange privatbedrifter samlet for å kåre vinneren. Og det betyr at hvis et offentlig prosjekt satser på feil teknologi eller feil produkt for å implementere en tjeneste, så vil man bli sittende med dette valget i potensielt mange år fremover.
Jeg har flere ganger vært vitne til offentlige prosjekter der man har tatt en sjanse på en ny teknologi eller et nytt produkt, og bommet skikkelig. Noen ganger var det fordi man var for tidlig ute, noen ganger fordi man rett og slett hadde uflaks. Men har man først valgt feil teknologi eller feil produkt, er det vanskelig å komme seg ut av det uten at det fører til store ekstrakostnader. Da vil man heller prøve å stå løpet ut, selv om det betyr at det vil være stadig vanskeligere å finne folk som vil jobbe på prosjektet, og stadig vanskeligere å videreutvikle prosjektet eller vedlikeholde det.
I mange av disse tilfellene hadde smidig utvikling ikke endret noe på utfallet av prosjektet. Noen av disse tilfellene var faktisk smidige prosjekter. Men fordi teknologi- og produktvalg er et binært valgt –man bruker det, eller man bruker det ikke– hjelper det ikke å prøve seg litt frem i begynnelsen, og ta små skritt. At man har valgt feil teknologi eller produkt er dessuten noe man ofte ikke får vite før et par år etterpå, og da er det for sent.
Så ja, det er surt å se at en masse skattepenger blir brukt på store forprosjekter uten at det kommer noe nyttig ut av det (hvis man snevrer begrepet nyttig inn til «kjørende kode»). Det er også surt å se at noen IT-systemer hindrer viktige og nødvendige lovendringer. Og det er sikkert riktig at smidig utvikling i mange tilfeller vil kunne gi mer verdi for pengene enn hva som er tilfellet nå. Men nei, heller ikke smidig prosjektutvikling ville kunne gi noen garantier for suksess i alle offentlige IT-investeringer. Det er en viktig nyanse å huske, slik at ikke smidig prosjektutvikling i seg selv får skylden neste gang det oppstår problemer i et offentlig IKT-prosjekt.
Time to Say Goodbye to the Traditional Watch
On Friday 24 April, Apple will start selling the Apple Watch
(you can already pre-order one now). I think this will be the start of
yet another disruptive change triggered by an Apple product. Why? Not so
much because of the Apple Watch product per se, but because it will make smart watches go mainstream. And once that happens, the era of the traditional watches will be over.
The reason for this is the following: once you've started using a smart watch, there's no way you'll go back to a traditional watch. The extra functionality that a smart watch gives you offsets the hassle of having to charge it at least once a day. I've been wearing a smart watch for almost four months now, and these are the main three reasons why I like it so much:
Like I already mentioned, yes, you have to charge the watch every single day. In fact, you'll have to charge your phone more often too if you keep the two synchronized, but that's also a new habit you'll pick up really quick. Another downside is that the watch is a bit big and clunky, but I expect smart watches will become smaller and smaller with better battery life, just like mobile phones did.
Contrary to a lot of other people, I'm not sure there will be a lot of app development directed to the watch. To me, it seems that the basic functionality is all you really need. Maybe it's just me lacking fantasy, but I don't see why people would want to spend a lot of time on the tiny display. There may be room for more “smart tracking” in the area of health or location, but I think that will be it.
The reason for this is the following: once you've started using a smart watch, there's no way you'll go back to a traditional watch. The extra functionality that a smart watch gives you offsets the hassle of having to charge it at least once a day. I've been wearing a smart watch for almost four months now, and these are the main three reasons why I like it so much:
- When you're in a meeting, or another situation where you can't take an incoming call, it's much easier and quicker to reject the call from your smart watch than picking up your phone and rejecting it from there. (Actually, even when I'm going to pick up the phone, I find myself looking at my watch to see who it is. It's just quicker.) Also, it's much easier to notice a vibrating watch than a vibrating phone floating around in one of your pockets, and definitely less noisy than a phone vibrating on a table.
- I read most of my notifications on my watch, and leave my phone in my pocket. Only if more action is needed, like sending a reply or if I want to read the full news article, I'll pick up my phone to do that.
- Synchronization with the Google calendar, and in particular the display of when your next meeting is. The last hour before the meeting, my watch even tells me how many minutes I have left. That's not a big issue with meetings at the top or the bottom of the hour, but it's really handy when you have to catch a train or a bus.
Like I already mentioned, yes, you have to charge the watch every single day. In fact, you'll have to charge your phone more often too if you keep the two synchronized, but that's also a new habit you'll pick up really quick. Another downside is that the watch is a bit big and clunky, but I expect smart watches will become smaller and smaller with better battery life, just like mobile phones did.
Contrary to a lot of other people, I'm not sure there will be a lot of app development directed to the watch. To me, it seems that the basic functionality is all you really need. Maybe it's just me lacking fantasy, but I don't see why people would want to spend a lot of time on the tiny display. There may be room for more “smart tracking” in the area of health or location, but I think that will be it.
woensdag 24 juli 2013
Converting Youtube Channels To Podcasts
I spend a lot of time on the bus commuting to and from work, and I like to listen to podcasts while I do that. I never got into the habit of watching channels on YouTube, so when CNET and StratFor moved from podcasting to a channel on YouTube, I pretty much lost track of both of them.
Sure, there's extra value in a video when you want to show a new product, screen shots, maps or some video footage, but most of the time the audio track is all you really need to get the essence. It's definitely better than not watching or listening to the video at all, and if I find out I would like to see the video too, I can always do that later. The problem is only: how do I convert a YouTube channel to MP3 files, just as if it ware a regular podcast feed?
In Ubuntu, it turns out to be surprisingly easy to create a little Bash script that does just that. Basic ingredients of the script are curl, youtube-dl and one text file per channel to keep track of already downloaded videos. Let's work our way through the script.
First of all, we want to make sure we're the only instance of the script running. The reason for this is that if we have a slow connection, and we've added the script to crontab to be run once every hour, we don't want to end up having two or more instances of the script trying to download the same video. Once we've established we're the only running instance of the script, we move to the directory where we want the files to be downloaded, and we're ready to do the actual work.
That work is done in a function called catch_feed, which takes two arguments: the name of the feed, which we'll use as a key to label the resulting MP3 files and the history file, and the URL of the feed. We then check whether there is already a history file for the feed, and if not we create an empty one touch'ing it.
Using curl, we download the page with the YouTube channel's feed. We save the page to a file, and then use grep to find all the links to videos. We do this in three stages: first we try to find all links to YouTube videos, then we remove the starting part of the links up to the video's ID, and then we get rid of the ending part of it. The result is a list of YouTube video IDs, which we then can match against the history file, and download if we want to.
First we match the ID against the history file. Notice that YouTube video IDs can contain dashes (“-”), so we have to use the F option to match using fixed strings, not regular expressions. We also use the option x to match complete lines.
If there's no match for the YouTube video ID in the history file, we download the video and convert it to an MP3 file. This is done using youtube-dl, which has a special option extract-audio to extract the audio track from the video file once it's downloaded. We also use the option quiet so that we keep our own log messages clean. Once we've downloaded the video file and converted it to MP3, we append the YouTube video ID to the history file, so it isn't downloaded a second time.
Notice that we check whether the MP3 file really exists before we add it to the history file. Otherwise, if the internet connection goes down during the download, or another error occurs that stops youtube-dl without stopping the whole script, we would add a YouTube video ID to the history file without having really downloaded.
Finally, when we're done processing a YouTube channel, we remove the downloaded page with the channel's feed.
The rest of the script just calls the function from above, as shown below. The three calls catch the feeds for CNET News, CNET Update and StratFor respectively.
Just a final notice: I have no idea whether using a script like this to listen to the audio tracks only instead of watching the actual videos is permitted by YouTube or the organizations producing the videos. But I'm assuming that if this would be an infringement on YouTube's end user license agreement, it's not one of their top priorities. Otherwise, YouTube surely would have broken youtube-dl a long time ago.
Sure, there's extra value in a video when you want to show a new product, screen shots, maps or some video footage, but most of the time the audio track is all you really need to get the essence. It's definitely better than not watching or listening to the video at all, and if I find out I would like to see the video too, I can always do that later. The problem is only: how do I convert a YouTube channel to MP3 files, just as if it ware a regular podcast feed?
In Ubuntu, it turns out to be surprisingly easy to create a little Bash script that does just that. Basic ingredients of the script are curl, youtube-dl and one text file per channel to keep track of already downloaded videos. Let's work our way through the script.
#!/bin/bash # # “Catches” a YouTube feed, i.e. reads a YouTube feed, downloads videos that # haven't been downloaded before, and converts them to MP3. # # Check if we are the only local instance if [[ "`pidof -x $(basename $0) -o %PPID`" ]]; then echo "This script is already running with PID `pidof -x $(basename $0) -o %PPID` -- exiting." exit fi cd <your directory here>
First of all, we want to make sure we're the only instance of the script running. The reason for this is that if we have a slow connection, and we've added the script to crontab to be run once every hour, we don't want to end up having two or more instances of the script trying to download the same video. Once we've established we're the only running instance of the script, we move to the directory where we want the files to be downloaded, and we're ready to do the actual work.
That work is done in a function called catch_feed, which takes two arguments: the name of the feed, which we'll use as a key to label the resulting MP3 files and the history file, and the URL of the feed. We then check whether there is already a history file for the feed, and if not we create an empty one touch'ing it.
function catch_feed { FEEDNAME=$1 FEEDURL=$2 HISTORY=${FEEDNAME}.hist if [ ! -f ${HISTORY} ] then touch ${HISTORY} fi echo "Downloading the feed for ${FEEDNAME}..." curl -s ${FEEDURL} -o ${FEEDNAME}.html FILES=`cat ${FEEDNAME}.html | grep -o "href=\"/watch?v=[^\"]*\"" | grep -o "=[^\"&]*" | grep -o "[^=]*"` for FILE in $FILES do DOWNLOADED=`grep -Fx -e "${FILE}" ${HISTORY}` if [[ ! $DOWNLOADED ]] then FILENAME="${FEEDNAME}-${FILE}" echo "Downloading ${FILENAME}..." youtube-dl --extract-audio --audio-format=mp3 --output="${FILENAME}.%(ext)s" --quiet "http://www.youtube.com/watch?v=${FILE}" if [ -f "${FILENAME}.mp3" ] then echo "${FILE}" >> ${HISTORY} fi fi done rm ${FEEDNAME}.html }
Using curl, we download the page with the YouTube channel's feed. We save the page to a file, and then use grep to find all the links to videos. We do this in three stages: first we try to find all links to YouTube videos, then we remove the starting part of the links up to the video's ID, and then we get rid of the ending part of it. The result is a list of YouTube video IDs, which we then can match against the history file, and download if we want to.
First we match the ID against the history file. Notice that YouTube video IDs can contain dashes (“-”), so we have to use the F option to match using fixed strings, not regular expressions. We also use the option x to match complete lines.
If there's no match for the YouTube video ID in the history file, we download the video and convert it to an MP3 file. This is done using youtube-dl, which has a special option extract-audio to extract the audio track from the video file once it's downloaded. We also use the option quiet so that we keep our own log messages clean. Once we've downloaded the video file and converted it to MP3, we append the YouTube video ID to the history file, so it isn't downloaded a second time.
Notice that we check whether the MP3 file really exists before we add it to the history file. Otherwise, if the internet connection goes down during the download, or another error occurs that stops youtube-dl without stopping the whole script, we would add a YouTube video ID to the history file without having really downloaded.
Finally, when we're done processing a YouTube channel, we remove the downloaded page with the channel's feed.
The rest of the script just calls the function from above, as shown below. The three calls catch the feeds for CNET News, CNET Update and StratFor respectively.
catch_feed cnetnews http://www.youtube.com/show/cnetnews/feed catch_feed cnetupdate http://www.youtube.com/show/cnetupdate/feed catch_feed stratfor http://www.youtube.com/user/STRATFORvideo/feed echo "Done."
Just a final notice: I have no idea whether using a script like this to listen to the audio tracks only instead of watching the actual videos is permitted by YouTube or the organizations producing the videos. But I'm assuming that if this would be an infringement on YouTube's end user license agreement, it's not one of their top priorities. Otherwise, YouTube surely would have broken youtube-dl a long time ago.
vrijdag 15 maart 2013
Tar-Based Back-ups
A few months ago, I found out that I had to change the back-up strategy on my personal laptop. Until then I had used Areca, which in itself worked fine, but I was looking for something that could be scripted and used from the command line, and that was easy to install. As often is the case in the Linux world, it turned out you can easily script a solution on your own using some basic building blocks. For this particular task, the building blocks are Bash, tar, rm and split.
What was my problem with Areca? First of all, from time to time, Areca had to be updated. This is usually a good thing, but not if the new version is incompatible with the old archives. This can also cause problems when restoring archives, e.g. from one computer to another, or after a complete reinstallation of the operating system. Furthermore, since Areca uses a graphical user interface, scripting and running the back-up process from the command line (or crontab) wasn't possible.
My tar-based back-up script starts with a shebang interpreter directive to the Bash shell. Then it sets up four environment variables: a base directory in BASEDIR, the back-up directory where all archives will be stored in BACKUPDIR, the number of the current month (two digits) in MONTH, and the first argument passed to the script in LEVEL. The LEVEL variable represents the back-up level, i.e. 1 if only the most important directories should be archived, 2 if some less important directories should be archived too, etc…
Next we define a two parameter function that backs up a particular directory to a file. First it echoes to the console what it's going to back up, then uses tar to do the actually archiving, and finally creates a SHA-256 digest from the result. Notice that the output of tar is redirected to a log file. That way we keep the console output tidy, and at the same time can browse through the log file if something went wrong. That's also why we included v (verbosely list files processed) in the option list for tar.
Here are some examples of how the function can be used.
Notice the use of the variable MONTH in the example above to create rolling archives. The directories bin and dev will always be backed up to the same archive file, but for the documents directory and Thunderbird, and new one will be created every month. Of course, if the script is run a second time during the same month, the archive file for the documents directory and Thunderbird will be overwritten. Also, the same will happen when the script is run a year later: the one year old archive file will then be overwritten with a fresh back-up of the documents directory and Thunderbird. Tailor to your needs in your own back-up script!
LEVEL can be used in the following manner to differentiate between important and often-changing directories on the one hand, and more stable directories you do not want to archive every time you run the script on the other hand.
Next, I'd like to split large files into chunks that are easier to handle. This makes it easier to move archives between computers or to external media. The following function splits a large file into chuncks of 4 GB. Before it does that, it removes the chunks from the previous run, and when it's done, it also removes the original file.
The example below shows how the function can be used.
Notice that it uses the LEVEL variable to control when the function is run on the various archive files. If there's a mismatch, the script would try to split non-existing files. That wouldn't hurt, but we also want to avoid the unnecessary error messages that would pop up on the console. A better solution would probably be to automatically detect whether there are any large files in the back-up directory and only split them, but I haven't had time to implement that yet.
Finally, at the end of the script, we write to the console that we're done. I like to do that to indicate explicitly that everything went well, especially since this script can take a while.
For the moment, I copy the resulting archive files manually from a local directory on the hard disk to an external disk, based on the timestamps. A better solution would be to create another script that can check the SHA-256 sums on the external disk against the local sums, and copy only those archives that are different. We'll save that one for another time.
What was my problem with Areca? First of all, from time to time, Areca had to be updated. This is usually a good thing, but not if the new version is incompatible with the old archives. This can also cause problems when restoring archives, e.g. from one computer to another, or after a complete reinstallation of the operating system. Furthermore, since Areca uses a graphical user interface, scripting and running the back-up process from the command line (or crontab) wasn't possible.
My tar-based back-up script starts with a shebang interpreter directive to the Bash shell. Then it sets up four environment variables: a base directory in BASEDIR, the back-up directory where all archives will be stored in BACKUPDIR, the number of the current month (two digits) in MONTH, and the first argument passed to the script in LEVEL. The LEVEL variable represents the back-up level, i.e. 1 if only the most important directories should be archived, 2 if some less important directories should be archived too, etc…
#!/bin/bash
#
# Creates a local back-up.
# The resulting files can be dumped to a media device.
BASEDIR=/home/filip
BACKUPDIR=${BASEDIR}/backup
MONTH=`date +%m`
LEVEL=$1
Next we define a two parameter function that backs up a particular directory to a file. First it echoes to the console what it's going to back up, then uses tar to do the actually archiving, and finally creates a SHA-256 digest from the result. Notice that the output of tar is redirected to a log file. That way we keep the console output tidy, and at the same time can browse through the log file if something went wrong. That's also why we included v (verbosely list files processed) in the option list for tar.
function back_up_to_file {
echo "Backing up $1 to $2."
tar -cvpzf ${BACKUPDIR}/$2.tar.gz ${BASEDIR}/$1 &> ${BACKUPDIR}/$2.log
sha256sum -b ${BACKUPDIR}/$2.tar.gz > ${BACKUPDIR}/$2.sha256
}
Here are some examples of how the function can be used.
back_up_to_file bin bin
back_up_to_file dev dev
back_up_to_file Documents Documents-${MONTH}
back_up_to_file .thunderbird/12345678.default Thunderbird-${MONTH}
Notice the use of the variable MONTH in the example above to create rolling archives. The directories bin and dev will always be backed up to the same archive file, but for the documents directory and Thunderbird, and new one will be created every month. Of course, if the script is run a second time during the same month, the archive file for the documents directory and Thunderbird will be overwritten. Also, the same will happen when the script is run a year later: the one year old archive file will then be overwritten with a fresh back-up of the documents directory and Thunderbird. Tailor to your needs in your own back-up script!
LEVEL can be used in the following manner to differentiate between important and often-changing directories on the one hand, and more stable directories you do not want to archive every time you run the script on the other hand.
# Backup of directories subject to changes
if [ ${LEVEL} -ge 1 ]; then
back_up_to_file bin bin-${MONTH}
back_up_to_file Documents Documents-${MONTH}
back_up_to_file .thunderbird/12345678.default Thunderbird-${MONTH}
back_up_to_file dev dev-${MONTH}
…
fi
# Backup of relatively stable directories
if [ ${LEVEL} -ge 2 ]; then
back_up_to_file Drawings Drawings
back_up_to_file Photos/2010 Photos-2010
back_up_to_file Movies/2013 Movies-2010
back_up_to_file .fonts fonts
…
fi
# Backup of stable directories
if [ ${LEVEL} -ge 3 ]; then
back_up_to_file Music Music
…
fi
Next, I'd like to split large files into chunks that are easier to handle. This makes it easier to move archives between computers or to external media. The following function splits a large file into chuncks of 4 GB. Before it does that, it removes the chunks from the previous run, and when it's done, it also removes the original file.
# Split large files
function split_large_file {
echo "Going to split $1.tar.gz."
rm ${BACKUPDIR}/$1.tar.gz.0*
split -d -b 3900m ${BACKUPDIR}/$1.tar.gz ${BACKUPDIR}/$1.tar.gz.
rm ${BACKUPDIR}/$1.tar.gz
}
The example below shows how the function can be used.
if [ ${LEVEL} -ge 2 ]; then
split_large_file Photos-2010
fi
Notice that it uses the LEVEL variable to control when the function is run on the various archive files. If there's a mismatch, the script would try to split non-existing files. That wouldn't hurt, but we also want to avoid the unnecessary error messages that would pop up on the console. A better solution would probably be to automatically detect whether there are any large files in the back-up directory and only split them, but I haven't had time to implement that yet.
Finally, at the end of the script, we write to the console that we're done. I like to do that to indicate explicitly that everything went well, especially since this script can take a while.
echo "Done."
For the moment, I copy the resulting archive files manually from a local directory on the hard disk to an external disk, based on the timestamps. A better solution would be to create another script that can check the SHA-256 sums on the external disk against the local sums, and copy only those archives that are different. We'll save that one for another time.
woensdag 9 januari 2013
SHA-1 Cracking Improvements and Cryptanalysis
In December of last year, researcher Jens Steube presented a big improvement in the efficiency to crack passwords using SHA-1 at the Passwords^12 conference in Oslo. In short, by focusing on the word expansion phase of SHA-1, he was able to reduce the number of operations by 21%. The reason why this is possible is that under some given conditions, a number of XOR operations have always a fixed result or cancel each other out. The result is that if you arrange your work in a smart way, password cracking can be speeded up with a factor of 25%.
His results seem amazing, especially because they seem so basic at the same time as many researchers have been trying to break SHA-1 for so many years. Indeed, as Joachim Strömbergson noted, SHA-1 was published in 1995, almost twenty years ago. One would expect that finding such simplifications would be the first thing a researcher would try to do. There are a few factors that should be considered though:
First of all, Jens Steube was able to make these reductions in the context of brute-force password cracking. Brute-force password cracking is basically a cipher-text only attack, and then it's indeed possible to arrange the chosen plaintexts such that you can exploit the optimisations that Jens Steube discovered. Cryptanalysis, however, usually concentrates on trying to find a collision in the hash function, and except for brute-force birthday attacks, this means in most cases you can't choose the plaintext any way you like.
Second, even though Jens Steube was able to find some shortcuts in the SHA-1 algorithm under a certain set of conditions, it doesn't seem that he was able to reduce the fundamental complexity of the SHA-1 algorithm. I'm no expert on SHA-1 and therefore in no position to really consider how good the attack is, but the number of conditions may just as well outweigh the progress. But I'll come back to that shortly.
Third, and finally, as impressing as a reduction of operations by 21% may sound, it doesn't represent such a big progress in the world of cryptology. In that world, progress isn't measured in percentages, but on an exponential scale. The base for that scale is usually 2, so that the results can be related to the number of bits in the search space. For SHA-1, the digest size is 160, which means that the search space is 2160. A brute-force birthday attack would then roughly require about 280 cipher-texts to be calculated in order to find a collision. A reduction of 21% would then be equivalent to reducing this number to 279.66. This number should be compared to the best known attack on SHA-1, by Marc Stevens, which requires 260 SHA-1 operations. Or, if you prefer to work with percentages, Marc Stevens' attack is equivalent to reducing the calculation time of the brute-force birthday attack by 99.9999%.
Having said all this, I hope the reader doesn't have the impression that I don't think Jens Steube's attack is impressive. Because it really is. Cryptanalysis is a one-sided arms race, where the attackers invent new weapons against old algorithms all the time. Jens Steube's attack is such a new weapon against SHA-1. Often, new weapons can be combined with old weapons to build even better weapons. This means that in the worst case, Jens Steube's attack brings no progress except for password cracking. But we can hope that his findings can be combined with somebody else's attack, like e.g. the one from Marc Stevens, or give some other inspiration to improve it. If some sort of combination of attacks is possible, one can probably expect that the number of operations could be reduced from 260 to 259.66.
Theoretically possible, but very unlikely because of the preconditions needed to apply the attack, would be a reduction of 21% of the complexity to attack SHA-1, not just the calculation of SHA-1. That would reduce the complexity from 260 to 247.4. However, the conditions to apply Jens Steube's attack may represent some added complexity or extra calculations needed to actually find a collision, and therefore increase the number again. But maybe, just maybe, Jens Steube's attack contains a clue to reduce it even further, to make breaking SHA-1 trivial. I don't think that's very likely, but you never know.
His results seem amazing, especially because they seem so basic at the same time as many researchers have been trying to break SHA-1 for so many years. Indeed, as Joachim Strömbergson noted, SHA-1 was published in 1995, almost twenty years ago. One would expect that finding such simplifications would be the first thing a researcher would try to do. There are a few factors that should be considered though:
First of all, Jens Steube was able to make these reductions in the context of brute-force password cracking. Brute-force password cracking is basically a cipher-text only attack, and then it's indeed possible to arrange the chosen plaintexts such that you can exploit the optimisations that Jens Steube discovered. Cryptanalysis, however, usually concentrates on trying to find a collision in the hash function, and except for brute-force birthday attacks, this means in most cases you can't choose the plaintext any way you like.
Second, even though Jens Steube was able to find some shortcuts in the SHA-1 algorithm under a certain set of conditions, it doesn't seem that he was able to reduce the fundamental complexity of the SHA-1 algorithm. I'm no expert on SHA-1 and therefore in no position to really consider how good the attack is, but the number of conditions may just as well outweigh the progress. But I'll come back to that shortly.
Third, and finally, as impressing as a reduction of operations by 21% may sound, it doesn't represent such a big progress in the world of cryptology. In that world, progress isn't measured in percentages, but on an exponential scale. The base for that scale is usually 2, so that the results can be related to the number of bits in the search space. For SHA-1, the digest size is 160, which means that the search space is 2160. A brute-force birthday attack would then roughly require about 280 cipher-texts to be calculated in order to find a collision. A reduction of 21% would then be equivalent to reducing this number to 279.66. This number should be compared to the best known attack on SHA-1, by Marc Stevens, which requires 260 SHA-1 operations. Or, if you prefer to work with percentages, Marc Stevens' attack is equivalent to reducing the calculation time of the brute-force birthday attack by 99.9999%.
Having said all this, I hope the reader doesn't have the impression that I don't think Jens Steube's attack is impressive. Because it really is. Cryptanalysis is a one-sided arms race, where the attackers invent new weapons against old algorithms all the time. Jens Steube's attack is such a new weapon against SHA-1. Often, new weapons can be combined with old weapons to build even better weapons. This means that in the worst case, Jens Steube's attack brings no progress except for password cracking. But we can hope that his findings can be combined with somebody else's attack, like e.g. the one from Marc Stevens, or give some other inspiration to improve it. If some sort of combination of attacks is possible, one can probably expect that the number of operations could be reduced from 260 to 259.66.
Theoretically possible, but very unlikely because of the preconditions needed to apply the attack, would be a reduction of 21% of the complexity to attack SHA-1, not just the calculation of SHA-1. That would reduce the complexity from 260 to 247.4. However, the conditions to apply Jens Steube's attack may represent some added complexity or extra calculations needed to actually find a collision, and therefore increase the number again. But maybe, just maybe, Jens Steube's attack contains a clue to reduce it even further, to make breaking SHA-1 trivial. I don't think that's very likely, but you never know.
vrijdag 7 september 2012
Making the Programming Pain Stop
Johannes Brodwall is organizing a panel debate at the upcoming JavaZone 2012 conference under the title “Making the programming pain stop”. I'm not on the panel, but here are my ideas about what's causing programming pain and how we can stop it.
Let's start by defining programming pain. I think this is what we feel when things are not what they're supposed to be. I'm thinking mainly of frameworks or tools that make us write extra code or unnecessary lines in configuration files. Or having to support code we're not able to understand –it doesn't matter whether we're talking about fixing bugs or adding new features–, because the method and variable names don't make sense, there's virtually no documentation, or much too much, and no unit tests are present or they're testing the wrong things. I think we've all been there. If not, just pick one of your own projects you worked on two or three years ago and see for yourself.
Now how can we make the programming pain stop? Johannes Brodwall has two suggestions: either firing all architects and project managers, or asking the developers to “grow the **** up”. I think the former is unrealistic, and won't make a big difference anyway. (I'm sorry to break you the news about that, architects and project managers.) But to all developers smiling right now: if it would have made a difference, it's probably going to be one for the worse, because I really think the big problem is that developers indeed need to grow up and get their act together.
For one thing, I'm still amazed that there are still so many developers out there thinking that writing automated unit tests is a waste of time. I can accept that “pure” TDD maybe doesn't work for you, and that you prefer to write your automated unit tests after you write your source code, but I'm going to be very suspicious about the quality of both your source code and your unit tests. But it's still better than no automated unit tests at all.
I know many project managers have to take part of the blame on this one too. Some of them still think automated unit testing is just gold plating. In my experience, it's very hard to write unit tests for anything close to gold plating. How would you do that, write a unit test for functionality you're going to add “just in case” or because it's nice to have? The unit test will either reveal that the functionality you want to add is useless, or that it isn't gold plating at all. I think that one of the main reasons why TDD speeds up development time –it does– is that it will keep developers from gold plating their code.
But there is more that developers should start doing. Unit testing and TDD in itself is not enough. Aim for high test coverage—not just system-wide, but in every single class you write. And think hard about why you really can't unit test the parts that aren't covered by automated unit tests yet. Use static code analysis tools with a sensible set of rules, and be strict about it. And if you're ready for it, have a look at mutation testing.
There are other things too that developers often are sloppy about. Is it really that hard to pick good names for all your methods and variables? You can afford to use ten seconds on every name, and if you can't come up with a good name after ten seconds, ask yourself whether you need the method or the variable at all. The fact that you can't come up with a good name may be an indication that you don't know what you're doing. Use some time to put in some documentation, but don't use time to put in incorrect, incomplete and/or unnecessary documentation. And this includes putting in a sensible message when you commit your code to your version control system.
While we're talking about version control systems: merge your code to the right branch(es) straight away. Don't even consider doing it later (like when you'll have more time—I mean, really?) or when you can do all the merges in one go. It's not going to work, you'll have lots of conflicts, and since you'll be out of context, you probably will have to use more time to fix things then if you did it right away. And sending a merge job to one of your colleagues is like asking to be fired on the spot.
Update you issue tracking system as soon as you start working on a new task, and every time it changes status. Add comments that will help testers to test the task, and chances are they will understand much faster why your task really is done, instead of sending it back to you because they couldn't figure out what's changed and how it should be tested.
Finally, if you're one of the hot shots in your company developing a framework or some services that will be used by other developers, how about some sensible defaults? Do I really have to specify that all my numbers are decimal? (Oh, by the way, this text uses the Latin alphabet.) And if you're a developer working on a project where all the numbers are octal, write a convenience method instead of spreading the number eight all over your code.
All the things listed above cause a lot of programming pain, not just for your colleagues, but for yourself too. Except for the TDD part, no architect and no project manager are involved in any of this, and no architect or project manager will ever stop you from doing these things correctly. So why don't you do so? It's amazing how much time you can save doing boring stuff if you put in a few seconds extra doing the boring stuff immediately and correctly.
Let's start by defining programming pain. I think this is what we feel when things are not what they're supposed to be. I'm thinking mainly of frameworks or tools that make us write extra code or unnecessary lines in configuration files. Or having to support code we're not able to understand –it doesn't matter whether we're talking about fixing bugs or adding new features–, because the method and variable names don't make sense, there's virtually no documentation, or much too much, and no unit tests are present or they're testing the wrong things. I think we've all been there. If not, just pick one of your own projects you worked on two or three years ago and see for yourself.
Now how can we make the programming pain stop? Johannes Brodwall has two suggestions: either firing all architects and project managers, or asking the developers to “grow the **** up”. I think the former is unrealistic, and won't make a big difference anyway. (I'm sorry to break you the news about that, architects and project managers.) But to all developers smiling right now: if it would have made a difference, it's probably going to be one for the worse, because I really think the big problem is that developers indeed need to grow up and get their act together.
For one thing, I'm still amazed that there are still so many developers out there thinking that writing automated unit tests is a waste of time. I can accept that “pure” TDD maybe doesn't work for you, and that you prefer to write your automated unit tests after you write your source code, but I'm going to be very suspicious about the quality of both your source code and your unit tests. But it's still better than no automated unit tests at all.
I know many project managers have to take part of the blame on this one too. Some of them still think automated unit testing is just gold plating. In my experience, it's very hard to write unit tests for anything close to gold plating. How would you do that, write a unit test for functionality you're going to add “just in case” or because it's nice to have? The unit test will either reveal that the functionality you want to add is useless, or that it isn't gold plating at all. I think that one of the main reasons why TDD speeds up development time –it does– is that it will keep developers from gold plating their code.
But there is more that developers should start doing. Unit testing and TDD in itself is not enough. Aim for high test coverage—not just system-wide, but in every single class you write. And think hard about why you really can't unit test the parts that aren't covered by automated unit tests yet. Use static code analysis tools with a sensible set of rules, and be strict about it. And if you're ready for it, have a look at mutation testing.
There are other things too that developers often are sloppy about. Is it really that hard to pick good names for all your methods and variables? You can afford to use ten seconds on every name, and if you can't come up with a good name after ten seconds, ask yourself whether you need the method or the variable at all. The fact that you can't come up with a good name may be an indication that you don't know what you're doing. Use some time to put in some documentation, but don't use time to put in incorrect, incomplete and/or unnecessary documentation. And this includes putting in a sensible message when you commit your code to your version control system.
While we're talking about version control systems: merge your code to the right branch(es) straight away. Don't even consider doing it later (like when you'll have more time—I mean, really?) or when you can do all the merges in one go. It's not going to work, you'll have lots of conflicts, and since you'll be out of context, you probably will have to use more time to fix things then if you did it right away. And sending a merge job to one of your colleagues is like asking to be fired on the spot.
Update you issue tracking system as soon as you start working on a new task, and every time it changes status. Add comments that will help testers to test the task, and chances are they will understand much faster why your task really is done, instead of sending it back to you because they couldn't figure out what's changed and how it should be tested.
Finally, if you're one of the hot shots in your company developing a framework or some services that will be used by other developers, how about some sensible defaults? Do I really have to specify that all my numbers are decimal? (Oh, by the way, this text uses the Latin alphabet.) And if you're a developer working on a project where all the numbers are octal, write a convenience method instead of spreading the number eight all over your code.
All the things listed above cause a lot of programming pain, not just for your colleagues, but for yourself too. Except for the TDD part, no architect and no project manager are involved in any of this, and no architect or project manager will ever stop you from doing these things correctly. So why don't you do so? It's amazing how much time you can save doing boring stuff if you put in a few seconds extra doing the boring stuff immediately and correctly.
Abonneren op:
Posts (Atom)