Saturday, February 20, 2010

Improved Conference Captions from Amazon Mechanical Turk (2)


After my initial experiments last month, I applied to the FreeBSD Foundation for funds to pay for additional human editing of the YouTube machine generated transcripts. The screenshot on the left shows an example HIT (Human Intelligence Task) available on Amazon Mechanical Turk.

The task description on the left is based on a template I created with three variables: $VIDEO_URL, $VIDEO_TITLE, and $CAPTIONS_URL. New HITs are then created by uploading a CSV file with three columns for each of those variables, e.g.

VIDEO_URL,VIDEO_TITLE,CAPTIONS_URL
http://www.youtube.com/watch?v=mMmbjJI5su0,"BSD v. GPL, Jason Dixon, NYCBSDCon 2008",http://people.FreeBSD.org/~murray/improved-captions-bsdvsgpl.sbv
http://www.youtube.com/watch?v=Pe8LdJpBGJ4,"Isolating Cluster Jobs for Performance and Predictability, Brooks Davis (DCBSDCon 2009",http://people.FreeBSD.org/~murray/improved-captions-isolatingcluster.sbv


Using this method I created 12 HITs for the first pass of editing for which I offered between $9 and $14 per video. A slightly modified template with the same three variables was used to pay ~$7 per video for a second pass to further improve the transcripts improved in the first pass.

The template has gotten more detailed over the past month in response to all of the minor ways that workers submitted less than perfect transcripts. The actual SBV file format used by YouTube captions is not formally specified anywhere as far as I can tell, but the 60 character maximum width and simple format can be verified in submitted transcripts with a few emacs macros.

The transcript files have been checked into the FreeBSD Doc CVS Repository. The full list of videos with human-edited English language transcripts is: