After my initial experiments last month, I applied to the FreeBSD Foundation for funds to pay for additional human editing of the YouTube machine generated transcripts. The screenshot on the left shows an example HIT (Human Intelligence Task) available on Amazon Mechanical Turk.
The task description on the left is based on a template I created with three variables: $VIDEO_URL, $VIDEO_TITLE, and $CAPTIONS_URL. New HITs are then created by uploading a CSV file with three columns for each of those variables, e.g.
VIDEO_URL,VIDEO_TITLE,CAPTIONS_URL
http://www.youtube.com/watch?v=mMmbjJI5su0,"BSD v. GPL, Jason Dixon, NYCBSDCon 2008",http://people.FreeBSD.org/~murray/improved-captions-bsdvsgpl.sbv
http://www.youtube.com/watch?v=Pe8LdJpBGJ4,"Isolating Cluster Jobs for Performance and Predictability, Brooks Davis (DCBSDCon 2009",http://people.FreeBSD.org/~murray/improved-captions-isolatingcluster.sbv
Using this method I created 12 HITs for the first pass of editing for which I offered between $9 and $14 per video. A slightly modified template with the same three variables was used to pay ~$7 per video for a second pass to further improve the transcripts improved in the first pass.
The template has gotten more detailed over the past month in response to all of the minor ways that workers submitted less than perfect transcripts. The actual SBV file format used by YouTube captions is not formally specified anywhere as far as I can tell, but the 60 character maximum width and simple format can be verified in submitted transcripts with a few emacs macros.
The transcript files have been checked into the FreeBSD Doc CVS Repository. The full list of videos with human-edited English language transcripts is:
- "M. Warner Losh, An Overview of FreeBSD/mips, AsiaBSDCon2009" (captions)
- "AsiaBSDCon 2009: Internet Mail — Past, Present, and (a bit of) the Future" (captions)
- "A. Rao: The Locking Infrastructure in the FreeBSD kernel #1" (captions)
- "A. Rao: The Locking Infrastructure in the FreeBSD kernel #2" (captions)
- "PC-BSD, Matt Olander, AsiaBSDCon 2008" (captions)
- "FreeBSD, Protecting Privacy with Tor" (captions)
- "Isolating Cluster Jobs for Performance and Predictability, Brooks Davis (DCBSDCon 2009" (captions)
- "Richard Bejtlich, Network Security Monitoring Using FreeBSD" (captions)
- "Jason Dixon Closing Remarks of DCBSDCon - BSD is Still Dying" (captions)
- "A Narrative History of BSD, Dr. Kirk McKusick" (captions)
- "BSD is Dying, Jason Dixon, NYCBSDCon 2007" (captions)
- "BSD v. GPL, Jason Dixon, NYCBSDCon 2008" (captions)
- "FreeBSD Kernel Internals, Dr. Marshall Kirk McKusick" (captions)