Saturday, February 20, 2010

Improved Conference Captions from Amazon Mechanical Turk (2)


After my initial experiments last month, I applied to the FreeBSD Foundation for funds to pay for additional human editing of the YouTube machine generated transcripts. The screenshot on the left shows an example HIT (Human Intelligence Task) available on Amazon Mechanical Turk.

The task description on the left is based on a template I created with three variables: $VIDEO_URL, $VIDEO_TITLE, and $CAPTIONS_URL. New HITs are then created by uploading a CSV file with three columns for each of those variables, e.g.

VIDEO_URL,VIDEO_TITLE,CAPTIONS_URL
http://www.youtube.com/watch?v=mMmbjJI5su0,"BSD v. GPL, Jason Dixon, NYCBSDCon 2008",http://people.FreeBSD.org/~murray/improved-captions-bsdvsgpl.sbv
http://www.youtube.com/watch?v=Pe8LdJpBGJ4,"Isolating Cluster Jobs for Performance and Predictability, Brooks Davis (DCBSDCon 2009",http://people.FreeBSD.org/~murray/improved-captions-isolatingcluster.sbv


Using this method I created 12 HITs for the first pass of editing for which I offered between $9 and $14 per video. A slightly modified template with the same three variables was used to pay ~$7 per video for a second pass to further improve the transcripts improved in the first pass.

The template has gotten more detailed over the past month in response to all of the minor ways that workers submitted less than perfect transcripts. The actual SBV file format used by YouTube captions is not formally specified anywhere as far as I can tell, but the 60 character maximum width and simple format can be verified in submitted transcripts with a few emacs macros.

The transcript files have been checked into the FreeBSD Doc CVS Repository. The full list of videos with human-edited English language transcripts is:

Sunday, January 10, 2010

Improved Conference Captions from Amazon Mechanical Turk

Just wanted to send a quick note that three of the popular videos from the BSD Conferences YouTube channel have been updated with human-edited English language caption files. These offer a significant improvement over the machine generated captions I wrote about last month.

The following videos have been updated:


I've also posted three simple captions text files which provide the times and text in a very simple ascii format in case anyone wants to provide a diff to improve any remaining mistakes in the captions.

The transcriptions were done with the help of the industrious workers behind Amazon Mechanical Turk. The three transcripts above, representing at least 6 person hours of work, but easily twice that much time, were completed for less than $50 by leveraging the timing information from free machine generated captions and mechanical turk for the editing. This is less than 1/10th of the cost of a commercial transcription service.

What is the quality of these captions in other languages when automatically translated with YouTube? Are there any other videos for which captions would particularly be useful?

AsiaBSDCon is coming up in March, and I hope to have things streamlined by then such that videos with both Japanese and English captions can be added to the channel shortly after the conference.

Tuesday, December 22, 2009

Machine generated captions for BSD conference videos

One of the most frequent requests I've received, since Launching the BSD Conferences YouTube channel last year, has been for captions in Spanish, Russian, Chinese, and other languages. I was excited last month when Google announced automatic captions for Youtube videos using machine translation. This feature is still highly experimental but I am happy to report that it has been enabled for the BSD Conferences channel. In combination with the much more mature automatic translation feature, this means that captions are now available in over 50 languages from Afrikaans to Vietnamese for most of the 73 videos in the BSD Conferences channel.

The automatic captions are still highly experimental and the quality of transcription for highly technical content spoken by a diverse set of international speakers is a significant challenge to get right. If you are interested in helping to correct any of the English transcripts I would be happy to provide you a simple text file of the transcription, with each line offering the start and end time for the caption to be displayed, and the caption text. One advantage of the machine translation is that the most time consuming part of manually creating captions, synchronizing the timing of the text with the speech, has been done automatically. Even when the technical words are mangled, the timing information in the automatic captions files can be leveraged to make the process of manually improving the captioning much easier.

The experimental automatic captions are only available directly from the video watch pages, and not from channel pages or other views. For example, visit www.youtube.com/watch?v=nwbqBdghh6E to see one of our most popular videos, Kirk McKusick speaking on FreeBSD Kernel Internals. Hover over the triangle at the bottom right of the video, then over the CC submenu and select "Transcribe Audio". You can then choose to "Translate Captions" into a different language as well.

Sunday, July 26, 2009

The Slashdot Effect

After 8 months, 66 videos uploaded, and 141,676 views, the BSD Conferences YouTube Channel was slashdotted for the first time last week. Specifically, Theo's OpenBSD Release Engineering talk was linked from this slashdot post. Views of the video spiked to nearly 8,000 a day after the Slashdot post, which dwarfs the previous highs of around 1,500 videos a day after I posted about Kirk McKusick's FreeBSD Kernel Internals lecture.

I think this is an excellent reminder of the power that forums like Slashdot still have in directing traffic among those seeking technical content online. I would encourage anyone interested in seeing more BSD related content online to install browser bookmarklets, toolbars, or other shortcuts to more easily share and promote FreeBSD content on Digg, Del.icio.us, Slashdot, etc..

Saturday, July 18, 2009

FreeBSD Code Metrics Now on Ohloh.net

I've written previously about Ohloh.net and how I'd like to see more of the dynamic code metrics calculated there available on the FreeBSD web site. I am happy to report that today I noticed after several years of attempts, the ohloh repository import servers have finally managed to get through the entire FreeBSD source repository. Their software setup previously had difficulties dealing with a project with as long of a history as FreeBSD.

You can now view the top level code metrics about FreeBSD from the FreeBSD Project Page on Ohloh.net. This page indicates that there are over 10 million lines of code, that more files are licensed under GPLv2 than any other license.

The committer totals do not quite match up with Peter's Commit Counters. Even after accounting for the fact that Peter's system could potentially double count a commit that touches both sys and non-sys parts of the source tree, the numbers from Ohloh are still lower for some committers. Unlike the numbers on cia.vc, the FreeBSD project on ohloh.net only contains the source repository. We are currently lacking the anonymous cvs access to our doc repository necessary to add the doc project to ohloh.net.

How do the numbers reported on ohloh.net compare to code metrics others have reported for FreeBSD? Does this match expectations or are there any major problems with this data? How can we use this information on our website? Would a badge on the front page showing "Last improvement made X minutes ago" be useful? A list of most active committers in the past week on one of the developer pages? Other ideas for utilizing the work the Ohloh project is doing?

Sunday, July 12, 2009

Open Source in Recessions

In general, recessions can be really good for open source. Large businesses look to cut back on IT budgets and this often involves re-evaluating whether proprietary software and maintenance contracts are necessary, given high quality open source alternatives. Companies may dedicate more internal resources to open source projects, and also the surplus of underemployed engineering talent in the market may be available for more open source development work.

Unfortunately, there are also some significant downsides, and one that I would like to highlight is the plight of the small open source shops around the world. We've seen pleas earlier this month to Save BSD Magazine, and in recent years some other smaller open source companies such as Daemonnews, and the Japanese publications FreeBSD Press and BSD Magazine have exited the business. In the current environment I would like to take the unusual step of plugging a company that has been selling, marketing, legally defending, and supporting FreeBSD from the very beginning.

FreeBSD Mall has been selling FreeBSD CDs since 1.0 in 1993 and is still selling and supporting CDs, DVDs, books, branded apparel, and more. The PC-BSD live-dvds make an excellent introduction to FreeBSD for new users and the complete FreeBSD DVD distributions are quite handy to have. Consider spending a few dollars at the FreeBSD Mall site, buy a BSD Magazine subscription, or otherwise spend some money to encourage the small commercial FreeBSD ecosystem and perhaps contribute to more funds being available to exploit the many disruptive opportunities (netbooks, cloud computing, etc..) that could be very good for open source and FreeBSD during this recession.

Sunday, May 17, 2009

Remaining AsiaBSDCon 2009 Videos Posted

The remaining 9 videos from AsiaBSDCon 2009 have been posted. The new videos include talks by Theo de Raadt, Eric Allman, Kris Moore, Mohamad Fauzie, Brooks Davis, Atillio Rao, A. Kantee, and the Works In Progress Sessions.

Thanks again to Hiroki Sato for posting the videos and organizing 3 consecutive years of successful AsiaBSDCons. Sato-san has also created two separate YouTube playlists for the AsiaBSDCon 2008 and AsiaBSDCon 2009 videos. These playlists make it easier to find the newest videos from among the 66 videos now in the BSDConferences YouTube channel.