ReCAPTCHA All That Data

2008 February 13

recaptcha.png

I was exploring Lawrence Lessig’s blog earlier and noticed that his commenting system uses reCAPTCHA, a CAPTCHA system run out of Carnegie Mellon.

What’s interesting about reCAPTCHA is that the words shown are unknown, even to the system. A regular CAPTCHA displays known words in distorted text, hopefully in such a way that only humans (and not spam robots) can read them. The human user types in the characters shown, and the system validates that the entry is correct.

reCAPTCHA uses words scanned from old books, and correlates the user’s input to a portion of the original work. Like Amazon’s Mechanical Turk, the Carnegie Mellon system deploys micro-tasks and takes advantage of distributed human labor to complete giant projects via minimal contributions of many. The idea, of course, is to perform OCR on books that are not practical to digitize using software (due to blurry letters, old typefaces, etc.).

3 comments. »

  1. Damn. These CMU people are really smart. What a neat idea!

    Comment by michael — 2008 February 18 @ 2:24 am

  2. How cool is this? I didn’t even know what the original use was let alone the newer program.

    Comment by Pat — 2008 February 21 @ 11:29 pm

  3. We’ve come full circle in print. The digitized is being outsourced to monks/scribes in medieval monasteries so knowledge can be saved for the future.

    Comment by Karen — 2008 February 22 @ 2:22 pm

RSS feed for comments on this post.

Leave a comment

Site content and design © copyright 2006–2008 Scott Murray.