ReCAPTCHA All That Data
2008 February 13

I was exploring Lawrence Lessig’s blog earlier and noticed that his commenting system uses reCAPTCHA, a CAPTCHA system run out of Carnegie Mellon.
What’s interesting about reCAPTCHA is that the words shown are unknown, even to the system. A regular CAPTCHA displays known words in distorted text, hopefully in such a way that only humans (and not spam robots) can read them. The human user types in the characters shown, and the system validates that the entry is correct.
reCAPTCHA uses words scanned from old books, and correlates the user’s input to a portion of the original work. Like Amazon’s Mechanical Turk, the Carnegie Mellon system deploys micro-tasks and takes advantage of distributed human labor to complete giant projects via minimal contributions of many. The idea, of course, is to perform OCR on books that are not practical to digitize using software (due to blurry letters, old typefaces, etc.).
