ReCAPTCHA All That Data

2008 February 13

recaptcha.png

I was explor­ing Lawrence Lessig’s blog ear­lier and noticed that his com­ment­ing sys­tem uses reCAPTCHA, a CAPTCHA sys­tem run out of Carnegie Mellon.

What’s inter­est­ing about reCAPTCHA is that the words shown are unknown, even to the sys­tem. A reg­u­lar CAPTCHA dis­plays known words in dis­torted text, hope­fully in such a way that only humans (and not spam robots) can read them. The human user types in the char­ac­ters shown, and the sys­tem val­i­dates that the entry is correct.

reCAPTCHA uses words scanned from old books, and cor­re­lates the user’s input to a por­tion of the orig­i­nal work. Like Amazon’s Mechanical Turk, the Carnegie Mellon sys­tem deploys micro-tasks and takes advan­tage of dis­trib­uted human labor to com­plete giant projects via min­i­mal con­tri­bu­tions of many. The idea, of course, is to per­form OCR on books that are not prac­ti­cal to dig­i­tize using soft­ware (due to blurry let­ters, old type­faces, etc.).

3 comments. »

  1. Damn. These CMU peo­ple are really smart. What a neat idea!

    Comment by michael — 2008 February 18 @ 2:24 am

  2. How cool is this? I didn’t even know what the orig­i­nal use was let alone the newer program.

    Comment by Pat — 2008 February 21 @ 11:29 pm

  3. We’ve come full cir­cle in print. The dig­i­tized is being out­sourced to monks/scribes in medieval monas­ter­ies so knowl­edge can be saved for the future.

    Comment by Karen — 2008 February 22 @ 2:22 pm

RSS feed for comments on this post.

Leave a comment

Site content and design © copyright 2006–2008 Scott Murray.