Tuesday, August 26, 2008

CAPTCHA and reCAPTCHA

CAPTCHA is a test that many readers have seen but likely do not know what it is. CAPTCHA stands for "Completely Automated Public Turing test to tell Computers and Humans Apart." CAPTCHAs consist of small images of letters or words that one must type in to convince a system that you are in fact a human and not a spambot. The letters are generally somewhat distorted or have stray additional lines added. Such excercises take advantage of the fact that humans are able to do difficult symbolic recognition processing that computers so far cannot.

One more recent innovation is reCAPTCHA. reCAPTCHA helps to solve an interesting problem: There are long-term efforts to enage in mass digitization of old books and records. However, there is a problem: simply saving images of the pages would take up much space and would make the pages unsearchable. Thus, books are being scanned and computer programs are being used to figure out what the words are. Some words however, if they are scratched, poorly written, water-damaged or subject to other problems can make the computers unable to recognize the original wording. This is the same issue that allows CAPTCHA to work. However, it is impractical to have humans comb through these many words to identify them all. Now, enter reCAPTCHA.

reCAPTCHA is just like CAPTCHA but the source words to be deciphered are words from old books which computers are having trouble recognizing. The same words needing to be recaptured are presented to multiple different users. If the users agree then the digitizers can be pretty sure that the humans successfuly recognized the correct word and can then digitize that word and use it as an additional word as a challenge word.

More specifically, each reCAPTCHA challenege consists of two words. One of which has a known answer and one which does not. The individual challenged does not know which word is in which category and thus must answer both.

This procedure is a brilliant way of harvesting otherwise lost processing power.

Now, this is all well and good, but what am I expected to do when the reCAPTCHA challenge is:

7 comments:

treehouses said...

I once read about a method that spambots would use to get around CAPTCHAs. The bot would send the CAPTCHA image to a special site. Visitors to this special site would then type in the CAPTCHA message. If what the visitor types in allows the bot to beat the CAPTCHA, the site rewards the visitor with porn.

Perhaps reCAPTCHA should take note.

Unfortunately, I don't have any sources to share.

Joshua said...

Yes, this is discussed for example:

http://boingboing.net/2004/01/27/solving-and-creating.html

There's not much one can do about this attack. However, even then the CAPTCHAs still get solved so reCAPTCHA is happy. Are you suggesting that reCAPTCHA should run a separate website offering porn for reCAPTCHA work?

sniffnoy said...

That reminds me! Have you seen this video? Or any of the games it talks about?

I wonder which came first, reCAPTCHA or this? (Well actually I guess I could look that up...)

sniffnoy said...

Also, I think it says "Ruheleben" and "you".

Joshua said...

Harry, that's a very interesting video. Thanks very much. I had known about this sort of thing before but had not realized how far this sort of thing had progressed. (I strongly urge any interested reader to go see the video linked to by Harry).

Harry, I agree with the first word and strongly suspect that that word is the known word. I'm not convinced the second one is correct. Indeed, I'm not convinced the second one in the Roman alphabet.

Also, one last remark. One person sent me an email asking what "reCAPTCHA" stood for. Sorry if this was not clear. reCAPTCHA is a pun on "CAPTCHA" and "recapture" since the lost words are being "recaptured."

Gabby Ehrlich said...

Hey Josh. I believe I mentioned that Josh (my brother) spent the summer working for a company called Laserfiche. One of his friends at the company was working on a program to transform scanned documents into text documents by processing the image, so I suspect once they get that working it could be used to digitize old books as well. I don't know too much about it though, as the conversation was a while ago and my memory is bad.

sniffnoy said...

Oh, something I noticed that I don't think is mentioned in the video: The "asymmetric verification game" is actually very similar to a popular formula for party games. So I guess it's not surprising that people find it fun.