An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop
Search the Live Index Does scale? | Download the latest release

Captcha GUI feature

rated by 0 users
Not Answered This post has 0 verified answers | 1 Reply | 2 Followers

Top 10 Contributor
229 Posts
megetron posted on Thu, Feb 25 2010 2:20 AM

Hi Mike,

Some websites requiers captcha. robots cannot deal with this captcha, and sometimes you would need this data urgently.

Is there a way detects the recaptcha and prompt through the command line to enter the code?

Following scenario is requiered:

  1. crawls.
  2. on page "sompage.html" detects there is a captch object.
  3. downloads the captch image, and displayes it to the user that crawls with
  4. waits untill the human crawler enters the code.
  5. saves the code and sends it to the recaptch.
  6. the recaptcha verify code is valid.
  7. if code is invalid, takes the image again and prompt the human crawler.
  8. if the code is a valid code, them the page is refreshed with the new hidden data.

This scenario can help when you need to crawl for couple of thousand pages only and you really need the data, seems like an impossible feature to me.




All Replies

Top 10 Contributor
1,871 Posts

This is possible, actually - to start, when the Crawler starts, have it create a new WebBrowser control.

When a crawl thread detects a captcha, it will need to send the AbsoluteUri to the browser contol to be rendered.  Then, the crawl thread should block until it receives the HTML back.

You will have to modify the core to get this to happen.

In DiscoveryManager.cs...

if (crawlRequest.ProcessData)
                        crawlRequest.Data = crawlRequest.WebClient.DownloadData(crawlRequest.Discovery.Uri.AbsoluteUri);
                    crawlRequest.Data = crawlRequest.WebClient.DownloadData(crawlRequest.Discovery.Uri.AbsoluteUri);

                    crawlRequest.ProcessData = true;

Here, you will have to block and query, say, a Dictionary<string, string> (AbsoluteUri, DecodedHtml) until, you, enter the captcha code, and populate the Crawler's dictionary with the DecodedHtml, assuming that you are really interested in what's behind the captcha.  (Because the submit button is a form submit, and not a hyperlink, we have to do a bit of hacking...)

Let me know when you get this far, OK?  And, then we can assess what you have found and move from there...

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (2 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2015, LLC