arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2005/2008/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop
Mongo/Raven/MySQL/Hadoop Does arachnode.net scale? | VS2008/2010/2012 & SQL2008/2012 | Download the latest release

Captcha GUI feature

rated by 0 users
Not Answered This post has 0 verified answers | 1 Reply | 2 Followers

Top 10 Contributor
229 Posts
megetron posted on Thu, Feb 25 2010 2:20 AM

Hi Mike,

Some websites requiers captcha. robots cannot deal with this captcha, and sometimes you would need this data urgently.

Is there a way arachnode.net detects the recaptcha and prompt through the command line to enter the code?

Following scenario is requiered:

  1. arachnode.net crawls.
  2. on page "sompage.html" arachnode.net detects there is a captch object.
  3. arachnode.net downloads the captch image, and displayes it to the user that crawls with arachnode.net.
  4. arachnode.net waits untill the human crawler enters the code.
  5. arachnode.net saves the code and sends it to the recaptch.
  6. the recaptcha verify code is valid.
  7. if code is invalid, arachnode.net takes the image again and prompt the human crawler.
  8. if the code is a valid code, them the page is refreshed with the new hidden data.

This scenario can help when you need to crawl for couple of thousand pages only and you really need the data, seems like an impossible feature to me.

 

 

 

All Replies

Top 10 Contributor
1,750 Posts

This is possible, actually - to start, when the Crawler starts, have it create a new WebBrowser control.

When a crawl thread detects a captcha, it will need to send the AbsoluteUri to the browser contol to be rendered.  Then, the crawl thread should block until it receives the HTML back.

You will have to modify the core to get this to happen.

In DiscoveryManager.cs...

if (crawlRequest.ProcessData)
                    {
                        crawlRequest.Data = crawlRequest.WebClient.DownloadData(crawlRequest.Discovery.Uri.AbsoluteUri);
                    }
                }
                else
                {
                    crawlRequest.Data = crawlRequest.WebClient.DownloadData(crawlRequest.Discovery.Uri.AbsoluteUri);

                    crawlRequest.ProcessData = true;
                }

Here, you will have to block and query, say, a Dictionary<string, string> (AbsoluteUri, DecodedHtml) until, you, enter the captcha code, and populate the Crawler's dictionary with the DecodedHtml, assuming that you are really interested in what's behind the captcha.  (Because the submit button is a form submit, and not a hyperlink, we have to do a bit of hacking...)

Let me know when you get this far, OK?  And, then we can assess what you have found and move from there...

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (2 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2005/2008/CE

copyright 2004-2014, arachnode.net LLC