Hi Mike,
Some websites requiers captcha. robots cannot deal with this captcha, and sometimes you would need this data urgently.
Is there a way arachnode.net detects the recaptcha and prompt through the command line to enter the code?
Following scenario is requiered:
This scenario can help when you need to crawl for couple of thousand pages only and you really need the data, seems like an impossible feature to me.
This is possible, actually - to start, when the Crawler starts, have it create a new WebBrowser control.
When a crawl thread detects a captcha, it will need to send the AbsoluteUri to the browser contol to be rendered. Then, the crawl thread should block until it receives the HTML back.
You will have to modify the core to get this to happen.
In DiscoveryManager.cs...
if (crawlRequest.ProcessData) { crawlRequest.Data = crawlRequest.WebClient.DownloadData(crawlRequest.Discovery.Uri.AbsoluteUri); } } else { crawlRequest.Data = crawlRequest.WebClient.DownloadData(crawlRequest.Discovery.Uri.AbsoluteUri); crawlRequest.ProcessData = true; }
Here, you will have to block and query, say, a Dictionary<string, string> (AbsoluteUri, DecodedHtml) until, you, enter the captcha code, and populate the Crawler's dictionary with the DecodedHtml, assuming that you are really interested in what's behind the captcha. (Because the submit button is a form submit, and not a hyperlink, we have to do a bit of hacking...)
Let me know when you get this far, OK? And, then we can assess what you have found and move from there...
For best service when you require assistance:
Skype: arachnodedotnet