arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop
Search the Live Index Does arachnode.net scale? | Download the latest release

Stuck how to start

rated by 0 users
Answered (Verified) This post has 1 verified answer | 1 Reply | 2 Followers

Top 75 Contributor
6 Posts
Manoj posted on Thu, Dec 27 2012 1:46 AM

Dear Sir ,

I have downloaded arachnode.net 2.6 and set the database up.

I have simple queries before i purchage the License

1. I want to do content search of an absolute URL.

2. What ever appears on that webpage , will that be available in xml formats etc.

EG : I want to crawl http://spotcrime.com/hi/hilo for a state Hawai and want to capture the Crime Information from this site and so on for other states also.

I have just commented all the crawler request and type one like 

wasTheCrawlRequestAddedForCrawling = _crawler.Crawl(new CrawlRequest(new Discovery("http://spotcrime.com/hi/hilo"), int.MaxValue, UriClassificationType.None, UriClassificationType.None, 1, RenderType.None, RenderType.Dynamic));

But after crawling is done , i dont get anything in database or anywhere.

The following error occures which i bye pass always.

 

WE ARE SERIOUS TO BUY THE LICENSE .

 

 

 

 

 

 

 

Answered (Verified) Verified Answer

Top 10 Contributor
1,905 Posts
Verified by arachnode.net

The path (for the exception) seems suspect.  Does that directory actually exist?  DB table: cfg.Configuration.  You probably want to remove 'Visual Studio 2010' from the path.  Or did you upgrade from the VS2008 solution?  There should be a VS2010 solution provided...  Take a screen shot of the directory structure with the arachnode.net.sln at the root please?

Are there any errors in the the Exceptions table?

What about DisallowedAbsoluteUris?

wasTheCrawlRequestAddedForCrawling = _crawler.Crawl(new CrawlRequest(new Discovery("http://spotcrime.com/hi/hilo"), int.MaxValue, UriClassificationType.None, UriClassificationType.None, 1, RenderType.None, RenderType.Dynamic));

This should be:

wasTheCrawlRequestAddedForCrawling = _crawler.Crawl(new CrawlRequest(new Discovery("http://spotcrime.com/hi/hilo"), int.MaxValue, UriClassificationType.Domain, UriClassificationType.None, 1, RenderType.None, RenderType.None));

You want to restrict the crawl to this particular domain, while allowing discoveries from others in...  (like images, .xml, .etc.)  Since you aren't rendering the parent, you probably don't want to render the children.

I opened a fresh copy of the demo, used your exact CrawlRequest and everything looks fine to me.

You can add me on Skype and we'll sort this out.  Perhaps some sort of screen sharing?  I'm in Hawaii, for the time zone.

Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (2 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC