arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2005/2008/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop
Mongo/Raven/MySQL/Hadoop Does arachnode.net scale? | VS2008/2010/2012 & SQL2008/2012 | Download the latest release

Site is not crawling completely

rated by 0 users
Answered (Verified) This post has 1 verified answer | 36 Replies | 2 Followers

Top 10 Contributor
82 Posts
InvestisDev posted on Mon, Dec 10 2012 10:53 PM

Hello Mike, 

i was crawling "http://www.bodycote.com" site. I had set the depth level to 6 while crawling this site.

if you will search the keyword "china" it will show some results but that result is not including "http://www.bodycote.com/en/contact-directory/asia/china.aspx" whose depth level is < 6. 

i had even checked the DB too. the pages on the level of china and below are not in the list of hyperlinks or webpages tables.

Even i am downloading the pages so if checking the "DownloadWebPages" folder, there is only one page thats of "world.aspx". 

Please suggest some way to solve this.

Thanks,

Answered (Verified) Verified Answer

Top 10 Contributor
1,751 Posts

Change this: UriClassificationType.Host | UriClassificationType.Domain | UriClassificationType.OriginalDirectory| UriClassificationType.OriginalDirectoryLevel | UriClassificationType.OriginalDirectoryLevelDown

To this: UriClassificationType.Host

The 'or' is a logical or, not do this or this or this.  Adding an | makes the crawl more restrictive.  The '~' isn't matching the directory settings.  Because a .pdf is an ahref to content not necessary for the page, the depth will be starting depth + 1 to retrieve, and the RestrictCrawlTo setting is preventing this from being crawled.

As you just want to crawl this site, to a depth of 6, and are starting at the root, there isn't a need to specify 'down', or original directory level.  AN will crawl breadth-first by default, to a depth of '6'.

Thanks,
Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

All Replies

Top 10 Contributor
82 Posts

FYI

I had even changed the crawldepth value from 6 to 10 but it is not increasing in webpages database, max crawldepth is 6 only. I think this is causing issue (check the below screen)

  

can you plz tell me from which places i can specify crawldepth.

Top 10 Contributor
1,751 Posts

Depth isn't directory depth... it is hops away from the initial starting point.  Does this help?

Any entries in the Exceptions or DisallowedAbsoluteUris table?

I will crawl the site now.

Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
1,751 Posts

This is probably as the site doesn't have static links to the 'china' search results.

Try crawling: http://www.bodycote.com/en/site-services/search-results.aspx?ResultPage=1&Domain=MergeAll&query=china

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
82 Posts

Hello Mike

Sorry for delay in reply, I was using below code while crawling a site

 

wasTheCrawlRequestAddedForCrawling = _crawler.Crawl(new CrawlRequest(new Discovery(url), 6, UriClassificationType.Host, UriClassificationType.Host | UriClassificationType.OriginalDirectory | UriClassificationType.OriginalDirectoryLevel | UriClassificationType.OriginalDirectoryLevelDown, 1, RenderType.None, RenderType.None));

where restrictDiscovertyTo were having parameters. I had just changed the above line of code with below:

wasTheCrawlRequestAddedForCrawling = _crawler.Crawl(new CrawlRequest(new Discovery(url), 7, UriClassificationType.Host | UriClassificationType.OriginalDirectoryLevel | UriClassificationType.OriginalDirectory| UriClassificationType.OriginalDirectoryLevelDown, UriClassificationType.Host, 1, RenderType.None, RenderType.None));

It is working for me as expected.

can you please give me some more detail for the parameters like "UriClassificationType, RestrictDiscoveryTo, RestrictCrawlTo, renderType, renderTypeForChildren".

 

Please suggest as this change is proper for me is it fine to use?

Let me know if you need further detail for the same 

Thanks,

Top 10 Contributor
1,751 Posts

No worries...

The | is logical OR and is used as a bitmask.  Every option that is added makes the CrawlRequest more restrictive.

RestrictCrawlTo: Where can the Crawl go?

  • http://arachnode.net/forums/p/323/10294.aspx#10294
  • http://arachnode.net/forums/t/739.aspx

RestrictDiscoveriesTo: When a Crawl finds a Discovery(HyperLink, Image, File, EmailAddress), are those Discoveries eligible to be collected, crawled and/or stored?

  • http://arachnode.net/forums/p/323/10294.aspx#10294
  • http://arachnode.net/forums/t/739.aspx

RenderType: Are you Rendering JavaScript using the Renderers project?  Render the CrawlRequest, or not?

RenderTypeForChildren: If A CrawlRequest creates other CrawlRequests, should those CrawlRequests be Rendered as well?

As it look like you want to crawl the site, from the root, .Host is appropriate for both classification types.  This will filter out any offsite links, unless you want to gather those.  But, from your existing parameters it doesn't look like you want those.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
82 Posts

Happy Christmas & Happy New Year,

sorry but facing one issue that while crawling a site, it is not including .pdf files.

DB is not having any entries of PDF file in hyper link tables. but if we check the page, its having more then 5 links of pdf file.

i had used the option as below:

wasTheCrawlRequestAddedForCrawling = _crawler.Crawl(new CrawlRequest(new Discovery(url), 6, UriClassificationType.Host | UriClassificationType.Domain | UriClassificationType.OriginalDirectory| UriClassificationType.OriginalDirectoryLevel | UriClassificationType.OriginalDirectoryLevelDown,UriClassificationType.Host , 1, RenderType.None, RenderType.None));

Please suggest if need to change any setting in above. The page view source shows the pdf link like "/~/media/Files/2012/reports-2012.pdf" and while checking the hyperlink or Files table, it is not showing any such entries.

 

while debugging, value of "fileOrImageDiscovery.DiscoveryState" in "CrawlRequestManager.cs" says "DiscoveryState.Undiscovered" in place of "DiscoveryType.File". 

I have the URL link "http://www.xyz.com/de-DE/default.aspx" for the site in German language 

and "http://www.xyz.com/en/default.aspx" for English. 

Please suggest some way to solve this.

Thanks :)

Top 10 Contributor
1,751 Posts

What is the exact AbsoluteUri you are trying to crawl?

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
82 Posts

Here is the actual URL

http://www.aamal.com.qa/ar-DZ/default.aspx 

http://www.aamal.com.qa/en/default.aspx

 

Top 10 Contributor
1,751 Posts

Change this: UriClassificationType.Host | UriClassificationType.Domain | UriClassificationType.OriginalDirectory| UriClassificationType.OriginalDirectoryLevel | UriClassificationType.OriginalDirectoryLevelDown

To this: UriClassificationType.Host

The 'or' is a logical or, not do this or this or this.  Adding an | makes the crawl more restrictive.  The '~' isn't matching the directory settings.  Because a .pdf is an ahref to content not necessary for the page, the depth will be starting depth + 1 to retrieve, and the RestrictCrawlTo setting is preventing this from being crawled.

As you just want to crawl this site, to a depth of 6, and are starting at the root, there isn't a need to specify 'down', or original directory level.  AN will crawl breadth-first by default, to a depth of '6'.

Thanks,
Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
82 Posts

Hello Mike,

i think this will solve the issue .... sorry but i will apply this solution soon as i am on leave ..... will update you soon once the testing finishes. 

Thanks for the quick reply.

Thanks :)

Page 1 of 1 (11 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2005/2008/CE

copyright 2004-2014, arachnode.net LLC