i was crawling "http://www.bodycote.com" site. I had set the depth level to 6 while crawling this site.
if you will search the keyword "china" it will show some results but that result is not including "http://www.bodycote.com/en/contact-directory/asia/china.aspx" whose depth level is < 6.
i had even checked the DB too. the pages on the level of china and below are not in the list of hyperlinks or webpages tables.
Even i am downloading the pages so if checking the "DownloadWebPages" folder, there is only one page thats of "world.aspx".
Please suggest some way to solve this.
Change this: UriClassificationType.Host | UriClassificationType.Domain | UriClassificationType.OriginalDirectory| UriClassificationType.OriginalDirectoryLevel | UriClassificationType.OriginalDirectoryLevelDown
To this: UriClassificationType.Host
The 'or' is a logical or, not do this or this or this. Adding an | makes the crawl more restrictive. The '~' isn't matching the directory settings. Because a .pdf is an ahref to content not necessary for the page, the depth will be starting depth + 1 to retrieve, and the RestrictCrawlTo setting is preventing this from being crawled.
As you just want to crawl this site, to a depth of 6, and are starting at the root, there isn't a need to specify 'down', or original directory level. AN will crawl breadth-first by default, to a depth of '6'.
For best service when you require assistance:
I had even changed the crawldepth value from 6 to 10 but it is not increasing in webpages database, max crawldepth is 6 only. I think this is causing issue (check the below screen)
can you plz tell me from which places i can specify crawldepth.
Depth isn't directory depth... it is hops away from the initial starting point. Does this help?
Any entries in the Exceptions or DisallowedAbsoluteUris table?
I will crawl the site now.
This is probably as the site doesn't have static links to the 'china' search results.
Try crawling: http://www.bodycote.com/en/site-services/search-results.aspx?ResultPage=1&Domain=MergeAll&query=china
Sorry for delay in reply, I was using below code while crawling a site
wasTheCrawlRequestAddedForCrawling = _crawler.Crawl(new CrawlRequest(new Discovery(url), 6, UriClassificationType.Host, UriClassificationType.Host | UriClassificationType.OriginalDirectory | UriClassificationType.OriginalDirectoryLevel | UriClassificationType.OriginalDirectoryLevelDown, 1, RenderType.None, RenderType.None));
where restrictDiscovertyTo were having parameters. I had just changed the above line of code with below:
wasTheCrawlRequestAddedForCrawling = _crawler.Crawl(new CrawlRequest(new Discovery(url), 7, UriClassificationType.Host | UriClassificationType.OriginalDirectoryLevel | UriClassificationType.OriginalDirectory| UriClassificationType.OriginalDirectoryLevelDown, UriClassificationType.Host, 1, RenderType.None, RenderType.None));
It is working for me as expected.
can you please give me some more detail for the parameters like "UriClassificationType, RestrictDiscoveryTo, RestrictCrawlTo, renderType, renderTypeForChildren".
Please suggest as this change is proper for me is it fine to use?
Let me know if you need further detail for the same
The | is logical OR and is used as a bitmask. Every option that is added makes the CrawlRequest more restrictive.
RestrictCrawlTo: Where can the Crawl go?
RestrictDiscoveriesTo: When a Crawl finds a Discovery(HyperLink, Image, File, EmailAddress), are those Discoveries eligible to be collected, crawled and/or stored?
RenderTypeForChildren: If A CrawlRequest creates other CrawlRequests, should those CrawlRequests be Rendered as well?
As it look like you want to crawl the site, from the root, .Host is appropriate for both classification types. This will filter out any offsite links, unless you want to gather those. But, from your existing parameters it doesn't look like you want those.
Happy Christmas & Happy New Year,
sorry but facing one issue that while crawling a site, it is not including .pdf files.
DB is not having any entries of PDF file in hyper link tables. but if we check the page, its having more then 5 links of pdf file.
i had used the option as below:
wasTheCrawlRequestAddedForCrawling = _crawler.Crawl(new CrawlRequest(new Discovery(url), 6, UriClassificationType.Host | UriClassificationType.Domain | UriClassificationType.OriginalDirectory| UriClassificationType.OriginalDirectoryLevel | UriClassificationType.OriginalDirectoryLevelDown,UriClassificationType.Host , 1, RenderType.None, RenderType.None));
while debugging, value of "fileOrImageDiscovery.DiscoveryState" in "CrawlRequestManager.cs" says "DiscoveryState.Undiscovered" in place of "DiscoveryType.File".
I have the URL link "http://www.xyz.com/de-DE/default.aspx" for the site in German language
and "http://www.xyz.com/en/default.aspx" for English.
Please suggest some way to solve this.
What is the exact AbsoluteUri you are trying to crawl?
Here is the actual URL
i think this will solve the issue .... sorry but i will apply this solution soon as i am on leave ..... will update you soon once the testing finishes.
Thanks for the quick reply.