Hi to all,
Can I please get a short explanation of this parametar?
The explanation given in the code:
//Setting the Depth to int.Max means to crawl the first page, and then int.MaxValue - 1 hops away from the initial CrawlRequest AbsoluteUri - so, the entire site.
//The higher the value for 'Priority', the higher the Priority.
is somehow confusing me about that "depth" and its values.
Also,i don't understand this OR here:
//You can logically OR the UriClassificationTypes to set what a CrawlRequest crawls!!!
Please advise so I can continue with my crawler project :)
Best regards,
Aleksandar.
Depth of 1 means that page and all content on that page. Depth of 2 means that page and every page found from the first page.
The higher the priority for a CrawlRequest, the sooner it will be crawled.
(OR) If you submitted http://arachnode.net/Home.aspx and set the RestrictCrawlTo parameter to UriClassificationType.Host | UriClassificationType.FileExtension you would only crawl .aspx pages from arachnode.net.
Which version are you using?
Mike
For best service when you require assistance:
An open source .NET web crawler written in C# using SQL 2005/2008.
Twitter: http://twitter.com/arachnode_net
arachnode.net provides custom crawling and contracting resources. Please ask.
C# crawler, C# web crawler, C# site crawler
Hi Mike,
thanks a lot for your fast response,
I'm using version 1.4.
The goal of my project is to make service that crawl ALL HTML from only one site (only to download html, not pictures,files...). Which depth should I use to crawl the whole site?
Also, I can not understand how this crawler works if it should crawl continiously (lets say,every 20 minutes)? after it crawls for the first time, what will it crawl the second time it is stared? Only new and updated sites or it will overwrite everything from the beginig? If it can handle this, what are the configuration parameters? Is there any performance issues?
Best Regards,
No problem!
To crawl an entire site, and ensure that you crawl the entire site, use a depth of Int.Max.
If you don't want to download Images and Files, turn 'AssignFileAndImageDiscoveries' in cfg.Configuration.
What is your desired method of crawling? Do you want to start from the beginning each time, or stop, perform analysis and the continue to crawl... AN can crawl however you'd like... just have to flip a switch here and there.
Going out for a bike ride... BB in about 6 hours.
HI Mike,
did you have a nice ride? :)
my goal is to set the crawler to work like this:
-when I first start it, i want to download all the html and put them in database (i suppose in table dbo.webpagemetadata, converted in xml),
-then every next time the crawler starts, just to download the new and the updated html (not all again from the beginning) .
Mike, i'm very pleased for your help. Thanks a lot again.
Br,
I did. Was a little shorter than expected but all in all a nice venture. (20 miles)
So, if there are 1000 pages in a site, and one of those pages changes, at, say Depth 15, you only want to download that one page and skip the rest?
Which site are you looking to crawl?
Thanks! Always glad to help.
::Mike
Good morning Mike (it is morning here in Macedonia :) ),
Yes, I would like to crawl exactly that way. I'll put that crawler in a service that will be started every 20 minutes.
I would like to crawl from a site with advertisements, www.pazar3.com.mk.
Thanks,
BR,
Unfortunately, I don't know how you would do this - or if it's even possible...
Explanation: Out of say 1,000 pages in a site, page 667 is new and is found at depth 8 from your initial crawling point... how would you know to go to that page without downloading/processing other pages in the site? This is precisely the reason why Google implemented their sitemaps program.
is it possible to download all the pages the first time I crawl, and then, every next time the crawler is started just to PROCESS all the pages and download ONLY the updated-one, not all of them again from the begining. that is my opinion about updated pages, and i don't have any idea about the deleted one :(.
This depends on what your definitions of 'PROCESS' and 'DOWNLOAD' are.
AN, by default, will not DOWNLOAD the page if it hasn't changed, but will allow you to PROCESS the content, if you desire. But, if your 'NEW' page is somewhere in the pile of webpages in a site, you have to ask the existing pages to get to that new page. You should consider looking at the site's RSS feed, if it has one.
If you can find the Google sitemap for the website, then this will tell you what pages have been updated. But, this really could be anywhere and isn't really public information.
I'll try that and i'll test it. I hope to get what i want :).
Thanks for your support, I will inform you.
Great! Let me know!