I can't remember (and don't have access to the code at the moment)... when a site is re-crawled are we storing the date/time of the last page update, and we don't re-crawl the page if it appears to have the same date/time as the last crawl?
The WebPages table stores InitiallyDiscovered, LastDiscovered and LastModified.
If no CrawlRequests exists in the CrawlRequests table and all HyperLinks exist in the WebPages table, then the crawler will move on to crawling the WebPage(s) that were LastDiscovered the longest ago. arachnode.net doesn't keep track of which content belongs to which crawl. About a year and a half ago arachnode.net did keep track of which Crawl a Discovery belonged to, but after it was coded I couldn't think of a good reason why we needed to keep track of that information explicitly, since you could derive the same information from the dates stored with the Discoveries. Does this answer your question?
For best service when you require assistance:
So lastmodified stores the date last modified in arachnode, not the date the page was last modified correct? I was thinking that if we stored the date of last modification the page (via http headers) tells us, and we store that for later comparison, would that allow us to not walk any page that are telling us they haven't been modified since our last visit?
Granted, lots of sites may make the page always look new to prevent caching and such. I was just wondering what we are actually doing.
I see. Yes - that's a good suggestion.
I could pass the LastDiscovered field along with the CrawlRequest and could compare it against the Headers.
Seems like it would be worth it to run a test for a day or two and see what we could see WRT updated headers vs. actually updated pages?
Agreed. Could make crawling MUCH more efficient in appearance anyway. Maybe an option in config whether or not to force re-crawl, or respect header info coming back from page?
I wonder how did it end up?
I do believe that the WebClient already takes this in account, FWIW - need to check for sure.