arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL 2005/2008/CE
Does arachnode.net scale? | VS2008/2010/2012 & SQL2008/2012 | Download the latest release

Prevent Re-Crawl If No Date Change?

rated by 0 users
Not Answered This post has 0 verified answers | 6 Replies | 3 Followers

Top 10 Contributor
Male
101 Posts
Kevin posted on Mon, Feb 16 2009 11:17 AM

I can't remember (and don't have access to the code at the moment)... when a site is re-crawled are we storing the date/time of the last page update, and we don't re-crawl the page if it appears to have the same date/time as the last crawl?

Thx

 

All Replies

Top 10 Contributor
1,692 Posts

The WebPages table stores InitiallyDiscovered, LastDiscovered and LastModified.

If no CrawlRequests exists in the CrawlRequests table and all HyperLinks exist in the WebPages table, then the crawler will move on to crawling the WebPage(s) that were LastDiscovered the longest ago.  arachnode.net doesn't keep track of which content belongs to which crawl.  About a year and a half ago arachnode.net did keep track of which Crawl a Discovery belonged to, but after it was coded I couldn't think of a good reason why we needed to keep track of that information explicitly, since you could derive the same information from the dates stored with the Discoveries.  Does this answer your question?

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
Male
101 Posts
Kevin replied on Mon, Feb 16 2009 8:28 PM

So lastmodified stores the date last modified in arachnode, not the date the page was last modified correct?  I was thinking that if we stored the date of last modification the page (via http headers) tells us, and we store that for later comparison, would that allow us to not walk any page that are telling us they haven't been modified since our last visit?

Granted, lots of sites may make the page always look new to prevent caching and such.  I was just wondering what we are actually doing.

Top 10 Contributor
1,692 Posts

I see.  Yes - that's a good suggestion.

I could pass the LastDiscovered field along with the CrawlRequest and could compare it against the Headers.

Seems like it would be worth it to run a test for a day or two and see what we could see WRT updated headers vs. actually updated pages?

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
Male
101 Posts
Kevin replied on Mon, Feb 16 2009 8:34 PM

Agreed.  Could make crawling MUCH more efficient in appearance anyway.  Maybe an option in config whether or not to force re-crawl, or respect header info coming back from page?

Top 10 Contributor
229 Posts

I wonder how did it end up?

Top 10 Contributor
1,692 Posts

I do believe that the WebClient already takes this in account, FWIW - need to check for sure.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (7 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2005/2008/CE

copyright 2004-2013, arachnode.net LLC