arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL 2005/2008/CE
Does arachnode.net scale? | VS2008/2010/2012 & SQL2008/2012 | Download the latest release

What is the effect of recrawl on WebPages_MetaData & WebPage

rated by 0 users
Not Answered This post has 0 verified answers | 4 Replies | 2 Followers

Top 25 Contributor
14 Posts
dbs2000 posted on Fri, Aug 14 2009 8:22 AM

Hi,

I have a small doubt. Does the WebPages_MetaData text / xml content change (or get updated) when I recrawl the page the next day. Or does it create a new WebPageID (in WebPage) & enter a new record in WebPages_MetaData table if it gets recrawled. Or does it check for any modifcations in the page (or its contents) & if modified then creates a record or updates the existing WebPages_MetaData record?

Also does the recrawl take into account what is there in the disallowed table & bypass crawling if the url is present over there?

Please note that the recrawl that I am talking about will be in separate runs on different dates.

Thanks

Debasish

All Replies

Top 25 Contributor
14 Posts

Sorry missed out one more question.

What does the lastDiscovered and lastModified dates tell us here exactly (WebPage). I have noticed that the lastModifed field is mostly null. When does it get populated.

Top 10 Contributor
1,714 Posts

Yes - each WebPage and WebPage_MetaData is tied to the WebPage it crawled - and AbsoluteUri is an AbsoluteUri is an AbsoluteUri...

The crawl process modifies the row, if existing.

If an AbsoluteUri is in the DisallowedAbsoluteUris table it won't be crawled.

-Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
1,714 Posts

Those dates are Database row "timestamps".

It gets populated/updated when the WebPage source changes.

-Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
1,714 Posts

IM when you are ready to move forward, of you already haven't.

-Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (5 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2005/2008/CE

copyright 2004-2013, arachnode.net LLC