arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL 2005/2008/CE
Does arachnode.net scale? | VS2008/2010/2012 & SQL2008/2012 | Download the latest release

Merge index files and crawl updated pages only

rated by 0 users
Not Answered This post has 0 verified answers | 5 Replies | 2 Followers

Top 10 Contributor
58 Posts
InvestisDev posted on Thu, Apr 12 2012 3:25 AM

Hello,

how to update / merge the index files of a site, is there any rules/flag that we need to enable to achieve this?

And crawl the sites pages which are updated only.

Please suggest a way to achieve this...

Thanks,

All Replies

Top 10 Contributor
1,692 Posts

They should merge/update automatically.

AN supports continuous crawling, so you can crawl and re-crawl the existing data and new sites and additionally query at the same time.  An update to a page will simply replace the existing version in the index.

To crawl ONLY the pages that have been updated you'd need a list, like Google's SiteMap.

By default, AN reads the Last-Modified header and if the page needs to be downloaded it will be, else the WebPage Source will be retrieved from the DB or from disk, wherever you have elected to store it.

Thanks,
Mike 

 

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
58 Posts

Hello Mike,

I think i still didn't understood your view. As right now when starting crawling any site, at the beginning there is a code of deleting all the folders used while crawling a site like "ConsoleOutputLogs, DownloadedFiles, DownloadedImages , DownloadedWebPages, LuceneDotNetIndex".

Also in the begining of crawling a site, there is one stored procedure that is executing named as "arachnodeDAO.ExecuteSql("EXEC [dbo].[arachnode_usp_arachnode.net_RESET_DATABASE]");" which is deleting data from the maximum tables like "Images, WebPages, WebPages_MetaData_TermExtraction, Discoveries, Files_Discoveries, HyperLinks, etc ..... ".

So at the time of crawling same site once again, it will list everything once again as crawling a site first time.

So the data are removed from the DB and disk both. .... Sad Confused

Please give some more detail on this to resolve this.

Thanks,

 

Top 10 Contributor
1,692 Posts

Smile

That could should be optional, and executed based on user input from the DEBUG build configuration.

Diff Console\Program.cs - did this file get changed somehow?

Are you using the DEMO build configuration?

Those steps to clear the data are executed by answering 'y' to the prompts.

Thanks,
Mike 

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
58 Posts

Thanks for  the reply Mike,

yes your are correct ... i had change the code of program.cs file and it was done in such a way that it will clear all the data without asking user to do ... I had reverted back the code and checked it by crawling one site that had shown proper updated results.

Now i had made some changes on the same site and then crawled once again just to check index merge login... at this time i had placed the value 'n' to stop clearing data from the DB, and also didn't removed the folders too.

While crawling site, it took data from the DB but the files were download again. Also the field "LastModified" in "Webpage" table is not updating ... its values is NULL always.

From this we can say that it crawl a complete site once again and then merges the indexes. Is it true ?

Surprise I want to know that is it possible to crawl the pages only that are updated. Because for crawling it once again takes same time as it took before and also going on each URL too for crawling and updating index file.

Let me know if there is a way to do so (but really the existing feature is enough for me but this is just for curiosity)

Thanks,

 

Top 10 Contributor
1,692 Posts

Great!  I am glad it is working for you.

If you re-crawl the same data and the pages change, you will see the LastModified field update.

Yes, once a crawl completes AN will merge the index files.

You will only know which pages have updated themselves by knowing what the Google SiteMap's page is, or through some other master resource that tells you explicitly which pages have been updated.  As AN uses the 'Last-Modified' HTTP Header, if a page hasn't been updated at the WebServer it will be recalled from the disk or the DB or wherever you have elected to store it, and AN will proceed as normal from here and continue crawling, theoretically, crawling a site from the HEAD request and the DB/Disk only, not downloading the GET request from the WebServer.

Cheers,
Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (6 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2005/2008/CE

copyright 2004-2013, arachnode.net LLC