arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

How does arachnode.net process identify duplicate content of the page?

rated by 0 users
Answered (Verified) This post has 1 verified answer | 3 Replies | 2 Followers

Top 100 Contributor
5 Posts
Justin posted on Fri, Mar 27 2015 5:36 AM
Hi I am playing with arachnode.net. Here are some questions that I wonder. 1. How does arachnode.net deal with duplicate content? In specific, for example, getting the link1's content after 1st crawling. In the next crawling, how to compare the link1's previous content with link1's current content? Is arachnode.net able to identify if the link1's content has been modififed or not? If so, how does it do? The reason I asked the question is, I found that in other crawler engine, it does has the function, for example, Nutch. But Nutch will create the signature of the entire webpage, in other word, if just the header part changed, the actual content part does not changed, it will regard the page content has been changed as well. 2. In the official website, it says arachnode.net supports grab dynamic content generates by ajax. Does it fully support? For example, I wondering does it support website built in wordpress? 3. For the confirguation part, if I have to change some settings, is database tables the only place I need to go to ? Is there any other setting in the program or configuration file? for example url filtering, seed urls, etc

Answered (Verified) Verified Answer

Top 100 Contributor
5 Posts
Answered (Verified) Justin replied on Mon, Mar 30 2015 12:41 AM
Verified by Justin
Thanks Mike, Currently, I am playing with the demo version, can I use the demo version to test if AN supports crawl webpage content generated by ajax? If so, Could you guide me how to do that? Many thanks!

All Replies

Top 10 Contributor
1,905 Posts

1.) AN uses the LastModified header and a few other tricks (hashing/caching at multiple levels, respecting any intermediary proxies that may be in the download chain), akin to what Nutch does.  Also, if the website supports the HEAD protocol this also aids in the process.

2.) Yes, it can deal with AJAX content / anything dynamically generated.

3.) You can change everything from ApplicationSettings.* as well - look at the methods AssignApplicationSettingsFor* in Console\Program.cs

Thanks,
Mike 

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 100 Contributor
5 Posts
Answered (Verified) Justin replied on Mon, Mar 30 2015 12:41 AM
Verified by Justin
Thanks Mike, Currently, I am playing with the demo version, can I use the demo version to test if AN supports crawl webpage content generated by ajax? If so, Could you guide me how to do that? Many thanks!
Top 10 Contributor
1,905 Posts

http://arachnode.net/blogs/arachnode_net/archive/2015/04/01/ajax-dynamic-content.aspx - gives a nice overview of the Dynamic/AJAX capabilities of AN.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (4 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC