arachnode.net v2.0
An open source .NET web crawler written in C# using SQL 2005/2008

Kevin's FAQ

Share arachnode.net

Bookmark and Share

Advertisements

Page Details

First published by:
Kevin
on 06-17-2009
Last revision by:
Kevin
on 06-26-2009
1 person found this article useful.

100% of people found this useful
Kevin's FAQ

Filed under: [Edit Tags]

Ok so this is my page of frequently asked questions and their answers, based on personal usage of Arachnode.net products.  Hope it helps!

If you have more questions, ask them in the forums and if appropriate I'll add them here.

- - - - -

Q: I heard the "sweetspot" for the number of crawlthreads in most seems to be 10.  Is this correct?

A: ?  This really depends on what your machine can handle.  I have one machine that can use 200 threads before there isn't a measurable performance improvement.  The best way to figure out where your sweet spot is it to use the performance counters.

- - - - -

Q: When the crawler is run and there is nothing pending in the crawlrequests table to do, is there logic in place that looks for old stuff to re-crawl?  If so where is that and what timeframe does it use to decide if re-crawling is needed?

A: ?  This depends on the settings 'CreateCrawlRequestsFromDatabase*'.  If all settings were set to 'true', then the order would be CrawlRequests, HyperLinks, WebPages, Files and then Images, ordered by Priority DESC, DateCreated ASC (or LastDiscovered ASC).  The actual logic for the procedure '[arachnode_omsp_CrawlRequests_SELECT]' is a touch more specific for DiscoveryTypes that have actual Discoveries (like HyperLinks, Files and Images), but suffices to answer your question.

- - - - -

Q: More and more, links exist that are shortener services like tinyurl.com/... or bit.ly/... . Are there any features built in to Arachnode.Net that allow walking of the resulting links they lead to, without ending up walking all of tinyurl.com and bit.ly?

A: ?  What happens when you crawl a tinyurl.com url?  I do have a //TODO: in the code to check into redirection and how we should handle it.  Check Line 155 in WebClient.cs.

- - - - -

Q: When Arachnode.Net is crawling webpages that have already been crawled, is it smart enough to only re-crawl content that has changed since the last visit?

A: ? Good one.  No.  This is fairly easy though.  Currently, the DiscoveryTypeID is passed into the Engine, so we do have a way to detect whether the Discovery is a File, Image or WebPage and the HttpHeader 'If-Modified-Since' could be custom tailored to the DiscoveryType.  There is a There is a //TODO: to make a switch for this though.

- - - - -

Q: What settings do I need to use to ensure I am collecting webpage source data for use in full-text indexing queries?

A: In the Configuration table, make sure InsertWebPages AND InsertWebPagesSource are set to true.  Double-check that your FTI catalogs are all set up correctly (sometimes the db restore when you initially create the arachnode.net db needs tweaking).  That should be it!  Do some crawls then run a test FTI query like the one below to make sure you get results:

 select source, CAST(source as varchar(max)), * from webpages where contains(source, 'arachnode');

Also, you might decide whether or not you still want to create the DotNetLucene indexes or not since you are probably doing fti.  If you do not want the Lucene indexes, edit the CrawlActions table and disable the Lucene plug-in.

Q: how exactly does the MaximumNumberOfCrawlRequestsToCreatePerBatch setting work?  I see that if I set it to 1000, the query that returns crawlrecords will limit itself to that number.  But what happens when those crawls are completed?  Does AN continue to grab MaximumNumberOfCrawlRequestsToCreatePerBatch records to crawl, until there is nothing else to crawl?  If so, what is the benefit of grabbing MaximumNumberOfCrawlRequestsToCreatePerBatch rcds at a time?

A: It is a SELECT TOP(x) setting.  When the DB gets large the query gets expensive, especially for selecting a small number of rows (very, very little difference in sorting 1000's of rows vs. 10's of rows) and so selecting more records from the DB at a time prevents spending unnecessary time on the query that returns the CrawlRequests.  AN will grab the requested number again and again until there are no more CrawlRequests to be crawled.

 

Recent Comments

By: arachnode.net Posted on 06-18-2009 3:46 PM

NICE!  KEEP 'EM COMING!  :)

By: arachnode.net Posted on 08-20-2009 6:05 PM

I really need to write a book on arachnode.net - not for vanity's sake - but so that users will know how to make the most of it!

An open source .NET web crawler written in C# using SQL 2005/2008

copyright 2009, arachnode.net LLC

Powered by Community Server (Non-Commercial Edition), by Telligent Systems