Frequently Asked Questions v.99 beta

How does arachnode.net work?
Arachnode.net may be installed on any number of computers that will act as crawlers. The crawler engine regulates crawling behavior, and is responsible for spawning crawl processes, ensuring distinct crawls, and updating statistics. Each crawl thread downloads, processes, and stores content in a relational database according to a set of configurable rules and actions. The database component may reside on a crawler or any server satisfying the minimum requirements for SQL Server 2005.  Federation and distribution is achieved through replication within SQL Server 2005 or through the implementation of rules and actions.

What is a discovery?
A discovery is any piece of content that arachnode.net downloads.  The first instance or discovery of a piece of content is recorded and sourced, subsequent discoveries of identical content are sourced, and the original piece of content is updated if necessary when subsequently discovered.

Which content types does arachnode.net collect?
Web pages, files, images, hyperlinks, and e-mail addresses.

What is configurable in arachnode.net?
Various crawling behaviors are configurable, including number of simultaneous crawls, aggressiveness, restricting crawls to a domain or host, verbose console output, statistics generation, metadata extraction.  Additionally, a pre- and post-request rule and action engine can custom tailor crawling behavior even further.  Pre-request rules include address filtering, robots.txt, frequency and depth.  Post-request rules include content filtering.  Post-request actions include CommunityServer 2007.1 integration.

I've installed arachnode.net and it seems slow.  What can I do to improve crawl performace?
Open App.Config and examine 'maximumNumberOfCrawlThreads'.  By default this is set to 1.  Increasing the number of crawl threads will increase the rate at which Arachnode.net crawls.  Also, open CrawlRules.config and examine rule 'Arachnode.SiteCrawler.Rules.Frequency'.  Check the 'threadSleepTimeInMillisecondsBetweenCrawlRequests' setting.  By default, 'threadSleepTimeInMillisecondsBetweenCrawlRequests' is set to sleep 1 second between CrawlRequests.

I've discovered inappropriate or offensive content on this site, what should I do?
Arachnode.net accepts or rejects addresses and content according to a set of configurable rules and is currently configured to exclude commonly objectionable content.  To alert us of potential improvements to our filters please send a private message to 'arachnode.net'.

I'd like to remove my content from your index.
Arachnode.net is committed to being a good web citizen. That means that we honor all common conventions for robots.txt. If you feel that Arachnode.net has contacted your site in error, or if you would prefer not to be listed in arachnode.net’s indexing system for any other reason, please send a private message to 'arachnode.net'.

Does arachnode.net interfere with search results and rankings of original content owners?
The content displayed to an unregistered user, or, what a site crawler would view, is granted explicit permission through Creative Commons licenses.  All Discoveries are shown and each piece of Creative Commons content links back to the original source per respective Creative Commons license.

Is arachnode.net safe?
Yes!  Absolutely!  Positively!  Source code is provided in its entirety and no part of arachnode.net is pre-compiled.

What is http://arachnode.net powered by?
http://arachnode.net is powered by IBM e326m and x3550 servers with dual- and quad-core Xeon and Opteron processers running Windows Server 2003 R2 in high-availability and compute cluster configurations.


Links: Technical Specifications, Frequently Asked Questions

arachnode.net - a .NET web crawler written in C# using SQL 2005