What is Arachnode.net?
Arachnode.net is an open source promiscuous Web crawler for downloading, indexing and storing Internet content including e-mail addresses, files, hyperlinks, images, and Web pages. Arachnode.net is written in C# using SQL Server 2005.
What can I do with it?
Research and Analysis:
Arachnode.net extracts, collects, sorts and parses downloaded content into multiple forms, including XML. SSIS packages extract terms and phrases from text content, and provides over 120 stored procedures and views to jumpstart Analysis Services or other text mining applications.
Education:
Arachnode.net is an excellent tool for learning introductory to advanced crawling techniques, and showcases many of the features of the .NET Framework and SQL Server 2005, including full-text indexing, multi-threading, caching, reflection, interfaces, object-oriented concepts, SQL common language runtime functions and regular expressions.
Content Aggregation:
Arachnode.net is appropriate for personal content aggregation, crawling intranets of any size or crawling the Internet as a whole. This site surfaces a small subset of data collected by Arachnode.net as integrated into a lightly modified Community Server installation. Arachnode.net integrates with CommunityServer 2007.1, automatically creating users for discovered hosts, posting discovered images to galleries and creating blog mirrors of discovered RSS, ATOM and XML content.
How can I get it?
Source code and a database backup are available on this site and at Sourceforge.net. A Virtual Machine (.vmc) and Virtual Hard Disk (.vhd) pre-configured with Windows Enterprise Server 2003 R2, Visual Studio 2008, SQL Server 2005 Enterprise Edition, Community Server 2007.1 SDK and TortoiseSVN is available here. Arachnode.net is released under the GNU General Public License.
Anything else I should know?
Yes. Please sign the Guestbook here. Anonymous posting is enabled.
Links: Technical Specifications, Frequently Asked Questions, Class Diagrams, Content
Recent News
- 07.04.2008 - Revisions 21 and 22 checked in.
- 04.16.2008 - We'd like to thank everyone who has downloaded Arachnode.net. Arachnode.net is currently ranked in the 99.22nd activity percentile on Sourceforge.net.
- 04.14.2008 - Alexa site thumbnails added. Where possible, user avatars will show a thumbnail preview of the referenced site. Click here for an example.
- 04.12.2008 - We'd like to thank http://visualsvn.com for generously providing a copy of their essential version control application, VisualSVN.


Feeds
Change Log
- Revision 1: Added \database
- Revision 2: Added \source
- Revision 3: Added arachnode.net.zip to \database
- Revision 4: Added files from \source to \source
- Revision 5: Corrected \source\Configuration\App.config paths for easier integration into first Virtual Hard Disk posting. Previous paths did not include \source as part of the path.
- Revision 6: Corrected \source\Test\App.config paths for easier integration into first Virtual Hard Disk posting. Previous paths did not include \source as part of the path.
- Revision 7: Removed obsolete reference to Lucene.NET.
- Revision 8: Updated the CommunityServerGalleryService password to match the hashed version in the database stored procedures.
- Revision 9: Added documentation. 65% complete. Improved CommunityServer 2007.1 integration components.
- Revision 10: Improved [arachnode_cssp_CreateUser]. No schema changes made.
- Revision 11: Small changes for a more correct default installation.
- Revision 12: Corrected CommunityServer integration WebService function name.
- Revision 13: Updated CreateBlogMirrors. This should be the final commit for the Virtual Hard Disk installation. (database commit will trail this.)
- Revision 14: Updated arachnode_cssp_CreateBlogMirrors. Added arachnode.net.zip to \database.
- Revision 15: Adding instructional text to Program.cs.
- Revision 16: RuleManager missing a call to ICrawlRule.PerformAdditionalActions. This call is needed to enable the crawl delay in Frequency.cs.
- Revision 17: Added AppDomain level error handling.
- Revision 18: Improved cache performance and solved condition where crawling would loop as a result of Discoveries prematurely timing out in the cache. Upgraded Thread.Resume() and Thread.Suspend() in Engine.cs to .NET 2.0 standards.
- Revision 19: Improved null object and GalleryPost category handling in PushImagesToCommunityServer.cs.
- Revision 20: Fixed Frequency.cs. Added 'PreAndPost' to RuleType.cs. Frequency.cs now properly manages the maximum number of WebPage requests per day.
- Revision 21: Added 'negateDisallowed' to the Address.cs and Content.cs CrawlRules, allowing arachnode.net to crawl only specific content genres. The default configuration for arachnode.net is to disallow all adult-themed content. Enabling 'negateDisallowed' will cause arachnode.net to crawl adult-themed content only. Refactord ArachnodeDAO.cs to pass a System.Exception to the exception reporting methods, instead of individual parameters. Added CLR methods to generate typographical errors from input strings. Improved robustness of Crawl.cs and added additional thread status tracking capabilities. Improved error handling capabilities of ActionManager.cs and RuleManager.cs. Improved duplicate page detection in Cache.cs.
Artistic Overview
Statistically Unique Phrases
The term 'extraoperability' was mentioned once by one of the fun people behind this site. As there are only a small spattering of Google search results for the term, we'd like to add http://arachnode.net to the list too.
- Community crawling
- Social crawling
- Social network crawling
- Friends crawling
- Web
spider, website
crawler, URL
extractor, link
collector, web
robot, web
data extraction software.
- i.bsteele
- We SEO so you don't have to.
What's New