arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

Upgrade

Why upgrade to Version 1.4?

  • Use arachnode.net in commercial/proprietary applications.
  • Reduce RAM consumption by up to 85%.
  • Index and search non-English character sets.
  • Administer and browse discovered data through a web-based interface.
  • Access to future feature implementation.
  • Priority consideration for feature requests.
  • Index and search .doc/.pdf/.ppt/.xls files.

Fixes implemented in Version 1.4:

  • Memory consumption is greatly reduced.
  • Addition of an administrative application and dynamic database browser.
  • Proper UNICODE support.  SearchResults.aspx and the lucene.net indexes accurately index non-English charaters.
  • The exact state of the Crawler/Engine is now accurately persisted in crawling environments consuming large amounts of RAM.
  • CrawlRequests are no longer dropped in crawling environments consuming large amounts of RAM.
  • The console gracefully exits when the close box is clicked or when a key is pressed during crawling.
  • All references used by the PriorityQueue are properly reclaimed.
  • The saved state of the Crawler/Engine respects order in which CrawlRequests were originally discovered/submtted.
  • CodePage is now available and stored in the WebPages table.
  • File handles are now properly closed, reducing RAM consumption.
  • ServicePoint allocation integrated with crawling configuration.
  • Content download rate improved.
  • Improvements to the regular expressions used for parsing.

 

An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC