arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release
Templater and http://code.google.com/p/boilerpipe/

A cool feature of AN that is still in BETA is Templater.cs.  Another user linked me to http://code.google.com/p/boilerpipe/, and after viewing their .pdf demo I can already see that their approach, while similar, is faster.

SummaryBoilerplate Detection using Shallow Text Features" by Christian Kohlschütter et al., presented at WSDM 2010 -- The Third ACM International Conference on Web Search and Data Mining New York City, NY USA. Click here to read the paper and the presentation slides

The boilerpipe library provides algorithms to detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.

The library already provides specific strategies for common tasks (for example: news article extraction) and may also be easily extended for individual problem settings.

Extracting content is very fast (milliseconds), just needs the input document (no global or site-level information required) and is usually quite accurate.

Boilerpipe is a Java library written by Christian Kohlschütter. It is released under the Apache License 2.0.

The algorithms used by the library are based on (and extending) some concepts of the paper "

Newsits own Maven repository as well as in the java.net Maven 2 repository.

  • (2010-05-06) boilerpipe 1.0.4
    Now supports TagSoup and other HTML balancers through a SAX ContentHandler. Boilerpipe now available in
  • (2010-01-30) boilerpipe 1.0.3
    Two bug fixes (XML parsing issues). Issues #1 and #2. (Thanks to Tom Taylor, Kaspar Fischer and nedunk for reporting the problems)
  • (2009-12-10) boilerpipe 1.0.2
    This release hot-fixes a NekoHTML bug which caused low-quality results in a rare situation. (Thanks to Kris Jirapinyo for reporting the problem)
  • (2009-12-04) boilerpipe 1.0.1
    Added the dependency libs (xerces and nekohtml) and the javadocs to the binary tarball. (Thanks to Mike Matthews for reporting the problem)
  • (2009-12-03) boilerpipe 1.0.0
    The code is now online. Have fun!

Getting Starteddocumentation in the Wiki and the binary and source tarballs. Please also read the FAQ, it contains important information.

To get started, see the

About the AuthorChristian Kohlschütter is currently working at the L3S Research Center. He is a PhD student of Professor Dr. Wolfgang Nejdl. His main research interests are in the area of Web Information Retrieval and Quantitative Linguistics.

 


Posted Mon, May 10 2010 7:51 AM by arachnode.net
Filed under:
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC