arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop
Search the Live Index Does arachnode.net scale? | Download the latest release

Browse Forum Posts by Tags

Showing related tags and posts for the General Questions forum. See all tags in the site
  • Re: What is the best method to parse html tags!

    Milan, if you don't have it already, here's a link to the HtmlAgility Pack docs: http://htmlagilitypack.codeplex.com/Release/ProjectReleases.aspx?ReleaseId=33903
    Posted to General Questions (Forum) by Kevin on Fri, Oct 23 2009
  • Re: Plugin help

    Templater.cs is ALPHA code. 1.) HtmlAgilityPack is a memory hog, and should only be used when you absolutely need it. It is used in the templater code because I need XPATH support. 2.) ExtractText does a much, much better job of stripping out tags than the HtmlAgilityPack does, and it's faster as...
    Posted to General Questions (Forum) by arachnode.net on Tue, Aug 11 2009
  • Re: Crawl pages created or modified 30 day ago

    So, for our crawling setup, we need to turn on 'ExtractWebPageMetaData' and 'InsertWebPageMetaData' in the database table 'cfg.Configuration'. This will strip out all tags from our HTML and insert the text into the database table 'WebPages_MetaData'. Since we're going...
    Posted to General Questions (Forum) by arachnode.net on Mon, Aug 10 2009
  • Re: Plugin help

    arachnode.net already contains support for the HtmlAgilityPack - however, the HtmlAgilityPack is a HUGE memory hog and has an extremely negative impact on crawling rate. If you can avoid it, don't use it. If you have to use it, change the configuration setting for 'ExtractWebPageMetaData'...
    Posted to General Questions (Forum) by arachnode.net on Fri, Aug 7 2009
  • Re: Partial crawling

    Yes. You will need to apply a specific xpath to specific hosts and directory paths. 1.) I'm not seeing an explicit OBJECT tag. Do you mean one that is rendered by JavaScript? 2.) /html/body/div[@id='baseDiv']/div[@id='watch-vid-title']/h1 Check out 'xpather' for FireFox: https...
    Posted to General Questions (Forum) by arachnode.net on Thu, May 7 2009
  • Re: Partial crawling

    Kevin is right on #1 - check out this file: http://arachnodenet.svn.sourceforge.net/viewvc/arachnodenet/trunk/SiteCrawler/Managers/WebPageManager.cs?revision=167&view=markup On line 80, if you have 'ExtractWebPageMetaData' set to true in the Configuration table in the database, then AN will...
    Posted to General Questions (Forum) by arachnode.net on Mon, May 4 2009
Page 1 of 1 (6 items)
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC