arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

Search

  • Re: Crawl Requests Repeating

    I'm just thinking to myself here but if discovery by hyperlink was processed fine and then it encountered discovery by exception with identical link and then again by exception same link again then shouldn't that be saved somewhere to simply mark it as processed? I must have messed up some settings. Number of remaining items to process is different
    Posted to General Questions (Forum) by polfilm on Thu, Jul 22 2010
  • Re: Crawl Requests Repeating

    Trying to find a way to prevent discovery by exception.
    Posted to General Questions (Forum) by polfilm on Thu, Jul 22 2010
  • Re: Crawl Requests Repeating

    CrawlRequests - empty WebPages - 9 Exeptions 16 Exception Discoveries 16 (all have Host Discovery = 18) HostDiscovery 18 has Type of 3 which is Discovery By Exception
    Posted to General Questions (Forum) by polfilm on Thu, Jul 22 2010
  • Crawl Requests Repeating

    I'm in a loop again or so I think. Engine: PopulateCrawlRequests: 22 CrawlRequests Returned. Engine: AssignCrawlRequestsToCrawls: Total.UncrawledCrawlRequests: 22 So they process and cycle repeats it seems forever. Its probably doing what I told it too but any light as to what might be happening? I am doing a test on my site, there are about 11
    Posted to General Questions (Forum) by polfilm on Thu, Jul 22 2010
  • Frequency - Politeness

    Took me a while to figure out where you keep MaximumNumberOfWebRequestsPerHostPerDay=500 ThreadSleepTimeInMillisecondsBetweenWebRequests=1000 They are simply parameters to the Rules.Frequency stored in the cfg.CrawlRules so now I know it all works. In the Program.cs you have the loop that turns off all the database rules //CrawlActions, CrawlRules and
    Posted to General Questions (Forum) by polfilm on Thu, Jul 22 2010
  • Storable.cs

    PROBLEM: I want to be able to specify about 10 keywords and/or terms (two words), I still want to process the pages for discovery but only save in the DB the ones that include one or more specified keywords. It would be nice to know how many have been find on a page for importance sorting. SOLUTION: Well for now i can just eat everything and do queries
    Posted to General Questions (Forum) by polfilm on Thu, Jul 22 2010
  • Spider Trap

    I got caught in a spider trap, web site keeps appending folders to a link, they resolved and process is repeated so the links get very very very long. Now you mentioned to look in Program.cs where the AbsoluteUri.is is disabled but still don't know how to fix that. Could you elaborate? Peter
    Posted to General Questions (Forum) by polfilm on Thu, Jul 22 2010
  • Re: v1.0 first comments

    Server is old Dual Core 3GHz with 3GB of RAM (2003SP2) but its running tons of my dev stuff like IIS, Sharepoint (brrr) and SQL2K5 which takes most of the RAM when it gets goin...and I was running your app from within Visual Studio... I think the actual console was eating about 150-200MB
    Posted to Bug Reports (Forum) by polfilm on Mon, Jan 5 2009
  • Re: v1.0 first comments

    Well, what can I say....INCREDIBLE. This is a nice piece of work. 30 Threads are putting my server at 95% cpu. I'm running SQL on it as well (makes a small humming noise :)) taking over 1.6GB of RAM Entire solution has been running from within Visual Studio for about 2 hours now. It's miles ahead from yesterdays all day session with 0.9. ..
    Posted to Bug Reports (Forum) by polfilm on Mon, Jan 5 2009
  • Re: v1.0 first comments

    1. Download HtmlAgilityPack from the website above. Compile with VS2005. 2. in VS open Arachnode.net/SiteCrawler project. Go to references. Remove missing library HtmlAgilityPack 3. Right click on references, then Browse and point to newly compiled DLL. The project now compiles. Ready to do some tests...with below missing. PS: there is still one library
    Posted to Bug Reports (Forum) by polfilm on Mon, Jan 5 2009
Page 1 of 2 (14 items) 1 2 Next > | More Search Options
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC