arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

Search

  • Re: Lost Crawl Requests?

    JCrawl - Yep, made sure that was turned off. =D I'm leaning more towards thinking they were just duplicates of stuff that was already discovered and not necessarily "valid" new urls to crawl.
    Posted to Bug Reports (Forum) by egecko on Wed, Jun 10 2015
  • Re: Lost Crawl Requests?

    I didn't keep track of how small of a memory footprint it could keep, but setting the desired max memory to 1 MB didn't have any noticible or significant negative effect in terms of its own operation. It did basically offload the discovery verification to the database as you mentioned. Part of the reason for doing this was to essentially get
    Posted to Bug Reports (Forum) by egecko on Wed, Jun 10 2015
  • Re: Lost Crawl Requests?

    Why 1MB? There is solid benefit to Discovery caching. :) It's not so much that we do not want the benefit of discovery caching but rather a preference to persist the crawl requests to the database instead of keeping them in RAM until a graceful shutdown. I haven't had a chance to tinker with the code that handles this yet but it's on the
    Posted to Bug Reports (Forum) by egecko on Tue, Jun 9 2015
  • Re: Lost Crawl Requests?

    By the way, the short period of time AN was running was less than 2 minutes so it's not really plausible that it ripped through that many requests while it was running.
    Posted to Bug Reports (Forum) by egecko on Tue, Jun 9 2015
  • Lost Crawl Requests?

    So, here's the scenario.. We had AN crawling overnight and when I returned in the morning there were 9M crawl requests queued in the table as expected. We have AN's memory bounds tightened down to a desired maximum memory set to 1 megabyte so this was not an unreasonable amount of requests to have queued in the table. I stopped AN and ran the
    Posted to Bug Reports (Forum) by egecko on Tue, Jun 9 2015
  • Re: Detecting Site Completion

    Just what I was hoping for! Thanks for all your help! :)
    Posted to General Questions (Forum) by egecko on Mon, Jun 8 2015
  • Re: Infrastructure, AN, & Big Data

    Excellent, exactly what I was hoping for! Thanks again! eGecko
    Posted to General Questions (Forum) by egecko on Thu, Jun 4 2015
  • Infrastructure, AN, & Big Data

    We were discussing my findings and work with AN and our experiences so far in terms of data being accumulated and our expectations of the volume we are anticipating needing to crawl. I made reference to your post about having a billion pages crawled and was asked how your back-end was structured to support gathering the billion pages? In particular
    Posted to General Questions (Forum) by egecko on Thu, Jun 4 2015
  • Re: Engine & Crawl Actions

    I'm not sure you can do an exact match unless you know exactly how many lambdas appear in the actual code being instantiated. :( There were so discrepancies in some of the flags on the type being checked that could be helpful. In particular, as I was tracing through I noted the following differences that could be leveraged to make sure the proper
    Posted to Bug Reports (Forum) by egecko on Thu, Jun 4 2015
  • Re: Engine & Crawl Actions

    So I found out where the "DisplayClasses" come from and why this actually happened. Display classes are compiler generated classes that are used to hold lambda expressions. In an earlier thread I made reference to spinning off a new Task inside our plug-in to perform its work in the background while AN went off and did more important stuff
    Posted to Bug Reports (Forum) by egecko on Thu, Jun 4 2015
Page 1 of 3 (27 items) 1 2 3 Next > | More Search Options
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC