arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

passing state into AN

rated by 0 users
Not Answered This post has 0 verified answers | 9 Replies | 2 Followers

Top 25 Contributor
19 Posts
offbored posted on Wed, Feb 10 2010 6:57 AM

Good morning,

I need to pass some data into AN for processing by a plugin. I could obviously write that data into the DB and pick it up in the CrawlAction later, but I'm already retrieving that data before calling into AN as a matter of necessity and I'd rather not be more redundant than I have to be.

I've thought of extending and overloading the CrawlRequest constructor and just passing in my stateful object to be carried through to PerformAction, but I'd have to extend too many places in AN to carry it through. I could do so by modding CrawlRequest directly (and I will barring a better answer), but I'm worried about breaking/maintaining future versions of AN. Maybe I could have my stateful object inherit ACrawlAction, but then I have to figure the best way to have AN call the specific instance (which seems to just beg the question of passing state).

From your perspective what's the best way to do this, am I missing something obvious?

 

Thanks,

- offbored

All Replies

Top 25 Contributor
19 Posts

Head-smacking moment. I just realized I answered my own question. I should really avoid posting stuff until I've had the entire first pot of coffee as a matter of policy. Schnikeys.

 

- offbored

Top 25 Contributor
19 Posts

Head-smacking moment, part deux. I was wrong about being wrong, or right--either way, I still need an answer.

I'm now sleepy AND concussed. I'd also be embarrassed if I hadn't worn out that part of my brain years ago.

- offbored

Top 10 Contributor
1,905 Posts

Big Smile

What sort of state do you wish to keep?  Tell me more about the process flow?

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 25 Contributor
19 Posts
offbored replied on Wed, Feb 10 2010 10:37 AM

 

I have keywords and xpaths that I want to use to filter by both content and structure. For the moment I just overloaded the CR constructor (directly in AN source), but I'm not sure that's optimal for the reasons I stated.

 

Top 10 Contributor
1,905 Posts

That's probably OK - even if you do have to merge it will be a good learning exercise.  Big Smile

Sending mail on SVN access shortly.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
1,905 Posts

One other thing to keep in mind is that when AN runs out of RAM it will cache CR's to the database.  When it does this, your state may be lost.  Does this change your approach?

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 25 Contributor
19 Posts

Hmmm...might have to. Thanks for the heads up--that would've been fun to try and track down when it happened. I guess I could account for it by overloading InsertCrawlRequest to accept/serialize/store that object, modding those calls in Cache.cs (and elsewhere?), inflating that object in PopulateCrawlCrawlRequests, and overloading the Engine's CR constructor. This solution doesn't immediately thrill me, but I'd like to know your opinion (and whether I've missed anything critical).

Maybe the best thing I could do is "pre-throttle" (is that a word?) to keep an even in-memory flow. Wouldn't I get better throughput that way anyway, by avoiding the DB read/writes, etc. (even assuming a separate server in the future)? I'm guessing I'd need a fairly generous margin in the tolerances I set, though. At any rate I could play with those settings to find the sweet spot, trap for and deal with the (hopefully outlier) errors when a spill happened.

And maybe I'm going about all of this a little backwards. I could break up the processing of my config objects into pieces. Create and store the CRs for later processing by AN (storing the ID with config), get that config again in a plugin (or maybe in CR construction) for the pieces I need to process. I'm guessing I shouldn't try to use the AbsoluteURI as a UID, so I probably have to expose ID in CrawlRequestsRow.

I think I'll sleep on it and hope for inspiration. I'd be grateful for any input on which of these approaches (or an entirely different one) seems most promising to you. Guess I could always just buy more RAM, rightBig Smile?

- offbored

BTW, I know it's likely to be pretty individual, but can you give me a general idea of some metrics as far as average cost-in-ram per crawl?

Top 10 Contributor
1,905 Posts

It's late for me, now, so I will 'warm up' with the RAM question.  RAM consumption is all over the board.  You could have 1,000 pages that only link to one other page, in a linked list... or, each WebPage could have 1,000 links - so, a lot of variety in what you will consume.  Overall, it doesn't take too long for AN to consume a gig of RAM and start hitting the disk.  The good news is that with millions and millions of discoveries, well, SQL serves them up quite handily.  When watching the console you will likely never detect that AN is using RAM and DISK for caching.

The thing with crawling, is that every crawl that I have set up or helped to set up is a unique case.

Having at least 4GB of RAM and something like this is necessary if you are serious about crawling: http://www.newegg.com/Product/Product.aspx?Item=N82E16820233087&cm_re=corsair_ssd-_-20-233-087-_-Product

I run 8 drives for my DB array on my test machine, and have more SCSI, SATA, SAS drives that I can count - so, disk is important!  (obviously - and rambling... Big Smile

Yes, avoiding the disk is desired, and I racked my brain about two years ago on how to avoid going to disk and resigned to the notion that I just couldn't afford the RAM I would need to NOT go to disk.  Funny though, that when I implemented the disk caching, well, I barely noticed a difference.  The BUFFERMANAGER cache hit ratio can be quite high - even as high as 99%.  So, it's very efficient.

But, going back to your question... implementing the overloads and the saving isn't exactly a fun task - or perhaps for me since I have to obsess over getting it right and not messing up the change scripts.

So, to help me understand it better - you pass filtering information from an outside process into the CR's... but you don't want to have a CrawlRule/CrawlAction pull this information?  What about having the AN projects reference your solution, and be able to call the filtering information directly from a CrawlRule/CrawlAction?  This way, you wouldn't have to make the changes to the CR's?

Would posting some code help me better understand what you are trying to do?

I agree with you that avoiding serialization (and de-) is a good notion.

Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 25 Contributor
19 Posts

Thanks for the input. Your observation about having AN call that info from a CR/CA lines up with my notion that I went down the wrong road with this. I didn't have a taboo or something about doing that, but I was trying to avoid doubling up on the DB load for crawls if I could, and keep dependencies loosely coupled. On the other hand, if I split things up and let AN call/create info on it's side, even though I do double up the load in places I should gain some bene's in exchange. First, I avoid the caching/lost state issue altogether, I think. It should also be easier to spread that (now slightly increased) load across time, and easier to physically abstract them onto different servers. I like the inherent optimization possibilities of that at scale.

So here's what the new flow looks like:

- queue up crawls

     - create info objects

     - process into CrawlRequests

- process crawls

     - create info objects in CrawlActions/Rules

Later, if i needed to, I could even conceivably even queue CRs up out of plugins for processing elsewhere/when, and limit the AN machine to crawl duty only. DutyStick out tongue. Sorry, it's early, and I'm pretty juvenile in developer-mode. Kidding aside, I'm not sure I see any downsides to doing it this way other than it's maybe too simple and I won't have as much opportunity for these fun little side adventures/mistakes. Two ways to do it, and I picked the wrong one. Quelle surprise...

- offbored

Page 1 of 1 (10 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC