arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL 2005/2008/CE
Does arachnode.net scale? | VS2008/2010/2012 & SQL2008/2012 | Download the latest release

adding new crawls/discoveries in-process

rated by 0 users
Answered (Verified) This post has 1 verified answer | 4 Replies | 2 Followers

Top 25 Contributor
19 Posts
offbored posted on Thu, Apr 8 2010 10:38 AM

Hey Mike,

I'm going to start by describing my issue, rather than the (sort of) solutions I've noodled because I'd rather have your unvarnished opinion.

Stemming from the last issue I had, where I actually crawl RSS feeds (and other web data) using local services I've built as an interface, I've managed to create a problem in the way AN usually crawls. For instance, in one service a feed entry is parsed, cleaned, categorized, maniplated in a variety of ways, and an xml doc is generated and returned containing both the metadata created by the service, and the (modified) HTML source. That source is stored encoded within one of the xml elements ("source"). AN has no issue treating this as a file, and crawling it as such. So far, so good.

The issue is that I would also like to crawl the contents of "source", e.g. all img tags, etc. Unfortunately AN isn't seeing this as crawlable because of the encoding, e.g. "<img/>". I'd prefer not to have to break standards in order to store those files with wrongly unencoded data. One possible fix (though maybe a kludgy one) might be to crawl the "native" feed (or whatever) in addition to crawling it via my interface, then create a relationship between them via AbsoluteURI. Means more and redundant crawling for AN, so not the most elegant solution. Another way might be to use a plugin to grab those crawls, decode the HTML, use HAP to grab the img/src attributes to build new Discovery/CR objects and add them to the crawl. Of course, that breaks (without more work) when trying to add to that collection.

I'd be grateful for your thoughts on this.

- offbored

p.s. I have another unrelated issue I put up separately so that it can be search and found on it's own, but this is the important one for me right now.

Answered (Verified) Verified Answer

Top 10 Contributor
1,694 Posts
Verified by offbored

Sounds like a plugin will work just fine for this.

In your plugin you can get back to the Crawler instance using:

crawlRequest.Crawl.Crawler.Crawl();

...this will place it into the Crawler, just like you do from Program.cs.

To be completly correct in sourcing the CrawlRequests from your RSS feeds, set the overload:

 

internal

 

 

CrawlRequest(CrawlRequest parent, Discovery discovery, int currentDepth, int maximumDepth, byte restrictCrawlTo, byte restrictDiscoveriesTo, double

priority

to public so you can supply the correct 'Parent'.

Also, you could do this as well:

modify this (in CrawlRequestManager.cs):

private

 

static void ProcessFile(CrawlRequest crawlRequest, FileManager fileManager, ArachnodeDAO arachnodeDAO)

 

and Decode your HTML, or 'switch the DecodedHTML' before calling:

private

 

static void ProcessHyperLinks(CrawlRequest crawlRequest, ArachnodeDAO arachnodeDAO)

, and then put your DecodedHTML back to what it should be.

One mod makes use of a plug in and an access modifier change, and the other changes the core.  (obviously... :))

Let me know which one you decided to do?

 

 

Will this work for you?

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

All Replies

Top 10 Contributor
1,694 Posts
Verified by offbored

Sounds like a plugin will work just fine for this.

In your plugin you can get back to the Crawler instance using:

crawlRequest.Crawl.Crawler.Crawl();

...this will place it into the Crawler, just like you do from Program.cs.

To be completly correct in sourcing the CrawlRequests from your RSS feeds, set the overload:

 

internal

 

 

CrawlRequest(CrawlRequest parent, Discovery discovery, int currentDepth, int maximumDepth, byte restrictCrawlTo, byte restrictDiscoveriesTo, double

priority

to public so you can supply the correct 'Parent'.

Also, you could do this as well:

modify this (in CrawlRequestManager.cs):

private

 

static void ProcessFile(CrawlRequest crawlRequest, FileManager fileManager, ArachnodeDAO arachnodeDAO)

 

and Decode your HTML, or 'switch the DecodedHTML' before calling:

private

 

static void ProcessHyperLinks(CrawlRequest crawlRequest, ArachnodeDAO arachnodeDAO)

, and then put your DecodedHTML back to what it should be.

One mod makes use of a plug in and an access modifier change, and the other changes the core.  (obviously... :))

Let me know which one you decided to do?

 

 

Will this work for you?

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
1,694 Posts

Additionally, sourcing the Discoveries in the DB will look like you posted a raw file request from the Crawler - there will be an entry in, say, Images, but no entry in Images_Discoveries sourcing it back to a WebPages.  Let me know if this is a problem for you.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 25 Contributor
19 Posts

Thanks, I could see either of these working well. On one hand, I like the notion of putting something into the core that could catch and correct some of the wackier stuff I'm doing from the outside "magically", like a more useful, code-based Harry Potter. On the other hand, I just don't have a lot of confidence that I'd be able to create something specific enough to be useful without constant change and generic enough to handle even some-to-most one-offs without compromising the way AN works right now (mostly flawless, btw). Add reconciling changes you make to the core and it could become a maintenance issue.

So I went with the plugin idea. It has its downsides, but I've tried to work within the plugin architecture thus far, and I'll try to stick to that script unless there's a more compelling reason not to.

BTW, I actually have been using the %_discoveries and other tables (via views) in my other DBs and to hydrate some objects I use in code, and while this worked to get the images and what-not, it broke some of that. Here's a (really kludged) query I threw together to try and find my way back to the data. I write some ugly stuff, but this is fugly even for me, so if you have a better way I'm all ears:

 

select

img.InitiallyDiscovered

, img.LastDiscovered

, ID = images.ID

, WebPageId = wp.ID

, AbsoluteUri = images.AbsoluteUri

, ResponseHeaders = images.ResponseHeaders

, Source = images.Source

, FullTextIndexType = images.FullTextIndexType

, EXIFData = im.EXIFData

, Flags = im.Flags

, Height = im.Height

, Width = im.Width

, HorizontalResolution = im.HorizontalResolution

, VerticalResolution = im.VerticalResolution

,   FilePath = [arachnode.net].[dbo].[ExtractDirectory](

(

SELECT CONVERT(nvarchar(4000), Value) FROM [arachnode.net].cfg.Configuration

WHERE [Key] = 'DownloadedImagesDirectory'

)

, images.AbsoluteUri)

, COALESCE(img.NumberOfTimesDiscovered, d.NumberOfTimesDiscovered)

, IsTrackingPixel = CASE WHEN im.Height = 1 AND im.Width = 1 THEN dbo.IsTrackingPixel(images.AbsoluteUri, images.Source) ELSE 0 END

--select *

--select img.*

FROM 

[arachnode.net].dbo.Images AS images WITH (nolock) LEFT JOIN

[arachnode.net].dbo.Images_MetaData AS im WITH (nolock) ON im.ImageID = images.ID LEFT JOIN

(

select id.ImageID, WebPageID = MAX(id.WebPageID), NumberOfTimesDiscovered = COUNT(*), 

InitiallyDiscovered = MIN(id.InitiallyDiscovered), LastDiscovered = MAX(id.LastDiscovered)

from [arachnode.net].dbo.Images_Discoveries AS id WITH (nolock) INNER JOIN

[arachnode.net].dbo.Images AS images WITH (nolock) ON images.ID = id.ImageID

group by id.ImageID

) img ON img.ImageID = images.ID LEFT JOIN

[arachnode.net].dbo.WebPages AS wp WITH (nolock) on wp.AbsoluteUri = img.WebPageID LEFT JOIN

[arachnode.net].dbo.Discoveries AS d WITH (nolock) on d.AbsoluteUri = images.AbsoluteUri

 

Top 10 Contributor
1,694 Posts

Looks OK, more or less.

Check out the reporting SP's - particularly the 'Popularity' ones - lots of good SQL in there.

Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (5 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2005/2008/CE

copyright 2004-2013, arachnode.net LLC