Hey Mike,
I'm going to start by describing my issue, rather than the (sort of) solutions I've noodled because I'd rather have your unvarnished opinion.
Stemming from the last issue I had, where I actually crawl RSS feeds (and other web data) using local services I've built as an interface, I've managed to create a problem in the way AN usually crawls. For instance, in one service a feed entry is parsed, cleaned, categorized, maniplated in a variety of ways, and an xml doc is generated and returned containing both the metadata created by the service, and the (modified) HTML source. That source is stored encoded within one of the xml elements ("source"). AN has no issue treating this as a file, and crawling it as such. So far, so good.
The issue is that I would also like to crawl the contents of "source", e.g. all img tags, etc. Unfortunately AN isn't seeing this as crawlable because of the encoding, e.g. "<img/>". I'd prefer not to have to break standards in order to store those files with wrongly unencoded data. One possible fix (though maybe a kludgy one) might be to crawl the "native" feed (or whatever) in addition to crawling it via my interface, then create a relationship between them via AbsoluteURI. Means more and redundant crawling for AN, so not the most elegant solution. Another way might be to use a plugin to grab those crawls, decode the HTML, use HAP to grab the img/src attributes to build new Discovery/CR objects and add them to the crawl. Of course, that breaks (without more work) when trying to add to that collection.
I'd be grateful for your thoughts on this.
- offbored
p.s. I have another unrelated issue I put up separately so that it can be search and found on it's own, but this is the important one for me right now.
Sounds like a plugin will work just fine for this.
In your plugin you can get back to the Crawler instance using: crawlRequest.Crawl.Crawler.Crawl(); ...this will place it into the Crawler, just like you do from Program.cs. To be completly correct in sourcing the CrawlRequests from your RSS feeds, set the overload: internal CrawlRequest(CrawlRequest parent, Discovery discovery, int currentDepth, int maximumDepth, byte restrictCrawlTo, byte restrictDiscoveriesTo, double priority to public so you can supply the correct 'Parent'. Also, you could do this as well: modify this (in CrawlRequestManager.cs): private static void ProcessFile(CrawlRequest crawlRequest, FileManager fileManager, ArachnodeDAO arachnodeDAO) and Decode your HTML, or 'switch the DecodedHTML' before calling: private static void ProcessHyperLinks(CrawlRequest crawlRequest, ArachnodeDAO arachnodeDAO) , and then put your DecodedHTML back to what it should be. One mod makes use of a plug in and an access modifier change, and the other changes the core. (obviously... :)) Let me know which one you decided to do? Will this work for you?
crawlRequest.Crawl.Crawler.Crawl();
...this will place it into the Crawler, just like you do from Program.cs.
To be completly correct in sourcing the CrawlRequests from your RSS feeds, set the overload:
internal
CrawlRequest(CrawlRequest parent, Discovery discovery, int currentDepth, int maximumDepth, byte restrictCrawlTo, byte restrictDiscoveriesTo, double
priority
to public so you can supply the correct 'Parent'.
Also, you could do this as well:
modify this (in CrawlRequestManager.cs):
private static void ProcessFile(CrawlRequest crawlRequest, FileManager fileManager, ArachnodeDAO arachnodeDAO) and Decode your HTML, or 'switch the DecodedHTML' before calling: private static void ProcessHyperLinks(CrawlRequest crawlRequest, ArachnodeDAO arachnodeDAO) , and then put your DecodedHTML back to what it should be. One mod makes use of a plug in and an access modifier change, and the other changes the core. (obviously... :)) Let me know which one you decided to do?
private
static void ProcessFile(CrawlRequest crawlRequest, FileManager fileManager, ArachnodeDAO arachnodeDAO)
and Decode your HTML, or 'switch the DecodedHTML' before calling: private
static void ProcessHyperLinks(CrawlRequest crawlRequest, ArachnodeDAO arachnodeDAO)
, and then put your DecodedHTML back to what it should be.
One mod makes use of a plug in and an access modifier change, and the other changes the core. (obviously... :))
Let me know which one you decided to do?
Will this work for you?
For best service when you require assistance:
Skype: arachnodedotnet
Additionally, sourcing the Discoveries in the DB will look like you posted a raw file request from the Crawler - there will be an entry in, say, Images, but no entry in Images_Discoveries sourcing it back to a WebPages. Let me know if this is a problem for you.
Thanks, I could see either of these working well. On one hand, I like the notion of putting something into the core that could catch and correct some of the wackier stuff I'm doing from the outside "magically", like a more useful, code-based Harry Potter. On the other hand, I just don't have a lot of confidence that I'd be able to create something specific enough to be useful without constant change and generic enough to handle even some-to-most one-offs without compromising the way AN works right now (mostly flawless, btw). Add reconciling changes you make to the core and it could become a maintenance issue.
So I went with the plugin idea. It has its downsides, but I've tried to work within the plugin architecture thus far, and I'll try to stick to that script unless there's a more compelling reason not to.
BTW, I actually have been using the %_discoveries and other tables (via views) in my other DBs and to hydrate some objects I use in code, and while this worked to get the images and what-not, it broke some of that. Here's a (really kludged) query I threw together to try and find my way back to the data. I write some ugly stuff, but this is fugly even for me, so if you have a better way I'm all ears:
select
img.InitiallyDiscovered
, img.LastDiscovered
, ID = images.ID
, WebPageId = wp.ID
, AbsoluteUri = images.AbsoluteUri
, ResponseHeaders = images.ResponseHeaders
, Source = images.Source
, FullTextIndexType = images.FullTextIndexType
, EXIFData = im.EXIFData
, Flags = im.Flags
, Height = im.Height
, Width = im.Width
, HorizontalResolution = im.HorizontalResolution
, VerticalResolution = im.VerticalResolution
, FilePath = [arachnode.net].[dbo].[ExtractDirectory](
(
SELECT CONVERT(nvarchar(4000), Value) FROM [arachnode.net].cfg.Configuration
WHERE [Key] = 'DownloadedImagesDirectory'
)
, images.AbsoluteUri)
, COALESCE(img.NumberOfTimesDiscovered, d.NumberOfTimesDiscovered)
, IsTrackingPixel = CASE WHEN im.Height = 1 AND im.Width = 1 THEN dbo.IsTrackingPixel(images.AbsoluteUri, images.Source) ELSE 0 END
--select *
--select img.*
FROM
[arachnode.net].dbo.Images AS images WITH (nolock) LEFT JOIN
[arachnode.net].dbo.Images_MetaData AS im WITH (nolock) ON im.ImageID = images.ID LEFT JOIN
select id.ImageID, WebPageID = MAX(id.WebPageID), NumberOfTimesDiscovered = COUNT(*),
InitiallyDiscovered = MIN(id.InitiallyDiscovered), LastDiscovered = MAX(id.LastDiscovered)
from [arachnode.net].dbo.Images_Discoveries AS id WITH (nolock) INNER JOIN
[arachnode.net].dbo.Images AS images WITH (nolock) ON images.ID = id.ImageID
group by id.ImageID
) img ON img.ImageID = images.ID LEFT JOIN
[arachnode.net].dbo.WebPages AS wp WITH (nolock) on wp.AbsoluteUri = img.WebPageID LEFT JOIN
[arachnode.net].dbo.Discoveries AS d WITH (nolock) on d.AbsoluteUri = images.AbsoluteUri
Looks OK, more or less.
Check out the reporting SP's - particularly the 'Popularity' ones - lots of good SQL in there.
Mike