arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

How to crawl uri with queryparameters?

rated by 0 users
Answered (Verified) This post has 1 verified answer | 1 Reply | 1 Follower

posted on Thu, Dec 17 2009 6:16 AM

Hi,

I know that crawl "filters" could be used by UriClassificationType setting (Host, FileExtension, domain, etc.)

I have 2 use case:

1) An absolute uri (base uri such as www.domain.com) containing links with query parameters (www.domail.com/index.asp?param1=value, etc.).

 In this case I need to crawl all the "host" links (excluding images, etc.... but only asp, aspx, html, etc. pages). Now I'm trying only with UriClassificationType.Host but it crawls all files (images, pdf, etc.).

 

2) A google page results... like http://www.google.it/search?q=myquery where the intial absoluteuri to crawl are the google results (so I cannot use the UriClassificationTyp.Host.... because the host is google.com)

Could you give me any suggestion about this "configuration"?

Thanks

 

 

 

Answered (Verified) Verified Answer

Top 10 Contributor
1,905 Posts

1.) Check out cfg.AllowedContentTypes.  Or, look at cfg.Configuration - you can turn off Images and Files.  AssignFileAndImageDiscoveries.  If you need more granular control, look at cfg.AllowedContentTypes.

2.) Run the crawl in two batches/phases.  The first pass is to crawl your Google links at a depth of 1 to get the Google links.  Then, stop the crawler and restart with your links that you want to crawl.  I would keep track of them in a plugin, and have a variable that can be set from, say, Program.cs to signal whether to add links to your Dictionary or not.  At the end of the first crawl phrase, read all of your Google links, and then create CrawlRequests from them.  Then, clear your Dictionary to save RAM and set your variable to indicate that you don't want to store HyperLinks that come through the system.  Take a look at AbsoluteUri.cs to see where Discoveries (not CrawlRequests) are processed.

Did I answer your question?

(Also, crawling Google is disabled by default.  Look at the cfg.DisallowedDomains table...)

Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (2 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC