arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

crawling includes unwanted URLs too

rated by 0 users
Answered (Not Verified) This post has 0 verified answers | 8 Replies | 1 Follower

Top 150 Contributor
3 Posts
vchauhan.me posted on Thu, Apr 5 2012 9:45 PM

Hello Mike,

I have one site for which i am crawling it with the DepthLevel=6.

while checking the "webpages" table, the "Absoluteuri" field is having URLs that includes the pagination number links too.

like if on a page, there is a pagination to show a list of items. And on pagination numbers, we have the url like "http://www.xyz.com/aboutus.aspx?page=1&lang=en" and "http://www.xyz.com/aboutus.aspx?page=2&lang=en". I dont want to include such URLs. for this i can use "Disallowed by query string" to stop including such urls. But in many other case I need the url that has query string parameter too. like "http://www.xyz.com/aboutus.aspx?lang=en". so in such case i can not use "Disallowed by query string" because its just an example of "lang" field even there are many which i want to allow.

Please suggest me a way to achieve this.

thanks,

Page 1 of 1 (1 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC