arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL 2005/2008/CE
Does arachnode.net scale? | VS2008/2010/2012 & SQL2008/2012 | Download the latest release

Are relative links crawled ?

rated by 0 users
Not Answered This post has 0 verified answers | 5 Replies | 2 Followers

Top 25 Contributor
14 Posts
sanwal223 posted on Mon, Jan 3 2011 12:08 PM

Hi Mike,

Good day !

I got past the DB size issues and exceptions as per your suggestions in other post.

Actually i could crawl some 10K pages in a website, but even though that website is heavily linked and i started form home page, I hope all of it should have been covered pretty quickly. 

Instead, I find the crawl getting into a loop and not crawling new discoveries. There are like 100K discovered links put into CrawlRequests from pprevious crawl. But now when i start crawl today, this looping seems to happen.

I am using BFS, and with 10 threads.

1. Are relative links within a page interpreted properly and crawled ?

2.  Sometimes crawl kind of gets stuck, like it doesnt show console log what is being done with each URI. 

Only that Engine:tb, cr, etc. state keeps refreshing for 15-20 minutes, then ultimately it proceeds.

3. Is there possibility of going in loop ? Because once a page is crawled, there is no reason that it be crawled again, right ? I mean next time it is discovered again, it should be checked against already crawled hyperlinks !

So as per my understanding, either a page will lead to exception or dissallow (due to some rule) or successful crawl and it will never be crawled again, right ? But i suspect this being true in my case :(

Thanks!

All Replies

Top 25 Contributor
14 Posts

http://www.mydomainname.com/abcd-257.html

http://www.mydomainname.com/dirname

 

What are reasons for above 2 types of links be disallowed: Reason Disallowed  By absolute URI.

There is no entry in exceptions table. I see majority of discovered links fall prey to Disallow. Ideas, where to make config changes to avoid this ?

 

2nd one if you say missing trailing '/' but then that was how the link was found in parent page. 

 

Thanx.

Top 10 Contributor
1,692 Posts

You need to get the latest from SVN, like I recommended before.  Big Smile

Yes, relative links are properly crawled.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
1,692 Posts

They were disallowed by some other rule, and are present in the DisallowedAbsoluteUris table the second time you crawled.

Again, check Program.cs to see which rules are enabled.  Those that are ENABLED are in GREEN, those that are DISABLED are in RED.

The trailing slash only applies to http://arachnode.net/, but not to http://arachnode.net/something.  You will never have to worry about the trailing slash if you do not insert rows in the CrawlRequests table.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 25 Contributor
14 Posts

ok i sifted through some code and got an idea with these tables.

related to this general question i have is about the summary flow of crawl.

take crawlrequesst from db, parse out discoveries . for now lets talk of only hyperlinks.

so then what next in steps ?

When the crawl ends how should the hyperlink related tables look like ? like countwise hyper_discoveries = webpages + exceptions + disalloweduris ?

I keep checking counts like but cannot make complete sense out of it . help needed. :)

 

SELECT count(*)

  FROM [arachnode.net].[dbo].[Discoveries]

 

  SELECT count(*)

  FROM [arachnode.net].[dbo].[HyperLinks_Discoveries]

 

  SELECT count(*)

  FROM [arachnode.net].[dbo].[HyperLinks]

 

  SELECT count(*)

  FROM [arachnode.net].[dbo].[WebPages]

 

 

  SELECT count(*)

  FROM [arachnode.net].[dbo].[Exceptions]

 

 

  SELECT count(*)

  FROM [arachnode.net].[dbo].[DisallowedAbsoluteUris]

 

  SELECT count(*)

  FROM [arachnode.net].[dbo].[CrawlRequests]

 

Top 10 Contributor
1,692 Posts

The Crawler.Engine is fed from code, or from the database, from the method ArachnodeDAO.GetCrawlRequests which selects from the DB stored procedure: dbo.arachnode_omsp_CrawlRequests_SELECT.

The counts depend on what you crawl and what you rules are and how many Http exceptions you encounter.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (6 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2005/2008/CE

copyright 2004-2013, arachnode.net LLC