Hi Mike,
Good day !
I got past the DB size issues and exceptions as per your suggestions in other post.
Actually i could crawl some 10K pages in a website, but even though that website is heavily linked and i started form home page, I hope all of it should have been covered pretty quickly.
Instead, I find the crawl getting into a loop and not crawling new discoveries. There are like 100K discovered links put into CrawlRequests from pprevious crawl. But now when i start crawl today, this looping seems to happen.
I am using BFS, and with 10 threads.
1. Are relative links within a page interpreted properly and crawled ?
2. Sometimes crawl kind of gets stuck, like it doesnt show console log what is being done with each URI.
Only that Engine:tb, cr, etc. state keeps refreshing for 15-20 minutes, then ultimately it proceeds.
3. Is there possibility of going in loop ? Because once a page is crawled, there is no reason that it be crawled again, right ? I mean next time it is discovered again, it should be checked against already crawled hyperlinks !
So as per my understanding, either a page will lead to exception or dissallow (due to some rule) or successful crawl and it will never be crawled again, right ? But i suspect this being true in my case :(
Thanks!
http://www.mydomainname.com/abcd-257.html
http://www.mydomainname.com/dirname
What are reasons for above 2 types of links be disallowed: Reason Disallowed By absolute URI.
There is no entry in exceptions table. I see majority of discovered links fall prey to Disallow. Ideas, where to make config changes to avoid this ?
2nd one if you say missing trailing '/' but then that was how the link was found in parent page.
Thanx.
You need to get the latest from SVN, like I recommended before.
Yes, relative links are properly crawled.
For best service when you require assistance:
Skype: arachnodedotnet
They were disallowed by some other rule, and are present in the DisallowedAbsoluteUris table the second time you crawled.
Again, check Program.cs to see which rules are enabled. Those that are ENABLED are in GREEN, those that are DISABLED are in RED.
The trailing slash only applies to http://arachnode.net/, but not to http://arachnode.net/something. You will never have to worry about the trailing slash if you do not insert rows in the CrawlRequests table.
ok i sifted through some code and got an idea with these tables.
related to this general question i have is about the summary flow of crawl.
take crawlrequesst from db, parse out discoveries . for now lets talk of only hyperlinks.
so then what next in steps ?
When the crawl ends how should the hyperlink related tables look like ? like countwise hyper_discoveries = webpages + exceptions + disalloweduris ?
I keep checking counts like but cannot make complete sense out of it . help needed. :)
SELECT count(*)
FROM [arachnode.net].[dbo].[Discoveries]
FROM [arachnode.net].[dbo].[HyperLinks_Discoveries]
FROM [arachnode.net].[dbo].[HyperLinks]
FROM [arachnode.net].[dbo].[WebPages]
FROM [arachnode.net].[dbo].[Exceptions]
FROM [arachnode.net].[dbo].[DisallowedAbsoluteUris]
FROM [arachnode.net].[dbo].[CrawlRequests]
The Crawler.Engine is fed from code, or from the database, from the method ArachnodeDAO.GetCrawlRequests which selects from the DB stored procedure: dbo.arachnode_omsp_CrawlRequests_SELECT.
The counts depend on what you crawl and what you rules are and how many Http exceptions you encounter.