Hello Guys,
I am stuck if anyone have idea of it !!!
well i want to crawl only the links which belong to my domain e.g. http://www.mydomain.com , i want to crawl all the pages under http://www.mydomain.com like http://www.mydomain.com/1.aspx, http://www.mydomain.com/2.aspx etc... and one of the page contains link to http://www.yahoo.com but i do not want to crawl http://www.yahoo.com , i know there is configuration through Application.config in Configuration project but if I set createCrawlRequestsFromDatabaseHyperLinks to false than it crawls only one link which is http://www.mydomain.com but i want data from all my sub pages.
Can this be done?
I hope i am clear enough.
Thanks JD
The easiest way to see how to crawl a single site is to make sure you're crawling only CrawlRequests, as set in Application.config. (Don't create CrawlRequests from Database HyperLinks or Database WebPages.)
Submit a CrawlRequest with a depth of 4 (any deeper and you'll need to check out CrawlRules.config for the Depth CrawlRule, I believe) and be sure to set RestrictToUriHost to true.
This configuration will crawl until all content found at depth 4 is complete. If you started a crawl like this at MSN you'd likely pick up 250,000 WebPages.
Of course, there is likely to be more content to be crawled. So, check out this forum post on how to restrict the entire system to a domain: http://arachnode.net/forums/p/103/367.aspx#367 You'll want to do this if you don't want to manually feed the CrawlRequests table with requests from your intended Domain as the database tables will contain HyperLinks and Files from other Domains.
If you think crawling a single domain is too complicated, would you make a forum post and make a suggestion. My eventual aim for arachnode.net is to make it as accessible as possible.
Thanks!Mike
An open source .NET web crawler written in C# using SQL 2005/2008.
Join the arachnode.net group on Facebook: http://www.facebook.com/groups.php?ref=sb#/group.php?gid=166721755872
Twitter: http://twitter.com/arachnode_net
arachnode.net provides custom crawling and contracting resources. Please ask.
http://bit.ly/TOFX4
C# crawler, C# web crawler, C# site crawler
Hello Mike,
wow great, thaks for your reply. It certainly works.
Actually i will be having 5 websites to crawl and i want to them wo be crawled everyday. so perhaps i will use windows service for that.
I have two issues.
1) After crawling one site i have 5 more sites to crawl and so how can i continue that after one has finished also as i want it to repeat that on everyday do i have to make entry in CrawlRequest table everyday (obviously does not make sense !!!) as it clears entry which we have made in it after cral starts for that.
2) If in future http://www.mydomain.com has few modified pages or new added pages so if i will crwal that website again than what will happen to my old crawled data?
and yeah i will wait for improvement in lucene search which you are working on :)
Thanks againJD
I updated the demo for the upcoming lucene.net functionality improvements: http://arachnode.net/Content/LiveDemonstration.aspx
Since I know you're waiting I'll make finishing a build top priority this weekend. I can't guarantee that I'll finish as the reporting views and stored procedures have to be modified for Release 1.1 and they always take longer than I think they will. Also, I've made changes to the DB structure and to the lucene.net index format/fields. Will you need a DB conversion script and a lucene.net index conversion utility?
Hello Mate,
Here my point is i had put two crawl request in CrawlRequests table and both were crawled but now they are no more in the table, now suppose i want to crawl them again now what should be done? because i am having 500 domains to crawl which i will put in CrawlRequests table, so my question is is there anything that can be done so that i do not have to enter 500 names again and again when i want to crawl them?
let me know if i am not clear!!!
Thanks for all your helpJD
OK, how about this: If you want to crawl 500 domains you would configure arachnode.net to restrict Crawls to those 500 domain only like the posts above describe how to do. Then, make sure your settings in Application.config are set as shown.
The Crawl process works like this if you have the settings set as shown above: (this is simplified, removing how Files work)
So, if you had 500 Domains let's say that those 500 CrawlRequests generated 500,000 HyperLinks. Of those 500,000 HyperLinks, 100,000 of them actually belong to your 500 Domains. Then, those 100,000 HyperLinks would be crawled and would generate 100,000 WebPages. When all CrawlRequests are done crawling and all HyperLinks belonging to your 500 Domains have been crawled (are found in the WebPages table), then all WebPages are crawled. When all WebPages are crawled, this is essentially the same as resubmitting the original 500 CrawlRequests, except that arachnode.net won't have to refilter the 500,000 HyperLinks that it finds.
Does this help?
I am not sure.
I have done everything you suggested and still for some reason the webpages table possess lots of records from http://sketchup.google.com
What I am missing? I changed configuration according to lots of other posts regarding single domain/several domains and still I cant. please help me understand what I am doing wrong,
and another thing, just a suggestion, maybe you can make a MODE feature. modes can be Single Site/Several Site/No Limits/By Country and other popular settings for popular settings.Maybe lots of code changes needed for this, but If you can do one settings file to hold all settings (and not on a database) then you can allow download form this site a setting files, so every one can download exactly the template he needs. just an idea as a new user it will be nice, I am sure I won't need it when I will understand code whenever it will be :)
thank you. waiting for answer on my issue/
You may be encountering a known bug that saves CrawlRequests from other domains when the Crawl process is shut down. I'm working on a new build that should be available soon.
Another user suggested the template idea - I'll keep that in mind.
If you are running from an official release - try downloading the code from the trunk.