arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

crawling specific web sites for tag words

rated by 0 users
Not Answered This post has 0 verified answers | 19 Replies | 3 Followers

Top 25 Contributor
14 Posts
dbs2000 posted on Fri, Jul 31 2009 2:54 PM

Thanks a lot for this open source venture.  I am trying to come up with a system that crawls specific sites (may be 4 or 5) for specific tag works.  As per the requirement I would have to keep the crawl restricted within the web site. Is it possible to achieve this with arachnode.net. I have download & have played with it. I somehow cannot keep the search restricted to a specific web site even after setting restrictToUriHost to true. Secondly, this crawler would need to run every night & would need to revist the same web sites to get hold of the new or changed contents. The aim is to regularly monitor & analyse the contents in particular web sites on the basis of tag words. Can you please help me figure out how to do all this with arachnode.net

 

All Replies

Top 10 Contributor
1,905 Posts

Of course.

Step 1: Do you have the latest code from SVN?  I wrote a nice fat bug in Version 1.1 that prevented 'RestrictToUriHost' from functioning properly.  http://arachnode.net/media/p/57.aspx

Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 25 Contributor
14 Posts

Thanks Mike. I am not being able to try out that option immediately. I got the latest code through SVN & tried to play with it. However, I do not have SQL server 2008 in my box. Hence, I could not restore the .bak file. I tried with the old backup file that I had been playing with that I got from version 1.1 download. But it seems that there has been some changes in the database structure. Hence, I got errors while running the console project & could not proceed. Bottom line is - I am stuck till the time I have SQL Server 2008 in my box or till the time you provide a new SQL Server 2005 backup. I have seen that other people have also asked you for it. It will be really nice if you can provide an SQL Server 2005 backup. Thanks a lot for you prompt help & for coming up with this awesome open source venture.

Top 10 Contributor
1,905 Posts

I will make a SQL2005 backup.  Give me a few hours.  :)

OK - I added a 2005 DB backup.  Let me know if you have problems - I need to get out and enjoy what's left of the sun.

-Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 25 Contributor
14 Posts

Thank you once again Mike & appologies for a late response. I cannot seem to find any 2005 backup. I got the latest again from SVN trunk (also checked out the tags). I cannot seem to find a second backup file. I tried with the one available, but I am getting the same error when I am trying to restore it. It goes like this - "Sql Server cannot process this media family. RESTORE HEADERONLY is terminating abnormally. (Microsoft SQL server, Error:3241)". Please note that I was being able to restore the backup that I got from the downloaded version earlier.

Top 10 Contributor
1,905 Posts

Really?  I have to confess I didn't test yesterday - I really wanted to run out and get some of the sun that was left in the day.  I'll attempt a restore now.

Did you find the file arachnode.net.bak_2005.zip?  Let me check home again... make sure I got it checked into SVN.

Mike

(always glad to help)

I messed up and somehow managed to NOT check the file in.  I'm checking it in right now.  OK - all checked into the trunk.

Now, onto your original question... you should be able to restrict your crawls now - when all looks good AFA restricting crawls, let me know and we can chat about how to crawl only specific sites.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 25 Contributor
14 Posts

Thanks a lot Mike. The 2005 backup is available. I got the latest & it was workingSmile

Now, I tried to do some crawling. I added a new crawl request in Program.cs. [   _crawler.Crawl(new CrawlRequest(new Discovery("http://praxis-softek.com/"), int.MaxValue, UriClassificationType.Domain, UriClassificationType.Domain, 1)); ] It crawled well but I could not terminate the console. I got this message

Press any key to terminate arachnode.net

Console.exe will call _crawl.Engine.Start() in 30 seconds.

But nothing happened when I pressed a key. It continued to run after sometime. When I tried to close the window by hitting the cross button I got the windows "End Program" dialog box. I had to hit "End Now" to terminate the program. So how does the program get terminated gracefully?

Secondly, the crawling of more that one web site. I added another crawl request below the first one [  _crawler.Crawl(new CrawlRequest(new Discovery("http://eforceglobal.com/"), int.MaxValue, UriClassificationType.Domain, UriClassificationType.Domain, 1));  ]

It was never crawled. I checked the WebPages table, there were no additions. I tried to add the 2 crawl requests in the CrawlRequests table too. But the second one was never taken. So how do I specify a list of 4 or 5 web sites that needs to be crawled each day?

Thanks once again for you help.

Top 10 Contributor
229 Posts

I am not sure, but why do you try to change the code for regular crawling? I am trying to do exactly like you do. running the crawl on several websites, each day, and if any changes were detected it updates the database.
Mike helped me few times, you can view the history of threads made by me.

To crawl several websites, open the configuration table and change createcrawlrequest parameters to false except createcrawlrequestsdromdatabasefiles.

then you go to crawlrequest table and add your websites to be crawled. you can also delete all data from Disallowedwords table. the table design to prevent crawling on sec sites or whatever.

i hope it help.

Top 10 Contributor
1,905 Posts

You are welcome.

I'll check into the key not being registered.  The end program dialog is there to allow the Engine to save its state and should terminate the console when complete.  Again, I'll check into why is isn't being registered.  The handler for the console termination calls 'Stop' on the Engine.

The ideal way for you to crawl several sites a day is to create a Crawler, in code, crawl and then create another Crawler, in code and crawl again, just like you are doing in your code above.  The console is a nice start...

So, create a crawler, feed it your crawl requests, crawl for some time, then stop the engine, wait for it to stop, then repeate the process.

http://eforceglobal.com/robots.txt :

User-agent: *
Disallow:
It is being disallowed by robots.txt.
Check the table DIsallowedAbsoluteUris.
Check the table CrawlRules.  (Where the robots.txt rule is configured...)

Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
1,905 Posts

It is easiest to use the C# code to insert code into the DB, unless you are comfortable with bitmasks.

The parameter is 'CreateCrawlRequestsFromDatabaseCrawlRequests'.

Megetron means 'sex' sites.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 25 Contributor
14 Posts

Thanks Mike.

You are correct. http://eforceglobal.com is disallowed. I replaced that with http://arachnode.net/Default.aspx. It  then crawled both the sites. Now an obvious question comes to my mind. If I am needed to crawl a site that is disallowed what do I do? Does it mean that the specific site cannot be crawled?

I could successfully crawl 2 sites by adding 2 crawl requests one after the other like this:

 _crawler.Crawl(new CrawlRequest(new Discovery("http://praxis-softek.com/"), int.MaxValue, UriClassificationType.Domain, UriClassificationType.Domain, 1));

_crawler.Crawl(new CrawlRequest(new Discovery("http://arachnode.net/Default.aspx"), int.MaxValue, UriClassificationType.Domain, UriClassificationType.Domain, 1));

 My question to you is - is this all that I have to do to crawl mutiple sites (within the domain)? Apart from doing these code changes, I have just done what the console asked me to do, that is, to provide the various folder paths that I had to add in the Configuration & CrawlActions tables? Do I need to be bothered about anything else? I plan to write a seperate web app that would have very similar code as the existing search code in the web project. This code will be more customised to my needs. This is how i plan to proceed.

Also, the console exe termination thing. I would like to run this console app exe with the help of a schedular. Will the termination problem that I had mentioned to you earlier be a problem?

Thanks for all your help.

Debasish

Top 25 Contributor
14 Posts

Thanks for you response.

The option that you pointed out was tried. The problem was the one that Mike pointed out, that is, the second web site was disallowed.

I really appriciate your help.

Debasish

Top 25 Contributor
14 Posts

Mike I had send you a reply on this. But this is the first time it went for moderation. My other posts did not. i dont know why. Hence i am sending the reply once again.

You were correct in pointing out that the second site (eforceglobal) is disallowed.  Now if I need to crawl such a web site what do I do? Does that mean I would not be able to crawl it?

The crawl worked. i replaced eforceglobal with arachnote.net & then i could crawl both the sites. I added 2 crawl requests one after the other like this.

 _crawler.Crawl(new CrawlRequest(new Discovery("http://praxis-softek.com/"), int.MaxValue, UriClassificationType.Domain, UriClassificationType.Domain, 1));

_crawler.Crawl(

new CrawlRequest(new Discovery("http://arachnode.net/Default.aspx"), int.MaxValue, UriClassificationType.Domain, UriClassificationType.Domain, 1));

Apart from these code changes I had to provide folder paths for the once I was prompted / asked for when I first ran the console, in the database tables - Configuration & CrawlActions. Is there anything else that i have to do to crawl multiple sites repeatedly every night. Or do i need to take care of other things aswell?

I am planning to write a seperate web project that would contain code very similar to the existing web project that would fit my requirement for doing the search.

I am thinking of running the console app exe by a schedular. Will the termination work properly in that case or do I need to make any changes (currently it starts again after a 30 sec wait after the crawl has finished).

I am not sure whether I have missed out any other points that i wanted to ask you. Smile

And yes, THANKS a lot for all your help.

Debasish 

 

 

 

 

 

Top 10 Contributor
229 Posts

Just curious.

why the design behaves that way and disallowed the second website on the CrawlRequest table. that force to manipulate the code, and you want it to be flexible enough using the configuration data, so pepole with less knowledge of code will be able running it without any changes.. Since the requierment browsing single or several sites is common, maybe the design should be different?

I am new to code so I am just wondering and the question does not base on a ny previous knowledge, so please enlight me.

Top 10 Contributor
1,905 Posts

Hey - I'm back from my mini-vacation to the Washington coast.

The site was disallowed as, by default, arachnode.net follows robots.txt rules.

If you want to turn off the robots.txt behavior, check the 'CrawlRules' table, find the robots.txt rule and turn it off.

No worries on being new to the code - I am always glad to answer and help.

-Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 2 (20 items) 1 2 Next > | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC