arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

Getting started: crawling multiple sites

rated by 0 users
Not Answered This post has 0 verified answers | 1 Reply | 2 Followers

Top 100 Contributor
4 Posts
mopiola posted on Wed, Apr 28 2010 7:03 AM

I am using arachnode.net with the SQL 2005 database. I assume that running arachnode.net  for the first time will not do anything until the code is uncommented in Console -> program.cs?

How do I set up a crawl request to begin crawling a site? For example, what if I wanted to crawl http://arachnode.net/forums/ for .aspx files and only in the /forums/ directory. Do I specify this in the code or in the databae? I see a table called CrawlRequests, do I add this site information to this table?

Ultimately, here is what I am trying to accomplish. I want to be able to creat a document repository of resources from many different websites. I need to be able to crawl multiple websites, and for each website restrict the crawl to certain directories, filetypes, depth, etc. Is there a way I can have a table with sites to crawl and then add them to the crawl request?

I hope someone can point me in the right direction. Thank you.

Mark

All Replies

Top 10 Contributor
1,905 Posts

Looks like you have been registered for quite some time, mopiola.

Which version of AN are you using?

I wouldn't add CR's to the CR table, but rather add them as shown in Program.cs.  This table is used by the Cache/Engine.

Read this post about restricting a crawl: http://arachnode.net/forums/t/739.aspx

To restrict to a specific directory you would need to write a plugin.  There is a file, AbsoluteUri.cs, which you may or may not have... (it's not in a demo as the SiteCrawler project is compiled/obfuscated/encrypted)... which shows how you would accomplish filtering.  There is also a new property of Discoveries and CrawlRequests (IsStorable) which allows you to crawl through a site and only store what you want, rather than being blocked by IsDisallowed.  Very handy, but only in the newest release version.  If you purchase a license I will write the plugin for you. 

Let me know how you would like to proceed.  I would love to help you!  Big Smile

- Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (2 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC