arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2005/2008/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop
Mongo/Raven/MySQL/Hadoop Does arachnode.net scale? | VS2008/2010/2012 & SQL2008/2012 | Download the latest release

How to Begin my web crawler and build the Index?

rated by 0 users
Answered (Verified) This post has 1 verified answer | 5 Replies | 2 Followers

Top 50 Contributor
7 Posts
wumengge posted on Tue, Apr 14 2009 1:07 AM

I have installed all the part of the arachnode.net 1.1,and complie the whole project without any Errors.

However ,I am puzzled that how to use arachnode,net. There are so many files  in the folder,but where is the entry of the builded project? Is there any windows in the project , or this is only a console programme?(so big a project without any windows?Really make me puzzled.)

I am a colledge school student in China,my parterner and I have great interests in this project, we need your help to learn more of this project.

Answered (Verified) Verified Answer

Top 10 Contributor
1,751 Posts
Verified by wumengge

You are very welcome!

1.) Check the FullTextIndexType column in the Files, Images and WebPages table.  This column contains the file extensions.
2.) I haven't seen near that many errors from Analysis services.  The project isn't essential to crawling and can be excluded.
3.) You can deploy to IIS or you can use Visual Studio's web host.  Either one is fine.

Mike

 

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

All Replies

Top 10 Contributor
1,751 Posts

The first part, about not receiving any compilation errors is great news!  ;)

arachnode.net is currently meant to be used as a service (always on) or as class library.  A Console project is provided to help with stepping through the code.  A Web Admin interface is in development, but we do not have a release date at this time.

Tell me, what are you trying to accomplish.  I can best help you by knowing more about your specific intentions for the code.

(In the meantime, set the Console project to the startup project, and press F5.  A crawl will start at arachnode.net.  Check the bottom of the stored procedure '[dbo].[arachnode_usp_arachnode.net_RESET_DATABASE]'.)

Always glad to help,
Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 50 Contributor
7 Posts
Firstly, I should tell you that these days I am going to accomplish my Graduation Project for my bachelor's degree.My project is to crawl the pages from the web sites and build a index by using Lucene.net,.After all I have to provide a programme to my instructor. I notice that arachnode.net is an excellent and open-source project ,so I suppose maybe I can find some ideas from it. After several days' research ,I am deeply attracted by arachnode ,but I find it really hard for me to understand all the code of it. So maybe I can recieve some suggestions for the usage of arachnode.net. New problem is brought to me now,that is the "test.csproj" can not be loaded by VS2005 . In addtion,I do have installed the Team Suit SP1 for VS2005. Thanks sincerely WuMengge
Top 10 Contributor
1,751 Posts
arachnode.net replied on Wed, Apr 15 2009 6:44 AM

It means a great deal to me that you have chosen arachnode.net as something to learn from, especially as something to influence your graduation project.

 

The project Test.csproj can be removed from the solution if you can't load it.

 

One of my main intentions for arachnode.net was to keep the code simple and accessible for everyone.  While this was and is a great intention, crawling the internet, and crawling it properly is not a simple process.  Undoubtedly you have evaluated other crawlers and noticed that most all of them in C# crawl HyperLinks only, do not download content in the form of files or images and do not store or index the content.  Why?  The process of crawling seems simple.  I thought that it would be relatively easy to craft a crawler until I discovered the thousands of conditions that must be met to crawl properly.  If you have evaluated nutch (a complete crawler, like arachnode.net is), then you have noticed that nutch is quite complex and takes a good amount of time to learn and debug what is going on under the covers.

 

The basic usage of arachnode.net is this.  CrawlRequests are placed into the CrawlRequests table in the database.  The CrawlRequests are fed into the system and crawled, and each CrawlRequests is run against a configurable set of CrawlActions and CrawlRules.  The lucene.net functionality is implemented as a CrawlAction.  The Web project attaches to the indexes and you can search.

 

The best way to understand how arachnode.net works is to 1.) Run the application from the default configuration and 2.) Step through the code using the debugger.

 

From the default installation, start Visual Studio and get the crawler crawling.  Take a look at the database tables, and familiarize yourself with what is being collected.  Now, reset the database per the installation instructions.  Modify the ‘MaximumNumberOfCrawlThreads’ setting in the Confuration table in the database and set this value to 1.  This will instruct the crawler to only use one crawl threads and makes debugging much, much easier.  Next, brew yourself a fresh cup of coffee or tea and step into the code.  The best way to learn what is occurring is to read the code, line by line.  Yes, there is a good deal of code, but it will be worth it, I promise.

 

In the past several days, what have you discovered in your research?

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 50 Contributor
7 Posts
wumengge replied on Thu, Apr 16 2009 12:01 AM

First, I have been moved so much that you are such warm-hearted to help me. Really thanks.

 Good news is that I have built the whole project successfully after setting the console as the starting item. I try to run the console programme on the Internet, and I change the target Uri to "www.sohu.com" which is a Portal in China. To my greatly surprise, the process of crawling is faster than any other crawler that I have used, while another there questions rush into my head:

 

  • (1) The downloaded files have no suffixes like ".htm" or ".jpg" in DownloadedWebPages folder or the others, so how to look up into the downloaded pages in the proper way?
  • (2) I can build the whole project ,but when I deploy all of it ,error occurs in "Analysis"(as the following picture), "UriClassificationDataSource" can not be connected, maybe it does not have any influence on the use of crawling, however, it looks not very pleased, isn't it?

 

  • (3) I notice that you told other people to use the Lucene Index by seting Web as the starting item when build the whole project, so what's the next step to use the Web application? Should I deploy the Web to a folder and load it in the IIS configuration(Just as the following picture)?

 

I will insist to read your code line by line, thanks for your answer all above.

WuMengge

Top 10 Contributor
1,751 Posts
Verified by wumengge

You are very welcome!

1.) Check the FullTextIndexType column in the Files, Images and WebPages table.  This column contains the file extensions.
2.) I haven't seen near that many errors from Analysis services.  The project isn't essential to crawling and can be excluded.
3.) You can deploy to IIS or you can use Visual Studio's web host.  Either one is fine.

Mike

 

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (6 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2005/2008/CE

copyright 2004-2014, arachnode.net LLC