arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL 2005/2008/CE
Does arachnode.net scale? | VS2008/2010/2012 & SQL2008/2012 | Download the latest release

Initial Recommendations

rated by 0 users
Answered (Verified) This post has 1 verified answer | 2 Replies | 2 Followers

Top 100 Contributor
3 Posts
hsaritas posted on Wed, Jul 20 2011 12:35 AM

Hi Mike,

We purchased AN a month ago for our needs and started to examine the code. 

Before going further, I want to inform you about our specs and then want to learn your recommendations.

 

Here are our specs about the project:

1. We want to crawl a number of sites (300+) per regular basis,

2. There will be a number of specified keywords for each  site, 

3. For each page crawled, the main page content will be extracted (by BoilerPipe or another way) so that advertisements and unrelated information will be discarded,

4. Site specific related keywords will be searched in the page,

5. If the page contains the keyword, the extracted page content will be transferred to our intranet,

6. We don't want to crawl outside a site but want to crawl subdomains of site (e.g: abc.com and news.abc.com)

7. We want to exclude script, css, image files and include html, pdf, office files.

 

In this scope,

1. We want to use AN as a service and run continously,

2. We don't want to store any data, except any information needed by AN for the next crawls,

3. We want to use HTTP or SOCKS proxy for specific sites while crawling,

 

So,

1. What do you recommend us to use? AN, or AN.NEXT? 

4. What should be the steps in using Arachnode to achive our goals,

 

Thanks for any help,

 

Halis

 

Answered (Verified) Verified Answer

Top 10 Contributor
1,692 Posts
Verified by hsaritas

OK, back.

SPECS:

  1. OK.  No problem.
  2. See these posts: http://arachnode.net/forums/p/1395/12831.aspx#12831 , http://arachnode.net/forums/p/1462/13133.aspx#13133
  3. Great.  That's a great project.
  4. OK.
  5. See #2.  Also, be sure you are familiar with this: http://arachnode.net/forums/t/1463.aspx and this: http://arachnode.net/forums/p/1026/12029.aspx#12029 and this: http://arachnode.net/Content/CreatingPlugins.aspx
  6. Cool.  Set your CrawlRequests to RestictCrawlTo and RestrictDiscoveriesTo: = Host.  http://arachnode.net/forums/p/323/10294.aspx#10294
  7. OK.  http://arachnode.net/forums/p/758/11596.aspx#11596 and http://arachnode.net/forums/p/1664/15500.aspx#15500 and http://arachnode.net/forums/p/565/10763.aspx#10763

SCOPE:

  1. Great!  Definitely use the Service project.
  2. Perfect.  Look at the switches for what you can store in ApplicationSettings, and in the Service specifically, at OverrideDatabaseSettings.
  3. OK.  AN uses the standard HttpWebRequest and HttpWebResponse, so you can easily add this info into WebClient.cs.

RECOMMENDATIONS:

  1. Use AN.  AN.Next does not have a service implementation.
  2. Read every post above, and read the FAQ and everything should be clear.  Big Smile  http://arachnode.net/Content/FrequentlyAskedQuestions.aspx

Thanks!
Mike

 

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

All Replies

Top 10 Contributor
1,692 Posts

Hi there!

Just finished with the day's work and I need a break.  I can likely get to this tomorrow...

Thanks for your patience,
Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
1,692 Posts
Verified by hsaritas

OK, back.

SPECS:

  1. OK.  No problem.
  2. See these posts: http://arachnode.net/forums/p/1395/12831.aspx#12831 , http://arachnode.net/forums/p/1462/13133.aspx#13133
  3. Great.  That's a great project.
  4. OK.
  5. See #2.  Also, be sure you are familiar with this: http://arachnode.net/forums/t/1463.aspx and this: http://arachnode.net/forums/p/1026/12029.aspx#12029 and this: http://arachnode.net/Content/CreatingPlugins.aspx
  6. Cool.  Set your CrawlRequests to RestictCrawlTo and RestrictDiscoveriesTo: = Host.  http://arachnode.net/forums/p/323/10294.aspx#10294
  7. OK.  http://arachnode.net/forums/p/758/11596.aspx#11596 and http://arachnode.net/forums/p/1664/15500.aspx#15500 and http://arachnode.net/forums/p/565/10763.aspx#10763

SCOPE:

  1. Great!  Definitely use the Service project.
  2. Perfect.  Look at the switches for what you can store in ApplicationSettings, and in the Service specifically, at OverrideDatabaseSettings.
  3. OK.  AN uses the standard HttpWebRequest and HttpWebResponse, so you can easily add this info into WebClient.cs.

RECOMMENDATIONS:

  1. Use AN.  AN.Next does not have a service implementation.
  2. Read every post above, and read the FAQ and everything should be clear.  Big Smile  http://arachnode.net/Content/FrequentlyAskedQuestions.aspx

Thanks!
Mike

 

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (3 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2005/2008/CE

copyright 2004-2013, arachnode.net LLC