arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

www prefix?

rated by 0 users
Answered (Verified) This post has 1 verified answer | 13 Replies | 1 Follower

posted on Thu, Dec 3 2009 6:53 AM

Hi all,

 

sometimes there are sites that don't work without the www prefix.

it seems that aracnhode.net removes the www prefix and tries to navigate using only the host name...

could you tell me how to use arachnode.net with these uri types (working only with the www prefix) ?

 

thanks,

  Revenge

Answered (Verified) Verified Answer

Top 10 Contributor
1,905 Posts
Verified by arachnode.net

This issue is fixed in Version 1.4.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

All Replies

Top 10 Contributor
1,905 Posts

1.) Wait for Version 1.4

2.) Find the constructors for Discovery.cs and look for the Replace("://www.", "") code.  Modify the (2) two check constraints in table dbo.CrawlRequests.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

replied on Thu, Dec 3 2009 9:02 AM

it's ok for the Discovery.cs

but for the constraints it seems to be only one constraint about it

CK_CrawlRequests:

(NOT [AbsoluteUri] like '%://www.%' AND [AbsoluteUri] like '%//%' AND ([AbsoluteUri] like '%/' OR NOT (len([AbsoluteUri])-len(replace([AbsoluteUri],'/','')))<(3)))

 

thanks

Top 10 Contributor
1,905 Posts

Which version are you using?  This constraint is an older one I recall...

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

replied on Fri, Dec 4 2009 2:29 AM

I'm trying the 1.2 free version

replied on Fri, Dec 4 2009 3:36 AM

now i'm downloading from svn repository the latest version.....

it seems to be the 1.3 version...

 

replied on Fri, Dec 4 2009 6:21 AM

yea / this is the constraint in 1.3 version 

 

 

 

replied on Fri, Dec 4 2009 6:26 AM

Hello,

Does anybody know when we can expect 1.4 version? Wchich version is in svn repository now?

Best regards
k1 

Top 10 Contributor
1,905 Posts

I will likely finish 1.4 this weekend.  No distinct version is in SVN now...

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
1,905 Posts
Verified by arachnode.net

This issue is fixed in Version 1.4.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 150 Contributor
2 Posts

Hi,

I finally bought the commercial license and now I'm using the 1.4 version.

In Discovery.cs there are rows such as:

CacheKey = new Uri(discoveriesRow.AbsoluteUri.Replace("://www.", "://"));

Maybe this fix is not present in the 1.4 zip package

and the same in database triggers....(1 reference in CK_CrawlRequests)

I can remove the www references...

but this reply is to report the missing fix in 1.4 package

Bye

EDIT: I forgot the update script.... that's ok for database....

 

 

Top 10 Contributor
1,905 Posts

Is everything OK for you?  How can I help?  Are you saying that after you ran the DB update script everything is OK now?

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 150 Contributor
2 Posts

It seems to be ok

after about 20 hours the crawler is running without any outofmemory errors as in 1.2

maybe in theese days I'll ask you something in the forum (now i'm testing the 1.4 version using the plugins i developed while using the 1.2)

Thanks

 

Top 10 Contributor
1,905 Posts

Great!   Please do.  I am always glad to help.

-Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (14 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC