arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL 2005/2008/CE
Does arachnode.net scale? | VS2008/2010/2012 & SQL2008/2012 | Download the latest release

Bug in DiscoveryManager

rated by 0 users
Answered (Verified) This post has 1 verified answer | 5 Replies | 2 Followers

Top 10 Contributor
30 Posts
pp.ps posted on Fri, Apr 30 2010 4:17 AM

there is a error in hyperlink regex. regex doesn't match href='http://yahoo.com

correct version of regex is: 

Regex("href=[\"\'](?<HyperLink>.*?)[\"\']", RegexOptions.Compiled | RegexOptions.IgnoreCase | RegexOptions.Singleline);

Answered (Verified) Verified Answer

Top 10 Contributor
1,696 Posts
Verified by pp.ps

A while back I changed the RegEx for Megetron to accomodate pages like this:

http://www.movin.co.il/movie-861-Rest_Stop:_Don'T_Look_Back.html

Some sites don't encode their URLS.

I think I agree with you and am changing it back.  :)

(a small trade-off for the speed you get with a regex, vs the slowness you get with DOM parsing...)

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

All Replies

Top 10 Contributor
1,696 Posts
Verified by pp.ps

A while back I changed the RegEx for Megetron to accomodate pages like this:

http://www.movin.co.il/movie-861-Rest_Stop:_Don'T_Look_Back.html

Some sites don't encode their URLS.

I think I agree with you and am changing it back.  :)

(a small trade-off for the speed you get with a regex, vs the slowness you get with DOM parsing...)

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
1,696 Posts

There is/was a reason why I didn't include this.  I believe it was due to the extra 'noise' and non-HyperLink matches that it creates/created.

Check your exceptions table to ensure that the RegEx isn't generating invalid AbsoluteUris.

If it works for you, then great!  Smile

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
1,696 Posts

I agree with you that your RegEx will match invalid html such as your example.

This change will have to be tested, by both of us to ensure that there aren't other cases that need to be handled, like invalid matches being returned generating additional errors for AbsoluteUris that can't be parsed.

Let me know what you find in your exceptions table, and if mine looks good I will make the official change.

Sincerely,
Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (6 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2005/2008/CE

copyright 2004-2013, arachnode.net LLC