there is a error in hyperlink regex. regex doesn't match href='http://yahoo.com'
correct version of regex is:
Regex("href=[\"\'](?<HyperLink>.*?)[\"\']", RegexOptions.Compiled | RegexOptions.IgnoreCase | RegexOptions.Singleline);
A while back I changed the RegEx for Megetron to accomodate pages like this: http://www.movin.co.il/movie-861-Rest_Stop:_Don'T_Look_Back.html Some sites don't encode their URLS. I think I agree with you and am changing it back. :) (a small trade-off for the speed you get with a regex, vs the slowness you get with DOM parsing...)
http://www.movin.co.il/movie-861-Rest_Stop:_Don'T_Look_Back.html
Some sites don't encode their URLS.
I think I agree with you and am changing it back. :)
(a small trade-off for the speed you get with a regex, vs the slowness you get with DOM parsing...)
For best service when you require assistance:
Skype: arachnodedotnet
There is/was a reason why I didn't include this. I believe it was due to the extra 'noise' and non-HyperLink matches that it creates/created.
Check your exceptions table to ensure that the RegEx isn't generating invalid AbsoluteUris.
If it works for you, then great!
I agree with you that your RegEx will match invalid html such as your example.
This change will have to be tested, by both of us to ensure that there aren't other cases that need to be handled, like invalid matches being returned generating additional errors for AbsoluteUris that can't be parsed.
Let me know what you find in your exceptions table, and if mine looks good I will make the official change.
Sincerely,Mike