arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

How does AN deala with forms?

rated by 0 users
Answered (Verified) This post has 1 verified answer | 4 Replies | 2 Followers

Top 10 Contributor
229 Posts
megetron posted on Fri, Sep 4 2009 10:47 PM

Hello,

there are pages on the web the link to a page like this:

<a href="javascript:__doPostBack('ctl00$ctl00$B$B$ctl00$Pager','0')"><b>1</b></a>


<script type="text/javascript">
//<![CDATA[
var theForm = document.forms['aspnetForm'];
if (!theForm) {
    theForm = document.aspnetForm;
}
function __doPostBack(eventTarget, eventArgument) {
    if (!theForm.onsubmit || (theForm.onsubmit() != false)) {
        theForm.__EVENTTARGET.value = eventTarget;
        theForm.__EVENTARGUMENT.value = eventArgument;
        theForm.submit();
    }
}
//]]>
</script>

 

can you crawl on such links using forms?

Answered (Verified) Verified Answer

Top 10 Contributor
1,905 Posts

You are more than welcome to devise/propose a solution.  Let me know what you find!  :)

Here is an opinion on what Google does: http://www.webmasterworld.com/google/3378000.htm

My opinion: I don't plan to do it.  Basically, it's 2009 - I would venture to guess that every webmaster worth their salt knows that placing links in Javascript like this means that there's a good chance those pages won't be indexed.  Perhaps the behavior is intentional?

To crawl links as shown, AN would need to instantiate the AxShDocView control, and this is a heavy control to say the least.  Check this page to see my response to another question about pulling AdWords information: http://arachnode.net/forums/t/112.aspx

EDIT: JavaScript IS now supported, through the use of the Renderers.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

All Replies

Top 10 Contributor
Male
101 Posts
Kevin replied on Sat, Sep 5 2009 8:41 AM

Good question.  I don't think AN will see this as a valid hyperlink.

Even if it does see it and store it, AN will not be able to "follow" the link anywhere.

Take a look at the regex expressions in I think DiscoveryManager.cs and maybe set a break point to walk through it.

 

Top 10 Contributor
1,905 Posts

This will get parsed but will die once AN tries to convert it into an AbsoluteUri, and so the HyperLink will be rejected.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
229 Posts

ok then. maybe a new feature, maybe not,

what I suggest is to find a code that will call the javascript  code and will get the string javascript returns, whatever the string will be, if it's an absoluteuri pattern, then add it to the CR without rejecting.

can such be done? is it a problematic feature or a legal one?

this behaviour can eliminate the links being skipped by AN, and add more flexibility to complex links.

Please let me know.

Thank you.

Top 10 Contributor
1,905 Posts

You are more than welcome to devise/propose a solution.  Let me know what you find!  :)

Here is an opinion on what Google does: http://www.webmasterworld.com/google/3378000.htm

My opinion: I don't plan to do it.  Basically, it's 2009 - I would venture to guess that every webmaster worth their salt knows that placing links in Javascript like this means that there's a good chance those pages won't be indexed.  Perhaps the behavior is intentional?

To crawl links as shown, AN would need to instantiate the AxShDocView control, and this is a heavy control to say the least.  Check this page to see my response to another question about pulling AdWords information: http://arachnode.net/forums/t/112.aspx

EDIT: JavaScript IS now supported, through the use of the Renderers.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (5 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC