arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL 2005/2008/CE
Does arachnode.net scale? | VS2008/2010/2012 & SQL2008/2012 | Download the latest release

can I start the Crawl Action thrue a web site

rated by 0 users
Answered (Verified) This post has 1 verified answer | 11 Replies | 2 Followers

Top 25 Contributor
12 Posts
itayeng posted on Wed, Jun 16 2010 4:45 AM

can I post to the crawler thrue a web page to start crawling and once it entered to return to the cilent a diffrante page (with my results after i fetched them?)

Because I tried to create an application which is almost the same as the console one just without the 'console' stuff (also retrieced all references) and when i send it seems it gets to the egine.stop() instantly without crawling anything

while if I test it in the console app it starts to crawl a lot of stuff before it actually gets there

 

Answered (Verified) Verified Answer

Top 10 Contributor
1,692 Posts
Verified by itayeng

Is there any particular reason why you need to run AN from a WebPage?  It can be done, but IMO AN runs best from a console or from the included Service.  I believe it would be best to run AN from the included Service, which could check your webpage for a 'start' value, and then start crawling.  The Service implementation has been much more thoroughly tested than running AN from an ASP.NET page.  Thoughts?

Take a look at Application (may not be in the solution but will be on disk...)  This shows you how to run AN from a WebPage.  It can be a bit tricky, based on your timeouts, but if you run AN in a background thread all should work fine.

Thanks for sharing your modification.

Have you tried running any perf tests on your code?  It may be simpler/faster to create an additional RegEx that matches "onclick", or simply "href=" (after filtering) than running index checks.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

All Replies

Top 10 Contributor
1,692 Posts

I will answer this question after you explain how you modified SiteCrawler.dll.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 25 Contributor
12 Posts

Thank you on the quick replay

in SiteCrawler the only change I made was the previous
in the discoveryManager I changed the regex of the hyperlink
and in the hyperlink matches function I added a part which cuts the parts you used (because my regex got the hole "<a href ... *** > </a>" and yours copied only the "href=***", so I cutted the values and pasted them instade)

expact that I didnt tuch the SiteCrawler but I opened a new ClassLibrary and copied the Program file and App.config from the console application, in the program file i commented the console parts

In the Program(The Copied one) I also commented _crawler.Engine.OnCrawlRequestCompleted += Engine_OnCrawlRequestCompleted; and the Engine_OnCrawlRequestCompleted function (do I need to keep this ?)

Expact that I call the function from my ASP.NET page and let it run
If you think that is the case then I would paste you the changed parts if you wish or just roll back to the original version , altough the change in discoveryManager is quite important and I really don't think it's him

Also as far as i checked
the original console project runs smoothly.

Thank you on your time a paitence

Itay

Top 10 Contributor
1,692 Posts

How do you have the code for DiscoveryManager.cs?

I will help you, but please tell me how you have the code inside the SiteCrawler project.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
35 Posts
flash replied on Wed, Jun 16 2010 7:20 AM

this is my cilent account
itayeng = flash same user
itayeng is the developer which works for the site that needs Aranchnode
Flash is the cilent which bought the application

Top 10 Contributor
1,692 Posts

Thank you very much.  This clears it up!  Big Smile

I will answer your questions in :45.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
1,692 Posts

OK, would you mind sharing the RegEx you created?

You don't need to explicitly wire the Engine events.

I need to read your other posts, trying to figure out what you are trying to do...

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
1,692 Posts

I am completely confused at to what you are trying to do.  Could you write out the steps (1. 2. 3.) as to what you need to accomplish?

Thanks!

Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
35 Posts
flash replied on Wed, Jun 16 2010 8:31 AM

What i am trying to do

1.
I am trying to create an ASP.NET Page
Which would be directed by some application once every some time (prolly once a day, not sure yet tough)
2.
the ASP.NET page would lunch AN on few sites
(the AN code would be in an diffrante class library and all which would be in the asp.net page would be SomeClassLibrary.Run();)
3.
The class library would check a list of sites and Crawl them
4.
when AN finishes it's job, I would make my own actions with the data from AN databases and insert it in a diffrate way more fitting to my Domain Model to a diffrante database

About the discoreyManager Change:
Also there had been an issue which AN didnt gave me all link results, after a look in DiscoveryManager I had seen it doesnt retrieve hyperlinks which contains "onclick" and some more attributes in them
so I changed the regex to the following:
@"<a[\s]+[^>]*?href[\s]?=[\s\""\']+(.*?)[\""\']+.*?>([^<]+|.*?)?<\/a>"

which now gives me the FULL <a href /> instade of only the href part
so in AssignHyperLinkDiscoveries function, after
"if (!match.Value.ToLower().StartsWith("<script"))
{ "...
I added:
string value = FixMatchValueBug(match.Value); // this gives you "href:LINK" like you had for match.value
string groupValue = value.Replace("href=", "").Replace("'","").Replace('"',' ').Trim(); // this gives you only the LINK like you had for

I also added the following function:

//Next part is bit massy, it is the function that returns the "href:LINK" ,
private static string FixMatchValueBug(string value)
{
            int nHrefStartIndex = value.IndexOf("href="); // gets the starting index of href
            char perfix = value.ToCharArray()[nHrefStartIndex + "href=".Length]; //gets the char after href= , to see it if is ' or "
            int nHrefEndIndex = value.IndexOf(perfix, nHrefStartIndex + "href=".Length + 1); //gets the href=LINK ending index
            return value.Substring(nHrefStartIndex, nHrefEndIndex - nHrefStartIndex); //returns the string from HREF till it's ending
}

and the last, changed the following 2 lines in the AssignHyperLinkDiscoveries  function
from:
if (Uri.TryCreate(match.Groups["HyperLink"].Value.TrimEnd('/'), UriKind.RelativeOrAbsolute, out hyperLinkDiscovery))
crawlRequest.Tag = match.Value;
to:
if (Uri.TryCreate(groupValue.TrimEnd('/'), UriKind.RelativeOrAbsolute, out hyperLinkDiscovery))
crawlRequest.Tag = value;

Top 10 Contributor
1,692 Posts
Verified by itayeng

Is there any particular reason why you need to run AN from a WebPage?  It can be done, but IMO AN runs best from a console or from the included Service.  I believe it would be best to run AN from the included Service, which could check your webpage for a 'start' value, and then start crawling.  The Service implementation has been much more thoroughly tested than running AN from an ASP.NET page.  Thoughts?

Take a look at Application (may not be in the solution but will be on disk...)  This shows you how to run AN from a WebPage.  It can be a bit tricky, based on your timeouts, but if you run AN in a background thread all should work fine.

Thanks for sharing your modification.

Have you tried running any perf tests on your code?  It may be simpler/faster to create an additional RegEx that matches "onclick", or simply "href=" (after filtering) than running index checks.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
35 Posts
flash replied on Thu, Jun 17 2010 12:24 AM

It is just that we doesnt want to have any thing open on the computer which luncher AN , and we want it to happen automaticlly every XXX time some day

I would take a look about the things you offered and if I would have any farther problems I would connect you

about the code change
I didnt tried to create an additional regex or anything like that , the reason is that i totally suck at regex so i just copied a regex from asp.net forums and changed the values to fit ;)

Top 10 Contributor
1,692 Posts

OK, got it.  As an alternative, so you don't have to worry about oddities calling AN from a WebPage... would it be feasible for you to call the Console via Process.Start(...);?

I will take a look at your code and see if I can make a RegEx that covers 'onclick' events.

-Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (12 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2005/2008/CE

copyright 2004-2013, arachnode.net LLC