can I post to the crawler thrue a web page to start crawling and once it entered to return to the cilent a diffrante page (with my results after i fetched them?)Because I tried to create an application which is almost the same as the console one just without the 'console' stuff (also retrieced all references) and when i send it seems it gets to the egine.stop() instantly without crawling anythingwhile if I test it in the console app it starts to crawl a lot of stuff before it actually gets there
Is there any particular reason why you need to run AN from a WebPage? It can be done, but IMO AN runs best from a console or from the included Service. I believe it would be best to run AN from the included Service, which could check your webpage for a 'start' value, and then start crawling. The Service implementation has been much more thoroughly tested than running AN from an ASP.NET page. Thoughts?
Take a look at Application (may not be in the solution but will be on disk...) This shows you how to run AN from a WebPage. It can be a bit tricky, based on your timeouts, but if you run AN in a background thread all should work fine.
Thanks for sharing your modification.
Have you tried running any perf tests on your code? It may be simpler/faster to create an additional RegEx that matches "onclick", or simply "href=" (after filtering) than running index checks.
For best service when you require assistance:
Skype: arachnodedotnet
I will answer this question after you explain how you modified SiteCrawler.dll.
Thank you on the quick replay
in SiteCrawler the only change I made was the previousin the discoveryManager I changed the regex of the hyperlinkand in the hyperlink matches function I added a part which cuts the parts you used (because my regex got the hole "<a href ... *** > </a>" and yours copied only the "href=***", so I cutted the values and pasted them instade)expact that I didnt tuch the SiteCrawler but I opened a new ClassLibrary and copied the Program file and App.config from the console application, in the program file i commented the console partsIn the Program(The Copied one) I also commented _crawler.Engine.OnCrawlRequestCompleted += Engine_OnCrawlRequestCompleted; and the Engine_OnCrawlRequestCompleted function (do I need to keep this ?)Expact that I call the function from my ASP.NET page and let it runIf you think that is the case then I would paste you the changed parts if you wish or just roll back to the original version , altough the change in discoveryManager is quite important and I really don't think it's himAlso as far as i checkedthe original console project runs smoothly.Thank you on your time a paitence
Itay
How do you have the code for DiscoveryManager.cs?
I will help you, but please tell me how you have the code inside the SiteCrawler project.
this is my cilent accountitayeng = flash same useritayeng is the developer which works for the site that needs AranchnodeFlash is the cilent which bought the application
Thank you very much. This clears it up!
I will answer your questions in :45.
OK, would you mind sharing the RegEx you created?
You don't need to explicitly wire the Engine events.
I need to read your other posts, trying to figure out what you are trying to do...
I am completely confused at to what you are trying to do. Could you write out the steps (1. 2. 3.) as to what you need to accomplish?
Thanks!
Mike
What i am trying to do1.I am trying to create an ASP.NET PageWhich would be directed by some application once every some time (prolly once a day, not sure yet tough)2.the ASP.NET page would lunch AN on few sites(the AN code would be in an diffrante class library and all which would be in the asp.net page would be SomeClassLibrary.Run();)3.The class library would check a list of sites and Crawl them4.when AN finishes it's job, I would make my own actions with the data from AN databases and insert it in a diffrate way more fitting to my Domain Model to a diffrante database
About the discoreyManager Change:Also there had been an issue which AN didnt gave me all link results, after a look in DiscoveryManager I had seen it doesnt retrieve hyperlinks which contains "onclick" and some more attributes in themso I changed the regex to the following:@"<a[\s]+[^>]*?href[\s]?=[\s\""\']+(.*?)[\""\']+.*?>([^<]+|.*?)?<\/a>"which now gives me the FULL <a href /> instade of only the href partso in AssignHyperLinkDiscoveries function, after "if (!match.Value.ToLower().StartsWith("<script")){ "... I added:string value = FixMatchValueBug(match.Value); // this gives you "href:LINK" like you had for match.valuestring groupValue = value.Replace("href=", "").Replace("'","").Replace('"',' ').Trim(); // this gives you only the LINK like you had for I also added the following function:
//Next part is bit massy, it is the function that returns the "href:LINK" , private static string FixMatchValueBug(string value){ int nHrefStartIndex = value.IndexOf("href="); // gets the starting index of href char perfix = value.ToCharArray()[nHrefStartIndex + "href=".Length]; //gets the char after href= , to see it if is ' or " int nHrefEndIndex = value.IndexOf(perfix, nHrefStartIndex + "href=".Length + 1); //gets the href=LINK ending index return value.Substring(nHrefStartIndex, nHrefEndIndex - nHrefStartIndex); //returns the string from HREF till it's ending}and the last, changed the following 2 lines in the AssignHyperLinkDiscoveries functionfrom:if (Uri.TryCreate(match.Groups["HyperLink"].Value.TrimEnd('/'), UriKind.RelativeOrAbsolute, out hyperLinkDiscovery))crawlRequest.Tag = match.Value;to:if (Uri.TryCreate(groupValue.TrimEnd('/'), UriKind.RelativeOrAbsolute, out hyperLinkDiscovery))crawlRequest.Tag = value;
It is just that we doesnt want to have any thing open on the computer which luncher AN , and we want it to happen automaticlly every XXX time some day
I would take a look about the things you offered and if I would have any farther problems I would connect youabout the code changeI didnt tried to create an additional regex or anything like that , the reason is that i totally suck at regex so i just copied a regex from asp.net forums and changed the values to fit ;)
OK, got it. As an alternative, so you don't have to worry about oddities calling AN from a WebPage... would it be feasible for you to call the Console via Process.Start(...);?
I will take a look at your code and see if I can make a RegEx that covers 'onclick' events.
-Mike