arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

How to avoid creating discoveries ?

rated by 0 users
Answered (Verified) This post has 1 verified answer | 15 Replies | 2 Followers

Top 25 Contributor
25 Posts
Dinesh posted on Tue, May 21 2013 6:48 AM

Hi All,

I wrote a small Plugin (crawlaction) which will extract the specific content from website and save the content into database. Here is piece of code from my plugin. I am taking the crawl requests (websites) from  CrawlRequests.txt and this .txt file has only one website name such as acquia.com. So, I want to execute the below PerformAction method only once but this method executing more than 100 times. I think the acquia.com has 100+ discoveries, thats why the PerformAction method executing 100 + times. How to avoid creating discoveries or how to avoid multiple times execution ?

 

public override void PerformAction(CrawlRequest crawlRequest, ArachnodeDAO arachnodeDAO)

        {

            _crawlRequest = crawlRequest.Parent.Uri.AbsoluteUri.ToString();

             // Get the URL specified

             var webGet = new HtmlWeb();

             var document = webGet.Load(_crawlRequest);

             //hyperlink weblinks

             var companyLnks = document.DocumentNode.SelectNodes("//a");

             if (companyLnks != null)

             {

                 foreach (var lnk in companyLnks)

                 {

                     if (lnk.Attributes["href"] != null)

                     {

                         if (lnk.Attributes["href"].Value.Contains("https://www.facebook.com/"))

                         {

                             _fbLink = lnk.Attributes["href"].Value.ToString();

                         }                       

                     }

                 }

             }             

             _activityDateTime = DateTime.Now.ToString();

             arachnodeDAO.InsertWebLinks(_crawlRequest, _fbLink, _activityDateTime);  

 

Answered (Verified) Verified Answer

Top 10 Contributor
1,905 Posts
Verified by arachnode.net

Set ApplicationSettings.InsertHyperLinks = true; and use IsStorable in the plugin and let the CrawlRequestManager.cs insert the HyperLinks.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

All Replies

Top 10 Contributor
1,905 Posts

Not sure what to tell you.  ???  This is code patterned after this:

Did you create the TableAdapter with something simple like "Select * From Companies";

There are so many ways to get data in .NET - if you have run into some crazy .NET bug, just pick another one...  https://www.google.com/webhp?sourceid=chrome-instant&ion=1&ie=UTF-8#sclient=psy-ab&q=get%20data%20from%20a%20database%20C%23&oq=&gs_l=&pbx=1&fp=701c806c0df3bf7b&ion=1&bav=on.2,or.r_cp.r_qf.&bvm=bv.47534661,d.cGE&biw=1468&bih=901

return (ArachnodeDataSet.CompDataTable)_CompDataTable.Rows?;

Compare against the WebPagesTableAdapter - perhaps you have some setting that is different.  BTW, this is standard .NET stuff - nothing AN specific.

Thanks.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 2 of 2 (16 items) < Previous 1 2 | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC