arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

How to avoid creating discoveries ?

rated by 0 users
Answered (Verified) This post has 1 verified answer | 15 Replies | 2 Followers

Top 25 Contributor
25 Posts
Dinesh posted on Tue, May 21 2013 6:48 AM

Hi All,

I wrote a small Plugin (crawlaction) which will extract the specific content from website and save the content into database. Here is piece of code from my plugin. I am taking the crawl requests (websites) from  CrawlRequests.txt and this .txt file has only one website name such as acquia.com. So, I want to execute the below PerformAction method only once but this method executing more than 100 times. I think the acquia.com has 100+ discoveries, thats why the PerformAction method executing 100 + times. How to avoid creating discoveries or how to avoid multiple times execution ?

 

public override void PerformAction(CrawlRequest crawlRequest, ArachnodeDAO arachnodeDAO)

        {

            _crawlRequest = crawlRequest.Parent.Uri.AbsoluteUri.ToString();

             // Get the URL specified

             var webGet = new HtmlWeb();

             var document = webGet.Load(_crawlRequest);

             //hyperlink weblinks

             var companyLnks = document.DocumentNode.SelectNodes("//a");

             if (companyLnks != null)

             {

                 foreach (var lnk in companyLnks)

                 {

                     if (lnk.Attributes["href"] != null)

                     {

                         if (lnk.Attributes["href"].Value.Contains("https://www.facebook.com/"))

                         {

                             _fbLink = lnk.Attributes["href"].Value.ToString();

                         }                       

                     }

                 }

             }             

             _activityDateTime = DateTime.Now.ToString();

             arachnodeDAO.InsertWebLinks(_crawlRequest, _fbLink, _activityDateTime);  

 

Answered (Verified) Verified Answer

Top 10 Contributor
1,905 Posts
Verified by arachnode.net

Set ApplicationSettings.InsertHyperLinks = true; and use IsStorable in the plugin and let the CrawlRequestManager.cs insert the HyperLinks.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

All Replies

Top 10 Contributor
1,905 Posts
Verified by arachnode.net

Set ApplicationSettings.InsertHyperLinks = true; and use IsStorable in the plugin and let the CrawlRequestManager.cs insert the HyperLinks.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 25 Contributor
25 Posts
Dinesh replied on Tue, May 21 2013 11:54 PM

Hi

I am new to AN crawler, Can you explain me where Do I need to set ApplicationSettings.InsertHyperLinks = true and How do I use IsStorable in the plugin.

Top 10 Contributor
1,905 Posts

Look in Console\Program.cs - this is where the settings are set.

Set IsStorable = true to store the HyperLink (crawlRequest.IsStorable and/or the discovery.IsStorable) - in your case, use the 'Discovery' method, and not the 'CrawlRequest' method.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 25 Contributor
25 Posts
Dinesh replied on Thu, May 23 2013 8:01 AM

Hi

If you don't mind, I am not clear with your reply (i.e. 'Discovery' method and the 'CrawlRequest' method)

As per my requirement, we need to crawl specific text data from no of websites. I am not at all interest to have discoveries, saving web pages, create indexes, create folders, etc. I just want to provide the website domains (i.e. www.google.com,   etc) from database to the crawler and need to crawl the specific text data from those websites and save to the database. To achieve my requirements I did the following steps

1.  Commented all unnecessary code in program.cs

2. Created the below plugin and changed plugin name in program.cs as well as in database crawlactions table

 public override void AssignSettings(Dictionary<string, string> settings)

        {            

        }

        /// <summary>

        /// Stops this instance.

        /// </summary>

        public override void Stop()

        {

        }

        /// <summary>

        /// Performs the action.

        /// </summary>

        /// <param name = "crawlRequest">The crawl request.</param>

        /// <param name = "arachnodeDAO">The arachnode DAO.</param>

        public override void PerformAction(CrawlRequest crawlRequest, ArachnodeDAO arachnodeDAO)

        {

                _crawlRequest = crawlRequest.Parent.Uri.AbsoluteUri.ToString();

                // Get the URL specified

                var webGet = new HtmlWeb();

                var document = webGet.Load(_crawlRequest);

                //hyperlink weblinks

                var companyLnks = document.DocumentNode.SelectNodes("//a");

                if (companyLnks != null)

                {

                    foreach (var lnk in companyLnks)

                    {

                        if (lnk.Attributes["href"] != null)

                        {

                            //Facebook

                            if (lnk.Attributes["href"].Value.Contains("https://www.facebook.com/") || lnk.Attributes["href"].Value.Contains("http://www.facebook.com/") || lnk.Attributes["href"].Value.Contains("https://facebook.com/") || lnk.Attributes["href"].Value.Contains("http://facebook.com/"))

                            {

                                _fbLink = lnk.Attributes["href"].Value.ToString();

                            }                          

                        }

                    }

                }

             _activityDateTime = DateTime.Now.ToString();

                arachnodeDAO.InsertWebLinks(_tempCrawlRequest, _crawlRequest, _fbLink,  _activityDateTime);

}

3. Since I dont need any discoveries, I just commented the below section in Engine.cs

               //create the DiscoveryProcessor...

                //DiscoveryProcessor discoveryProcessor = new DiscoveryProcessor(_crawler);

                //Thread thread2 = new Thread(discoveryProcessor.BeginDiscoveryProcessor);

                //thread2.Name = "DiscoveryProcessorThread:" + (i + 1);

                //discoveryProcessor.Thread = thread2;

                //DiscoveryProcessors.Add(i + 1, discoveryProcessor); 

4. After done above steps, I think we are facing threadsafe issues. Data is not properly saving into database

My requirements are very simple and straight forward.  

Are we doing in right way or Can you please help us how to achieve my requirements ??. 

Top 10 Contributor
1,905 Posts

1.  Commented all unnecessary code in program.cs - What does this mean?  What did you comment?

2. Created the below plugin and changed plugin name in program.cs as well as in database crawlactions table - OK, good.

3. Since I dont need any discoveries, I just commented the below section in Engine.cs - You don't need to comment out anything in AN - everything can be turned off with switches from ApplicationSettings.cs - Please revert your changes to all files excluding your plugin.  Smile

4. After done above steps, I think we are facing threadsafe issues. Data is not properly saving into database - The plugins are 100% threadsafe - You added a method 'InsertWebLinks' - is there an error here?  Are there any exceptions in the Exceptions database table?

So... in the plugin, you are making an additional web request from the HtmlAgilityPack - this is unnecessary.

Look at crawlRequest.DecodedHtml - AN has already downloaded the HTML for you - assuming you have set the plugin as a PostRequest plugin.

Use HtmlDocument.Load(...);

I see that _crawlRequest is named like a private member variable - if this is declared at the class level multiple threads will be trying to change this, so yes, you could be experiencing threading issues.

So...  

Use HtmlDocument htmlDocument = new HtmlDocument();

htmlDocument.Load(crawlRequest.DecodedHtml);

then... do whatever you need to do with the code.

The easier way to accomplish this is to create a CrawlRule.  Set to PreRequest and in the method that processes Discoveries (an AbsoluteUri which will eventually be classified as a File, Image or WebPage) check to see if discovery.Uri.Host.ToLowerInvariant().Contains("facebook.com")...  if it does, set discovery.IsStorable = true;, else set discovery.IsStorabe = false;  Then, if you have set ApplicationSettings.InsertHyperLinks = true; the CrawlRequestManager.cs (where you commented the DiscoveryProcessor code) will insert the HyperLinks for you.

So, in summary, revert all changes - get a fresh copy of the source - create a CrawlRule - you can use _crawler.AddCrawlRule in Console\Program.cs - set it to PreRequest - in the IsDisallowed(Discovery discovery, ArachnodeDAO arachnodeDAO) method, check the discovery.Uri.Host property for "facebook.com" - if you find it, set IsStorable = true;, else set IsStorable = false;  Set ApplicationSettings.InsertHyperLinks = true, set ApplicationSettings.MaximumNumberOfCrawlRequests = 1 - submit ONE AbsoluteUri in CrawlRequests.txt and set a breakpoint in your CrawlRule - look at the HyperLinks database table. 

Thanks,
Mike

 

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 25 Contributor
25 Posts
Dinesh replied on Mon, May 27 2013 8:54 AM

Hi Mike,

I hope you understood my requirements after much communications between us. As you suggested in previous reply, I did the following steps

  1. Took the fresh copy of the source code and customized as per my requirements

  2. I want to crawl web links from absolute URLs. So, I created below crawl action which will take the absolute uris from CrawlRequests.txt, capture the links and finally save them into database

    public override void PerformAction(CrawlRequest crawlRequest, ArachnodeDAO arachnodeDAO)

{

HtmlDocument htmlDocument = new HtmlDocument();

htmlDocument.LoadHtml(crawlRequest.DecodedHtml);

//hyperlink weblinks

var companyLnks = htmlDocument.DocumentNode.SelectNodes("//a");

if (companyLnks != null)

{

foreach (var lnk in companyLnks)

{

if (lnk.Attributes["href"] != null)

{

//Facebook

if (lnk.Attributes["href"].Value.Contains("https://www.facebook.com/") || lnk.Attributes["href"].Value.Contains("http://www.facebook.com/") || lnk.Attributes["href"].Value.Contains("https://facebook.com/") || lnk.Attributes["href"].Value.Contains("http://facebook.com/"))

{

_fbLink = lnk.Attributes["href"].Value.ToString();

}

}

}

}

_activityDateTime = DateTime.Now.ToString();

arachnodeDAO.InsertWebLinks(crawlRequest.Parent.Uri.AbsoluteUri.ToString(), _fbLink, _twrLink, _activityDateTime);

}

  1. We don’t want to crawl discoveries of absolute uris. So, created the below CrawlRule and configured the IsDisallowed methods

     

    public override bool IsDisallowed(CrawlRequest crawlRequest, ArachnodeDAO arachnodeDAO)

{

return false;

}

    public override bool IsDisallowed(Discovery discovery, ArachnodeDAO arachnodeDAO)

{

discovery.IsStorable = true;

return discovery.IsStorable;

}

4. Changed crawlaction name and crawlrule name in program.cs as well as in database crawlactions table. As per our requirement, we are able to crawl only absolute URIs not the discoveries but we noticed one exception like “Cannot insert the value NULL into column 'Reason', table 'arachnode.net.dbo.DisallowedAbsoluteUris'; column does not allow nulls. INSERT fails. The statement has been terminated.” in Exceptions table. How Do I avoid this exception ??

  1. After we wrote crawlaction and crawl rule, we crawled 300 websites using AN, it is taking an a average 6 mins but this is huge time. Any suggestion to reduce the time ?

  2. As mentioned earlier, the absolute uris are taking from CrawlRequests.txt but we need to take from database table instead of CrawlRequests.txt. Can you suggest us how to approach for this task ?

  3. Is it possible to have multiple absolute uris in a single crawl request ? . So that we can access multiple absolute uris in “PerformAction” method

  4. Can we have the complete server configuration to run AN faster. You already told us partial information regd sever but we are looking for complete server configuration, proxies, etc. We are using SQL Server 2008 R2 Enterprise Edition. Is this version enough because we crawl much data ?

  5. AN crawler every time delete and re-creates the folders like DownloadedFiles, DownloadedImages, DownloadedWebPages, Lucene indexes, etc. we don’t want all these deletion and re-creation work. Can't we comment the code which does the above functionality ? And also AN crawler saves all the web pages to disk. We don’t want to save any web page . What do you suggest ?

  6. Is it necessary to execute 'arachnode_usp_arachnode.net_RESET_DATABASE' stored procedure every time ?

     

     

     

Can you please help us

Top 10 Contributor
1,905 Posts

OK, once again...  Smile

To be clear, again...  A Discovery is anything found on the web - it becomes a File/Image/WebPage once the specific type is determined.

Delete this: arachnodeDAO.InsertWebLinks(crawlRequest.Parent.Uri.AbsoluteUri.ToString(), _fbLink, _twrLink, _activityDateTime);

Actually, delete the whole plugin too.  AN already parses out the HyperLinks -> crawlRequest.Discoveries.HyperLinkDiscoveries.

Let the CrawlRequestManager.cs insert the HyperLinks.  ApplicationSettings.InsertHyperLinks = true;

If you only want to store FB links do this: 

 

  1. public override bool IsDisallowed(CrawlRequest crawlRequest, ArachnodeDAO arachnodeDAO)

{

return false;

}

    public override bool IsDisallowed(Discovery discovery, ArachnodeDAO arachnodeDAO)

{

//check to see if the Discovery is a FB link, and if it isn't set discovery.IsStorable = false;

//do NOT try and parse the page out again, it has already been done.  Try looking at the information that is present in the crawlRequest and discovery objects.

discovery.IsStorable = true;

return false;

}

Changed crawlaction name and crawlrule name in program.cs as well as in database crawlactions table. As per our requirement, we are able to crawl only absolute URIs not the discoveries but we noticed one exception like “Cannot insert the value NULL into column 'Reason', table 'arachnode.net.dbo.DisallowedAbsoluteUris'; column does not allow nulls. INSERT fails. The statement has been terminated.” in Exceptions table. How Do I avoid this exception ??

Try looking at any of the other plugins.  Your code has an error where it was disallowing the Discovery but wasn't providing a reason why it was disallowed.  You AREN'T disallowing anything, just changing whether a Discovery can be stored.

 

  1. After we wrote crawlaction and crawl rule, we crawled 300 websites using AN, it is taking an a average 6 mins but this is huge time. Any suggestion to reduce the time ?

    What are you crawling?  What hardware?  What network connection?  How many threads?  300 WebSites, or do you mean WebPages?

  2. As mentioned earlier, the absolute uris are taking from CrawlRequests.txt but we need to take from database table instead of CrawlRequests.txt. Can you suggest us how to approach for this task ?

    Sure.  There are many ways to get data from SQL from .NET, choose one.  :)  Or, look at arachnodeDAO.InsertCrawlRequest(...);

    Where AN gets the CrawlRequests from CrawlRequests.txt, just grab from a database.

    https://www.google.com/search?rlz=1C1CHFX_enUS531US531&output=search&sclient=psy-ab&q=how+to+debug+in+visual+studio&oq=how+to+debug+in+visual&gs_l=hp.3.0.0l9.7194.7655.1.8357.6.6.0.0.0.0.200.950.0j5j1.6.0.cpsugrccggmnoe..0.0...1.1.14.hp.EisexMyURd8&pbx=1&biw=1463&bih=977&cad=cbv&sei=B5CjUaqnNeOZiQKoxYDADg#rlz=1C1CHFX_enUS531US531&sclient=psy-ab&q=how+to+get+data+from+sql+.net&oq=how+to+get+data+from+sql+.net&gs_l=serp.3..33i29i30l4.120126.125197.0.125283.36.31.3.0.0.0.360.6457.0j13j13j4.30.0.cpsugrccggmnoe..0.0...1.1.14.psy-ab.McuipO5z9hQ&pbx=1&bav=on.2,or.r_cp.r_qf.&bvm=bv.47008514,d.cGE&fp=6c7f8a5fed4db490&biw=1463&bih=977

  3. Is it possible to have multiple absolute uris in a single crawl request ? . So that we can access multiple absolute uris in “PerformAction” method

    You can't - you can't ask a webserver to serve you more than one page at a time though the HTTP protocol.

  4. Can we have the complete server configuration to run AN faster. You already told us partial information regd sever but we are looking for complete server configuration, proxies, etc. We are using SQL Server 2008 R2 Enterprise Edition. Is this version enough because we crawl much data ?

    Let's just focus on getting AN to work and downloading/storing what you want before throwing even more information at you.  :)  Please tell me the EXACT specifications of your hardware.  What are the EXACT specifications of your network? 

  5. AN crawler every time delete and re-creates the folders like DownloadedFiles, DownloadedImages, DownloadedWebPages, Lucene indexes, etc. we don’t want all these deletion and re-creation work. Can't we comment the code which does the above functionality ? And also AN crawler saves all the web pages to disk. We don’t want to save any web page . What do you suggest ?

    Read Console\Program.cs - the whole thing, end to end.  Really, read the WHOLE THING.  The switches that reset the database are at the top.  Look for ApplicationSettings.SaveDiscoveredWebPagesToDisk.  Set to 'false'.  Helpful links on how to debug with Visual Studio: https://www.google.com/search?rlz=1C1CHFX_enUS531US531&output=search&sclient=psy-ab&q=how+to+debug+in+visual+studio&oq=how+to+debug+in+visual&gs_l=hp.3.0.0l9.7194.7655.1.8357.6.6.0.0.0.0.200.950.0j5j1.6.0.cpsugrccggmnoe..0.0...1.1.14.hp.EisexMyURd8&pbx=1&biw=1463&bih=977&cad=cbv&sei=B5CjUaqnNeOZiQKoxYDADg

  6. Is it necessary to execute 'arachnode_usp_arachnode.net_RESET_DATABASE' stored procedure every time ?

    Only if you want to reset the database.  See the answer to question 5.

 

 


For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 25 Contributor
25 Posts

Hi Mike, Many Thanks for helping US.

We are struck with the below 3rd point which I mentioned earlier.

3. Is it possible to have multiple absolute uris in a single crawl request ? . So that we can access multiple absolute uris in “PerformAction” method ??

Actually, As per my requirement we want to save # of likes from FB, # of followers from twitter at at time for a company in a single row. For Instance,

1. I need to save # of FB likes, # of followers for Amazon. So, I have a table with columns company name (i.e. Amazon), FB Link  for amazon (i.e. https://www.facebook.com/Amazon) and Twitter link (https://www.twitter.com/Amazon).

2. In "PerformAction" method , I want to access both FB link and Twitter link like below and save the data in a single row. In the below code i have given static the  FB link and Twitter link  but those has to come from database dynamically. I hope you understand my requirement. Please do needful help.

public override void PerformAction(CrawlRequest crawlRequest, ArachnodeDAO arachnodeDAO)

        {

//this links has to come from database

            crawlRewuestURL = "https://www.facebook.com/" + crawlRequest.Parent.Uri.Authority;

            noofLikes = CrawlFaceBookWebPage(crawlRewuestURL);

/this links has to come from database

crawlRewuestURL = "https://twitter.com/" + crawlRequest.Parent.Uri.Authority;

            nooffollowers = CrawlDataTWR(crawlRewuestURL);

             arachnodeDAO.HistoryInformation(crawlRequest, noofLikes,nooffollowers);           

        }

        private static string CrawlFaceBookWebPage(string crawlRequest)

        {

var document = webGet.Load(_crawlRequest);

            //hyperlink weblinks

            var companyLnks = document.DocumentNode.SelectNodes("//a");

             if (companyLnks != null)

             {

                 foreach (var lnk in companyLnks)

                 {

                     if (lnk.Attributes["href"] != null)

                     {

                         if (lnk.Attributes["href"].Value.Contains("https://www.facebook.com/"))

                         {

                             _fbLink = lnk.Attributes["href"].Value.ToString();

                         }                       

 

                     }

 

                 }

            return noofLikes;

        }

private static string CrawlDataTWR(string crawlRequest)

        {

var document = webGet.Load(_crawlRequest);

             //hyperlink weblinks

             var companyLnks = document.DocumentNode.SelectNodes("//a");

             if (companyLnks != null)             {

                 foreach (var lnk in companyLnks)

                 {

                     if (lnk.Attributes["href"] != null)

                     {

                         if (lnk.Attributes["href"].Value.Contains("https://www.twitter.com/"))

                         {

                             twitter = lnk.Attributes["href"].Value.ToString();

                         }                   

 

                     }

 

                 }

            return twitter;

        }

 

    }

 

Thanks

Top 10 Contributor
1,905 Posts

1.) The purpose of a CrawlAction is to fire once for each CrawlRequest.  You should have yours set to PostRequest.

2.) It still looks like you are using member variables.  _fbLink = lnk.Attributes["href"].Value.ToString();  (not threadsafe)  Please learn about thread safety:  https://www.google.com/webhp?sourceid=chrome-instant&rlz=1C1CHFX_enUS531US531&ion=1&ie=UTF-8#rlz=1C1CHFX_enUS531US531&sclient=psy-ab&q=member%20variables%20thread%20safe%20c%23&oq=&gs_l=&pbx=1&fp=6a75cf603c6d3d5f&ion=1&bav=on.2,or.r_cp.r_qf.&bvm=bv.47244034,d.cGE&biw=1920&bih=1085

3.) crawlRewuestURL = "https://www.facebook.com/" + crawlRequest.Parent.Uri.Authority; - Looks like you are crawling, say, 'Amazon.com', and then asking Facebook and Twitter for their numbers.  Resubmit back to the Crawler through crawlRequest.Crawler.Crawl(...);

4.) var document = webGet.Load(_crawlRequest); - Please don't do this.  Again, not thread safe.  You are circumventing AN's download logic and bypassing all of the 'goodness' that AN provides by doing this.  Smile  The CA's are shared by all threads, so any member variables you use must be synchronized - or, just use local variables.  Notice how the ArachnodeDAO doesn't belong to the CA?  This is to keep the CA's thread safe.  The download functionality of the HtmlAgilityPack isn't thread safe by default.

5.) You don't need to submit the counts for both sites at once.  They way you have it set up now, a web exception (from the HAP) will cause the entire row to not be present.  (error handling is a good thing...)  Modify your stored procedure to detect whether you are referencing Facebook or Twitter.  One CR will complete for Twitter, insert a new row.  The other for Facebook will complete, updating the count for Facebook.  (and so and so...)

Thanks,
Mike 

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 25 Contributor
25 Posts
Dinesh replied on Tue, Jun 4 2013 12:26 PM

Thanks Mike, Will try with your logic. I didnt understand what CA and CR from your reply.

Also I am trying to get the absoluteurl from database instead of crawlrequest.txt. So, I have written the below code in program.cs

 ArachnodeDAO _arachnodeDAO = new ArachnodeDAO();

 foreach (ArachnodeDataSet.CompanyRow row in _arachnodeDAO.GetLinks())

 {

//some code like

wasTheCrawlRequestAddedForCrawling = _crawler.Crawl(new CrawlRequest(new Discovery("http://" + absoluteUri2)........

}

but I am getting below  error message. How do i resolve it ?

foreach statement cannot operate on variables of type 'Arachnode.DataSource.ArachnodeDataSet.CompanyDataTable' because 'Arachnode.DataSource.ArachnodeDataSet.CompanyDataTable' does not contain a public definition for 'GetEnumerator'

Top 10 Contributor
1,905 Posts

CA = CrawlAction

CR = CrawlRow

Try .Rows

https://www.google.com/webhp?sourceid=chrome-instant&ion=1&ie=UTF-8#sclient=psy-ab&q=does%20not%20contain%20a%20public%20definition%20for%20'GetEnumerator'&oq=&gs_l=&pbx=1&fp=c3099a6ee85cf32c&ion=1&bav=on.2,or.r_cp.r_qf.&bvm=bv.47380653,d.cGE&biw=1237&bih=901

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 25 Contributor
25 Posts
Dinesh replied on Wed, Jun 5 2013 12:38 PM

Hi Mike,

Thanks alot for giving quick replies but we are not succeeded to get the absoluteuri from database. In your previous reply mentioned like "Try .Rows", Can you expand the same. You want me to write like "ArachnodeDataSet.Rows" ??...Can you please help us on this,

Also another question from previous reply . what do you mean by  "Resubmit back to the Crawler through crawlRequest.Crawler.Crawl(...);" 

Top 10 Contributor
1,905 Posts

GetLinks() returns a DataTable, but you want to get to the datatable's .Rows.

GetLinks().Rows

Please check the Google links.

Resubmit - try typing crawlRequest.Crawler.Crawl in a CrawlAction - what happens?  (Hint: It's the same thing that happens when you submit from Program\Console.cs)

You should probably read this entire thread again.  Smile

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 25 Contributor
25 Posts
Dinesh replied on Thu, Jun 6 2013 11:48 AM

Hi Mike, 

1. As per our discussion I did the following code in ArachnodeDAO.cs

public ArachnodeDataSet.CompDataTable GetLinks()

        {

            try

            {

                _CompDataTable.Clear();

                _CompDataTableAdapter.Fill(_CompDataTable);

 

                if (_CompDataTable.Count != 0)

                {

                    return (ArachnodeDataSet.CompDataTable)_CompDataTable;

                }

                return null;      

            }

            catch (Exception exception)

            {

                InsertException("", null, exception, false);

            }

 

            return null;  

        }

2. I did the following code in program.cs

foreach (ArachnodeDataSet.CompDataRow row in _arachnodeDAO.GetLinks())

                        {

                            //

                        }

I am trying to access like GetLinks().Rows in the above code but I didnt see .Rows when I put dot after GetLinks()

Any suggestion on this?

Page 1 of 2 (16 items) 1 2 Next > | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC