arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL 2005/2008/CE
Does arachnode.net scale? | VS2008/2010/2012 & SQL2008/2012 | Download the latest release

webPagesRow coming back null...

rated by 0 users
Answered (Verified) This post has 1 verified answer | 23 Replies | 2 Followers

Top 10 Contributor
Male
101 Posts
Kevin posted on Mon, Feb 16 2009 3:19 PM

Probably the way I pulled this down but getting an error.  I am walking through it but wanted to throw a note up.

Pulled down latest from svn, but tried to keep my connectionstring info there.  THANKS for making the changes so that there are no longer any hard coded connectionstrings anywhere that I can see.

Left db as is.

Recompiled and am just trying to run the web search page directly.

SearchResults.ascx.cs has webPagesRow coming back null:
                            ArachnodeDataSet.WebPagesRow webPagesRow = arachnodeDAO.GetWebPage(Path.GetFileNameWithoutExtension(discoveryPath));

the discoveryPath looks to be:
"C:\\AppWorkspace\\arachnode.net\\source\\Console\\DownloadedWebPages\\http\\dev\\communityserver\\com\\forums\\979.aspx"

Not sure where it is grabbing this - must be the first search result that comes back with a hit on my search of 'test'.  So it's in the section where the file does NOT exist.

Anyway webPagesRow comes back null which makes the following bomb:
                            ManagedWebPage managedWebPage = webPageManager.ManageWebPage(webPagesRow.ID, webPagesRow.AbsoluteUri, webPagesRow.Source, webPagesRow.FullTextIndexType, false, false, true);

I know I took a shortcut in pulling down latest and trying to recompile-run, but I assume this could be a bonified error we need to catch?

Thx

 

Answered (Verified) Verified Answer

Top 10 Contributor
1,694 Posts
Verified by arachnode.net

I believe I have a better strategy for finishing the reporting procedures.  So, Friday, Saturday and Sunday are dedicated to finishing 1.1.

To solve the 'One Place One ConnectionString' problem I think something like this will need to be implemented: http://geekswithblogs.net/akraus1/articles/75391.aspx

Tonight, I added a configuration parameter to set the Request timeout.

Also, I have a goal to get the current dataset and lucene.net indexes to 1,000,000 indexed pages.  We're at 74,888 right now.  We should be at 100,000 by morning.

 

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

All Replies

Top 10 Contributor
1,694 Posts

There is still a hardcoded connection string in IsDisallowed.cs, sadly.

I added code in the Search functionality to try and retrieve the page from the database if it isn't found on disk.  So, if the WebPage is returning NULL then 979 isn't in the WebPages table.  Either that, or there is an error in retrieving WebPage 979.

Yes.  We should be trapping this error and logging it in the database.  I just add some code that updates the page at the bottom of the search page when a page/result isn't found and logs the error as well.

Thanks for catching this!!!

I'm working hard on getting the Reporting right - I had to change tack on the Reporting Views - managing 80+ views and stored procedures is pretty draining.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
Male
101 Posts
Kevin replied on Mon, Feb 16 2009 7:18 PM

Yeah I plugged in a working connectionstring, and tweaked the Settings.Settings file with my connection info, and can do crawls.

But the search definitely bombs because of the error that needs to be trapped.  I'll see if you make a tweak to the code in the next day or so, rather than tweak it myself.  Maybe you'll have some comments on my reply regarding ideas on the priority stuff and end up doing something there too.

Keep up the good work!

Thx

 

Top 10 Contributor
1,694 Posts

Just checked in the error handling code.  :)

All good stuff on the priority post.  I'll work on nailing down 1.1 and will tackle whatever strategy we come up with for 1.2.

-Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
Male
101 Posts
Kevin replied on Mon, Feb 16 2009 8:23 PM

I'm wondering if my data is maybe whacked?  In your new code below, the following line is still giving an error:

if (webPagesRow.AbsoluteUri != null

It's webPagesRow that is null so the above line will actually cause a null exception error.  I"ll walk thru the code and see why webPagesRow is returning null.

 

 

 

 

 

 

 

 

 

 

 

try

{

webPagesRow = ArachnodeDAO.GetWebPage(

 

Path

.GetFileNameWithoutExtension(discoveryPath));

 

 

WebPageManager webPageManager = new WebPageManager

(ArachnodeDAO);

 

 

ManagedWebPage managedWebPage = webPageManager.ManageWebPage(webPagesRow.ID, webPagesRow.AbsoluteUri, webPagesRow.Source, webPagesRow.FullTextIndexType, false, false, true

);

managedWebPage.FileStream.Close();

discoveryPath = managedWebPage.DiscoveryPath;

}

 

 

catch (Exception

exception)

{

 

 

if (webPagesRow.AbsoluteUri != null

)

{

_arachnodeDAO.InsertException(webPagesRow.AbsoluteUri,

 

null

, exception);

}

 

 

else

{

_arachnodeDAO.InsertException(

 

null, null

, exception);

}

TotalNumberOfHits--;

 

 

continue

;

Top 10 Contributor
Male
101 Posts
Kevin replied on Mon, Feb 16 2009 8:24 PM

Sorry, that didn't paste very well did it :(

Top 10 Contributor
1,694 Posts

Just checked in again.  My bad.

Is your working directory \AppWorkspace too?

Does WebPage 979 (or whatever it was) exist in the DB?

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
Male
101 Posts
Kevin replied on Mon, Feb 16 2009 8:32 PM

Don't remember which test I just did (which search term I used).  But, the discovery path is:

discoveryPath = "C:\\AppWorkspace\\arachnode.net\\source\\Console\\DownloadedWebPages\\http\\en\\wikipedia\\org\\wiki\\3461.htm"

...and no that file does not exist.

I had done a db reset, then started a new crawl against the default uri you provide in program.cs.  Actually I think you removed this from the current revision but I grabbed from previous :)

 

Top 10 Contributor
1,694 Posts

Did the file exist at one point?  Resetting the database also resets the IDENTITY SEEDS.

I bet it would be good once the product matures a bit more to have the service install itself and protect the lucene.net index files like SQL does with its databases...

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
Male
101 Posts
Kevin replied on Mon, Feb 16 2009 8:43 PM

Yeah.  And maybe I'll just change:

if 

 

(webPagesRow.AbsoluteUri != null)

to be:

if

 

((webPagesRow != null) && (webPagesRow.AbsoluteUri != null))

UPDATE

I made that change above, and now down in SearchResults.ascx.cs I'm getting a null error at:

uxLblStrength.Text = Document.GetField(

"strength").StringValue();

...no strength field in the Document.  Should I manually delete all lucene index files and do a reset just to make sure all is in synch?!

Thx

Top 10 Contributor
1,694 Posts

Sure.  That works too.

I had to change the lucene.net indexes.  I removed a few fields and added one or two, and I have to write a converter for 1.0 to 1.1 indexes.  Sadly, I'm not 100% happy with myself for this... it's essentially a breaking change. I have to write a converter on the chance that someone has invested a lot of serious crawling time in the 1.0 indexes.

So, yeah - I would start fresh.  Keep the 1.0 indexes and I will write a converter/merger.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
Male
101 Posts
Kevin replied on Mon, Feb 16 2009 8:58 PM

I zapped all the stored web pages, and all the lucene index files, and ran a db reset.  Then I did a walk that grabbed about 1000 domains and only about 100 webpages.

I killed the crawl, and sadly my lucene files look very small.  The search does not bomb now, but the search isn't returning anything.

I get 4 files in the currentcrawls folder, and 2 in the parent folder, but they are all 0k or 1k.

Any ideas?

Unfortunately I can't just do a revert in SVN.  I pull things down, convert the project/solution to vs2008, plug in my connectionstrings, etc.  Maybe I'm missing a config file change since I am not replacing them?

Sorry to waste time on this.

 

Top 10 Contributor
1,694 Posts

Lucene.net stores results in memory and they aren't immediately flushed.

Find the Stop() method in ManageLuceneDotNetIndexes.  Here's where the indexes will write to disk if the crawl is stopped.  If you click close on the console window you sould get a dialog like a application that is hanging.  When this happens the console is actually writing it's state back to the DB.

Is your CrawlRequests table empty?

What is in the Exceptions table?

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
Male
101 Posts
Kevin replied on Mon, Feb 16 2009 9:32 PM

LOL yeah I have the single crawl request in the crawl table.  But I definitely missed the exceptions!  I'm getting a ton of:

Procedure or function arachnode_omsp_CrawlRequests_SELECT has too many arguments specified.

I totally forgot to even look here.  sp change? Or did I miss a revision in the data layer?

 EDIT:

Yeah, arachnode_omsp_CrawlRequests_SELECT in my db has 3 parms but my ArachnodeDataset.xsd has 4 defined.  Looks like added

CreateCrawlRequestsFromDatabaseFiles
 parm.  Do I need to pull down some sp updates?

Top 10 Contributor
1,694 Posts

Yeah - I did a database checkin too.  :)  I should have clarified this.  Burning the candle at both end here.  Every time I have to touch the reporting views they always kick my ass.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 2 (24 items) 1 2 Next > | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2005/2008/CE

copyright 2004-2013, arachnode.net LLC