So I ran this once I did some crawling and had data to hit. It ran like 50 minutes!
I thought maybe running this would generate data needed to be able to return something from my web/search.aspx tests that are not working. No such luck.
Per other post, any recommendations on tweaks to the lucenedotnetindex config settings to get that working? I was able to make a tweak by copying in some manual settings to the web.config in the web folder but I'm sure it's not right. It runs, but returns no data no matter what I do.
Look forward to getting this working, and understanding more the interaction between what arachnode might offer and what lucene might offer.
Lots of good code too! Performance counters, a web service, not sure what plugins are yet. And lots of nice stored procedure code taking advantage of sql 2005.
Oh yes, here's the warnings I"m getting back after running the sp:
Warning: Null value is eliminated by an aggregate or other SET operation. Msg 515, Level 16, State 2, Procedure arachnode_rsp_Files_MOST_POPULAR_EXTENSIONS_BY_ABSOLUTEURIS, Line 10 Cannot insert the value NULL into column 'InitiallyDiscovered', table 'arachnode.net.rpt.Files_MOST_POPULAR_EXTENSIONS_BY_ABSOLUTEURIS'; column does not allow nulls. INSERT fails. The statement has been terminated. Msg 515, Level 16, State 2, Procedure arachnode_rsp_Files_MOST_POPULAR_EXTENSIONS_BY_DOMAINS, Line 10 Cannot insert the value NULL into column 'InitiallyDiscovered', table 'arachnode.net.rpt.Files_MOST_POPULAR_EXTENSIONS_BY_DOMAINS'; column does not allow nulls. INSERT fails. The statement has been terminated. Msg 515, Level 16, State 2, Procedure arachnode_rsp_Files_MOST_POPULAR_EXTENSIONS_BY_HOSTS, Line 10 Cannot insert the value NULL into column 'InitiallyDiscovered', table 'arachnode.net.rpt.Files_MOST_POPULAR_EXTENSIONS_BY_HOSTS'; column does not allow nulls. INSERT fails. The statement has been terminated. Msg 515, Level 16, State 2, Procedure arachnode_rsp_HyperLinks_MOST_POPULAR_EXTENSIONS_BY_ABSOLUTEURIS, Line 13 Cannot insert the value NULL into column 'InitiallyDiscovered', table 'arachnode.net.rpt.HyperLinks_MOST_POPULAR_EXTENSIONS_BY_ABSOLUTEURIS'; column does not allow nulls. INSERT fails. The statement has been terminated. Msg 515, Level 16, State 2, Procedure arachnode_rsp_HyperLinks_MOST_POPULAR_EXTENSIONS_BY_DOMAINS, Line 13 Cannot insert the value NULL into column 'InitiallyDiscovered', table 'arachnode.net.rpt.HyperLinks_MOST_POPULAR_EXTENSIONS_BY_DOMAINS'; column does not allow nulls. INSERT fails. The statement has been terminated. Msg 515, Level 16, State 2, Procedure arachnode_rsp_HyperLinks_MOST_POPULAR_EXTENSIONS_BY_HOSTS, Line 13 Cannot insert the value NULL into column 'InitiallyDiscovered', table 'arachnode.net.rpt.HyperLinks_MOST_POPULAR_EXTENSIONS_BY_HOSTS'; column does not allow nulls. INSERT fails. The statement has been terminated. Warning: Null value is eliminated by an aggregate or other SET operation. Warning: Null value is eliminated by an aggregate or other SET operation. Warning: Null value is eliminated by an aggregate or other SET operation. (1 row(s) affected)
Warning: Null value is eliminated by an aggregate or other SET operation.
Msg 515, Level 16, State 2, Procedure arachnode_rsp_Files_MOST_POPULAR_EXTENSIONS_BY_ABSOLUTEURIS, Line 10
Cannot insert the value NULL into column 'InitiallyDiscovered', table 'arachnode.net.rpt.Files_MOST_POPULAR_EXTENSIONS_BY_ABSOLUTEURIS'; column does not allow nulls. INSERT fails.
The statement has been terminated.
Msg 515, Level 16, State 2, Procedure arachnode_rsp_Files_MOST_POPULAR_EXTENSIONS_BY_DOMAINS, Line 10
Cannot insert the value NULL into column 'InitiallyDiscovered', table 'arachnode.net.rpt.Files_MOST_POPULAR_EXTENSIONS_BY_DOMAINS'; column does not allow nulls. INSERT fails.
Msg 515, Level 16, State 2, Procedure arachnode_rsp_Files_MOST_POPULAR_EXTENSIONS_BY_HOSTS, Line 10
Cannot insert the value NULL into column 'InitiallyDiscovered', table 'arachnode.net.rpt.Files_MOST_POPULAR_EXTENSIONS_BY_HOSTS'; column does not allow nulls. INSERT fails.
Msg 515, Level 16, State 2, Procedure arachnode_rsp_HyperLinks_MOST_POPULAR_EXTENSIONS_BY_ABSOLUTEURIS, Line 13
Cannot insert the value NULL into column 'InitiallyDiscovered', table 'arachnode.net.rpt.HyperLinks_MOST_POPULAR_EXTENSIONS_BY_ABSOLUTEURIS'; column does not allow nulls. INSERT fails.
Msg 515, Level 16, State 2, Procedure arachnode_rsp_HyperLinks_MOST_POPULAR_EXTENSIONS_BY_DOMAINS, Line 13
Cannot insert the value NULL into column 'InitiallyDiscovered', table 'arachnode.net.rpt.HyperLinks_MOST_POPULAR_EXTENSIONS_BY_DOMAINS'; column does not allow nulls. INSERT fails.
Msg 515, Level 16, State 2, Procedure arachnode_rsp_HyperLinks_MOST_POPULAR_EXTENSIONS_BY_HOSTS, Line 13
Cannot insert the value NULL into column 'InitiallyDiscovered', table 'arachnode.net.rpt.HyperLinks_MOST_POPULAR_EXTENSIONS_BY_HOSTS'; column does not allow nulls. INSERT fails.
(1 row(s) affected)
OK. Bad news: You've found a bug. Good news: It's easy to fix.
The error occurs because the stored procedure is trying to insert one of the first 10 rows from either the Domains_Discoveries, Extensions_Discoveries, Hosts_Discoveries or Schemes_Discoveries tables into a reporting table. But, the reporting tables don't allow NULL values in the InitiallyDiscovered, DaysSinceInitiallyDiscovered and Strength columns.
When arachnode.net isn't able to parse an AbsoluteUri into a Domain, Extension, Host or Scheme the AbsoluteUri is assigned either an 'UNKNOWN' value or an empty string, depending on what the value of the AbsoluteUri is. An example of an AbsoluteUri that would cause an association of 'UNKNOWN' or an empty string could be something along the lines of: http://bogus. bogus.
Here's a look at the Hosts table to better illustrate what this means:
And a look at the Hosts_Discoveries table, to further the illustration:
If the AbsoluteUri was a WebPage, http://bogus. bogus would be assigned to either row 5 or row 10 in rpt.WebPages_Hosts_Discoveries. (See dbo.DiscoveryTypes for an explanation of DiscoveryTypeID)
In arachnode.net, Domains, Extensions, Hosts and Schemes are at the very top of the food chain, and in the foreign key chain too. If you delete a WebPage, then all EmailAddresses, Files and HyperLinks are removed from the database. Yet, the Domain, Extension, Host and Scheme information remains. Why? We can delete a WebPage from our database but deleting a WebPage doesn't necessarily mean that the Domain, etc. is gone from the internet and the world.
So, when we are unable to parse an AbsoluteUri we assign it a value of 'UNKNOWN' or an empty string. But, these two values aren't assigned an 'InitiallyDiscovered' as this column value is used for reporting, which is in turn used to influence Analysis Services or the Lucene.net indexes. As more than one real or bogus domain could be assigned to the default 'we can't parse you' placeholders it would be incorrect to grant undue influence to AbsoluteUris which arachnode.net couldn't properly parse.
The fix: Allow NULLS for InitiallyDiscovered, DaysSinceInitiallyDiscovered and Strength.
(i'll answer your lucene.net questions in your other post)
For best service when you require assistance:
Skype: arachnodedotnet
Hey -
Headed to band practice and then to a social engagement for the rest of the day. I'll answer all of your posts tonight or tomorrow! Thanks for the great posts! :)
-Mike
Super. Will check in later. Thanks :)
Great reply, thanks for the explanation. I like your db design so far. Good fundamentals, good use of stored procedures, and you're doing some assembly stuff I haven't done yet so I'm intrigued!
I'll make these tweaks. And I look forward to the replies on the other posts.
Kevin
Made those nullable tweaks to a number of tables. Also, ran the reset SP to reset all the db data and run a fresh crawl to test the search stuff.
Interesting and not sure what the cause is: after doing the db tweaks and doing the reset, it almost looks like no matter what hard-coded crawl I put into program.cs, it's just crawling stuff off of arachnode.net. Possible the reset SP did something I'm not expecting?
I'll start walking through the SP's and code but wanted to mention this.
Thx,Kevin
I'll answer this one tonight further when I have a little better access to the screenshot tool.
When you reset the database, a CrawlRequest is created, starting at http://arachnode.net
I figured as much, but it just seemed like my manually entered crawl request wasn't getting done. Maybe each run is grabbing another level of crawls for the newly discovered domains from the last arachnode.net crawl?
:)
Try inserting the CrawlRequest into the CrawlRequests database table.
It's possible that the Engine and Crawls are off and running before your CrawlRequest is processed.