I somehow for now have come past inital hurdles. But the 1st 5 minute crawl went fine.
Then i re-started the crawl, this time assigncrawlrequsetsfromhyperlinks = true. So i assume it should continue form previous discoveries. But i started getting these weird exceptions. pl have alook.
Cannot insert the value NULL into column 'Scheme_DiscoveryID1', table 'arachnode.net.dbo.Exceptions_Schemes_Discoveries'; col
umn does not allow nulls. INSERT fails.
An error occurred in the Microsoft .NET Framework while trying to load assembly id 66354. The server may be running out of re
sources, or the assembly may not be trusted with PERMISSION_SET = EXTERNAL_ACCESS or UNSAFE. Run the query again, or check do
cumentation to see how to solve the assembly trust issues. For more information about this error:
System.IO.FileLoadException: Could not load file or assembly 'arachnode.functions, Version=0.0.0.0, Culture=neutral, PublicKe
yToken=null' or one of its dependencies. An error relating to security occurred. (Exception from HRESULT: 0x8013150A)
at System.Reflection.Assembly._nLoad(AssemblyName fileName, String codeBase, Evidence assemblySecurity, Assembly locationH
int, StackCrawlMark& stackMark, Boolean throwOnFileNotFound, Boolean forIntrospection)
at System.Reflection.Assembly.nLoad(AssemblyName fileName, String codeBase, Evidence assemblySecurity, Assembly locationHi
nt, StackCrawlMark& stackMark, Boolean throwOnFileNotFound, Boolean forIntrospection)
at System.Reflection.Assembly.InternalLoad(AssemblyName assemblyRef, Evidence assemblySecurity, StackCrawlMark& stackMark,
at System.Reflection.Assembly.InternalLoad(String assemblyString, Evidence assemblySecurity, StackCrawlMark& stackMark, Bo
at System.Reflection.Assembly.Load(String assemblyString)
You likely turned off AutoGrowth for all FILEGROUPS.
If a setting exists in VS or in SQL, I have taken a look at the setting and set it to the correct value for AN. So, if these settings are changed, then a new set of problems may arise.
Everything is set at it should be for the DB, and according to the install instructions. If you deviate from this, like copying the DB to another server and trying to attach you will cause errors like the one from your other post, or if you turn off autogrowth, etc.
I suggest that you download a fresh copy, and start from the installation instructions and try and crawl ONE SITE by replacing the 'tmz.com' CrawlRequest in Program.cs. Change the boolean values for the ApplicationSettings and look at the code that comments the CrawlActions and CrawlRules. Just start from here and see what happens on disk and in the database.
It is much simpler to learn what does what in AN by changing these settings and looking in the database, and also, search the site for the keywords. I have been religious about calling things the same name since day 1. Like. A crawl action will never (shouldn't) be called 'crawl actions', but always 'CrawlActions'. So, taking a guess here, you can search for 'ExtractImageMetaData' and get more results than for 'extract image metadata'.
Let's see if I am right:
For best service when you require assistance:
You can safely ignore the NULL error message. It is the result of a race condition and occurs because I don't synchronize inserts on the tables that correspond to Uri classification because doing so incurs a performance hit. So, this is fine. You won't see it every time.
The other error? http://support.microsoft.com/kb/918040 This should solve it.
Funny thing is, is that AN touches so many areas of .NET/CLR and quite a bit of SQL that there are a few SQL errors where this site has the top results for that error.
Ok, So NULL msg i will ignore.
So related questions i want to ask are:
1. So at end of crawl, will i find all the raw webpages in dbo.webpages table. If i just want TEXT, then is that the only table of interest at end ?
But I did not see any rows in that after some 5 minute crawl, even though i see raw files in DownloadedWebPages directory. I have InsertWebpages* set to true. Am i missing smthing ?
Right now i will just collect raw data, not putting my plugin. Will do offline processing.
2. For maximum crawl speed, Can i set ThreadSleepTimeInMillisecondsBetweenWebRequests= to very low. Because the site i want to crawl , i will do in off-peak hours and shouldnt be a problem ? and Is there any other setting apart from # threads ?
3. If crawl has to be restarted due to sm reason etc. (like this assembly error) then will it resume perfectly or some links will be lost. I guess i will need to reset the exceptions* tables , right ?
The 2nd one i found workaround for now "run the exe as admin" :) and even the link you sent is useful.
Now getting DB insert fail (no space) exception on REPORTING filegroup. It is already set to unrestricted growth + i have plenty of space empty on the disk. I modified its initial size to 20 MB. Right now it is about 4 MB and complaining.
Hopefully these are all initial hiccups and i will also get running smoothly soon with your help :)
EDIT: Increased REPORTING DB file inital size to 500 mb. Now another such filegroup is complaining..
InsertException: Could not allocate space for object 'dbo.DisallowedAbsoluteUris_Discoveries'.'IX_DisallowedAbsoluteUris_Disc
overies_2' in database 'arachnode.net' because the 'DISALLOWEDABSOLUTEURIS' filegroup is full. Create disk space by deleting
unneeded files, dropping objects in the filegroup, adding additional files to the filegroup, or setting autogrowth on for exi
sting files in the filegroup.
System.Data.SqlClient.SqlException: Could not allocate space for object 'dbo.Exceptions'.'PK_Exceptions' in database 'arachno
de.net' because the 'EXCEPTIONS' filegroup is full. Create disk space by deleting unneeded files, dropping objects in the fil
egroup, adding additional files to the filegroup, or setting autogrowth on for existing files in the filegroup.
Engine: AssignCrawlRequestsToCrawls: Total.UncrawledCrawlRequests: 139.
InsertException: Could not allocate space for object 'dbo.HyperLinks_Hosts_Discoveries'.'PK_HyperLinks_Hosts_Discoveries' in
database 'arachnode.net' because the 'REPORTING' filegroup is full. Create disk space by deleting unneeded files, dropping ob
jects in the filegroup, adding additional files to the filegroup, or setting autogrowth on for existing files in the filegroup.
System.Data.SqlClient.SqlException: Could not allocate space for object 'dbo.Exceptions_Schemes_Discoveries'.'IX_Exceptions_S
chemes_Discoveries' in database 'arachnode.net' because the 'REPORTING' filegroup is full. Create disk space by deleting unne
eded files, dropping objects in the filegroup, adding additional files to the filegroup, or setting autogrowth on for existin
g files in the filegroup.
at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection)
at System.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection)
at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj)
at System.Data.SqlClient.TdsParser.Run(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopyS
impleResultSet bulkCopyHandler, TdsParserStateObject stateObj)
1.) Turn on ExtractWebPageMetaData, InsertWebPageMetaData and look in table WebPages_MetaData. Don't know. You likely changed 'something'. As always, check my signature for how to troubleshoot.
2.) You can, but in the demo config this rule is turned off. General rule: The faster you crawl, the more likely it is that you wil be blocked.
3.) If you ctrl-c the crawl then the state is saved. If there is some problem with your actual VS/SQL setup then you should consider the whole crawl invalid.