Ok. Almost completely up and running with VS2008 and SQL2008. Here are a few questions that I can't seem to track down myself.
I'm getting closer. Any help is appreciated.
J
Hey J -
It's late for me - all good questions... let me get back to you over the weekend!
But, good news - Version 1.2 is close... I'm hoping that within a week we'll have it ready!!!
Mike
For best service when you require assistance:
An open source .NET web crawler written in C# using SQL 2005/2008.
Twitter: http://twitter.com/arachnode_net
arachnode.net provides custom crawling and contracting resources. Please ask.
C# crawler, C# web crawler, C# site crawler
1.) Check out the stored procedure and remove the line shown below: (I have removed this from the SP that will be part of Version 1.2) dbo
dbo
.arachnode_usp_arachnode.net_RESET_DATABASE - EXEC dbo.arachnode_omsp_CrawlRequests_INSERT @Datetime, 'http://arachnode.net/', 4, 0, 1, null2.) That would work. An even easier way would be to submit CrawlRequests with a high depth, say '100' and restrict crawling to CrawlRequests and WebPages.
3.) I think this actually may be a bug that I fixed for the upcoming Version 1.2 release. :( (Sorry...) If you stop a crawl it doesn't check when saving the requests back to the DB.
4.) Check out the 'Integration' project in the solution. This is where the terms get split. The switch you found in the DB is for extracting Text and XML from the WebPages (inserted into WebPages_MetaData) and for enabling creating of an HtmlAgilityPack HtmlDocument. (The HtmlDocument functionality is very memory intensive.)
5.) Yes - disallowQueryStrings.
<? xml version="1.0" encoding="utf-8" ?> < crawlRules> < rule assemblyName="Arachnode.SiteCrawler" typeName="Arachnode.SiteCrawler.Rules.AbsoluteUri" isEnabled="true" order="1" ruleType="3" outputIsDisallowedReason="true" disallowNamedAnchors="true" disallowQueryStrings="false" ...
<?
xml version="1.0" encoding="utf-8" ?> <
<
crawlRules> <
rule assemblyName="Arachnode.SiteCrawler" typeName="Arachnode.SiteCrawler.Rules.AbsoluteUri" isEnabled="true" order="1" ruleType="3" outputIsDisallowedReason="true" disallowNamedAnchors="true" disallowQueryStrings="false" ...