<?xml version="1.0" encoding="UTF-8" ?>
<?xml-stylesheet type="text/xsl" href="http://arachnode.net/utility/FeedStylesheets/rss.xsl" media="screen"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:wfw="http://wellformedweb.org/CommentAPI/"><channel><title>General Questions</title><link>http://arachnode.net/forums/7.aspx</link><description /><dc:language>en</dc:language><generator>CommunityServer 2008.5 SP2 (Build: 40407.4157)</generator><item><title>Re: How to restrict crawl to single domain?</title><link>http://arachnode.net/forums/thread/10112.aspx</link><pubDate>Sun, 24 May 2009 20:30:12 GMT</pubDate><guid isPermaLink="false">a2478770-777f-41ab-83b8-a21ff47ebb1f:10112</guid><dc:creator>arachnode.net</dc:creator><slash:comments>0</slash:comments><comments>http://arachnode.net/forums/thread/10112.aspx</comments><wfw:commentRss>http://arachnode.net/forums/commentrss.aspx?SectionID=7&amp;PostID=10112</wfw:commentRss><description>&lt;p&gt;You may be encountering a known bug that saves CrawlRequests from other domains when the Crawl process is shut down.&amp;nbsp; I&amp;#39;m working on a new build that should be available soon.&lt;/p&gt;
&lt;p&gt;Another user suggested the template idea - I&amp;#39;ll keep that in mind.&lt;/p&gt;
&lt;p&gt;If you are running from an official release - try downloading the code from the trunk.&lt;/p&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;</description></item><item><title>Re: How to restrict crawl to single domain?</title><link>http://arachnode.net/forums/thread/10105.aspx</link><pubDate>Fri, 22 May 2009 17:34:15 GMT</pubDate><guid isPermaLink="false">a2478770-777f-41ab-83b8-a21ff47ebb1f:10105</guid><dc:creator>megetron</dc:creator><slash:comments>0</slash:comments><comments>http://arachnode.net/forums/thread/10105.aspx</comments><wfw:commentRss>http://arachnode.net/forums/commentrss.aspx?SectionID=7&amp;PostID=10105</wfw:commentRss><description>&lt;p&gt;I am not sure.&lt;/p&gt;
&lt;p&gt;I have done everything you suggested and still for some reason the &lt;strong&gt;webpages&lt;/strong&gt; table possess lots of records from &lt;a href="http://sketchup.google.com"&gt;&lt;strong&gt;http://sketchup.google.com&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;What I am missing? I changed configuration according to lots of other posts regarding single domain/several domains and still I cant. please help me understand what I am doing wrong,&lt;/p&gt;
&lt;p&gt;and another thing, just a suggestion, maybe you can make a MODE feature. modes can be Single Site/Several Site/No Limits/By Country and other popular settings for popular settings.&lt;br /&gt;Maybe lots of code changes needed for this, but If you can do one settings file to hold all settings (and not on a database) then you can allow download form this site a setting files, so every one can download exactly the template he needs. just an idea as a new user it will be nice, I am sure I won&amp;#39;t need it when I will understand code whenever it will be :)&lt;/p&gt;
&lt;p&gt;thank you. waiting for answer on my issue/&lt;/p&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;</description></item><item><title>Re: How to restrict crawl to single domain?</title><link>http://arachnode.net/forums/thread/409.aspx</link><pubDate>Fri, 13 Feb 2009 19:23:16 GMT</pubDate><guid isPermaLink="false">a2478770-777f-41ab-83b8-a21ff47ebb1f:409</guid><dc:creator>arachnode.net</dc:creator><slash:comments>0</slash:comments><comments>http://arachnode.net/forums/thread/409.aspx</comments><wfw:commentRss>http://arachnode.net/forums/commentrss.aspx?SectionID=7&amp;PostID=409</wfw:commentRss><description>&lt;p&gt;&lt;strong&gt;OK, how about this: &lt;/strong&gt;If you want to crawl 500 domains you would configure arachnode.net to restrict Crawls to those 500 domain only like the posts above describe how to do.&amp;nbsp; Then, make sure your settings in &lt;strong&gt;Application.config&lt;/strong&gt; are set as shown.&lt;/p&gt;
&lt;p&gt;&lt;a href="http://arachnode.net/cfs-file.ashx/__key/CommunityServer.Discussions.Components.Files/7/6622.SuperConfig.bmp"&gt;&lt;img border="0" src="http://arachnode.net/resized-image.ashx/__size/550x0/__key/CommunityServer.Discussions.Components.Files/7/6622.SuperConfig.bmp" alt="" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The Crawl process works like this if you have the settings set as shown above: (this is simplified, removing how Files work)&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Crawl all database CrawlRequests.&amp;nbsp; CrawlRequests generate HyperLinks and a WebPage.&lt;/li&gt;
&lt;li&gt;Crawl all database HyperLinks.&amp;nbsp; HyperLinks generate WebPages.&lt;/li&gt;
&lt;li&gt;Crawl all WebPages (again).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;So, if you had 500 Domains let&amp;#39;s say that those 500 CrawlRequests generated 500,000 HyperLinks.&amp;nbsp; Of those 500,000 HyperLinks, 100,000 of them actually belong to your 500 Domains.&amp;nbsp; Then, those 100,000 HyperLinks would be crawled and would generate 100,000 WebPages.&amp;nbsp; When all CrawlRequests are done crawling and all HyperLinks belonging to your 500 Domains have been crawled (are found in the WebPages table), then all WebPages are crawled.&amp;nbsp; When all WebPages are crawled, this is essentially the same as resubmitting the original 500 CrawlRequests, except that arachnode.net won&amp;#39;t have to refilter the 500,000 HyperLinks that it finds.&lt;/p&gt;
&lt;p&gt;Does this help?&lt;/p&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;</description></item><item><title>Re: How to restrict crawl to single domain?</title><link>http://arachnode.net/forums/thread/407.aspx</link><pubDate>Fri, 13 Feb 2009 13:40:34 GMT</pubDate><guid isPermaLink="false">a2478770-777f-41ab-83b8-a21ff47ebb1f:407</guid><dc:creator>jaydeep</dc:creator><slash:comments>0</slash:comments><comments>http://arachnode.net/forums/thread/407.aspx</comments><wfw:commentRss>http://arachnode.net/forums/commentrss.aspx?SectionID=7&amp;PostID=407</wfw:commentRss><description>&lt;p&gt;Hello Mate,&lt;/p&gt;
&lt;p&gt;Here my point is i had put two crawl request in &lt;span style="font-size:x-small;"&gt;CrawlRequests table and both were crawled but now they are no more in the table, now suppose i want to crawl them again now what should be done? because i am having 500 domains to crawl which i will put in &lt;span style="font-size:x-small;"&gt;CrawlRequests table, so my question is is there anything that&amp;nbsp;can be done so that i do not have to enter 500 names again and again when i want to crawl them?&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="font-size:x-small;"&gt;&lt;span style="font-size:x-small;"&gt;let me know if i am not clear!!!&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="font-size:x-small;"&gt;Thanks for all your help&lt;br /&gt;JD&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="font-size:x-small;"&gt;&lt;/span&gt;&lt;/p&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;</description></item><item><title>Re: How to restrict crawl to single domain?</title><link>http://arachnode.net/forums/thread/401.aspx</link><pubDate>Wed, 11 Feb 2009 17:41:02 GMT</pubDate><guid isPermaLink="false">a2478770-777f-41ab-83b8-a21ff47ebb1f:401</guid><dc:creator>arachnode.net</dc:creator><slash:comments>0</slash:comments><comments>http://arachnode.net/forums/thread/401.aspx</comments><wfw:commentRss>http://arachnode.net/forums/commentrss.aspx?SectionID=7&amp;PostID=401</wfw:commentRss><description>&lt;ol&gt;
&lt;li&gt;After all of the CrawlRequests are crawled, if you have &lt;strong&gt;CreateCrawlRequestsFromDatabaseHyperLinks&lt;/strong&gt; set to &lt;strong&gt;true &lt;/strong&gt;or &lt;strong&gt;CreateCrawlRequestsFromDatabaseHyperLinks &lt;/strong&gt;set to &lt;strong&gt;true &lt;/strong&gt;then arachnode.net will create CrawlRequests from those AbsoluteUris.&amp;nbsp; There is an order in which CrawlRequests are processed.&amp;nbsp; a.) CrawlRequests b.) HyperLinks (not already in the WebPages table) and c.) WebPages.&amp;nbsp; (The next release adds the ability to re-crawl Files as well).&amp;nbsp; If you look at the stored procedure &lt;strong&gt;arachnode_omsp_CrawlRequests_SELECT &lt;/strong&gt;you&amp;#39;ll see that it selects the&amp;nbsp;TOP 3000 rows from a UNION of the TOP 3000 CrawlRequests, HyperLinks and WebPages.&amp;nbsp; If you wanted to adjust this balance so that 2900 HyperLinks not in the WebPages table and 100 WebPages were submitted for crawling you could do this.&amp;nbsp; So, after all CrawlRequests are crawled, and all HyperLinks (not already in the WebPages table) are crawled then WebPages will be submitted for recrawling.&amp;nbsp; (and you&amp;#39;re right... having to maintain the CrawlRequests table would be a chore.&amp;nbsp; Just for fun, submit a deep depth crawl and set &lt;strong&gt;desiredMaximumMemoryUsageInMegabytes &lt;/strong&gt;to a low number and keep an eye on the CrawlRequests table.)&lt;/li&gt;
&lt;li&gt;No data is deleted from arachnode.net.&amp;nbsp; So, if your WebPages are re-crawled, if their Source has changed, the date values in the WebPages table will update.&amp;nbsp; The screenshot below shows WebPages that were re-crawled.&lt;/li&gt;
&lt;/ol&gt;
&lt;p style="padding-left:30px;"&gt;&lt;a href="http://arachnode.net/cfs-file.ashx/__key/CommunityServer.Discussions.Components.Files/7/8688.TimesTheyAreAChangin.bmp"&gt;&lt;img border="0" src="http://arachnode.net/resized-image.ashx/__size/550x0/__key/CommunityServer.Discussions.Components.Files/7/8688.TimesTheyAreAChangin.bmp" alt="" /&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;a href="http://arachnode.net/cfs-file.ashx/__key/CommunityServer.Discussions.Components.Files/7/1832.TimesTheyAreAChangin.bmp"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;I updated the demo for the upcoming lucene.net functionality improvements: &lt;a href="http://arachnode.net/Content/LiveDemonstration.aspx"&gt;http://arachnode.net/Content/LiveDemonstration.aspx&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Since I know you&amp;#39;re waiting I&amp;#39;ll make finishing a build top priority this weekend.&amp;nbsp; I can&amp;#39;t guarantee that I&amp;#39;ll finish as the reporting views and stored procedures have to be modified for Release 1.1 and they always take longer than I think they will.&amp;nbsp; Also, I&amp;#39;ve made changes to the DB structure and to the lucene.net index format/fields.&amp;nbsp; Will you need a DB conversion script and a lucene.net index conversion utility?&lt;/p&gt;
&lt;p&gt;&lt;span style="font-size:x-small;"&gt;&lt;/span&gt;&lt;/p&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;</description></item><item><title>Re: How to restrict crawl to single domain?</title><link>http://arachnode.net/forums/thread/400.aspx</link><pubDate>Wed, 11 Feb 2009 10:33:45 GMT</pubDate><guid isPermaLink="false">a2478770-777f-41ab-83b8-a21ff47ebb1f:400</guid><dc:creator>jaydeep</dc:creator><slash:comments>0</slash:comments><comments>http://arachnode.net/forums/thread/400.aspx</comments><wfw:commentRss>http://arachnode.net/forums/commentrss.aspx?SectionID=7&amp;PostID=400</wfw:commentRss><description>&lt;p&gt;Hello Mike,&lt;/p&gt;
&lt;p&gt;wow great, thaks for your reply. It certainly works.&lt;/p&gt;
&lt;p&gt;Actually i will be having 5 websites to crawl and i want to them wo be crawled everyday. so perhaps i will use windows&amp;nbsp;service for&amp;nbsp;that.&lt;/p&gt;
&lt;p&gt;I have two issues.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;1)&amp;nbsp; &lt;/strong&gt;After crawling one site i have 5 more sites to crawl and so how can i continue that after one has finished also as i want it to repeat that on everyday do i have to make entry in &lt;strong&gt;CrawlRequest&lt;/strong&gt; table everyday (obviously does not make sense !!!) as it clears entry which we have made in it after cral starts for that.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;2)&lt;/strong&gt;&amp;nbsp; If in future&amp;nbsp;&lt;a href="http://www.mydomain.com"&gt;http://www.mydomain.com&lt;/a&gt; has few modified pages or new added pages so if i will crwal that website again than what will happen to my old crawled data?&lt;/p&gt;
&lt;p&gt;I hope i am clear enough.&lt;/p&gt;
&lt;p&gt;and yeah i will wait for improvement in lucene search which you are working on :)&lt;/p&gt;
&lt;p&gt;Thanks again&lt;br /&gt;JD&lt;/p&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;</description></item><item><title>Re: How to restrict crawl to single domain?</title><link>http://arachnode.net/forums/thread/399.aspx</link><pubDate>Tue, 10 Feb 2009 22:33:45 GMT</pubDate><guid isPermaLink="false">a2478770-777f-41ab-83b8-a21ff47ebb1f:399</guid><dc:creator>arachnode.net</dc:creator><slash:comments>0</slash:comments><comments>http://arachnode.net/forums/thread/399.aspx</comments><wfw:commentRss>http://arachnode.net/forums/commentrss.aspx?SectionID=7&amp;PostID=399</wfw:commentRss><description>&lt;div&gt;
&lt;p&gt;The easiest way to see how to crawl a single site is to make sure you&amp;#39;re crawling only CrawlRequests, as set in Application.config.&amp;nbsp; (Don&amp;#39;t create CrawlRequests from Database HyperLinks or Database WebPages.)&lt;/p&gt;
&lt;p&gt;Submit a CrawlRequest with a depth of 4 (any deeper and you&amp;#39;ll need to check out CrawlRules.config for the Depth CrawlRule, I believe) and be sure to set RestrictToUriHost to true.&lt;/p&gt;
&lt;p&gt;This configuration will crawl until all content found at depth 4 is complete.&amp;nbsp; If you started a crawl like this at MSN you&amp;#39;d likely pick up 250,000 WebPages.&amp;nbsp; &lt;/p&gt;
&lt;p&gt;Of course, there is likely to be more content to be crawled.&amp;nbsp; So, check out this forum post on how to restrict the entire system to a domain: &lt;a href="http://arachnode.net/forums/p/103/367.aspx#367"&gt;&lt;span style="color:#003399;"&gt;http://arachnode.net/forums/p/103/367.aspx#367&lt;/span&gt;&lt;/a&gt;&amp;nbsp; You&amp;#39;ll want to do this if you don&amp;#39;t want to manually feed the CrawlRequests table with requests from your intended Domain as the database tables will contain HyperLinks and Files&amp;nbsp;from other Domains.&lt;/p&gt;
&lt;p&gt;If you think crawling a single domain is too complicated, would you make a forum post and make a suggestion.&amp;nbsp; My eventual aim for arachnode.net is to make it as accessible as possible.&lt;/p&gt;
&lt;p&gt;Thanks!&lt;br /&gt;Mike&lt;/p&gt;
&lt;/div&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;</description></item><item><title>How to restrict crawl to single domain?</title><link>http://arachnode.net/forums/thread/395.aspx</link><pubDate>Thu, 05 Feb 2009 12:57:26 GMT</pubDate><guid isPermaLink="false">a2478770-777f-41ab-83b8-a21ff47ebb1f:395</guid><dc:creator>jaydeep</dc:creator><slash:comments>0</slash:comments><comments>http://arachnode.net/forums/thread/395.aspx</comments><wfw:commentRss>http://arachnode.net/forums/commentrss.aspx?SectionID=7&amp;PostID=395</wfw:commentRss><description>&lt;p&gt;Hello Guys,&lt;/p&gt;
&lt;p&gt;I am stuck if anyone have idea of it !!!&lt;/p&gt;
&lt;p&gt;well i want to crawl only the links which belong to my domain e.g. &lt;a href="http://www.mydomain.com"&gt;http://www.mydomain.com&lt;/a&gt; , i want to crawl all the pages under &lt;a href="http://www.mydomain.com"&gt;http://www.mydomain.com&lt;/a&gt;&amp;nbsp;like &lt;a href="http://www.mydomain.com/1.aspx"&gt;http://www.mydomain.com/1.aspx&lt;/a&gt;, &lt;a href="http://www.mydomain.com/1.aspx"&gt;http://www.mydomain.com/2.aspx&lt;/a&gt;&amp;nbsp;etc... and one of the page contains link to &lt;a href="http://www.yahoo.com"&gt;http://www.yahoo.com&lt;/a&gt; but i do not want to crawl &lt;a href="http://www.yahoo.com/"&gt;http://www.yahoo.com&lt;/a&gt;&amp;nbsp;, i know there is configuration through Application.config in Configuration project but if&amp;nbsp;I set&amp;nbsp;createCrawlRequestsFromDatabaseHyperLinks to false than it crawls only one link which is &lt;a href="http://www.mydomain.com"&gt;http://www.mydomain.com&lt;/a&gt; but i want data from all my sub pages.&lt;/p&gt;
&lt;p&gt;Can this be done?&lt;/p&gt;
&lt;p&gt;I hope i am clear enough.&lt;/p&gt;
&lt;p&gt;Thanks &lt;br /&gt;JD&lt;/p&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;</description></item></channel></rss>