arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

Site with session ids / anchors

rated by 0 users
Not Answered This post has 0 verified answers | 6 Replies | 2 Followers

Top 100 Contributor
3 Posts
redevries posted on Tue, Apr 6 2010 5:48 AM

Hi Mike,

This is an example of a logfrom crawling http://forum.www.something.nl, with a UriClassificationtype.Host. I understand this restrict me to just crawl forum.www.something.nl So far so good.

Now, I think this site normalyy set cookies, and if unable builds up a session is (sid). This causes some problems:-

  • AN thinks there is an Anchor in the url (and so it seems, but it is part of the Session ID)
  • I get URL's that don't work anymore after session ends....

dt:6-4-2010 14:38:53| ot:IsDisallowedReason| tn:9| crd:2147483647| ecr:40| AbsoluteUri:http://forum.www.something.nl/viewtopic.php?p=966739&sid=59244785fd5fcb3bb29772783c21b7c0#p966739| IsDisallowedReason:Disallowed by named anchor.

So, my questions are:-

  • Can I make AN accept cookies, to avoid getting the above problem
  • Can I make an exception for allowing anchors for just this site (rather then a global setting)
  • How do I set the crawler ONLY follow links that have /viewtopic.php in them? I.e.  the startup will still be http://forum.www.something.nl, but I only want to index the links that have http://forum.www.something.nl/viewtopic.php in them.

Thanks in advance, appologies for the questions while I make my way through the endless config options ;)

René

All Replies

Top 10 Contributor
1,905 Posts

Hmmm... I will doublecheck the cookie handing and get back to you.  Question though:  Are you using the latest and greatest from SVN?

To make exceptions for named anchors check AbsoluteUri.cs.

You can use one of the existing rules (or create a new one which is executed before AbsoluteUri.cs) and modify the Discoveries Dictionary.  Just remove the Discoveries that you don't want to crawl for HyperLinks.

I actually misread your question (it was early) and it got me thinking about another popular (as of now) request, which is to allow crawling through a site, but only insert pages / index pages that conform to a set of rules.  IsDisallowed is a parameter that says, "Hey!  This is a dead end for you AN.... do not continue to crawl Discoveries here..."   It seems as though it would be beneficial to add a new parameter... something like Insert, or Store or Process to CrawlRequest.cs.

However, if you only want to filter but still allow the crawl to pass through a site, then IsDisallowed won't work.  Does this make sense?

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 100 Contributor
3 Posts

> Hmmm... I will doublecheck the cookie handing and get back to you.  Question though:  Are you using the latest and greatest from SVN?

I am using the download link you send me for the licensed version. I'll wait for your investigation into the cookies handling before I start messing with the code.

BTW, I don't want to publicly announce the URL here that I want to index, but check the mail we exchanged earlier, it is the first one from that list. Feel free to try, you will see that the URLs retrieved by AN are entirely different compared to what yoy get when you just use a browser.

As for the crawling rules -- I'f thinking more in terms of a 'Store' rule, for example whenever you encounter a url which has '/news' in it, then store/index. For Twitter, the example would be /status. So just crawl these, and ignore all other. Something like that would be a HUGE safer in storing data I don't want anyway. In many of my use-cases I can easily determine what URL is useful to follow. 

Look for

.

Top 10 Contributor
1,905 Posts

I updated WebClient here: (SVN Link Removed) to accept and return cookies.

I am working on retrieving the cookies from the current Windows user context - so that you could, say, log into your Facebook account and crawl your profile.  (Which will be enabled as soon as I finish the AJAX/Rendering functionality.

Is this your desired situation: Read the current user's cookies (this should also work for service accounts), and send those first to the website, per domain, and then record the cookies which are returned, and then continue to collect all cookies from subsequent HttpWebResponses from the host and return, return, return... ?

Implementing the 'Store' functionality will come after the cookie/AJAX updates.

Mike

 

 

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 100 Contributor
3 Posts

arachnode.net:

Is this your desired situation: Read the current user's cookies (this should also work for service accounts), and send those first to the website, per domain, and then record the cookies which are returned, and then continue to collect all cookies from subsequent HttpWebResponses from the host and return, return, return... ?

I think that might work. If a cookie exists, it should be passed. The only issue I see is when the cookie expires, it may need an actual 'person' to go and login to get a new cookie. It would be good to store that as an exception in the table, if only to be able to notify that someone needs to intervene.

Look forward to this!

 

Top 10 Contributor
1,905 Posts

I checked in a beta version of this functionality last night.  I tried with this site (which apparently generates 'problematic' cookies - communityserver problem/bug), with AdWords (which I didn't really expect to work) and with Twitter (which uses distinct session cookies)...  I need to find a site with a simple login to test.  Suggestions?  Should I try one of your AbsoluteUris?

Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
1,905 Posts

Does AN work with cookies?  Yes.

http://search.arachnode.net/Search.aspx?query=netflix&discoveryType=WebPage&pageNumber=1&pageSize=10&shouldDocumentsBeClustered=1

My name is Mike Anderson.

 

 

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (7 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC