arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL 2005/2008/CE
Does arachnode.net scale? | VS2008/2010/2012 & SQL2008/2012 | Download the latest release

Help with Setup

rated by 0 users
Answered (Verified) This post has 1 verified answer | 11 Replies | 2 Followers

posted on Wed, Jan 27 2010 2:26 PM

Hi there,

I would like to use AN for searching our own website.  The website is a .NET web application.  The goal is to be able to search all entered information.

Because it is an application there is no static html (for the purposes of crawling anyway).  Users would enter in information and I would like that information to be searchable.

A few considerations:

1. There is a login page - so the crawler would have to be able to log into the application since it is protected by username / password.  How would I configure this?

2. I would like to crawl / index / search pages (which display data stored in a database), files (PDF, Word, etc) and images.  Would I use Lucene for this?

3. What do I need to do to set up Lucene (besides creating the directory, and setting that in the Config and CrawlActions tables)?

4. I need to have a hit list that shows the context in which a hit was found.  Example:

If search for the term "red" in a House application and there is a hierarchy in the data model that looks like:

House -> Room -> Appliance

Let's say there was data that existed like this:

House1 (where color is red)
Room1 (Kitchen where walls are red)
Appliance1 (Stove where color is red)

My hit list would have to look like this:
House1 (link)
House1 -> Room1 (the hit is Room1 but needs to show the context of House1)
House1 -> Room1 -> Appliance1 (the hit is Appliance1 but needs to show context of House1->Room1)

This is likely a poor example but suffice it to say context is important and would naturally include duplicates (imagine Appliance1 belonging to a kitchen in many houses).

 

5. Finally (I think), what would I have to do in order to enable the above requirements for a search embedded in my existing web application?

I am really sorry for the abundant questions but am under a tight deadline with this "context" requirement that could make your product my only solution at this point.

Thank you!!!

Richard

Answered (Verified) Verified Answer

Top 10 Contributor
1,696 Posts

OK, this explains it.  SVN access is supposed to be private, but apparently it isn't.  You are running version b of the code with version a of the database.

You are free to install AN to as many machines as you would like.

Download the DEMO here: http://arachnode.net/media/p/41.aspx  .doc and .pdf indexing isn't present in this DEMO, but should be on Saturday.  Version 1.4 (current) does contain this functionality.

Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

All Replies

replied on Wed, Jan 27 2010 2:27 PM

By the way I only submitted as anonymous because I am waiting for my account to be verified - not trying to hide Smile

Top 10 Contributor
1,696 Posts

0.) Account approved!  :)

1.) Use the CredentialCache in Crawler.cs and supply a cookie or other piece of information that allows access (depending on what type of login the page provides)... or, you may need to provide a page known only to the crawler to gain access.

2.) AN indexes .doc and .pdf by default.

3.) Nothing other than what you have described.

4.) You should be able to accomplish this as AN records each HyperLink and where it was found, and thus you can construct discovery chains, if this sounds similar to what you are asking... ???  Perhaps you can explain this in terms of HyperLinks and WebPages and content contained therein?

5.) Sounds like most of it is already in place.  Just need a bit more info on #4.

You are welcome!

Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 75 Contributor
4 Posts

Hi Mike.

Thanks so much for your quick reply!!

A couple of things:

Given my time crunch I may need to obtain your services for a custom set up for my specific situation.  Could you provide me details on that (i.e. cost, etc).

Is there documentation beyond the startup (which is very good by the way) that helps with figuring out how to do certain things (i.e. create discovery chains)?

I noticed that when I crawl my test site (just a couple of pages) it doesn't get all my .doc files.  I created two in the same directory but only one is picked up - the one that is referenced in my index.html page.

Also, it indexed a .gif file that is not referenced in any of my html pages.  Really I would ideally like to index only items that are referenced (linked to) from the pages (html, aspx, etc).  Is this possible?

Again, maybe I need your help given the time crunch.......thanks!!

 

Richard

Top 10 Contributor
1,696 Posts

You are very welcome!

Most likley you won't require a ton of code.  If I had to guess, without knowing more about your setup for the 'discovery chains', is that you would leverage the HyperLinks reference before, and add the 'Parent' references from 'CrawlRequest.cs' to the index, so you can trace the chain back as far as you'd like.

Are you saying that you have an index.html file that references one file in a directory with two files and only one file is being picked up.  This sounds correct to me.  Big Smile

It isn't - if the .gif is listed in your pages it will get picked up.  Big Smile

If you purchase a license I am willing to spend some time with you to get AN setup on your machine.  What I have found works best is to have customers/clients purchase and install, follow the instructions for install, then compile a list of questions.  After this, we can schedule a time to go over your questions, using a TeamViewer session.  Then questions are posted to the forums and I will answer them in text for better retention.

Setup is no cost, and then I bill in blocks of 20 hours of custom coding support at $50 an hour.  And, of course, general AN help, if asked on the forums, is free.

Given that you only want to crawl one site, and the site is your site, there is a great chance for success in rapid fashion.  As long as you don't have anything too crazy like spider traps, AJAX, redirect schemes, dynamic link generation intended to confuse crawlers and aren't trying to break crawlers with malformed XML the process should be relatively painless.

How much time are we working with?

Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 75 Contributor
4 Posts

Hi Mike,

What is happening is the following:

index.html references Word1.doc  -> this gets picked up.
index.html references page2.html -> no web pages get indexed (I was expecting the pages themselves to be indexed)
page2.html references Word2.doc -> Word2.doc does NOT get picked up.
image.gif is not referenced by any html pages and this does get picked up.

I am confused as to why one word document gets picked up and not another.  And why no html pages get picked up.

Questions:

1. Is a license required to use this product for commercial use?

2. Is there documentation that comes with a license purchase that is not otherwise available?

3. Time is real tight - a few days - I am a developer but would benefit greatly of just cutting to the chase to get setup and then examine the code to see what's going on (I know sounds bad - but deadlines still exist) - is it possible to do this in a short turnaround?

Again, your help and time are greatly appreciated.

P.S. please feel free to email me if it is easier.  Thanks!

Cheers,

Richard

 

Top 10 Contributor
1,696 Posts

index.html references Word1.doc  -> this gets picked up. OK.
index.html references page2.html -> no web pages get indexed (I was expecting the pages themselves to be indexed)  What depth are you crawling at?  Should be 2.
page2.html references Word2.doc -> Word2.doc does NOT get picked up.  Could be depth related.
image.gif is not referenced by any html pages and this does get picked up.  I don't know how this could be.  Image.gif has to be in one of your pages.  (Curious...)

1.) Yes.  Which version are you using?

2.) No.

3.) OK.  I can't really say without diving in to the specifics.  You know... plus, as the old addage goes, "A well spec'd project takes twice as long... without specs takes three times as long..."  Wink

Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
1,696 Posts

Also, always check the DisallowedAbsoluteUris table and the Exceptions table when troubleshooting why something isn't working the way you think it should.  90% of all mysteries can be solved with these two tables.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 75 Contributor
4 Posts

I thought of the depth so I set it to 1000!

I just downloaded the code today so whatever the latest version is in SVN.

Also, the solution we are deploying will likely eventually be a centralized web application.  But for security reasons is being deployed (for now) on separate laptops.  Does that mean I would have to purchase a license per laptop until the server option is setup?

In the DisallowedAbsoluteUris table there are two records:

AbsoluteUri                        Reason
http://richardb.amita.com/robots.txt            The remote server returned an error: (404) Not Found.
http://richardb.amita.com/test/files/WhatsNext.doc    A CrawlRequest did not download Data as the 'LastModified' HttpResponse Header indicating that the Data was not stale, but could not find the Data (Source) in the Files database table or at ApplicationSettings.DownloadedFilesDirectory.

And in the Exceptions table there were quite a few records - most of them were of the following nature:

Procedure or function arachnode_omsp_CrawlRequests_DELETE has too many arguments specified.
Procedure or function arachnode_omsp_WebPages_INSERT has too many arguments specified.

 

Next steps?

Thanks!

Richard

Top 10 Contributor
1,696 Posts

OK, this explains it.  SVN access is supposed to be private, but apparently it isn't.  You are running version b of the code with version a of the database.

You are free to install AN to as many machines as you would like.

Download the DEMO here: http://arachnode.net/media/p/41.aspx  .doc and .pdf indexing isn't present in this DEMO, but should be on Saturday.  Version 1.4 (current) does contain this functionality.

Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 75 Contributor
4 Posts

Ah...sorry...I was just following instruction in the Getting Started video.

I will download the demo.  Here is a question, could I use the Demo license free on these laptops until we move to a central server at which point I would purchase the commercial license?  Or how does that work?

 

When you say version 1.4 (current) - that is the version of a purchased license?

Thanks again Mike - your support in these few short hours is exceptional!

Richard

Top 10 Contributor
1,696 Posts

Sorry about that.  I will get that corrected.

You can use the demo license however you'd like to.

Version 1.4 is current.  Version 1.5 will be released this week - update to the lucene as lucene.net left the Apache incubator status.

You are welcome!  I check the forums several times a day - always glad to help!

Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (12 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2005/2008/CE

copyright 2004-2013, arachnode.net LLC