arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

Quick Questions / Thoughts

rated by 0 users
Not Answered This post has 0 verified answers | 5 Replies | 2 Followers

Top 10 Contributor
Male
101 Posts
Kevin posted on Fri, Feb 13 2009 11:46 AM

I've been swamped at work but still playing with arachnode stuff and had some thoughts/questions:

1. I seem to remember thinking that my lucene data was being erased/recreated each time I ran a crawl test.  Is this the case?  If so it would obviously be great if the lucene data was updated as crawls occurred.  Is this not possible?  The thought is that new sites would be added continuously and added to the lucene search results.  If this isn't possible (at least not yet), I could consider using the FTS data and doing it that way :(

2. I forgot and am at work so can't look at the moment... is there a project already included that is a web service or a system service that watches the crawl table for something to do and does it?  The thought being that I want to start playing with web pages that ask for and submit new crawl requests and those requests get picked up and processed.

3. I know there is no web interface yet but an idea I might play with.  Ability to submit crawl requests and they are "queued up for review."  An admin screen would allow review, maybe a thumbnail of the site queued, and upon acceptance a crawl request created.  Scenario might be a site that is building a search portal specific to certain types of sites (by interest or topic for example).  So sites would be queued for review.  THIS ALSO MEANS that additional domains found while walking an approved site would have to be queued for review as well.  The list could get large, but would still have to be reviewed if a search portal wanted to make sure only particular domains are part of the site

4. Any thoughts or work already begun on the concept of adjusting search result rankings based on vote?  For example on a portal search site focused on some particular topic, as searches are performed and results shown, if the user had an opportunity to rank a result as "very pertinent", "kind of pertinent", "not applicable", etc., it may prove useful in adjusting search results.  Yes there could be issues with users faking activity to force rankings on particular results, and maybe with being able to affect lucene's internal search ranking stuff from the outside in, but wanted to at least ask the question.

5. Question you might be able to answer since you have code doing it already.  Loading a type dynamically via entries in a config file.  Great idea!  Nice way to be able to change out a provider and other functionality!  So when doing this, technically isn't the entire assembly loaded?  I mean, in a situation where I may have 100 reports in an assembly, all typed differently, if trying to dynamically load one Type from the assembly won't the entire assembly astually be loaded?  Or in fact is more optimized than I think and only the type would load, and this would be a great way to dynamically load reports on demand rather than loading 100 reports just to see 1?

Keep up the good work!

 

All Replies

Top 10 Contributor
1,905 Posts

Hey!  Glad to see you back!  I still have to finish answering your post on how crawl priority influences search result relevancy.  :)  And, another user, jaydeep is really interested in the product too, so I've been working hard a getting a release together.  (and, we're trying to buy a second home...!!!)  So...

1.) The lucene.net indexes are maintained and are added to with each crawl.  The indexes aren't like nutch where you have to merge them and merge the linkdb's, etc.  I have made a change slated for Version 1.1 which uses a fresh lucene.net index and optimzes/merges with the main search index when a new crawl is started.  Indexing is much faster when done this way and merging indexes is a fast process, faster than indexing into one main index at crawl time.  I just about have the code done to update existing documents.

2.) There is a Windows service that does what the Console app test harness does.  I would really like to add functionality from the Web project that would allow you submit CrawlRequests.

3.) This would be awesome!!!  A friend of mine is working on a data visualization engine in Silverlight that could show what was currently being crawled.  I can imaging crawling being like a video game or sorts.

4.) Yes, but thoughts only, for the most part.  Were you thinking of something along the lines of what Google has done recently with their search results?  I have to finish some code in ManageLuceneDotNetIndexes.cs for updating existing Documents, and most of this code would be used for updating Document Boosts.

5.) Thanks!  I believe I mentioned this elsewhere on the site, but my thinking was this functionalityu would be used for commercial plugins... where the source wouldn't be distributed.  I believe that only the Type from the Assembly that you specify will be loaded and not the entire Assembly.  I'm using this as a reference: http://msdn.microsoft.com/en-us/library/1fce0hc8.aspx  If the entire Assembly was loaded each time then for each CrawlAction or CrawlRule that was loaded then it would mean that the entire Arachnode.SiteCrawler Assembly would be loaded.

Each dynamically loaded type supplies two parameters.  1.) The assemblyName and 2.) the typeName.  So, I think we're in the clear for only loading the classes/types that you want to load.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
Male
101 Posts
Kevin replied on Fri, Feb 13 2009 2:22 PM

Yeah I was thinking of something what Google is doing, or even the digg approach.  Some way for relevant (registered, caring) users to tweak rankings of particular content based on particular searches.

When are you thinking the next build will be out?  Can't wait!

I will play with setting up the windows service and do crawl submissions.  I will also play with a web interface for submissions and such.

I'm heavy into Telerik UI, .NET 3.5, SubSonic DAL generation, and Enterprise Library 4.1 so what I "play" with may end up using some of this stuff.

Good luck on the second house!

Thx

 

Top 10 Contributor
1,905 Posts

Awesome!!!  Something like Digg labs would be cool too!  I was thinking about ranking caps - limiting the number of WebPages that you could influence.  A sort of 'choose your votes wisely' motif.

I'm going to do my best this weekend to get a build out.  It may not happen, and if it doesn't I will make an experimental tag.

This is very, very cool.  So, are you thinking that there's a strong nutch competitor here?  ;)

Big fan of the Telerik Controls.  I purchased a set two years ago.  LINQ == CRACK.

Thanks!  We've been living in a tiny condo (560 square feet) for 2.5 years and we're going a little crazy.  We actually have a bunk bed to make a little more room.

Mike

 

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
1,905 Posts

I like what this page does for search result promomtion: http://zataka.com/tgs/comask%20tic+ask.html

A user can't promote past an agreement of other users.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
Male
101 Posts
Kevin replied on Wed, Feb 18 2009 10:09 AM

Yeah that's what I was thinking.  Or some way to require x people to agree on a vote before it is "applied."  Something simple, but still effective and that gives users ability to affect results.

 

Page 1 of 1 (6 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC