arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

Web Admin Console

rated by 0 users
Answered (Verified) This post has 1 verified answer | 2 Replies | 2 Followers

Top 10 Contributor
Male
101 Posts
Kevin posted on Sun, Feb 22 2009 8:25 AM

Hey I wanted to capture some thoughts for discussion as I'm thinking through a web console for arachnode.  I'll capture here until we build anything else (road map and such).

Some features might be specific to certain ideas I have and some more general.  So we'd need to think through everything in terms of being able to enable/disable it all based on specific implementation needs.

Let's add to this as needed!

So questions/thoughts:

Most/all config params now stored in database?
This would allow for changes to crawl and other actions on the fly as opposed to having to change config params by hand.  I know there are ways to change config files programmatically, but is it just easier to move them to db?  Handful of params still in config would of course be connection string info.

Crawler API on thread activity?
If not there already, it would be nice to have each of the crawl threads "report" on what they are doing in a way that could be accessible from a web console.  Nothing fancy just a report that maybe says whether or not a thread is actively doing something, what domain it is crawling, etc.

Localization?
I know it's a pain, but easier to do right up front rather than add in later.  Do we want to go ahead and localize all string info in the web console to make it easier to adapt to other languages?  Can either do resource files or db-driven - have done it both ways.  Again I know it's a pain, but less of a pain if started early than if added later.

Features to queue domains for review before being crawled
This would be an option that could be enabled/disabled.  For the ideas I have in mind I'd use this for sure.  This would also require being able to complete remove all crawl info for a given domain, if one sneaks in.

Vote feature...
We are throwing this idea around already, but the ability for registered users to vote a link up or down.  The final implementation would allow someone (or multiple people is even better) to tweak search results so that results they find more or less applicable could be tweaked up or down the list.

Membership
I'd likely want/need a membersip feature.  This would support building the community, allowing for tweaks to search results, and the abilityy to do other membership related things later.  Nothing fancy.

Search term statistics API
Easy way to store/access search term info like what search terms, how often used, etc.  This will be handy for general information display, for ajax-driven auto-complete searches that show most common similar searches, and for later ability to do any kind of advertising based on search terms.

 

Answered (Verified) Verified Answer

Top 10 Contributor
1,905 Posts
Verified by arachnode.net

A project named 'Administration' has been checked into the trunk.  Give it a whirl!

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

All Replies

Top 10 Contributor
1,905 Posts

This is all great!!!

Working on getting release 1.1 ready and working on the OOBE.

Responses to your post above:

  • Most/all config params now stored in database?  I have been evaluating the way arachnode.net stores configuration information.  Parameters that are boolean, integers, etc. are stored in Config files.  Parameters that are set/list based are stored in the database.  I'm not totally happy with having two location to configure settings, but I'm not totally unhappy with it either.
  • Crawler API on thread activity? There exists information currently that describes the thread, its number and the ConsoleManager could be modified to add what domain the Crawl is currently crawling.
  • Localization? Yes please.  We should do what nutch does and provide all languages out of the box.  Also, it isn't difficult to implement language detection.  1,000 pages per language from Wikipedia and NClassifier makes a nice language detector.
  • Features to queue domains for review before being crawled This one should probably have some logic or concrete decisions paths behind it.  Are we going to suggest which domains should be crawled?  It's possible to manipulate the filters for the Disallowed(s)... is this what you are thinking?
  • Vote feature... Pretty easy to do, and would likely tie into the membership feature.
  • Membership Are you thinking the ASP.Net membership provider?
  • Search term statistics API I have a good deal of work done on this, from two years ago.  Would be good to dig this up.

We also need testing.  Do you have anyone that would be willing to write tests?

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
1,905 Posts
Verified by arachnode.net

A project named 'Administration' has been checked into the trunk.  Give it a whirl!

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (3 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC