arachnode.net
An open source .NET web crawler written in C# using SQL 2005/2008

What new feature should I finish first?

rated by 0 users
Not Answered This post has 0 verified answers | 16 Replies | 4 Followers

Top 10 Contributor
1,202 Posts
arachnode.net posted on 7 Jan 2009 11:19 AM

I'm on my self-imposed one week sprint break in development, and I'm considering what features I should add/improve next.  Help me decide!

What new feature should I finish first?

  • Distributed caching. Enable multiple instances of arachnode.net to communicate. (25%)
  • Automating WebPage templating. I have 50% of the work done to automatically parse WebPages into Content Body and Comments. (41.7%)
  • RSS Parsing. Enable arachnode.net to parse discovered RSS feeds. (8.3%)
  • CommunityServer 2008.5 integration. Enable arachnode.net to communicate with the new CommunityServer 2008.5 REST API. (0%)
  • Work on the TextAnalytics module. NLP and Sentiment extraction. (25%)
  • Total Votes: 12

An open source .NET web crawler written in C# using SQL 2005/2008.

Join the arachnode.net group on Facebook: http://www.facebook.com/groups.php?ref=sb#/group.php?gid=166721755872

Twitter: http://twitter.com/arachnode_net

arachnode.net provides custom crawling and contracting resources.  Please ask.

http://bit.ly/TOFX4

C# crawler, C# web crawler, C# site crawler

All Replies

Top 10 Contributor
1,202 Posts

I think automatically web page templating is next.  :)

An open source .NET web crawler written in C# using SQL 2005/2008.

Join the arachnode.net group on Facebook: http://www.facebook.com/groups.php?ref=sb#/group.php?gid=166721755872

Twitter: http://twitter.com/arachnode_net

arachnode.net provides custom crawling and contracting resources.  Please ask.

http://bit.ly/TOFX4

C# crawler, C# web crawler, C# site crawler

Top 10 Contributor
Male
101 Posts

Yeah if it's mostly done finish 'er up!  But then the NLP and sentiment stuff would be a close second!!!

 

Top 25 Contributor
Male
11 Posts

What's the concept behind this?  Practically what problem does it solve?  I "think" it sounds useful, but what are the specifics?

Without knowing what this is about, I'm personally more interested in NLP, Semantic Analysis and Sentiment Analysis (NLPSS :) )

Top 10 Contributor
1,202 Posts

If you want to analyze a page for sentiment, or for importants words or for whatever, really, you don't most likely don't want to grab the sidebar and the links and the footer for your analysis.

When a page is submitted to the code it is run though an algorithm that parses out the meat of a page and then creates an XPATH template that can be re-used for pages from the same HOST.

This code is really necessary for NLP because it greatly reduces noise.  What do you think?

An open source .NET web crawler written in C# using SQL 2005/2008.

Join the arachnode.net group on Facebook: http://www.facebook.com/groups.php?ref=sb#/group.php?gid=166721755872

Twitter: http://twitter.com/arachnode_net

arachnode.net provides custom crawling and contracting resources.  Please ask.

http://bit.ly/TOFX4

C# crawler, C# web crawler, C# site crawler

Top 50 Contributor
Male
8 Posts

I voted for that templating, I find it really interesting. I actually wanted to develop such thing for my home-brew mini crawler but wasn't sure how to start doing it in VB...arachnode project provided my a lot of material for crawling related and C# in general stuff, so thanks :)

Top 10 Contributor
1,202 Posts

Awesome!  I'll put some time into templating this weekend.

We recently added a Bayesian classifier plugin (beta), so it's probably OK to add the templater in beta format too.

Mike

An open source .NET web crawler written in C# using SQL 2005/2008.

Join the arachnode.net group on Facebook: http://www.facebook.com/groups.php?ref=sb#/group.php?gid=166721755872

Twitter: http://twitter.com/arachnode_net

arachnode.net provides custom crawling and contracting resources.  Please ask.

http://bit.ly/TOFX4

C# crawler, C# web crawler, C# site crawler

Top 50 Contributor
Male
8 Posts

Yeah, I also hope that some day eventually I learn C# enough to contribute to this great project too...Smile

Top 10 Contributor
1,202 Posts

Awesome!  The more the merrier!

What would you want to contribute?  What about writing a Plugin that serves a specific purpose?

Mike

An open source .NET web crawler written in C# using SQL 2005/2008.

Join the arachnode.net group on Facebook: http://www.facebook.com/groups.php?ref=sb#/group.php?gid=166721755872

Twitter: http://twitter.com/arachnode_net

arachnode.net provides custom crawling and contracting resources.  Please ask.

http://bit.ly/TOFX4

C# crawler, C# web crawler, C# site crawler

Top 50 Contributor
Male
8 Posts

Well I don't feel having the skills yet :P I need to dig into the source code first anyway...but for sure I'm gonna try to do something useful, I promise :)

Top 10 Contributor
1,202 Posts

I added the Templater to the source repository.  It's EXTREMELY rough, and is likely ALPHA code.  FWIW...

At least it's there and I can spend a few cycles in the next week or so finishing it up.

Mike

An open source .NET web crawler written in C# using SQL 2005/2008.

Join the arachnode.net group on Facebook: http://www.facebook.com/groups.php?ref=sb#/group.php?gid=166721755872

Twitter: http://twitter.com/arachnode_net

arachnode.net provides custom crawling and contracting resources.  Please ask.

http://bit.ly/TOFX4

C# crawler, C# web crawler, C# site crawler

Top 10 Contributor
202 Posts

Good choise. the templates are great and very helpfull to get the meat.

I am looking on it right now, and trying to study this. I hope to contribute a template of my own when I will be ready and know the code, I have a lot to study on the code behind of this project.

Can anyone explain the differences between the plugins?

Top 10 Contributor
1,202 Posts

The plugins are vastly different, because they perform different functions.

ManageLuceneDotNetIndexes is used for creating full-text indexes over crawled content.

Anonymizer is used to translate the CrawlRequest AbsoluteUri into an anonymized CrawlRequest.

BayesianClassifier is used to classify content into 'Class A' or 'Class B'.  (BETA)

Templater is used to extract the 'meat' of a page, or to extract the main post or posts from a WebPage.  (ALPHA)

-Mike

And, please do contribute!!!

An open source .NET web crawler written in C# using SQL 2005/2008.

Join the arachnode.net group on Facebook: http://www.facebook.com/groups.php?ref=sb#/group.php?gid=166721755872

Twitter: http://twitter.com/arachnode_net

arachnode.net provides custom crawling and contracting resources.  Please ask.

http://bit.ly/TOFX4

C# crawler, C# web crawler, C# site crawler

Top 25 Contributor
Female
9 Posts

Hi Mike,

Any News with the Anonymizer plug-in? Is it implemented in a newer version and if so how do I use it?

Thanks!

Top 10 Contributor
1,202 Posts

Hey there!

Find Anonymizer.cs.  Enable the CrawlAction in cfg.CrawlActions.

The plugin performs a string replacement on the AbsoluteUri submitted for crawling, so that the HttpWebRequest is fulfilled through an intermediary.

crawlRequest.Discovery.Uri =

new Uri (_anonymizerAbsoluteUri + crawlRequest.Discovery.Uri.AbsoluteUri);

 

An open source .NET web crawler written in C# using SQL 2005/2008.

Join the arachnode.net group on Facebook: http://www.facebook.com/groups.php?ref=sb#/group.php?gid=166721755872

Twitter: http://twitter.com/arachnode_net

arachnode.net provides custom crawling and contracting resources.  Please ask.

http://bit.ly/TOFX4

C# crawler, C# web crawler, C# site crawler

Page 1 of 2 (17 items) 1 2 Next > | RSS
An open source .NET web crawler written in C# using SQL 2005/2008

copyright 2004-2010, arachnode.net LLC

Powered by Community Server (Non-Commercial Edition), by Telligent Systems