I'm on my self-imposed one week sprint break in development, and I'm considering what features I should add/improve next. Help me decide!
An open source .NET web crawler written in C# using SQL 2005/2008.
Join the arachnode.net group on Facebook: http://www.facebook.com/groups.php?ref=sb#/group.php?gid=166721755872
Twitter: http://twitter.com/arachnode_net
arachnode.net provides custom crawling and contracting resources. Please ask.
http://bit.ly/TOFX4
C# crawler, C# web crawler, C# site crawler
I think automatically web page templating is next. :)
Yeah if it's mostly done finish 'er up! But then the NLP and sentiment stuff would be a close second!!!
What's the concept behind this? Practically what problem does it solve? I "think" it sounds useful, but what are the specifics?
Without knowing what this is about, I'm personally more interested in NLP, Semantic Analysis and Sentiment Analysis (NLPSS :) )
If you want to analyze a page for sentiment, or for importants words or for whatever, really, you don't most likely don't want to grab the sidebar and the links and the footer for your analysis.
When a page is submitted to the code it is run though an algorithm that parses out the meat of a page and then creates an XPATH template that can be re-used for pages from the same HOST.
This code is really necessary for NLP because it greatly reduces noise. What do you think?
I voted for that templating, I find it really interesting. I actually wanted to develop such thing for my home-brew mini crawler but wasn't sure how to start doing it in VB...arachnode project provided my a lot of material for crawling related and C# in general stuff, so thanks :)
Awesome! I'll put some time into templating this weekend.
We recently added a Bayesian classifier plugin (beta), so it's probably OK to add the templater in beta format too.
Mike
Yeah, I also hope that some day eventually I learn C# enough to contribute to this great project too...
Awesome! The more the merrier!
What would you want to contribute? What about writing a Plugin that serves a specific purpose?
Well I don't feel having the skills yet :P I need to dig into the source code first anyway...but for sure I'm gonna try to do something useful, I promise :)
I added the Templater to the source repository. It's EXTREMELY rough, and is likely ALPHA code. FWIW...
At least it's there and I can spend a few cycles in the next week or so finishing it up.
Good choise. the templates are great and very helpfull to get the meat.
I am looking on it right now, and trying to study this. I hope to contribute a template of my own when I will be ready and know the code, I have a lot to study on the code behind of this project.
Can anyone explain the differences between the plugins?
The plugins are vastly different, because they perform different functions.
ManageLuceneDotNetIndexes is used for creating full-text indexes over crawled content.
Anonymizer is used to translate the CrawlRequest AbsoluteUri into an anonymized CrawlRequest.
BayesianClassifier is used to classify content into 'Class A' or 'Class B'. (BETA)
Templater is used to extract the 'meat' of a page, or to extract the main post or posts from a WebPage. (ALPHA)
-Mike
And, please do contribute!!!
Hi Mike,
Any News with the Anonymizer plug-in? Is it implemented in a newer version and if so how do I use it?
Thanks!
Hey there!
Find Anonymizer.cs. Enable the CrawlAction in cfg.CrawlActions.
The plugin performs a string replacement on the AbsoluteUri submitted for crawling, so that the HttpWebRequest is fulfilled through an intermediary.
crawlRequest.Discovery.Uri =
new Uri (_anonymizerAbsoluteUri + crawlRequest.Discovery.Uri.AbsoluteUri);