arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

RE: Crawl Web Pages, RSS, Facebook & Twitter and store cleaned content to database

rated by 0 users
Answered (Verified) This post has 0 verified answers | 26 Replies | 2 Followers

Top 200 Contributor
1 Posts
Tred.S posted on Fri, Jun 15 2012 4:25 AM

Hi,

        My colleague and I are developing a web application which requires frequently updated web content from the sites that we want to monitor, from facebook and twitter. The type of content we are interested in is news and trending topics. We have a licensed version of arachnode. 

        For that purpose, we want arachnode crawler to crawl the list of URLs (blogs, forums, RSS, facebook pages, twitter users) and 

- extract out only the useful content of the sites (that is main content, posts and comments, without advertisements and other irrelevant stuff)

[For that, I have tried to use templater plugin and the content I gathered is not satisfactory - some content only contains h1, h2 etc of the post leaving out most of the stuff, some content has html tags inside etc... I also noticed some of the relevant crawl requests didn't reach to templater's perform action to process?]

What would be the best way to deal with different sites having varying structures and we would like to be able to detect if a particular site changes its html/xml site structure, adapt crawler on the spot without having to stop the crawler?

- store in our own database with these columns: url of the content being found, the content itself, last updated time, author, post/comment id where commentid would link back to post id of its parent post.

[I have tried to create our own database among default tables and another table adapter in ArachnodeDataSet.xsd and from templater's perform function, I called DAO to insert to database. Is it the most efficient way for that purpose? How to do update if exists or else insert from dataset environment?]

- crawl around every 4 hours or shorter than that and don't want it to crawl if the content is the same (afaik, it is by default?)

- currently, our first priority is to get all useful rss content from the list of sites and store neatly to our database.

[For that, I have deleted other entries in AllowedDataTypes except for xml. But, we are having two issues. With that setting, after we crawl, there are way fewer results (xml feeds) that we expected. More than that, when we put break point in 1st line after templater perform action method declaration and we were not able to catch all instances where an xml file is found. For e.g, we captured only 2 times when we found dozens of xml files inside console/debug/bin/downloaded files. After this step, we would like to extract relevant content, post, comment and store to db.]

 

We really hope you can reply very soon since our development time frame is extremely short. We realize we should have contacted to earlier instead of playing around with the tool by ourselves to find out.

Thanks in advance..

Page 1 of 1 (3 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC