My colleague and I are developing a web application which requires frequently updated web content from the sites that we want to monitor, from facebook and twitter. The type of content we are interested in is news and trending topics. We have a licensed version of arachnode.
For that purpose, we want arachnode crawler to crawl the list of URLs (blogs, forums, RSS, facebook pages, twitter users) and
- extract out only the useful content of the sites (that is main content, posts and comments, without advertisements and other irrelevant stuff)
[For that, I have tried to use templater plugin and the content I gathered is not satisfactory - some content only contains h1, h2 etc of the post leaving out most of the stuff, some content has html tags inside etc... I also noticed some of the relevant crawl requests didn't reach to templater's perform action to process?]
What would be the best way to deal with different sites having varying structures and we would like to be able to detect if a particular site changes its html/xml site structure, adapt crawler on the spot without having to stop the crawler?
- store in our own database with these columns: url of the content being found, the content itself, last updated time, author, post/comment id where commentid would link back to post id of its parent post.
[I have tried to create our own database among default tables and another table adapter in ArachnodeDataSet.xsd and from templater's perform function, I called DAO to insert to database. Is it the most efficient way for that purpose? How to do update if exists or else insert from dataset environment?]
- crawl around every 4 hours or shorter than that and don't want it to crawl if the content is the same (afaik, it is by default?)
- currently, our first priority is to get all useful rss content from the list of sites and store neatly to our database.
[For that, I have deleted other entries in AllowedDataTypes except for xml. But, we are having two issues. With that setting, after we crawl, there are way fewer results (xml feeds) that we expected. More than that, when we put break point in 1st line after templater perform action method declaration and we were not able to catch all instances where an xml file is found. For e.g, we captured only 2 times when we found dozens of xml files inside console/debug/bin/downloaded files. After this step, we would like to extract relevant content, post, comment and store to db.]
We really hope you can reply very soon since our development time frame is extremely short. We realize we should have contacted to earlier instead of playing around with the tool by ourselves to find out.
Thanks in advance..
Use boilerpipe instead of the Templater: http://code.google.com/p/boilerpipe/
Yes, the TableAdapters will be fine. Just ensure your table has a PrimaryKey, and you can use the TA's to check for key existence. The Discovery carries around the PrimaryKey for each of my Discoveries tables. You can always follow the SP route I took, or if you generate DB-direct/table-direct methods from the Wizards you can check if a record exists before electing to update it.
Sure. You could write your own app that restarts the Crawler, or look at the Service code, which is intended to be restarted.
Look at the DisallowedAbsoluteUris table and look for 'Disallowed by unknown ContentType' in the 'Reason' table. This will give you a pointer into ContentTypes that are being rejected.
Would you provide me an example of a page that isn't returning the XML documents you are expecting please? Try resetting the DB and crawling one page you believe to be problematic.
Did you stop a crawl, and then crawl again w/o resetting the state of the crawl? AN can stop where you left off. Each plugin is called (when enabled) for every Discovery that is found. It may be that you had already found those Discoveries, hadn't reset the DB, and those Discoveries hadn't changed as per the 'Last-Modified' HTTP header.
For best service when you require assistance:
I will be around over the weekend - ask away...