Thanks for the quick reply on mail.
As I said, I already have the enviorment up and running - Crawling my sites and searching.
My main goal is to be able to have a list of sites that the user can enter and save (which I assume can be saved in the CrawlRequest Table + enabling the CreateCrawlRequestsFromDatabaseFiles?) and also have a list of words to be searched (that will also be entered and saved by the user).So I created a new table to handle the list of words.
Finnaly, only the Pages from those sites that contains any of the words in my list should be downloaded and then I need to work on them and get the relevant text out - for that I thought to be using the Templater which is doing some of text extracting and such.
As I understand, currenlty the arachnode works by first crawling, downloading, indexing and then searching. I want to make the search while crawling and download only relevant pages, once I have them to get the text out.
Hope I was clear so far.
Any thoughts on how should I proceed from here? is the Templater relevant for my needs? if so, how do I use it? I also saw that there were some bugs in that class in one of the earlier posts last month, is there any new version for that class?
The Templater class is a bit of AI used for extracting the "meat" of a page programatically... like how can a web spider tell what is the main blog post on a page, or which posts are comments in a blog page...
Don't think the Templater is relevant for what you are trying to do.
Once a CrawlRequest is processed, it is removed from the CrawlRequests table, so I would create another table to hold the CR's you want to save.
If you have a list of words that you want to filter sites by, create a custom CrawlRule and set the 'IsDisallowed' property in the rule. Look at the DisallowedAbsoluteUris table... :)
You are correct in your assessment of how AN works. :)
CreateCrawlRequestsFromDatabaseFiles will create CR's from the Files database table.
For best service when you require assistance:
Thanks for your last answer and advices.
I'm currently trying to understand how creating a CrawlRule can help me to filter result web pages while crawling. I've searched and read posts I've found about rules in the forom but still didn't figure completely how to define one.
Can you please give an example of the basic steps needed to define a new rule that would filter my results? (including which parameters in which tables) for example, if I'm crawling the CNN News sites (http://edition.cnn.com/WORLD/) and I would like to get back only the pages that contains stories related to Afghanistan. So for my example on the pages that are in this list http://edition.cnn.com/search/?query=afganistan&primaryType=mixed&sortBy=date&intl=true would appear in my folder when the crawler is finished.
We have a basic set of help documents coming very soon.
Additionally, consider purchasing a license and/or a support contract. This is now the best way to receive directed help on your issues.