We purchased AN a month ago for our needs and started to examine the code.
Before going further, I want to inform you about our specs and then want to learn your recommendations.
Here are our specs about the project:
1. We want to crawl a number of sites (300+) per regular basis,
2. There will be a number of specified keywords for each site,
3. For each page crawled, the main page content will be extracted (by BoilerPipe or another way) so that advertisements and unrelated information will be discarded,
4. Site specific related keywords will be searched in the page,
5. If the page contains the keyword, the extracted page content will be transferred to our intranet,
6. We don't want to crawl outside a site but want to crawl subdomains of site (e.g: abc.com and news.abc.com)
7. We want to exclude script, css, image files and include html, pdf, office files.
In this scope,
1. We want to use AN as a service and run continously,
2. We don't want to store any data, except any information needed by AN for the next crawls,
3. We want to use HTTP or SOCKS proxy for specific sites while crawling,
1. What do you recommend us to use? AN, or AN.NEXT?
4. What should be the steps in using Arachnode to achive our goals,
Thanks for any help,
For best service when you require assistance:
Just finished with the day's work and I need a break. I can likely get to this tomorrow...
Thanks for your patience,Mike