arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

dynamic list of websites have to be crawled

rated by 0 users
Answered (Not Verified) This post has 0 verified answers | 18 Replies | 2 Followers

Top 200 Contributor
Male
1 Posts
Sergii posted on Mon, Apr 30 2012 7:23 PM

Hello,

I am thinking to use your product to crawl up to million web sites. My goal is to find top websites that abound in pdf documents. I am not really interested in indexing, downloading content and storing it to a hard drive. I just need pdfs discovering within the only sites I specified, but with unlimited depth of discovering.

The list of websites will be dynamic. Is it possible to add new crawl requests dynamically, not changing source code like it's done in the demo console application? 

And the last question, can I run crawler on schedule basis?

P.S:

http://arachnode.net/media/g/releases/tags/AN.Next+_2800_DEMO_2900_/default.aspx - not found

Thank you,

Sergey

All Replies

Top 10 Contributor
1,905 Posts

Great!

Yes, AN can help you.  You can easily filter for .pdf document and nothing else, simply crawling pages and validating the content type for suspected .pdf's and only downloading and processing those documents.

It is possible to add them dynamically.  You could add them to the CrawlRequests database table while crawling, or could have your process read from CrawlRequests.txt, like the Service (Service project) does.

Yes, it's easily possible to run AN on a schedule.

Thanks for the broken link.  Looks like CommunityServer needs to re-index the Media galleries.

Thanks,
Mike 

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

All Replies

Top 10 Contributor
1,905 Posts

Great!

Yes, AN can help you.  You can easily filter for .pdf document and nothing else, simply crawling pages and validating the content type for suspected .pdf's and only downloading and processing those documents.

It is possible to add them dynamically.  You could add them to the CrawlRequests database table while crawling, or could have your process read from CrawlRequests.txt, like the Service (Service project) does.

Yes, it's easily possible to run AN on a schedule.

Thanks for the broken link.  Looks like CommunityServer needs to re-index the Media galleries.

Thanks,
Mike 

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (2 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC