arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

Filtering URLs Comune Milano

rated by 0 users
Answered (Verified) This post has 1 verified answer | 8 Replies | 1 Follower

Top 10 Contributor
Male
58 Posts
Massimo Ghidoni posted on Tue, Jan 19 2010 10:15 AM

Good morning Mike,

We have a problem with AN 1.4 (we purchased the license last month from Software Technologies S.r.l., my company).

What we have to do is download files and web pages from a lot (about 10000) of specific URI (domains and hosts, with Query string or not) WITHOUT  download from hyperlink that are "brother" or "father" in the tree of the page.

 

I make an example to explain better:

http://www.comune.milano.it/dseserver/webcity/garecontratti.nsf/Wapertitipo?openview&RestricttoCategory=ELENCO

we would like to download the 2 web pages ("Avviso per l'inserimento.." e "Convenzione per l'affidamento..") and the files that you obtain from the 2 pages WITHOUT download other web pages (or other files) that are under links in the left side menu or that are in the same directory of the files that you reach from the 2 pages ("Avviso per l'inserimento.." e "Convenzione per l'affidamento..") but that are reachable by different web pages (this because the 2 pages mentioned above are in http://www.comune.milano.it/dseserver/webcity/garecontratti.nsf/WEBAll/74431BB70A3C27A4C12573230046F60A?opendocument       with many other pages and files that we would like to refuse) .

 

We have many URLs to crawl with the same configuration.

Could you, please, give us an idea to resolve the problem?

Answered (Verified) Verified Answer

Top 10 Contributor
1,905 Posts

CreateCrawlRequestsFromDatabaseCrawlRequests Application true Should CrawlRequests stored in the database be converted to CrawlRequests for crawling?
CreateCrawlRequestsFromDatabaseFiles Application true Should Files stored in the database be converted to CrawlRequests for crawling?
CreateCrawlRequestsFromDatabaseHyperLinks Application true Should HyperLinks stored in the database be converted to CrawlRequests for crawling?
CreateCrawlRequestsFromDatabaseImages Application false Should Images stored in the database be converted to CrawlRequests for crawling?
CreateCrawlRequestsFromDatabaseWebPages Application true Should WebPages stored in the database be converted to CrawlRequests for crawling?

 

Your first problem is that you have CreateCrawlRequests* set to true, which means "When there isn't anything left to crawl in the Engine, go to the database and make CrawlRequests according to those categories"... so, turn off all but CreateCrawlRequestsFromDatabaseCrawlRequests.

Depth of 1 = This page.

Depth of 2 = This page and all pages found from that page and nothing else.

Does this solve your problem?  Big Smile

-Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

All Replies

Top 10 Contributor
1,905 Posts

OK, so the pages you want to crawl have to be pointed to by other pages but can't be pointed to by navigation links?

Is there a pattern than can be applied to all 10,000 pages that will allow this?

(not fully understanding what you want to do...)

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
1,905 Posts

Another thing I am thinking is that you could add a static variable (an instance variable may work as well...) to a CrawlRequest, and have this variable keep track of whether or not you were crawling in state 'A' or state 'B'...

In state 'A' you would gather your list of links. 

Then, in state 'B', you would allow pages into the DB.

If you can tell me a little bit more about what you are trying to do I'm sure we can figure it out.

Thanks!

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
1,905 Posts

Also, what are 'brother' and 'father'?

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
Male
58 Posts

The problem is that all web pages and/or documents (files .pdf, .doc, .xls, .jpg, .xml, etc.) are in one directory (http://www.comune.milano.it/dseserver/webcity/garecontratti.nsf/WEBAll/) independently from the menu voice selected..

 

Take the URL of the sample above

http://www.comune.milano.it/dseserver/webcity/garecontratti.nsf/Wapertitipo?openview&RestricttoCategory=ELENCO

 

The 2 pages (that we need to crawl) that you reach navigating the 2 hyperlinks ‘Avviso per l'inserimento..’ and ‘Convenzione per l'affidamento..’ are in the same directory (http://www.comune.milano.it/dseserver/webcity/garecontratti.nsf/WEBAll/) of web pages that you reach navigating the voice of the left side menu (‘Bandi CHIUSI’, ‘Esiti’, ‘PROSSIMI bandi’, ‘NEWSLETTER’).

we need to crawl only those 2 web pages (‘Avviso per l'inserimento..’ and ‘Convenzione per l'affidamento..’ ) with their documents, without crawl other web pages and documents (under ‘Bandi CHIUSI’ or ‘Esiti’ or..).

 

this kind of issue is very often distributed across our URLs and can be considered as a search pattern for our URLs list.

thanks again

Top 10 Contributor
Male
58 Posts

Hi mike,

we need to obtain the web page and the files (.doc, .pdf, etc) that are linked directly by this URL :

http://www.ausl5.la-spezia.it/template1.asp?itemID=32&livello=2&label=gare&CodMenu=2

 

We crawl with current depth = 1, so our expected results was the download of:

·         this web page,

·         all the documents (.doc, .pdf, etc) that you can download directly from that web page,

·         all the web pages that you reach navigating the hyperlink on that web page

and nothing else.

 

Actuality crawling that URL we crawl ALL web site, with all documents, and all we pages.

Why? There is something wrong in our configuration?

We have misunderstood the Depth concept?

 

Thank again

Top 10 Contributor
1,905 Posts

CreateCrawlRequestsFromDatabaseCrawlRequests Application true Should CrawlRequests stored in the database be converted to CrawlRequests for crawling?
CreateCrawlRequestsFromDatabaseFiles Application true Should Files stored in the database be converted to CrawlRequests for crawling?
CreateCrawlRequestsFromDatabaseHyperLinks Application true Should HyperLinks stored in the database be converted to CrawlRequests for crawling?
CreateCrawlRequestsFromDatabaseImages Application false Should Images stored in the database be converted to CrawlRequests for crawling?
CreateCrawlRequestsFromDatabaseWebPages Application true Should WebPages stored in the database be converted to CrawlRequests for crawling?

 

Your first problem is that you have CreateCrawlRequests* set to true, which means "When there isn't anything left to crawl in the Engine, go to the database and make CrawlRequests according to those categories"... so, turn off all but CreateCrawlRequestsFromDatabaseCrawlRequests.

Depth of 1 = This page.

Depth of 2 = This page and all pages found from that page and nothing else.

Does this solve your problem?  Big Smile

-Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
Male
58 Posts

Hi Mike,

Yes!! This solve our problem!!

Big Smile

MANY MANY THANKS

Massimo

Top 10 Contributor
1,905 Posts

WONDERFUL!

Always glad to help!  Smile

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (9 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC