arachnode.net
An open source .NET web crawler written in C# using SQL 2005/2008

Filtering URLs Comune Milano

rated by 0 users
Answered (Verified) This post has 1 verified answer | 8 Replies | 1 Follower

Top 10 Contributor
Male
43 Posts
Massimo Ghidoni posted on 19 Jan 2010 10:15 AM

Good morning Mike,

We have a problem with AN 1.4 (we purchased the license last month from Software Technologies S.r.l., my company).

What we have to do is download files and web pages from a lot (about 10000) of specific URI (domains and hosts, with Query string or not) WITHOUT  download from hyperlink that are "brother" or "father" in the tree of the page.

 

I make an example to explain better:

http://www.comune.milano.it/dseserver/webcity/garecontratti.nsf/Wapertitipo?openview&RestricttoCategory=ELENCO

we would like to download the 2 web pages ("Avviso per l'inserimento.." e "Convenzione per l'affidamento..") and the files that you obtain from the 2 pages WITHOUT download other web pages (or other files) that are under links in the left side menu or that are in the same directory of the files that you reach from the 2 pages ("Avviso per l'inserimento.." e "Convenzione per l'affidamento..") but that are reachable by different web pages (this because the 2 pages mentioned above are in http://www.comune.milano.it/dseserver/webcity/garecontratti.nsf/WEBAll/74431BB70A3C27A4C12573230046F60A?opendocument       with many other pages and files that we would like to refuse) .

 

We have many URLs to crawl with the same configuration.

Could you, please, give us an idea to resolve the problem?

Answered (Verified) Verified Answer

Top 10 Contributor
1,202 Posts

CreateCrawlRequestsFromDatabaseCrawlRequests Application true Should CrawlRequests stored in the database be converted to CrawlRequests for crawling?
CreateCrawlRequestsFromDatabaseFiles Application true Should Files stored in the database be converted to CrawlRequests for crawling?
CreateCrawlRequestsFromDatabaseHyperLinks Application true Should HyperLinks stored in the database be converted to CrawlRequests for crawling?
CreateCrawlRequestsFromDatabaseImages Application false Should Images stored in the database be converted to CrawlRequests for crawling?
CreateCrawlRequestsFromDatabaseWebPages Application true Should WebPages stored in the database be converted to CrawlRequests for crawling?

 

Your first problem is that you have CreateCrawlRequests* set to true, which means "When there isn't anything left to crawl in the Engine, go to the database and make CrawlRequests according to those categories"... so, turn off all but CreateCrawlRequestsFromDatabaseCrawlRequests.

Depth of 1 = This page.

Depth of 2 = This page and all pages found from that page and nothing else.

Does this solve your problem?  Big Smile

-Mike

An open source .NET web crawler written in C# using SQL 2005/2008.

Join the arachnode.net group on Facebook: http://www.facebook.com/groups.php?ref=sb#/group.php?gid=166721755872

Twitter: http://twitter.com/arachnode_net

arachnode.net provides custom crawling and contracting resources.  Please ask.

http://bit.ly/TOFX4

C# crawler, C# web crawler, C# site crawler

All Replies

Top 10 Contributor
1,202 Posts

OK, so the pages you want to crawl have to be pointed to by other pages but can't be pointed to by navigation links?

Is there a pattern than can be applied to all 10,000 pages that will allow this?

(not fully understanding what you want to do...)

An open source .NET web crawler written in C# using SQL 2005/2008.

Join the arachnode.net group on Facebook: http://www.facebook.com/groups.php?ref=sb#/group.php?gid=166721755872

Twitter: http://twitter.com/arachnode_net

arachnode.net provides custom crawling and contracting resources.  Please ask.

http://bit.ly/TOFX4

C# crawler, C# web crawler, C# site crawler

Top 10 Contributor
1,202 Posts

Another thing I am thinking is that you could add a static variable (an instance variable may work as well...) to a CrawlRequest, and have this variable keep track of whether or not you were crawling in state 'A' or state 'B'...

In state 'A' you would gather your list of links. 

Then, in state 'B', you would allow pages into the DB.

If you can tell me a little bit more about what you are trying to do I'm sure we can figure it out.

Thanks!

An open source .NET web crawler written in C# using SQL 2005/2008.

Join the arachnode.net group on Facebook: http://www.facebook.com/groups.php?ref=sb#/group.php?gid=166721755872

Twitter: http://twitter.com/arachnode_net

arachnode.net provides custom crawling and contracting resources.  Please ask.

http://bit.ly/TOFX4

C# crawler, C# web crawler, C# site crawler

Top 10 Contributor
1,202 Posts

Also, what are 'brother' and 'father'?

An open source .NET web crawler written in C# using SQL 2005/2008.

Join the arachnode.net group on Facebook: http://www.facebook.com/groups.php?ref=sb#/group.php?gid=166721755872

Twitter: http://twitter.com/arachnode_net

arachnode.net provides custom crawling and contracting resources.  Please ask.

http://bit.ly/TOFX4

C# crawler, C# web crawler, C# site crawler

Top 10 Contributor
Male
43 Posts

The problem is that all web pages and/or documents (files .pdf, .doc, .xls, .jpg, .xml, etc.) are in one directory (http://www.comune.milano.it/dseserver/webcity/garecontratti.nsf/WEBAll/) independently from the menu voice selected..

 

Take the URL of the sample above

http://www.comune.milano.it/dseserver/webcity/garecontratti.nsf/Wapertitipo?openview&RestricttoCategory=ELENCO

 

The 2 pages (that we need to crawl) that you reach navigating the 2 hyperlinks ‘Avviso per l'inserimento..’ and ‘Convenzione per l'affidamento..’ are in the same directory (http://www.comune.milano.it/dseserver/webcity/garecontratti.nsf/WEBAll/) of web pages that you reach navigating the voice of the left side menu (‘Bandi CHIUSI’, ‘Esiti’, ‘PROSSIMI bandi’, ‘NEWSLETTER’).

we need to crawl only those 2 web pages (‘Avviso per l'inserimento..’ and ‘Convenzione per l'affidamento..’ ) with their documents, without crawl other web pages and documents (under ‘Bandi CHIUSI’ or ‘Esiti’ or..).

 

this kind of issue is very often distributed across our URLs and can be considered as a search pattern for our URLs list.

thanks again

Top 10 Contributor
Male
43 Posts

Hi mike,

we need to obtain the web page and the files (.doc, .pdf, etc) that are linked directly by this URL :

http://www.ausl5.la-spezia.it/template1.asp?itemID=32&livello=2&label=gare&CodMenu=2

 

We crawl with current depth = 1, so our expected results was the download of:

·         this web page,

·         all the documents (.doc, .pdf, etc) that you can download directly from that web page,

·         all the web pages that you reach navigating the hyperlink on that web page

and nothing else.

 

Actuality crawling that URL we crawl ALL web site, with all documents, and all we pages.

Why? There is something wrong in our configuration?

We have misunderstood the Depth concept?

 

Thank again

Top 10 Contributor
1,202 Posts

CreateCrawlRequestsFromDatabaseCrawlRequests Application true Should CrawlRequests stored in the database be converted to CrawlRequests for crawling?
CreateCrawlRequestsFromDatabaseFiles Application true Should Files stored in the database be converted to CrawlRequests for crawling?
CreateCrawlRequestsFromDatabaseHyperLinks Application true Should HyperLinks stored in the database be converted to CrawlRequests for crawling?
CreateCrawlRequestsFromDatabaseImages Application false Should Images stored in the database be converted to CrawlRequests for crawling?
CreateCrawlRequestsFromDatabaseWebPages Application true Should WebPages stored in the database be converted to CrawlRequests for crawling?

 

Your first problem is that you have CreateCrawlRequests* set to true, which means "When there isn't anything left to crawl in the Engine, go to the database and make CrawlRequests according to those categories"... so, turn off all but CreateCrawlRequestsFromDatabaseCrawlRequests.

Depth of 1 = This page.

Depth of 2 = This page and all pages found from that page and nothing else.

Does this solve your problem?  Big Smile

-Mike

An open source .NET web crawler written in C# using SQL 2005/2008.

Join the arachnode.net group on Facebook: http://www.facebook.com/groups.php?ref=sb#/group.php?gid=166721755872

Twitter: http://twitter.com/arachnode_net

arachnode.net provides custom crawling and contracting resources.  Please ask.

http://bit.ly/TOFX4

C# crawler, C# web crawler, C# site crawler

Top 10 Contributor
Male
43 Posts

Hi Mike,

Yes!! This solve our problem!!

Big Smile

MANY MANY THANKS

Massimo

Top 10 Contributor
1,202 Posts

WONDERFUL!

Always glad to help!  Smile

An open source .NET web crawler written in C# using SQL 2005/2008.

Join the arachnode.net group on Facebook: http://www.facebook.com/groups.php?ref=sb#/group.php?gid=166721755872

Twitter: http://twitter.com/arachnode_net

arachnode.net provides custom crawling and contracting resources.  Please ask.

http://bit.ly/TOFX4

C# crawler, C# web crawler, C# site crawler

Page 1 of 1 (9 items) | RSS
An open source .NET web crawler written in C# using SQL 2005/2008

copyright 2004-2010, arachnode.net LLC

Powered by Community Server (Non-Commercial Edition), by Telligent Systems