Good morning Mike,
We have a problem with AN 1.4 (we purchased the license last month from Software Technologies S.r.l., my company).
What we have to do is download files and web pages from a lot (about 10000) of specific URI (domains and hosts, with Query string or not) WITHOUT download from hyperlink that are "brother" or "father" in the tree of the page.
I make an example to explain better:
http://www.comune.milano.it/dseserver/webcity/garecontratti.nsf/Wapertitipo?openview&RestricttoCategory=ELENCO
we would like to download the 2 web pages ("Avviso per l'inserimento.." e "Convenzione per l'affidamento..") and the files that you obtain from the 2 pages WITHOUT download other web pages (or other files) that are under links in the left side menu or that are in the same directory of the files that you reach from the 2 pages ("Avviso per l'inserimento.." e "Convenzione per l'affidamento..") but that are reachable by different web pages (this because the 2 pages mentioned above are in http://www.comune.milano.it/dseserver/webcity/garecontratti.nsf/WEBAll/74431BB70A3C27A4C12573230046F60A?opendocument with many other pages and files that we would like to refuse) .
We have many URLs to crawl with the same configuration.
Could you, please, give us an idea to resolve the problem?
Your first problem is that you have CreateCrawlRequests* set to true, which means "When there isn't anything left to crawl in the Engine, go to the database and make CrawlRequests according to those categories"... so, turn off all but CreateCrawlRequestsFromDatabaseCrawlRequests.
Depth of 1 = This page.
Depth of 2 = This page and all pages found from that page and nothing else.
Does this solve your problem?
-Mike
An open source .NET web crawler written in C# using SQL 2005/2008.
Join the arachnode.net group on Facebook: http://www.facebook.com/groups.php?ref=sb#/group.php?gid=166721755872
Twitter: http://twitter.com/arachnode_net
arachnode.net provides custom crawling and contracting resources. Please ask.
http://bit.ly/TOFX4
C# crawler, C# web crawler, C# site crawler
OK, so the pages you want to crawl have to be pointed to by other pages but can't be pointed to by navigation links?
Is there a pattern than can be applied to all 10,000 pages that will allow this?
(not fully understanding what you want to do...)
Another thing I am thinking is that you could add a static variable (an instance variable may work as well...) to a CrawlRequest, and have this variable keep track of whether or not you were crawling in state 'A' or state 'B'...
In state 'A' you would gather your list of links.
Then, in state 'B', you would allow pages into the DB.
If you can tell me a little bit more about what you are trying to do I'm sure we can figure it out.
Thanks!
Also, what are 'brother' and 'father'?
The problem is that all web pages and/or documents (files .pdf, .doc, .xls, .jpg, .xml, etc.) are in one directory (http://www.comune.milano.it/dseserver/webcity/garecontratti.nsf/WEBAll/) independently from the menu voice selected..
Take the URL of the sample above
The 2 pages (that we need to crawl) that you reach navigating the 2 hyperlinks ‘Avviso per l'inserimento..’ and ‘Convenzione per l'affidamento..’ are in the same directory (http://www.comune.milano.it/dseserver/webcity/garecontratti.nsf/WEBAll/) of web pages that you reach navigating the voice of the left side menu (‘Bandi CHIUSI’, ‘Esiti’, ‘PROSSIMI bandi’, ‘NEWSLETTER’).
we need to crawl only those 2 web pages (‘Avviso per l'inserimento..’ and ‘Convenzione per l'affidamento..’ ) with their documents, without crawl other web pages and documents (under ‘Bandi CHIUSI’ or ‘Esiti’ or..).
this kind of issue is very often distributed across our URLs and can be considered as a search pattern for our URLs list.
thanks again
Hi mike,
we need to obtain the web page and the files (.doc, .pdf, etc) that are linked directly by this URL :
http://www.ausl5.la-spezia.it/template1.asp?itemID=32&livello=2&label=gare&CodMenu=2
We crawl with current depth = 1, so our expected results was the download of:
· this web page,
· all the documents (.doc, .pdf, etc) that you can download directly from that web page,
· all the web pages that you reach navigating the hyperlink on that web page
and nothing else.
Actuality crawling that URL we crawl ALL web site, with all documents, and all we pages.
Why? There is something wrong in our configuration?
We have misunderstood the Depth concept?
Thank again
Hi Mike,
Yes!! This solve our problem!!
MANY MANY THANKS
Massimo
WONDERFUL!
Always glad to help!