arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

Problem downloading from php sites

rated by 0 users
Answered (Verified) This post has 1 verified answer | 9 Replies | 2 Followers

Top 10 Contributor
Male
58 Posts
Massimo Ghidoni posted on Fri, Mar 26 2010 8:51 AM

Good evening,

We have a problem with AN 1.4.

We have to download only files (not web pages,images,etc.) with a specified depth, and for that reason we have set in configuration:

CreateCrawlRequestsFromDatabaseCrawlRequests = true
CreateCrawlRequestsFromDatabaseFiles = false
CreateCrawlRequestsFromDatabaseHyperLinks = false
CreateCrawlRequestsFromDatabaseImages = false
CreateCrawlRequestsFromDatabaseWebPages = false

 We have problems in downloading files from PHP sites.

in table AllowedDataTypes we have this record:

ID   ContentTypeID  DiscoveryTypeID   FullTextIndexType   Overrides
28   898                        7                        .htm                  .asp,.aspx,.php,.dyn,.html,.jsp

Furthermore if I try to crawl only that site:

http://www.napoli2nord.it/aziende.php#bandi?bandi.php

AN doesn't create directories path and it doesn't download nothing..

Is there a specific setting in the configuration?

Thank you in advance

Massimo

Answered (Verified) Verified Answer

Top 10 Contributor
1,905 Posts
Verified by arachnode.net

It doesn't look like your page 'http://www.napoli2nord.it/aziende.php#bandi?bandi.php' is experiencing any errors that aren't caused by your configuration.

In the DisallowedAbsoluteUris table, 'http://www.napoli2nord.it/aziende.php#bandi?bandi.php' isn't listed.  Unless you had 'InsertDisallowedAbsoluteUris' set to 'false'.

In the Exceptions table AbsoluteUri2 would equal 'http://www.napoli2nord.it/aziende.php#bandi?bandi.php' if this page had a problem downloading.

Do you see 'http://www.napoli2nord.it/aziende.php#bandi?bandi.php' in WebPages?

Also, CreateCrawlRequestsFromDatabase* pertains to fetching CrawlRequests from disk, not whether files/images will be crawled when discovered inside the Crawler/Engine.  If you don't want files and images at all set 'AssignFileAndImageDiscoveries' in cfg.Configuration to 'false'.

If you aren't saving Source to Disk or to Database, when you encounted a 'Last-Modified' HttpHeader you may encounter the 'STALE' error as AN is saying, 'Hey - I'd like to use the cached version, but you didn't save one for me...'

ID;ContentTypeID;DiscoveryTypeID;AbsoluteUri;Reason
1;898;7;http://www.asianapoli.it/sqes4asia/stile.css;Disallowed by FileExtension.
2;898;7;http://porto.napoli.it//flash.gif;Disallowed by FileExtension.
3;898;7;http://porto.napoli.it/img/bannerNews.gif;Disallowed by FileExtension.
4;1;0;http://porto.napoli.it/js/RAPID.js;Disallowed by unassigned DataType.
5;1;0;http://porto.napoli.it/img/iconBack.gif;Disallowed by FileExtension.
6;1;0;http://porto.napoli.it/img/iconPrint.gif;Disallowed by FileExtension.
7;1;0;http://porto.napoli.it/img/iconReserved.gif;Disallowed by FileExtension.
8;1;0;http://porto.napoli.it/js/FOOTER.js;Disallowed by unassigned DataType.

The entries look like perfectly valid results according to your configuration and do not pertain to .php sites.

Created;ID;AbsoluteUri1;AbsoluteUri2;HelpLink;Message;Source;StackTrace
2010-03-27 12:01:32.123;1;http://www.napoli2nord.it/aziende.php#bandi?bandi.php;http://www.napoli2nord.it/file/magazine_foto_68_pagina;NULL;Errore del server remoto: (404) Non trovato.;System;   in System.Net.HttpWebRequest.GetResponse()
   in Arachnode.SiteCrawler.Components.WebClient.GetWebResponse(String absoluteUri, String method)

Finally, a 404 error means 'not found'.  Big Smile

Does any of this help in any way?

Let me know!

-Mike

 

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

All Replies

Top 10 Contributor
1,905 Posts

Is "http://www.napoli2nord.it/aziende.php#bandi?bandi.php" found in the Exceptions and/or DisallowedAbsoluteUris tables?

If so, what is the contents of the row(s)?

Examining these two tables the the first step in troubleshooting.

Your setting in cfg.AllowedDataTypes look correct, but the Overrides column is to allow the FullTextIndexType column in WebPages to be more specific than .htm.

I bet the DisallowedAbsoluteUris table will tell you/us what is wrong.  Big Smile

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
Male
58 Posts

The DisallowedAbsoluteURIs table was empty during our test session.

The Exceptions table was composed by only 1 row and said that the page was STALE based on discovery date comparison. I'm outside the office at the moment and I don't remember with more precision the whole description of the exception.

This URL has been migrated recently but with the browser you can download the files.

Thanks you in advance.

  

Top 10 Contributor
Male
58 Posts

4478.Exceptions_NapoliNord_02.txtIn attachment you will find the export of DisallowedAbsoluteURIs table and the Exception table. We have a list of PHP sites with similar results.

You will see a disallowed by Unassigned Data Type reason, but we don't know if this information is useful for throubleshooting.

We have included in the allowed data types also javascript , but we have obtained the same result.

Massimo 2084.DisallowedAbsoluteURI_PortoNapoli.txt

Top 10 Contributor
Male
58 Posts

7380.Exceptions_NapoliNord_03.txtHere is the "STALE" exception error.

 

Top 10 Contributor
1,905 Posts
Verified by arachnode.net

It doesn't look like your page 'http://www.napoli2nord.it/aziende.php#bandi?bandi.php' is experiencing any errors that aren't caused by your configuration.

In the DisallowedAbsoluteUris table, 'http://www.napoli2nord.it/aziende.php#bandi?bandi.php' isn't listed.  Unless you had 'InsertDisallowedAbsoluteUris' set to 'false'.

In the Exceptions table AbsoluteUri2 would equal 'http://www.napoli2nord.it/aziende.php#bandi?bandi.php' if this page had a problem downloading.

Do you see 'http://www.napoli2nord.it/aziende.php#bandi?bandi.php' in WebPages?

Also, CreateCrawlRequestsFromDatabase* pertains to fetching CrawlRequests from disk, not whether files/images will be crawled when discovered inside the Crawler/Engine.  If you don't want files and images at all set 'AssignFileAndImageDiscoveries' in cfg.Configuration to 'false'.

If you aren't saving Source to Disk or to Database, when you encounted a 'Last-Modified' HttpHeader you may encounter the 'STALE' error as AN is saying, 'Hey - I'd like to use the cached version, but you didn't save one for me...'

ID;ContentTypeID;DiscoveryTypeID;AbsoluteUri;Reason
1;898;7;http://www.asianapoli.it/sqes4asia/stile.css;Disallowed by FileExtension.
2;898;7;http://porto.napoli.it//flash.gif;Disallowed by FileExtension.
3;898;7;http://porto.napoli.it/img/bannerNews.gif;Disallowed by FileExtension.
4;1;0;http://porto.napoli.it/js/RAPID.js;Disallowed by unassigned DataType.
5;1;0;http://porto.napoli.it/img/iconBack.gif;Disallowed by FileExtension.
6;1;0;http://porto.napoli.it/img/iconPrint.gif;Disallowed by FileExtension.
7;1;0;http://porto.napoli.it/img/iconReserved.gif;Disallowed by FileExtension.
8;1;0;http://porto.napoli.it/js/FOOTER.js;Disallowed by unassigned DataType.

The entries look like perfectly valid results according to your configuration and do not pertain to .php sites.

Created;ID;AbsoluteUri1;AbsoluteUri2;HelpLink;Message;Source;StackTrace
2010-03-27 12:01:32.123;1;http://www.napoli2nord.it/aziende.php#bandi?bandi.php;http://www.napoli2nord.it/file/magazine_foto_68_pagina;NULL;Errore del server remoto: (404) Non trovato.;System;   in System.Net.HttpWebRequest.GetResponse()
   in Arachnode.SiteCrawler.Components.WebClient.GetWebResponse(String absoluteUri, String method)

Finally, a 404 error means 'not found'.  Big Smile

Does any of this help in any way?

Let me know!

-Mike

 

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
Male
58 Posts

5611.DisallowedAbsoluteURI_NapoliNord.txtHi Mike, we have made a new session of tests. This is the result.

 

 0827.exceptionsNapoliNord.txt8360.webPagesNapoliNord.txt8838.FilesNapoliNord.txt

Look at the page. I want to crawl the PDF files in section Bandi di gara (green anchor tab)

I have seen that AN doesn't store the pdf located in www.napoli2nord.it/file

Only the magazine files are crawled but they are reached by the section named "archivio" and not "bandi di gara".

I thought to a spider trap and I've turn off just my code option but with the same result.

Now I can recrawl with the SaveDiscoveredWebPage to disk set to true.

We have thought to the AJAX located in the page (the URL of above PDFs are dynamically created with AJAX)

A lot of sites that we have to crawl have the same technical details

I hope in your excellent skill - Thanks in advance

Massimo

 

 

Top 10 Contributor
1,905 Posts

Hey -

AN doesn't render the page, only processing the text itself that is returned from the webserver.

You might consider submitting the AbsoluteUris manually, according to a scheme bandi_file_[0-100]_[0-100].pdf if you need the files.

http://www.napoli2nord.it/file/bandi_file_95_50.pdf

Is it possible to render the page, but isn't something that AN supports natively.

See here: http://arachnode.net/forums/t/112.aspx for an idea on how to use the WebBrowser control to render the page.

Also, does the site provide an RSS feed?  Google Alerts feed from the site?

Thanks!
Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
Male
58 Posts

No, I think that the site doesn't have any kind of RSS or google alert

Why we cannot crawl this site? detph = 2 (same configuration)

http://www.asl.milano.it/user/Default.aspx?MOD=ASLBGA&SEZ=10&PAG=232 

Is the same problem of rendering?

Thanks a lot

 

Top 10 Contributor
1,905 Posts

An easy way to figure out whether your issue is due to rendering is to view the page source.  If you don't see all of the page in the source, then the page has some sort of dynamic client-side rendering in place.

This page looks OK to me.  Did you check the Exceptions and DisallowedAbsoluteUris tables?

If you absolutely need AJAX pages crawled I can be contracted to write the functionality for you.

-Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (10 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC