arachnode.net
An open source .NET web crawler written in C# using SQL 2005/2008

Web Shopping Pricing Bots solution...

rated by 0 users
Answered (Verified) This post has 1 verified answer | 28 Replies | 5 Followers

Top 50 Contributor
5 Posts
demol posted on 28 Jan 2009 9:57 AM

Hello guys!

Congratulations, your project is incredible...

I´d like to know if is possible to do a Web Shopping Pricing site using arachnode, like http://www.pricegrabber.com/...  Is it recommended?

 

Thanks!

Answered (Verified) Verified Answer

Top 10 Contributor
1,202 Posts
Answered (Verified) arachnode.net replied on 28 Jan 2009 5:41 PM
Verified by arachnode.net

The answer is yes.

If you want to crawl a specific list of domains here's what you need to do:

1.) Insert your intended Domains into the DisallowedDomains table and set the column value for 'IsDisallowed' to True.

2.) Delete all rows from the DisallowedWords table.  The words in this table are for filtering adults-only content.  Since you know you want to crawl specific sites we can remove them.  And, since we'll need to negate the Address CrawlRule, we need to delete these rules or else we'll only get content from PriceGrabber.com that is adults-only content, which will likely be 2 pages.  (Yes, it's possible to crawl only adults-only content...)

3.) Set the value for negateIsDisallowed in the Address CrawlRule in CrawlActions.config to True.

4.) Insert your starting domains into the CrawlRequests table.

5.) Start crawling.

Then, slice and dice the imcoming data however you please.  Do you need additional information?

An open source .NET web crawler written in C# using SQL 2005/2008.

Join the arachnode.net group on Facebook: http://www.facebook.com/groups.php?ref=sb#/group.php?gid=166721755872

Twitter: http://twitter.com/arachnode_net

arachnode.net provides custom crawling and contracting resources.  Please ask.

http://bit.ly/TOFX4

C# crawler, C# web crawler, C# site crawler

All Replies

Top 10 Contributor
1,202 Posts

Does the App.config in the Console project point to Application.config in the Configuration project?

An open source .NET web crawler written in C# using SQL 2005/2008.

Join the arachnode.net group on Facebook: http://www.facebook.com/groups.php?ref=sb#/group.php?gid=166721755872

Twitter: http://twitter.com/arachnode_net

arachnode.net provides custom crawling and contracting resources.  Please ask.

http://bit.ly/TOFX4

C# crawler, C# web crawler, C# site crawler

Top 25 Contributor
11 Posts

i think this post is not valid for version 1.1 because Arachnode.SiteCrawler.Rules.Address does not exist anymore.

So what is the actual solution to restrict a crawling to a specific domain?

Now I will try to make Arachnode.SiteCrawler.Rules.AbsoluteUri negateIsDisallowed=true and post if it works

Top 10 Contributor
1,202 Posts

Address has been merged with AbsoluteUri.

The AbsoluteUri CrawlRule references the following tables:

If you wanted to Crawl msn.com, delete all rows in the tables above, insert msn.com into DisallowedDomains, and set IsDisallowed=true for msn.com and set the following setting in CrawlRules.config:

negateIsDisallowedForAbsoluteUri="true"

Let me know!

 

 

 

An open source .NET web crawler written in C# using SQL 2005/2008.

Join the arachnode.net group on Facebook: http://www.facebook.com/groups.php?ref=sb#/group.php?gid=166721755872

Twitter: http://twitter.com/arachnode_net

arachnode.net provides custom crawling and contracting resources.  Please ask.

http://bit.ly/TOFX4

C# crawler, C# web crawler, C# site crawler

Top 25 Contributor
11 Posts

Hi,

I am trying to crawl completely a big domain, which has aprox. 1m pages(or maybe more). I have started the crawl with maximum depth "int.MaxValue", restrictToUriHost. Then I have confıgured everything for negateIsDisallowedForAbsoluteUri="true". 

But when the crawl is completed I have only 3000 pages and 33000 hyperlinks in the database. There must be a problem or something missing. The hyperlinks are also in the same domain and should be crawled and saved as web pages, and so on till the complete domain is registered.

Please respond asap, cause I am in a hurry.

Thanks.

Top 10 Contributor
1,202 Posts

Which exceptions are present in the exceptions table?

Have you checked out the latest build from the trunk?

An open source .NET web crawler written in C# using SQL 2005/2008.

Join the arachnode.net group on Facebook: http://www.facebook.com/groups.php?ref=sb#/group.php?gid=166721755872

Twitter: http://twitter.com/arachnode_net

arachnode.net provides custom crawling and contracting resources.  Please ask.

http://bit.ly/TOFX4

C# crawler, C# web crawler, C# site crawler

Top 25 Contributor
11 Posts

thank you for your reply. 

1. I have just some 404s mainly about pictures and a couple of XML parsing errors. I am currently crawling and there are 100 exp.s 7000 hyperlinks ( I assume there are some which represent the same uri like .../download and .../download/ ) and just 400 webpages. The crawling is continuing with no problems. But the problem is when it is finished I realize that there are some uris in hyperlinks table that actually exist (present in the domain) but is not on the webpages table.

 

2. No I could not run the latest build nor the branch 1.2 . When I try to deploy, I always get that assembly error about the functions project. I don't know if it is necessary to deploy the whole solution to run the crawling. But i need to run the term extraction and lookup on the webpage metadata.

Top 25 Contributor
11 Posts

By the way i am currntly running the 1.1. 3444.26904 version 

Top 25 Contributor
11 Posts

Maybe I have some fault in the depth configuration.  I did not touch the "Arachnode.SiteCrawler.Rules.Depth" rule. How should I configure this?

Top 10 Contributor
1,202 Posts

1.) Do you have CreateCrawlRequestsFromDatabaseHyperLinks set to true?

2.) What error do you get?  Let's work on getting this running for you as the process of restricting to a domain has been greatly simplified.

An open source .NET web crawler written in C# using SQL 2005/2008.

Join the arachnode.net group on Facebook: http://www.facebook.com/groups.php?ref=sb#/group.php?gid=166721755872

Twitter: http://twitter.com/arachnode_net

arachnode.net provides custom crawling and contracting resources.  Please ask.

http://bit.ly/TOFX4

C# crawler, C# web crawler, C# site crawler

Top 10 Contributor
1,202 Posts

Set the depth to a high integer value.

An open source .NET web crawler written in C# using SQL 2005/2008.

Join the arachnode.net group on Facebook: http://www.facebook.com/groups.php?ref=sb#/group.php?gid=166721755872

Twitter: http://twitter.com/arachnode_net

arachnode.net provides custom crawling and contracting resources.  Please ask.

http://bit.ly/TOFX4

C# crawler, C# web crawler, C# site crawler

Top 25 Contributor
11 Posts

OK I've got a fresh 1.2 and build it. Now I've got the 

Error 1 UNSAFE ASSEMBLY permission was denied on object 'server', database 'master'. Functions

Top 25 Contributor
11 Posts

OK, I think I've passed the configuration and deployed the project, But I don't know how to configure from 1.2

I want to create crawl requests from all resources image, hyperlink, file,database etc. and I want to crawl the domain entirely as you know. Also I don't want to allow namedAnchors and query strings in the crawling.

I have no other configurations like file extensions.

 

Top 10 Contributor
1,202 Posts

Are you an admin on the box?  Where does this error come from?

An open source .NET web crawler written in C# using SQL 2005/2008.

Join the arachnode.net group on Facebook: http://www.facebook.com/groups.php?ref=sb#/group.php?gid=166721755872

Twitter: http://twitter.com/arachnode_net

arachnode.net provides custom crawling and contracting resources.  Please ask.

http://bit.ly/TOFX4

C# crawler, C# web crawler, C# site crawler

Top 10 Contributor
1,202 Posts

Create a CrawlRequest from code that restricts the crawl to UriClassificationType.Domain and restricts discoveries to UriClassificationType.Domain.

Set the config settings for CreateCrawlRequestsFromDatabase* except WebPages.

Find the CrawlRules table and set IsDisallowedForNamedAnchors and IsDisallowedForQueryStrings = false

Mike

An open source .NET web crawler written in C# using SQL 2005/2008.

Join the arachnode.net group on Facebook: http://www.facebook.com/groups.php?ref=sb#/group.php?gid=166721755872

Twitter: http://twitter.com/arachnode_net

arachnode.net provides custom crawling and contracting resources.  Please ask.

http://bit.ly/TOFX4

C# crawler, C# web crawler, C# site crawler

Page 2 of 2 (29 items) < Previous 1 2 | RSS
An open source .NET web crawler written in C# using SQL 2005/2008

copyright 2004-2010, arachnode.net LLC

Powered by Community Server (Non-Commercial Edition), by Telligent Systems