arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

What am i doing wrong

rated by 0 users
Answered (Verified) This post has 1 verified answer | 6 Replies | 1 Follower

Top 25 Contributor
Male
20 Posts
vishal posted on Thu, Jan 21 2010 12:24 PM

Hi,

 

I am using AN_v1.4, and struggling to make AN work for me, I need help really bad,

 

Scenario:

I have two sites that I want to crawl

(i) www.monster.com

(ii) www.jobster.com

 

They both are search engines for jobs, on their home page they allow you to enter your search string and provide you with search results based on user input,

I used following URLs to provide my search string,

 

http://jobsearch.monster.com/Search.aspx?brd=1&q=c%23&cy=us&where=San%20Diego,%20CA&rad=20&rad_units=miles&qlt=1227153&qln=628427&lid=354&re=130

 

http://www.jobster.com/find/US/jobs/in/Carlsbad%2c+CA/for/c%23+asp.net

 

& added these to the crawler using the code below and started the crawl

Once the crawl was finished I got following results

 

select top 10 * from [Arachnode.net].dbo.WebPages

select * from [arachnode.net].dbo.WebPages_MetaData

 

 

select * from [arachnode.net].dbo.DisallowedAbsoluteUris

select * from [arachnode.net].cfg.CrawlRules

 

 

Below is my configuration table

select [ConfigurationTypeID],[Key],[Value] from [arachnode.net].cfg.Configuration

ConfigurationTypeID

Key

Value

1

AssignCrawlRequestPrioritiesForFiles

true

1

AssignCrawlRequestPrioritiesForHyperLinks

true

1

AssignCrawlRequestPrioritiesForImages

true

1

AssignCrawlRequestPrioritiesForWebPages

true

1

AssignEmailAddressDiscoveries

true

1

AssignFileAndImageDiscoveries

true

1

AssignHyperLinkDiscoveries

true

1

ClassifyAbsoluteUris

true

1

ConsoleOutputLogsDirectory

D:\Arachnode_net_Crawer\ConsoleOutputLogsDirectory

1

CrawlRequestTimeoutInMinutes

1

1

CreateCrawlRequestsFromDatabaseCrawlRequests

true

1

CreateCrawlRequestsFromDatabaseFiles

false

1

CreateCrawlRequestsFromDatabaseHyperLinks

false

1

CreateCrawlRequestsFromDatabaseImages

false

1

CreateCrawlRequestsFromDatabaseWebPages

false

1

DesiredMaximumMemoryUsageInMegabytes

1024

1

DownloadedFilesDirectory

D:\Arachnode_net_Crawer\DownloadedFilesDirectory

1

DownloadedImagesDirectory

D:\Arachnode_net_Crawer\DownloadedImagesDirectory

1

DownloadedWebPagesDirectory

D:\Arachnode_net_Crawer\DownloadedWebPagesDirectory

1

EnableConsoleOutput

true

1

ExtractFileMetaData

false

1

ExtractImageMetaData

true

1

ExtractWebPageMetaData

false

1

InsertDisallowedAbsoluteUriDiscoveries

true

1

InsertDisallowedAbsoluteUris

true

1

InsertEmailAddressDiscoveries

true

1

InsertEmailAddresses

true

1

InsertExceptions

true

1

InsertFileDiscoveries

true

1

InsertFileMetaData

false

1

InsertFiles

true

1

InsertFileSource

false

1

InsertHyperLinkDiscoveries

true

1

InsertHyperLinks

true

1

InsertImageDiscoveries

true

1

InsertImageMetaData

true

1

InsertImages

true

1

InsertImageSource

false

1

InsertWebPageMetaData

false

1

InsertWebPages

true

1

InsertWebPageSource

false

1

MaximumNumberOfCrawlRequestsToCreatePerBatch

1000

1

MaximumNumberOfCrawlThreads

1

1

MaximumNumberOfHostsAndPrioritiesToSelect

10000

1

OutputConsoleToLogs

false

1

OutputStatistics

false

1

SaveDiscoveredFilesToDisk

true

1

SaveDiscoveredImagesToDisk

true

1

SaveDiscoveredWebPagesToDisk

true

1

SqlCommandTimeoutInMinutes

60

1

UserAgent

Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/532.0 (KHTML, like Gecko) Chrome/3.0.195.38 Safari/532.0

1

VerboseOutput

true

2

CacheTimeoutInMinutes

15

2

CreateCrawlRequestsForMissingFilesAndImages

false

2

DownloadedFilesVirtualDirectory

\DownloadedFiles

2

DownloadedImagesVirtualDirectory

\DownloadedImages

2

LuceneDotNetIndexDirectory

D:\Arachnode_net_Crawer\Index

2

MaximumNumberOfDocumentsToReturnPerSearch

200

2

MaximumPageTitleLength

64

2

PageSize

10

Problem:

I am not sure where is the result of the crawl,

Where to look for the page that was crawled, I know AN converts it to XHTML but i am not sure where are they..

please help, I have a team viewer or microsoft live meeting session if needed

-------------------------------

I found the following

I had to run following to make AN store the web pages, I still have questions about DisallowedAbsoluteUris

update [Arachnode.net].cfg.Configuration set Value='true' where [key] = 'ExtractWebPageMetaData'
update [Arachnode.net].cfg.Configuration set Value='true' where [key] = 'InsertWebPageMetaData'

 

Answered (Verified) Verified Answer

Top 10 Contributor
1,905 Posts
Verified by arachnode.net

Easy fix!

 

ExtractWebPageMetaData = true

InsertWebPageMetaData = true

Smile

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

All Replies

Top 10 Contributor
1,905 Posts
Verified by arachnode.net

Easy fix!

 

ExtractWebPageMetaData = true

InsertWebPageMetaData = true

Smile

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 25 Contributor
Male
20 Posts
vishal replied on Thu, Jan 21 2010 1:29 PM

Thanks Mike, I fixed this one, I started writing my first plugin, I want to do the following, either before or after inserting the row in WebPages_Metadata table, i want my plugin to get invoked, and should have access to following information

WebPages table and all related WebPages_Metadata records in form of parent - child data

What would be the best way to get this info?

 

Regards

Vishal

Top 25 Contributor
Male
20 Posts
vishal replied on Thu, Jan 21 2010 4:45 PM

I started creating this plugin and i am following ManageLuceneDotNetIndexes.cs as my template around this new plugin

Top 10 Contributor
1,905 Posts

Each CrawlRequest has a Parent reference, but not a Grandparent reference, to keep memory consumption down.

Does the Parent reference work for you?

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 25 Contributor
Male
20 Posts
vishal replied on Fri, Jan 22 2010 10:15 AM

I think it might, I started writing this plugin and will reach at the code soon when i can start playing with crawlrequest

Top 10 Contributor
1,905 Posts

Great!  Let me know how it turns out.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (7 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC