Hi,
I am using AN_v1.4, and struggling to make AN work for me, I need help really bad,
Scenario:
I have two sites that I want to crawl
(i) www.monster.com
(ii) www.jobster.com
They both are search engines for jobs, on their home page they allow you to enter your search string and provide you with search results based on user input,
I used following URLs to provide my search string,
http://jobsearch.monster.com/Search.aspx?brd=1&q=c%23&cy=us&where=San%20Diego,%20CA&rad=20&rad_units=miles&qlt=1227153&qln=628427&lid=354&re=130
http://www.jobster.com/find/US/jobs/in/Carlsbad%2c+CA/for/c%23+asp.net
& added these to the crawler using the code below and started the crawl
Once the crawl was finished I got following results
select top 10 * from [Arachnode.net].dbo.WebPages
select * from [arachnode.net].dbo.WebPages_MetaData
select * from [arachnode.net].dbo.DisallowedAbsoluteUris
select * from [arachnode.net].cfg.CrawlRules
Below is my configuration table
select [ConfigurationTypeID],[Key],[Value] from [arachnode.net].cfg.Configuration
ConfigurationTypeID
Key
Value
1
AssignCrawlRequestPrioritiesForFiles
true
AssignCrawlRequestPrioritiesForHyperLinks
AssignCrawlRequestPrioritiesForImages
AssignCrawlRequestPrioritiesForWebPages
AssignEmailAddressDiscoveries
AssignFileAndImageDiscoveries
AssignHyperLinkDiscoveries
ClassifyAbsoluteUris
ConsoleOutputLogsDirectory
D:\Arachnode_net_Crawer\ConsoleOutputLogsDirectory
CrawlRequestTimeoutInMinutes
CreateCrawlRequestsFromDatabaseCrawlRequests
CreateCrawlRequestsFromDatabaseFiles
false
CreateCrawlRequestsFromDatabaseHyperLinks
CreateCrawlRequestsFromDatabaseImages
CreateCrawlRequestsFromDatabaseWebPages
DesiredMaximumMemoryUsageInMegabytes
1024
DownloadedFilesDirectory
D:\Arachnode_net_Crawer\DownloadedFilesDirectory
DownloadedImagesDirectory
D:\Arachnode_net_Crawer\DownloadedImagesDirectory
DownloadedWebPagesDirectory
D:\Arachnode_net_Crawer\DownloadedWebPagesDirectory
EnableConsoleOutput
ExtractFileMetaData
ExtractImageMetaData
ExtractWebPageMetaData
InsertDisallowedAbsoluteUriDiscoveries
InsertDisallowedAbsoluteUris
InsertEmailAddressDiscoveries
InsertEmailAddresses
InsertExceptions
InsertFileDiscoveries
InsertFileMetaData
InsertFiles
InsertFileSource
InsertHyperLinkDiscoveries
InsertHyperLinks
InsertImageDiscoveries
InsertImageMetaData
InsertImages
InsertImageSource
InsertWebPageMetaData
InsertWebPages
InsertWebPageSource
MaximumNumberOfCrawlRequestsToCreatePerBatch
1000
MaximumNumberOfCrawlThreads
MaximumNumberOfHostsAndPrioritiesToSelect
10000
OutputConsoleToLogs
OutputStatistics
SaveDiscoveredFilesToDisk
SaveDiscoveredImagesToDisk
SaveDiscoveredWebPagesToDisk
SqlCommandTimeoutInMinutes
60
UserAgent
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/532.0 (KHTML, like Gecko) Chrome/3.0.195.38 Safari/532.0
VerboseOutput
2
CacheTimeoutInMinutes
15
CreateCrawlRequestsForMissingFilesAndImages
DownloadedFilesVirtualDirectory
\DownloadedFiles
DownloadedImagesVirtualDirectory
\DownloadedImages
LuceneDotNetIndexDirectory
D:\Arachnode_net_Crawer\Index
MaximumNumberOfDocumentsToReturnPerSearch
200
MaximumPageTitleLength
64
PageSize
10
Problem:
I am not sure where is the result of the crawl,
Where to look for the page that was crawled, I know AN converts it to XHTML but i am not sure where are they..
please help, I have a team viewer or microsoft live meeting session if needed
-------------------------------
I found the following
I had to run following to make AN store the web pages, I still have questions about DisallowedAbsoluteUris
update [Arachnode.net].cfg.Configuration set Value='true' where [key] = 'ExtractWebPageMetaData'update [Arachnode.net].cfg.Configuration set Value='true' where [key] = 'InsertWebPageMetaData'
Easy fix!
ExtractWebPageMetaData = true
InsertWebPageMetaData = true
For best service when you require assistance:
Skype: arachnodedotnet
Thanks Mike, I fixed this one, I started writing my first plugin, I want to do the following, either before or after inserting the row in WebPages_Metadata table, i want my plugin to get invoked, and should have access to following information
WebPages table and all related WebPages_Metadata records in form of parent - child data
What would be the best way to get this info?
Regards
Vishal
I started creating this plugin and i am following ManageLuceneDotNetIndexes.cs as my template around this new plugin
Each CrawlRequest has a Parent reference, but not a Grandparent reference, to keep memory consumption down.
Does the Parent reference work for you?
I think it might, I started writing this plugin and will reach at the code soon when i can start playing with crawlrequest
Great! Let me know how it turns out.