arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

Crawl pages created or modified 30 day ago

rated by 0 users
Not Answered This post has 0 verified answers | 8 Replies | 2 Followers

Top 25 Contributor
14 Posts
dbs2000 posted on Thu, Aug 6 2009 8:34 AM

Is it possible to crawl only such pages that were created or modified 30 day ago? I am not interested in crawling pages older than that.

All Replies

Top 10 Contributor
1,905 Posts

Yes/possibly.  Do you want to filter when crawling, or when submitting WebPages to be crawled.

(from the recent enthusiasm for crawling via date and not recrawling via date, I may need to put some work into this this weekend...)

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 25 Contributor
14 Posts

Good to see you Mike.Smile

Its great to know that you will be doing something about "crawling via date and not recrawling via date" this weekend.
As far as your question goes, either will be fine, as long as my target is achieved.

Let me tell you in slight detail what my requirements are. I have a list (big one - 20,000 - 50,000) of sites (under different categories) that I have to search for specific tag words. I would need to come up with a count of these tag words on each of these web sites on a per day basis. Say an article which was published on 1st Aug contains some tag words. The crawl that will be done on the 1st should catch this. However, the crawl that I would do on 2nd Aug should not take this article (publishes on 1st Aug) into consideration.

With the count figures that I would collect on a per day basis I would have to do some analysis.

A question in my mind here is -  can such a huge list of web site be crawled on a per day basis? I was also worried about the disk space that the lucene index & downloaded files would take. Will it be manageable?

Do you think I will be able to achieve all this with what is there in archnode.net currently or may be after the additions that you are planning to make during the weekend?

I will be needing your guidance & help :)

Thanks again
Debasish

Top 10 Contributor
1,905 Posts

AFA the number of WebPages that can be crawled in a day - I have been able to get through 1,000,000 WebPages in a day using this hardware: http://arachnode.net/media/p/9935.aspx  (text only...)

The biggest factor in how fast you can crawl is disk speed.

If you have 500GB of spare disk space you can handle your load - of course, this is a rough estimate since I don't know what you are crawling.

How about this - can we arrange a simulation?  Send me your list of sites and we can document what we need to do to achieve your goal?

There is an SSIS package in the solution that will extract the tag words, but it is expensive (in terms of time needed) to run - but perhaps we can find a better solution for large-scale tag analysis - shouldn't be too difficult.

Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
1,905 Posts

OK.  I crawled using the list you sent me.

Create a crawl using the crawl code I sent you.  (Not listing the sites you sent me per your NDA...)

The next step is to examine the DisallowedAbsoluteUris table and the Exceptions table as some sites are disallowed by the default rules.  Let me know once you're there and we'll take the next step.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
1,905 Posts

Here is the output of the extracted words:

Select

 

* From

WebPages_MetaData_TermExtraction

ability 14.5560907917588
abuse event 18.4443972705697
access 24.6625598035165
account 26.3395928612779
acquisition 25.8221561787976
action 12.9513358272291
Activision 14.7555178164557
ad 20.7944154167984
addition 13.2798398942012
addthis_close 22.1332767246836
administration 17.9743936413239
adoption 10.3610686617833
advantage 12.9513358272291
advertiser 31.0832059853499
Advertising 10.3610686617833
AI 14.97866136777
AIM 14.97866136777
Air Force 25.8221561787976
Allahpundit 29.5110356329115
Amazon 29.1121815835177
Amazon MP3 18.4443972705697
America 12.9513358272291
American 15.541602992675
American Idol 9.21034037197618
analysis 19.1726623556449
analyst 10.3610686617833
Android 20.9701259148779
Angola 18.4443972705697
Anil Dash 11.982929094216
announcement 10.3972077083992
answer 15.6867237455276
Apple 149.895360235042
application 59.9146454710798
approval 9.21034037197618
approval rating 14.7555178164557
apps 16.1180956509583
April 16.6355323334387
archives 14.5560907917588
article 10.3610686617833
Asia 9.21034037197618
attack 38.9445195562019
attempt 10.3610686617833
Aug 85.2571032088733
Aug. 64.7566791361457
August 66.3599070838739
Authority 25.9026716544583
Authorized Book 25.8221561787976
award 11.982929094216
baby 15.1769598790871
baby daddy 12.9513358272291
badge 29.5110356329115
bank 12.9513358272291
Banking 11.982929094216
bear 18.4443972705697
bear shaving 14.7555178164557
beginning 9.21034037197618
behavior 18.7149738751185
Ben Kuchera 14.7555178164557
benefit 17.7038170367751
Better Job 14.97866136777
bid 18.4206807439524
Big Poppa 20.9701259148779
bill 10.3610686617833
Bill Clinton 15.541602992675
Bing 13.8155105579643
bit 27.887508880938
Bite 22.1332767246836
Blair 18.4443972705697
Blog 37.0170719859843
blog post 11.5129254649702
blog_item 44.2665534493672
blogs 22.7654398186306
Blonde 14.5560907917588
Blu-ray 28.4929388199041
Board 18.4206807439524
book 19.4081210556785
bookmark 22.1332767246836
bottom 10.3972077083992
box 15.1769598790871
brand 27.6310211159286
break 10.3610686617833
breakfast 15.541602992675
Brian Ashcraft Comment 18.4443972705697
Bristol Palin 18.1318701581208
browser 16.1180956509583
Bryce Dallas Howard 14.7555178164557
btn_fb_share 22.1332767246836
btn_share_this 22.1332767246836
building 10.3610686617833
Business 33.2710646668774
Cajun Boy Comment 25.8221561787976
call 11.5129254649702
camera 12.9513358272291
campaign 10.3610686617833
Canada 14.5560907917588
Canadian leader 14.7555178164557
Car 23.9658581884319
care 10.3972077083992
case 31.3247524123321
cash 13.2798398942012
category 55.0164795616906
celebs 11.982929094216
Cellphones 11.982929094216
Cent 10.3610686617833
challenge 20.7944154167984
chance 11.5129254649702
change 20.883168274888
charge 10.3610686617833
chart 12.476649250079
check 15.541602992675
Chicago 9.21034037197618
child 16.6355323334387
choice 10.3610686617833
Chuck 14.7555178164557
circumstance 14.97866136777
CIRI 22.1332767246836
claim 12.476649250079
class 20.7221373235666
client 10.3610686617833
Clinton 14.7555178164557
cloud 11.982929094216
clue 11.982929094216
CNN 113.971755279616
CNN Video 22.1332767246836
collaboration 9.21034037197618
color 14.97866136777
column 12.9513358272291
comment 159.434587346103
Comment Week 14.7555178164557
commentary 9.21034037197618
Comments 9.21034037197618
communication 10.3972077083992
community 16.1180956509583
company 47.6471180574561
competition 11.5129254649702
computer 12.476649250079
concern 10.3972077083992
Cond 11.982929094216
Congress 22.8738569584782
connection 10.3610686617833
Conrad 18.1318701581208
consumer 12.9513358272291
Consumer Abuse 25.8221561787976
contact 16.6355323334387
Contact Us 15.1769598790871
content 20.9156316607035
contest 11.5129254649702
contestant 25.8221561787976
Continue Reading 31.0832059853499
contract 12.476649250079
Convenience 22.1332767246836
conversion 14.7555178164557
Conversion Optimizer 33.1999150870254
copy 12.9513358272291
copyright 15.1769598790871
corner 11.982929094216
country 28.9698824238138
couple 20.7944154167984
course 18.0737785384179
CourseSmart 14.7555178164557
court 18.1318701581208
courtesy 12.9513358272291
coverage 11.5129254649702
crime 11.982929094216
Crippen 22.1332767246836
Crowd 13.8155105579643
customer 35.4076340735502
dad 10.3610686617833
Daily Show 18.4443972705697
damage 18.1318701581208
danger 11.982929094216
darkhorse 18.4443972705697
darkhorse_loga 44.2665534493672
data 33.2710646668774
date 12.9513358272291
Daughter 11.5129254649702
Dave Bullock August 14.7555178164557
David Kravets August 22.1332767246836
day 42.3209078995419
DDoS 11.982929094216
deal 22.3748231516658
Dean 14.7555178164557
Death 11.3827199093153
death panel 11.982929094216
debate 11.5129254649702
decade 10.3972077083992
December 35.3505062085572
decision 14.5560907917588
DefCon 70.0887096281648
Dell 14.7555178164557
demise 11.982929094216
democrat 15.541602992675
Democrats 14.7555178164557
Dems 14.7555178164557
DentBetty 14.7555178164557
deposit 11.982929094216
design 10.3610686617833
Desktop 9.21034037197618
detail 12.476649250079
developer 38.8540074816874
device 41.831263321407
difference 9.21034037197618
Digg 15.541602992675
Direct Hire 14.7555178164557
director 25.9026716544583
Directors 14.7555178164557
dirt 15.541602992675
document 41.7663365497761
Dodd 14.7555178164557
dollar 12.2007851354104
door 10.3610686617833
Doug Reinhardt 11.982929094216
dozen 18.4206807439524
Dr. Arnold Klein 14.7555178164557
economy 18.9711998488588
Ed Morrissey 62.7109507199369
Edit 18.4443972705697
Edition 11.982929094216
editor 20.7221373235666
education 12.476649250079
effect 12.476649250079
effort 16.094379124341
election 11.982929094216
Eliot Van Buskirk 33.1999150870254
Email 47.0601712365828
E-mail 51.9860385419959
email address 18.4443972705697
employee 27.0327400418379
Employee Ownership 22.1332767246836
end 22.478601933048
energy 15.541602992675
entry 38.9445195562019
environment 11.5129254649702
epicenter 23.9658581884319
Erick Schonfeld 14.7555178164557
error 11.5129254649702
ESOPs 18.4443972705697
EST 22.1332767246836
ET 33.6734731507957
ETFs 11.982929094216
eureka moment 14.7555178164557
Evan Hansen 14.7555178164557
event 46.4754305273604
example 35.3505062085572
existence 12.9513358272291
experience 13.2798398942012
expert 14.97866136777
eye 10.3972077083992
Facebook 33.2710646668774
Facebook Share Link 18.4443972705697
fact 26.3395928612779
factor 11.982929094216
failure 15.541602992675
fame 12.9513358272291
family 24.953298500158
fan 12.476649250079
faq 12.9513358272291
Fat 25.8221561787976
favor 13.8155105579643
fb_share_link 22.1332767246836
FCC 11.982929094216
feature 24.953298500158
February 16.6355323334387
Fed 11.982929094216
Feedback 10.3972077083992
Fidelity 20.9701259148779
file 16.1180956509583
film 13.8155105579643
fire 10.3610686617833
firm 9.21034037197618
First 11.982929094216
fishy 14.7555178164557
folk 14.5560907917588
follower 10.3610686617833
food 9.21034037197618
form 16.1180956509583
Foster Kamer Comment 25.8221561787976
Free 10.3610686617833
Friday 30.5793203362479
friend 21.6715104778669
full story 11.982929094216
fun 15.6867237455276
Fund 13.8155105579643
future 19.3915133981103
G.I. Joe 14.7555178164557
gadget 10.3610686617833
gain 15.541602992675
game 41.5888308335967
Gaming 20.9701259148779
Gawker 18.4443972705697
GB iPhone 29.5110356329115
Getty 17.9743936413239
Gibbs 25.8221561787976
Gift 9.21034037197618
GIGAZINE 25.8221561787976
glance 10.3610686617833
goal 22.8738569584782
godfather 11.982929094216
good news 14.97866136777
Google 125.22546558761
Google Apps 17.9743936413239
Google Apps Ad Campaign 14.7555178164557
Google Chrome 17.9743936413239
Google Earth 14.97866136777
Google Voice 23.9658581884319
Googlers 14.7555178164557
GOP 14.7555178164557
Gossip Girl 11.982929094216
government 16.094379124341
group 17.0740798639729
Guadalajara 17.9743936413239
guide 11.5129254649702
guy 13.2798398942012
Gym 25.8221561787976
hacker 17.9743936413239
Hamilton Nolan Comment 22.1332767246836
hand 18.1318701581208
hardware 10.3610686617833
HDD 29.5110356329115
head 13.8155105579643
health care 34.1481597279459
health care bill 11.982929094216
health care reform 20.7221373235666
Help 14.4849412119069
Hero 11.982929094216
Hillary Clinton 14.7555178164557
Hollywood 11.3827199093153
home 24.0794560865187
Hoover 29.5110356329115
hope 14.5560907917588
hostile network 14.7555178164557
Hotline 22.1332767246836
hour 26.1445395758793
house 15.6867237455276
hundred 9.21034037197618
icio 36.8887945411394
icon 14.7555178164557
idea 27.0327400418379
im 29.9573227355399
image 36.6023554062311
impact 12.476649250079
improvement 9.21034037197618
Inc. 40.2359478108525
industry 12.9513358272291
info 10.3972077083992
information 45.1844463460448
innovation 12.9513358272291
in-person meeting 14.7555178164557
Intel 12.9513358272291
intention 10.3972077083992
interaction 17.9743936413239
interest 12.9513358272291
Internet 26.1445395758793
investment 11.982929094216
Investools 25.8221561787976
investor 18.4206807439524
iPhone 50.7162658104424
iPhone OS 14.7555178164557
iPod 11.5129254649702
iPod touch 33.1999150870254
Iran 10.3610686617833
ISPs 11.982929094216
issue 26.2455531124669
item 10.3972077083992
iTunes 59.576144805254
January 12.476649250079
Jason Fitzpatrick Comment 36.8887945411394
job 17.0740798639729
Joe 11.5129254649702
John King 18.4443972705697
Jon 30.3539197581741
Jon Gosselin 11.982929094216
Jonas Brothers 11.982929094216
journal 11.5129254649702
judge 20.7944154167984
Julie 14.7555178164557
July 85.2049559668273
jump 11.982929094216
June 16.4082036445549
Kate 13.8155105579643
Kate Gosselin 27.0327400418379
Kathy Griffin 22.8738569584782
Kevin Purdy Comment 14.7555178164557
kid 29.1121815835177
Kim 9.21034037197618
Kim Kardashian 13.8155105579643
kind 20.8683198337447
King 9.21034037197618
LAS VEGAS 11.982929094216
last time 11.5129254649702
last week 24.1415686865115
last year 17.0740798639729
Latest response 18.4443972705697
launch 18.4206807439524
Laura Moorhead 14.7555178164557
law 22.5321307740774
lawmaker 11.982929094216
lawyer 14.97866136777
leader 16.1180956509583
Lester 11.982929094216
letter 20.7221373235666
Level 10.3610686617833
Levi Johnston 20.8683198337447
Lewis Wallace 25.8221561787976
life 26.5724312243505
Lifehacker 14.7555178164557
Line 20.9156316607035
link 19.4081210556785
list 23.8664780284435
Listening Post blog 14.7555178164557
Live Nation 22.1332767246836
LiveJournal 11.982929094216
location 10.3972077083992
Log 9.21034037197618
Login 13.8155105579643
long time 11.982929094216
Los Angeles 10.3972077083992
loser 14.7555178164557
lot 33.2710646668774
Love 11.3827199093153
Luke Plunkett Comment 29.5110356329115
ly 17.9743936413239
MA 18.4443972705697
Mac 34.5387763949107
Mac OS X 17.9743936413239
MacBook 14.7555178164557
MacRumors 18.4443972705697
man 24.953298500158
map 11.982929094216
march 12.476649250079
MARKET 25.7510065989456
Market Consensus 14.7555178164557
Marketers 33.1999150870254
Mashable 36.8887945411394
material 9.21034037197618
MB 26.9615904619859
Me 10.3610686617833
media 9.21034037197618
meeting 20.7944154167984
Memba 33.1999150870254
member 27.3604445113797
Memory 10.3610686617833
Mess 11.982929094216
message 16.1180956509583
Mexican 14.7555178164557
Mexican President Felipe Calderon 18.4443972705697
Mexico 38.8540074816874
Michael Arrington 18.4443972705697
Michael Jackson 41.7366396674894
Microsoft 47.427999622147
Mike Fahey Comment 18.4443972705697
Miley Cyrus 11.982929094216
million 9.21034037197618
minute 22.3748231516658
MJ 10.3610686617833
MMORPG 14.7555178164557
Mo 11.982929094216
model 11.982929094216
mom 10.3610686617833
moment 16.4082036445549
Mon 38.9445195562019
Monday 52.2890791517587
Money 33.2710646668774
month 26.5724312243505
morning 12.476649250079
move 11.5129254649702
movie 22.7654398186306
movie_reviews 22.1332767246836
MSN 14.7555178164557
music 32.25103974306
name 19.3132549492092
NASDAQ 20.7221373235666
NASDAQ Answers 14.7555178164557
Nast Digital 11.982929094216
Nation 20.7944154167984
NationalJournal 14.7555178164557
NCEO 33.1999150870254
need 13.2798398942012
negative 36.8887945411394
netbooks 10.3610686617833
network 18.7149738751185
new feature 18.1318701581208
New Footage 14.7555178164557
new product 10.3610686617833
New York 20.8683198337447
News 29.9483138520202
newsletter 14.5560907917588
NewsVine StumbleUpon Mixx Comments 36.8887945411394
Newton 14.7555178164557
Next-Generation iMac 14.7555178164557
Nick 11.982929094216
Ninjawords Dictionary 14.7555178164557
Nintendo 25.8221561787976
North American summit 11.982929094216
note 12.2007851354104
November 12.9513358272291
number 27.6913744994965
NYSE 11.982929094216
Obama 45.7477139169564
Obama administration 23.3124044890124
ObamaCare 14.7555178164557
October 14.5560907917588
Office 11.5129254649702
official 13.8155105579643
On2 49.2150761434707
On2 Technologies 20.7221373235666
onclick 81.1553479905066
online 16.1180956509583
onmouseout 22.1332767246836
operation 11.982929094216
opportunity 16.6355323334387
option 16.6355323334387
order 18.4206807439524
organization 18.1318701581208
OS 20.9701259148779
Owen Good Comment 18.4443972705697
page 28.3451973614643
pageTracker 59.022071265823
Pakistani Taliban leader 14.7555178164557
Palin 22.1332767246836
panel 20.9701259148779
parent 11.5129254649702
Paris 16.1180956509583
Paris Hilton 11.982929094216
part 27.4887219562247
participant 14.7555178164557
party 13.943754440469
PASSWORD 28.9698824238138
patent 33.6734731507957
patient 11.982929094216
Patry 14.7555178164557
Paula 11.982929094216
Paula Abdul 18.1318701581208
payment 17.9743936413239
PC 12.476649250079
pdf 16.1180956509583
people 72.3869678180583
percent 48.283137373023
performance 12.476649250079
Permalink 85.2571032088733
person 25.9026716544583
Pete Cashmore Comments 14.7555178164557
Philipp Lenssen 51.6443123575951
phone 15.1769598790871
photo 19.1726623556449
picture 18.7149738751185
Piece 12.476649250079
place 23.2377152636802
plan 16.094379124341
platform 9.21034037197618
player 10.3972077083992
pm 142.008259944712
PM Company Last Sale 14.7555178164557
PM EST 14.7555178164557
PM ET 20.7221373235666
pm Experts 14.7555178164557
point 17.8998585213326
policy 18.9711998488588
politicalticker 18.4443972705697
politics 18.9711998488588
POLL 11.3827199093153
Popular 14.5560907917588
portion 10.3610686617833
position 9.21034037197618
positive 36.8887945411394
post 38.729525439467
Post Comment 35.9487872826479
post Labels 22.1332767246836
power 13.943754440469
prediction 11.982929094216
president 26.9615904619859
President Barack Obama 11.982929094216
President Obama 40.5776739952533
press 9.21034037197618
price 18.9711998488588
Print Email Share 22.1332767246836
Print Share Close Linkedin Digg Facebook Mixx 55.333191811709
privacy 15.1769598790871
Privacy Policy 16.7827943571024
problem 32.8164072891098
producer 11.982929094216
product 17.8998585213326
profile 33.6734731507957
profit 13.8155105579643
program 23.0258509299405
project 12.9513358272291
Prospectus 22.1332767246836
Proxy Statement 22.1332767246836
PS3 32.9530550090939
PS3 Slim 14.7555178164557
Public 9.21034037197618
Qtrax 18.4443972705697
quality 15.541602992675
quarter 20.9701259148779
quest 9.21034037197618
question 20.6557469010491
quote 15.6867237455276
rain 14.7555178164557
rate 31.0832059853499
rating 38.9445195562019
reaction 10.3610686617833
READ 16.1180956509583
Reader 11.5129254649702
readmore 14.7555178164557
ReadWriteWeb 33.1999150870254
reality check 14.7555178164557
Reason 29.6304781859966
Recession 18.4443972705697
Red Carpet 15.541602992675
regime 11.982929094216
Register 10.3610686617833
Registration Statement 14.7555178164557
rel 22.1332767246836
relationship 14.5560907917588
release 16.6355323334387
report 25.3581329052212
reporter 18.1318701581208
Republicans 11.982929094216
research 9.21034037197618
resource 14.97866136777
response 140.386878881555
responsibility 11.982929094216
rest 39.5093892919169
result 31.8847703057575
retailer 14.97866136777
return 20.7944154167984
return addthis_sendto 22.1332767246836
return pageTracker 22.1332767246836
review 17.4296930505862
RIAA 14.97866136777
rice 18.4443972705697
right 24.7398497606022
rightcol 22.1332767246836
Rights Reserved 10.3972077083992
rise 10.3610686617833
risk 14.5560907917588
road 10.3610686617833
Rogers 22.1332767246836
room 12.476649250079
round 23.9658581884319
RPG 36.8887945411394
RSS 34.3080621658875
RTM 18.4443972705697
rule 14.97866136777
rumor 12.9513358272291
sale 14.5560907917588
San Francisco 10.3610686617833
Saturday 10.3972077083992
Schmidt 25.8221561787976
school 20.7944154167984
Scott Thill 25.8221561787976
screenshots 9.21034037197618
search 17.8998585213326
Search Engine 14.97866136777
SEC 23.9658581884319
section 12.9513358272291
security 13.8155105579643
Security Update 18.4443972705697
Semitopless Serena 14.7555178164557
Senate 20.9701259148779
sense 9.21034037197618
separation 10.3610686617833
September 20.8683198337447
series 10.3610686617833
service 43.8332737694436
service member 18.4443972705697
Seth 18.4443972705697
Seth Godin 95.9108658069623
share 38.7830267962206
Share Link 18.4443972705697
SharePoint 14.7555178164557
show 13.8155105579643
side 18.4206807439524
Sign 15.6867237455276
site 35.9379766224243
situation 14.4849412119069
SMS 11.982929094216
social media 25.9026716544583
social network 14.5560907917588
society 12.476649250079
software 20.7232658369464
solution 20.8683198337447
son 12.9513358272291
song 18.4206807439524
songbird 18.4443972705697
Sony 22.1332767246836
sort 12.476649250079
Sotomayor 23.9658581884319
source 27.7258872223978
space 13.8155105579643
Space Permalink August 51.6443123575951
Specter 11.982929094216
sphere 47.9554329034812
sport 9.21034037197618
Sprint 14.7555178164557
SSA 14.7555178164557
SSD 14.97866136777
star 16.6355323334387
Stars 13.8155105579643
state 39.1439465808988
statement 19.1726623556449
STAT-USA 22.1332767246836
step 15.6867237455276
Steve Jobs 11.982929094216
stock 18.7149738751185
stock option 11.982929094216
store 9.21034037197618
storm 14.7555178164557
story 24.1415686865115
strategy 26.9615904619859
student 13.8155105579643
studio 10.3610686617833
study 10.3610686617833
stuff 10.3610686617833
style 11.982929094216
success 11.5129254649702
summer 13.2798398942012
summit 27.6310211159286
Sunday 30.5793203362479
Sunday afternoon 12.9513358272291
support 17.4296930505862
supporter 17.9743936413239
Susan 11.982929094216
switch 12.9513358272291
symbol 14.97866136777
system 22.7654398186306
tactic 20.7221373235666
Tag 10.3610686617833
talk 16.6355323334387
Tarantino 14.7555178164557
target 20.7232658369464
team 18.7149738751185
Tech 12.9513358272291
TechCrunch 14.97866136777
Technology 38.7830267962206
Teen Choice 17.9743936413239
Teen Choice Awards 49.3251196070329
term 24.0794560865187
Terms 9.21034037197618
theme 20.7221373235666
thing 30.4448416104617
thinkorswim 25.8221561787976
thought 12.476649250079
thousand 12.2007851354104
thread 68.6215708754346
threat 11.982929094216
Threat Level 14.7555178164557
Thursday 45.5308796372612
Thursday morning 14.7555178164557
Tickets 11.982929094216
time 56.7945150717839
tinyurl 18.4206807439524
Tip 20.883168274888
tips box 14.7555178164557
title 11.5129254649702
TMZ 29.5110356329115
TMZ Staff 33.1999150870254
today 45.0545667363964
toe 36.8887945411394
tomorrow 9.21034037197618
tonight 11.5129254649702
tool 19.3915133981103
toolbar data 14.7555178164557
top 17.8998585213326
top story 17.9743936413239
topic 15.1769598790871
touch 10.3610686617833
town hall chaos 14.7555178164557
Tr 38.9445195562019
track 14.5560907917588
TrackBack 92.2219863528484
Trackbacks 40.5776739952533
trackPageview 81.1553479905066
trade 12.9513358272291
traffic 12.476649250079
training 9.21034037197618
transaction 11.982929094216
Trip 9.21034037197618
TripAdvisor 14.7555178164557
Tuesday 27.0327400418379
tune 15.541602992675
TV 18.1318701581208
tweet 20.7221373235666
Tweets 18.4443972705697
twilight 25.9026716544583
Twist 14.97866136777
Twitter 79.4622050855118
Twitter Wit 22.1332767246836
type 11.3827199093153
U.S. 17.4296930505862
U.S. Senate 14.7555178164557
uber 81.1553479905066
uberblog 14.7555178164557
Underwire 14.7555178164557
Union 33.6734731507957
United States 18.4206807439524
university 13.8155105579643
UPDATE 38.2136232861816
Updated 12.9513358272291
URL shorteners 14.7555178164557
USB 14.7555178164557
Use 18.3258146374831
user 31.8847703057575
Username 17.0740798639729
value 11.5129254649702
variety 11.5129254649702
Venture Capital 11.982929094216
version 11.5129254649702
VIDEO 61.4026130206227
view 15.6867237455276
vote 16.1180956509583
Wake 10.3610686617833
Wal-Mart 10.3610686617833
war 10.3610686617833
WASHINGTON 29.9573227355399
Washington Wire 14.7555178164557
wave 16.1180956509583
way 44.957203866096
Web 30.4984759446376
Web site 18.9711998488588
website 13.2798398942012
wedding ring 10.3972077083992
Wednesday 30.5793203362479
week 34.7952788850877
weekend 14.5560907917588
Weekly Social Media 14.7555178164557
WENN 20.9701259148779
White House 26.9615904619859
wife 14.97866136777
Wii 18.1318701581208
Windows 29.1121815835177
wizardry 14.7555178164557
woman 13.8155105579643
word 24.6625598035165
WordPress 9.21034037197618
work 32.2746045328891
World 33.9027570793437
WSJ 22.1332767246836
Xbox 12.9513358272291
XML 17.9743936413239
Yahoo 45.5308796372612
year 79.4471694939498
yesterday 14.4849412119069
YouTube 10.3610686617833

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
1,905 Posts

So, for our crawling setup, we need to turn on 'ExtractWebPageMetaData' and 'InsertWebPageMetaData' in the database table 'cfg.Configuration'.  This will strip out all tags from our HTML and insert the text into the database table 'WebPages_MetaData'.

Since we're going to be doing a lot of high-volume crawling, and we need the text extracted, we need to modify the source code just a bit to turn off creating HtmlAgilityPack HtmlDocuments.  Find the line in WebPageManager:

managedWebPage.HtmlDocument =

 

HtmlManager.CreateHtmlDocument(source2, Encoding

.Unicode);

...and comment it out.  The HtmlAgilityPack consumes a lot of RAM and we won't need it.

we also need to modify this line to avoid a NullReferenceException:

 

_arachnodeDAO.InsertWebPageMetaData(webPageID, absoluteUri,

 

Encoding.UTF8.GetBytes(managedWebPage.Text), null/*managedWebPage.HtmlDocument.DocumentNode.OuterHtml*/

);

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 25 Contributor
14 Posts

Hi Mike,

This is Debasish once again.
Thanks a lot for walking me through the various steps of how to configure AN to make it work according to my needs.

However, there are some confusions that I am having when I am comparing your post reply (http://arachnode.net/forums/p/453/10655.aspx#10655) & what you told me in chat yesterday. In the chat you mentioned the following configuration changes.

cfg.Configuration:-

AssignEmailAddressDiscoveries, AssignFileAndImageDiscoveries = false
CreateCrawlRequestsFromDatabaseWebPages =true
InsertWebPageMetaData =false
ExtractWebPageMetaData =false
AssignCrawlRequestPrioritiesForWebPages = false

cfg.CrawlActions:-

 ManageLuceneDotNetIndexes - IsEnabled = false

cfg.CrawlRules:-
turn off robot.txt

But in the post you have written

"........So, for our crawling setup, we need to turn on 'ExtractWebPageMetaData' and 'InsertWebPageMetaData' in the database table 'cfg.Configuration'.  This will strip out all tags from our HTML and insert the text into the database table 'WebPages_MetaData'.........."

Now what do I do about this? i hope you have understood my confusion.

Also, you mentioned this in the chat - " Or, perhaps it would be best to write a plug-in and scan the source (DecodedHtml) for your tag words and then store the result as you saw fit?".
So do I need to write a plugin ?

Also, do I need to populate the TermExtraction table first manually & the count of those will be populated by the application into the termlookup table? If that is the case then I have 2 issues. First my terms (tag words) are categorised into groups that would also contain the links. So, my links are grouped & so are the tag words. In other words I would not be interested in looking for the same tag words in all pages. It will be on the basis of groups that would contain links & tag words that needs to be looked up only in those web pages. Plus the count has to be on a perday basis. That would mean that I would need a datetime column in the TermLookUp table. What is you opinion about all these?

And, yes, distributed caching may be needed.

Lastly, I believe that you will be putting the 'date specific crawling' thing. I am eagerly waiting for that.

My final aim here is to do 'Sentiment Analysis' (as you correctly pointed out yesterday). So I would have to do crawling each day & find out / filter out what has been told about on that specific day from various different category of sources & present that data. You have also mentioned that I would not need Lucene. So that would mean that I would not need the search functionality.

So, what are the things that I need to do to achieve my target in terms of configuration changes & code change (if required)

Top 10 Contributor
1,905 Posts

Let's go with what I communicated over IM.  There are several ways to achieve what (I think and hope) I understand your needs to be.

The switches and modifications from the post were to support a batch-style analysis - but we actually need to implement a continuous crawling mechanism, which is what arachnode.net was designed to do best.

A plug-in will be our best option and will manage your links and your tag words.  Simplistically, the plug-in will contain list of tags and their associated links.  Every CrawlRequest passes through the Crawl architecture and through each plug-in and we'll perform your business specific requirements there.

If you don't need to search, then we can turn off Lucene.net.  It is possible to create indexes from the WebPages at a later date, however, as long as we have stored the WebPage source in either the database or written the pages to disk.

The configuration settings should be correct (enough) for now - at some point we'll likely have you pass me your DB backup to double-check your config.

No CORE code changes are required - but we will need to craft a plug-in.

The next step is to run your subset list of site through a Crawl and verify that the AbsoluteUris present in DisallowedAbsoluteUris are acceptable.

-Mike

Date-specific crawling is coming... at least the part that inspects the 'LastModified' HttpRequestHeader.

I think as a good exercise for Date-Specific crawling, and a pre-cursor let's write a plug-in for this - specifically a CrawlRule.  Let me know when you're ready to tackle this.  (I have code that checks the LastModified header, which is needed, IMO, as a core freature... I'm not 100% sure that Date-specific crawling is a core Rule, but I may change my mind - in any case, let's draft up the plug-in and see what comes of it!)

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (9 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC