arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

Crawl specific domains, take screenshot, run javascript, ID ads, write to DB ... any ideas?

rated by 0 users
Answered (Verified) This post has 1 verified answer | 7 Replies | 4 Followers

Top 150 Contributor
1 Posts
jpntol posted on Sat, Feb 14 2009 6:53 AM

We want to capture info about AD SPOTS;

- on specific list of approx 10,000 domains

- capture a screenshot png / jpg

- run javascript on each page (browser specific?)

- read js and identify ad spot SIZES

- identify PLACE on page that the adspot is located

- relate this location back to the screenshot (draw it on the screenshot)

- write all to DB

- repeat above for next level down (max 1)

- repeat all weekly, compare and highlight differences

Are we looking at right toolkit with arachnode? Anyone out there wants to have a crack at coding this?

Answered (Verified) Verified Answer

Top 10 Contributor
1,905 Posts
Verified by arachnode.net

Hi there!

Yes, arachnode is the toolkit you want.  :)  (not that I'm biased...)

- on specific list of approx 10,000 domains

  • No problem.  10,000 domains will equal millions of WebPages.  Ensure that you have appropriate disk capacity and speed.

- capture a screenshot png / jpg

  • There are libraries that can do this from code.  I worked on a personal project that needed to capture an entire page as a .jpg.  The control we used was ~$159.

- run javascript on each page (browser specific?)

  • I don't know how to (easily) run the javascript on browsers other than IE, but it is easily possible using System.Windows.Forms.WebBrowser.

- read js and identify ad spot SIZES

  • If you know the name of the ad spot id then you can retrieve the size.


The size of the Google ad is 140 x 600.

- identify PLACE on page that the adspot is located

  • See above.

- relate this location back to the screenshot (draw it on the screenshot)

  • Simple as drawing a rectangle on the previously captured image.  Rectangle location is shown above.  Obtaining the offset from the top left is obtained by navigating the Parent element as shown in the QuickWatch.

- write all to DB

  • Yes.  You would extend the WebPages_MetaData table.

- repeat above for next level down (max 1)

  • This is set as a configuration parameter.

- repeat all weekly, compare and highlight differences

  • arachnode.net can run as a scheduled service.  A trigger would need to be added to the WebPages_MetaData table to archive records/rows when submitted to preserve change history.

So, how much of the functionality described above is currently present in arachnode.net?

Well, obviously the crawling functionality... most everything else would need to be added.  Making the modifications wouldn't be difficult at all.  The first step would be to familiarize yourself with how plugins work.  Take a look at ManageLuceneDotNetIndexes.cs to get started.

This would be fun to code, but, I'm working on completing the incremental update changes to the reporting views and finishing the enhancement to Lucene.NET that moves it past simple TF/IDF into elementary PageRank territory.

Send me a PM telling me more about what you want to do and what the purpose of this project is, if you wouldn't mind?

Thanks!
Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

All Replies

Top 10 Contributor
1,905 Posts
Verified by arachnode.net

Hi there!

Yes, arachnode is the toolkit you want.  :)  (not that I'm biased...)

- on specific list of approx 10,000 domains

  • No problem.  10,000 domains will equal millions of WebPages.  Ensure that you have appropriate disk capacity and speed.

- capture a screenshot png / jpg

  • There are libraries that can do this from code.  I worked on a personal project that needed to capture an entire page as a .jpg.  The control we used was ~$159.

- run javascript on each page (browser specific?)

  • I don't know how to (easily) run the javascript on browsers other than IE, but it is easily possible using System.Windows.Forms.WebBrowser.

- read js and identify ad spot SIZES

  • If you know the name of the ad spot id then you can retrieve the size.


The size of the Google ad is 140 x 600.

- identify PLACE on page that the adspot is located

  • See above.

- relate this location back to the screenshot (draw it on the screenshot)

  • Simple as drawing a rectangle on the previously captured image.  Rectangle location is shown above.  Obtaining the offset from the top left is obtained by navigating the Parent element as shown in the QuickWatch.

- write all to DB

  • Yes.  You would extend the WebPages_MetaData table.

- repeat above for next level down (max 1)

  • This is set as a configuration parameter.

- repeat all weekly, compare and highlight differences

  • arachnode.net can run as a scheduled service.  A trigger would need to be added to the WebPages_MetaData table to archive records/rows when submitted to preserve change history.

So, how much of the functionality described above is currently present in arachnode.net?

Well, obviously the crawling functionality... most everything else would need to be added.  Making the modifications wouldn't be difficult at all.  The first step would be to familiarize yourself with how plugins work.  Take a look at ManageLuceneDotNetIndexes.cs to get started.

This would be fun to code, but, I'm working on completing the incremental update changes to the reporting views and finishing the enhancement to Lucene.NET that moves it past simple TF/IDF into elementary PageRank territory.

Send me a PM telling me more about what you want to do and what the purpose of this project is, if you wouldn't mind?

Thanks!
Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
229 Posts

Mike thanks!

Top 10 Contributor
1,905 Posts

No problem...

I am actually working on the Rendering engine right now.  :)

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 200 Contributor
1 Posts
amir_b replied on Wed, Apr 11 2012 6:28 AM

Thanks Mike,

I'm looking for a way to automatically locate ads (without any previous knowladge) on web pages, and figure out their size, location and type (video/text/image etc..).

I have installed the arachnode, it seems like a great tool.

Is there a way to automatically id the ads with the arachnode?\

Thanks!

Amir.

Top 10 Contributor
1,905 Posts

There is, and I got a great start to it with the Templater.cs plugin, but there is a MUCH BETTER tool for the job.  https://www.google.com/webhp?sourceid=chrome-instant&ix=tea&ie=UTF-8#hl=en&sugexp=frgbld&gs_nf=1&tok=WcMatWsfPE-J3qIswCfdJA&pq=boilerplate&cp=8&gs_id=u&xhr=t&q=boilerpipe&pf=p&sclient=psy-ab&oq=boilerpi&aq=0&aqi=g4&aql=&gs_l=&pbx=1&bav=on.2,or.r_gc.r_pw.r_cp.r_qf.,cf.osb&fp=87c6230bd8e448f1&ix=tea&biw=1417&bih=907

AN can handle the pre-processing of the page, by rendering the JavaScript and giving Boilerpipe text to process.

 

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 500 Contributor
1 Posts

Dear Ami:

do you find a tool for detect ads?

Top 10 Contributor
1,905 Posts

Old thread; Look at some of the more popular browser tools - I'm sure they have compiled a list of advertising providers.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 1 (8 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC