We want to capture info about AD SPOTS;
- on specific list of approx 10,000 domains
- capture a screenshot png / jpg
- read js and identify ad spot SIZES
- identify PLACE on page that the adspot is located
- relate this location back to the screenshot (draw it on the screenshot)
- write all to DB
- repeat above for next level down (max 1)
- repeat all weekly, compare and highlight differences
Are we looking at right toolkit with arachnode? Anyone out there wants to have a crack at coding this?
Yes, arachnode is the toolkit you want. :) (not that I'm biased...)
The size of the Google ad is 140 x 600.
So, how much of the functionality described above is currently present in arachnode.net?
Well, obviously the crawling functionality... most everything else would need to be added. Making the modifications wouldn't be difficult at all. The first step would be to familiarize yourself with how plugins work. Take a look at ManageLuceneDotNetIndexes.cs to get started.
This would be fun to code, but, I'm working on completing the incremental update changes to the reporting views and finishing the enhancement to Lucene.NET that moves it past simple TF/IDF into elementary PageRank territory.
Send me a PM telling me more about what you want to do and what the purpose of this project is, if you wouldn't mind?
For best service when you require assistance:
I am actually working on the Rendering engine right now. :)
I'm looking for a way to automatically locate ads (without any previous knowladge) on web pages, and figure out their size, location and type (video/text/image etc..).
I have installed the arachnode, it seems like a great tool.
Is there a way to automatically id the ads with the arachnode?\
There is, and I got a great start to it with the Templater.cs plugin, but there is a MUCH BETTER tool for the job. https://www.google.com/webhp?sourceid=chrome-instant&ix=tea&ie=UTF-8#hl=en&sugexp=frgbld&gs_nf=1&tok=WcMatWsfPE-J3qIswCfdJA&pq=boilerplate&cp=8&gs_id=u&xhr=t&q=boilerpipe&pf=p&sclient=psy-ab&oq=boilerpi&aq=0&aqi=g4&aql=&gs_l=&pbx=1&bav=on.2,or.r_gc.r_pw.r_cp.r_qf.,cf.osb&fp=87c6230bd8e448f1&ix=tea&biw=1417&bih=907
do you find a tool for detect ads?
Old thread; Look at some of the more popular browser tools - I'm sure they have compiled a list of advertising providers.