What would be the best method of parsing html, i want to get all the tags in a array, using less possible cpu.
like for parsing u need to specify what pieces of information you need to parse.
to avoid problems in parsing, and extracting wrong information from a web page, it would be good to find a piece of text in a page get its index and add to index till you find the right info.
like for a jobs site,
index
0-[Home]1-[Job Title] 2-[asp.net developer]3-[City] 4-[New York City]
Here we have parsed the index to an array, now to extract job title of this job, we will find text "Job Title" and get its index and add +1 to it and extract that info, this will help avoid extracting wrong info in case of page updation.
So to achieve need to parse and extract all tags to array, what would be the best and easiest method to achieve this??
mshtml, htmlagilitypack... or other.
Use HtmlAgilityPack and query with xpath. Hands down.
For best service when you require assistance:
Skype: arachnodedotnet
Milan, if you don't have it already, here's a link to the HtmlAgility Pack docs:
http://htmlagilitypack.codeplex.com/Release/ProjectReleases.aspx?ReleaseId=33903
Milan Solanki:
> What would be the best method of parsing html
I do this daily. I use biterscripting for parsing our own web pages, and extracting all kinds of info from it in all kinds of formats. You can start with the sample script posted at http://www.biterscripting.com/SS_WebPageToText.html as I did. To try this, simply enter the following command in biterscripting.
script SS_WebPageToText.txt page("http://arachnode.net/forums/t/710.aspx")
It will show you this very page by extracting plain text from it. If you don't have biterscripting, that you can download free from any download site.
Jenni