<HTML>I should get two strings returned. The first would consist of the outer tag and its contents, and look like:
<BODY>
<TAG1>
<TAG1>
</TAG1>
</TAG1>
</BODY>
</HTML>
<TAG1>and the second would consist of the inner tag and its contents, and look like:
<TAG1>
</TAG1>
</TAG1>
<TAG1>I knew I had used some code for traversing the nodes of an HTML page in C# before, but although the basic traversal code was helpful, I didn't have anything that would help me pull out the contents. I had a look online, but couldn't find anything that really matched what I wanted to do. So, I wrote my own. Maybe someone else out there will find it useful too, or can recommend another approach. I'm always open to suggestions!
</TAG1>
First, we need to get the HTML document in a form that can be parsed easily. I used IHTMLDocument2, part of the mshtml COM module in C#. My document was already in the form of a string ("stringOfHTML"), so it was easy to transform that into the IHTMLDocument2 format. Here's how it's done:
IHTMLDocument2 doc = new HTMLDocumentClass();Once that is done, you need a way to access each node and traverse through them. I store the body node in an IHTMLElement as follows:
doc.write(new object[] { stringOfHTML });
doc.close();
IHTMLElement bodyElement = doc.body;Now I want to iterate through the child nodes, so I use IHTMLElementCollection to create a collection of IHTMLElements, where each item in the collection is a child node of the body tag:
IHTMLElementCollection childTags = IHTMLElementCollection)bodyElement.children;Using some recursion, we can extract the tags we want from within an HTML document. Here is the code below:
public void extractTagOfType(String stringOfHTML)
{
IHTMLDocument2 doc = new HTMLDocumentClass();
doc.write(new object[] { stringOfHTML});
doc.close();
IHTMLElement bodyElement = doc.body;
IHTMLElementCollection childTags = (IHTMLElementCollection) bodyElement.children;
if (childTags.length > 0)
{
foreach (IHTMLElement child in childTags)
{
if (child.tagName.Equals(DesiredTagName))
{
//do something with the contents of the tag (child.innerHTML)
//check inside this tag in case there are any other tags of this type nested inside it
extractTagOfType(child.innerHTML);
}
else
{
//there might be one of the tags we want nested inside the current node
extractTagOfType(child.innerHTML);
}
}
}
}
2 comments:
6 years after you post it was a Godsend..
Thanks, been working on all sorts of solutions..
This stuff isn't well documented anywhere.
same!! Very nice!
public void parseTable(string tableHTML)
{
IHTMLDocument2 doc = new HTMLDocumentClass();
doc.write(new object[] {tableHTML});
doc.close();
IHTMLElementCollection rows = ((IHTMLDocument3)doc).getElementsByTagName("tr");
foreach (IHTMLElement row in rows)
{
Trace.WriteLine("row.innerText=" + row.innerText);
IHTMLElementCollection cells = (IHTMLElementCollection) row.children;
foreach (IHTMLElement cell in cells)
{
Trace.WriteLine("cell.innerText=" + cell.innerText);
}
}
}
Post a Comment