Tuesday, January 27, 2009

Traversing the DOM in C#

Today I had a problem where I needed to find all tags of a given type in an HTML document and extract whatever was within them, including nested tags. That is, given:
<HTML>
<BODY>
<TAG1>
<TAG1>
</TAG1>
</TAG1>
</BODY>
</HTML>
I should get two strings returned. The first would consist of the outer tag and its contents, and look like:
<TAG1>
<TAG1>
</TAG1>
</TAG1>
and the second would consist of the inner tag and its contents, and look like:
<TAG1>
</TAG1>
I knew I had used some code for traversing the nodes of an HTML page in C# before, but although the basic traversal code was helpful, I didn't have anything that would help me pull out the contents. I had a look online, but couldn't find anything that really matched what I wanted to do. So, I wrote my own. Maybe someone else out there will find it useful too, or can recommend another approach. I'm always open to suggestions!

First, we need to get the HTML document in a form that can be parsed easily. I used IHTMLDocument2, part of the mshtml COM module in C#. My document was already in the form of a string ("stringOfHTML"), so it was easy to transform that into the IHTMLDocument2 format. Here's how it's done:
IHTMLDocument2 doc = new HTMLDocumentClass();
doc.write(new object[] { stringOfHTML });
doc.close();
Once that is done, you need a way to access each node and traverse through them. I store the body node in an IHTMLElement as follows:
IHTMLElement bodyElement = doc.body;
Now I want to iterate through the child nodes, so I use IHTMLElementCollection to create a collection of IHTMLElements, where each item in the collection is a child node of the body tag:
IHTMLElementCollection childTags = IHTMLElementCollection)bodyElement.children;
Using some recursion, we can extract the tags we want from within an HTML document. Here is the code below:
public void extractTagOfType(String stringOfHTML)
{
IHTMLDocument2 doc = new HTMLDocumentClass();
doc.write(new object[] { stringOfHTML});
doc.close();
IHTMLElement bodyElement = doc.body;
IHTMLElementCollection childTags = (IHTMLElementCollection) bodyElement.children;
if (childTags.length > 0)
{
foreach (IHTMLElement child in childTags)
{
if (child.tagName.Equals(DesiredTagName))
{
//do something with the contents of the tag (child.innerHTML)
//check inside this tag in case there are any other tags of this type nested inside it
extractTagOfType(child.innerHTML);
}
else
{
//there might be one of the tags we want nested inside the current node
extractTagOfType(child.innerHTML);
}
}
}
}

2 comments:

Mark said...

6 years after you post it was a Godsend..
Thanks, been working on all sorts of solutions..
This stuff isn't well documented anywhere.

Anonymous said...

same!! Very nice!

public void parseTable(string tableHTML)
{
IHTMLDocument2 doc = new HTMLDocumentClass();
doc.write(new object[] {tableHTML});
doc.close();

IHTMLElementCollection rows = ((IHTMLDocument3)doc).getElementsByTagName("tr");
foreach (IHTMLElement row in rows)
{
Trace.WriteLine("row.innerText=" + row.innerText);
IHTMLElementCollection cells = (IHTMLElementCollection) row.children;
foreach (IHTMLElement cell in cells)
{
Trace.WriteLine("cell.innerText=" + cell.innerText);

}
}
}