Ad

Our DNA is written in Swift
Jump

Taming HTML Parsing with libxml (1)

For the NSAttributedString+HTML Open Source project I chose to implement parsing of HTML with a set of NSScanner category methods. The resulting code is relatively easy to understand but has a couple of annoying drawbacks. You have to duplicate the NSData and convert it into an NSString effectively doubling the amount of memory needed. Then while parsing I am building an adhoc tree of DTHTMLElement instances adding yet another copy of the document in RAM.

When parsing HTML – and by extension XML – you have two kinds of operating mode available: you can have the Sequential Access Method (SAX) where walking through the document triggers events on the individual pieces of it. The second method is to build a tree of nodes, a Document Object Model (DOM). NSScanner lends itself to SAX, but in this case it is less than ideal because for CSS inheritance some sort of hierarchy is necessary to walk up on.

In this post we will begin to explore the industry-standard libxml library and see how we can thinly wrap it in Objective-C that it plays nicely with our code.

Getting libxml into your Xcode project is straightforward. Fortunately for us libxml is so old and established that you can find it already installed on Unix, Mac and iOS platforms. There are two kinds of libraries in C: static and dynamic. libxml is the latter which you can recognize by the .dylib extension.

Adding the Library

First we need to add the library providing all the XML and HTML structures and functions. We are actually using version 2.2 of libxml, the file libxml2.dylib is a symbolic link to libxml2.2.dylib.

Next – because libxml is not a framework that would package the necessary headers with it – we also need to tell Xcode where the headers can be found. Since libxml also comes with OSX, its headers – just like all other OSX system libraries can be found in /usr/include. Add /usr/include/libxml2  to the Header Search Paths and we’re set.

Now all we need to do to access libxml’s parsing methods and data structures is to add the appropriate import. Most of the internal structures are shared between the XML and HTML parsers and so we just need the HTMLparser header.

#import <libxml/HTMLparser.h>

Document Structure

Before we get into parsing let me show you how libxml represents HTML documents. Everything in libxml is a node. Because C does not have a concept of objects the classical method of representing a tree is by having C structs that have member variables pointing to other structs. A child is just a pointer to the child struct/node. If there can be more than one item, i.e. a list, this is represented by a linked list where the first node points to the next and so on until the very last node has a NULL pointer.

The smallest unit in libxml is xmlNode structure which is defined as such:

/**
 * xmlNode:
 *
 * A node in an XML tree.
 */
typedef struct _xmlNode xmlNode;
typedef xmlNode *xmlNodePtr;
struct _xmlNode {
    void           *_private;	/* application data */
    xmlElementType   type;	/* type number, must be second ! */
    const xmlChar   *name;      /* the name of the node, or the entity */
    struct _xmlNode *children;	/* parent->childs link */
    struct _xmlNode *last;	/* last child link */
    struct _xmlNode *parent;	/* child->parent link */
    struct _xmlNode *next;	/* next sibling link  */
    struct _xmlNode *prev;	/* previous sibling link  */
    struct _xmlDoc  *doc;	/* the containing document */
 
    /* End of common part */
    xmlNs           *ns;        /* pointer to the associated namespace */
    xmlChar         *content;   /* the content */
    struct _xmlAttr *properties;/* properties list */
    xmlNs           *nsDef;     /* namespace definitions on this node */
    void            *psvi;	/* for type/PSVI informations */
    unsigned short   line;	/* line number */
    unsigned short   extra;	/* extra data for XPath/XSLT */
};

The useful links depicted in the above chart as children, last, parent, next, prev and doc. The type value is the kind of role this node plays. If it is a tag then it is an XML_ELEMENT_NODE. The contents of a tag is represented by an XML_TEXT_NODE. Attributes are XML_ATTRIBUTE_NODE. Note that even if the original HTML does not contain a DTD, html or body tag these will be implied by the parser.

Let’s Parse Already

I sense that you grow impatient with me. Ok ok, we’re getting right to it now that you understand how libxml represents DOMs. Assume we have some HTML data downloaded from the web, the NSURL of it is in _baseURL.

// NSData data contains the document data
// encoding is the NSStringEncoding of the data
// baseURL the documents base URL, i.e. location 
 
CFStringEncoding cfenc = CFStringConvertNSStringEncodingToEncoding(encoding);
CFStringRef cfencstr = CFStringConvertEncodingToIANACharSetName(cfenc);
const char *enc = CFStringGetCStringPtr(cfencstr, 0);
 
htmlDocPtr _htmlDocument = htmlReadDoc([data bytes],
      [[baseURL absoluteString] UTF8String],
      enc,
      XML_PARSE_NOERROR | XML_PARSE_NOWARNING);

Since we don’t need any warnings or errors we can just ignore them by passing some options. The baseURL might be necessary to decode relative URLs contained in the document. And most importantly we cannot assume that UTF8 is used for encoding the bytes so we get the appropriate character set to pass to the parser.

Remember, this is pure C, so once we don’t need this DOM any more we need to trigger a routine that walks through this linked structures and frees up the reserved memory.

if (_htmlDocument)
{
   xmlFreeDoc(_htmlDocument);
}

If _htmlDocument is not NULL then we have successfully parsed the document. There are multiple methods how we could now use this, but for the final example in this post let me show you a function that just dumps the individual elements to to the log. This demonstrates how to follow the links and also how to access the contents of text elements.

xmlNodePtr currentNode = (xmlNodePtr)_htmlDocument;
 
BOOL beginOfNode = YES;
 
while (currentNode) 
{
    // output node if it is an element
    if (beginOfNode)
    {
        if (currentNode->type == XML_ELEMENT_NODE)
        {
            NSMutableArray *attrArray = [NSMutableArray array];
 
            for (xmlAttrPtr attrNode = currentNode->properties; 
                 attrNode; attrNode = attrNode->next)
            {
                xmlNodePtr contents = attrNode->children;
 
                [attrArray addObject:[NSString stringWithFormat:@"%s='%s'", 
                                      attrNode->name, contents->content]];
            }
 
            NSString *attrString = [attrArray componentsJoinedByString:@" "]; 
 
            if ([attrString length])
            {
                attrString = [@" " stringByAppendingString:attrString];
            }
 
            NSLog(@"<%s%@>", currentNode->name, attrString);
        }
        else if (currentNode->type == XML_TEXT_NODE)
        {
            NSLog(@"%s", currentNode->content);
        }
        else if (currentNode->type == XML_COMMENT_NODE)
        {
            NSLog(@"/* %s */", currentNode->name);
        }
    }
 
    if (beginOfNode && currentNode->children)
    {
        currentNode = currentNode->children;
        beginOfNode = YES;
    }
    else if (beginOfNode && currentNode->next)
    {
        currentNode = currentNode->next;
        beginOfNode = YES;
    }
    else
    {
        currentNode = currentNode->parent;
        beginOfNode = NO; // avoid going to siblings or children
 
        // close node
        if (currentNode && currentNode->type == XML_ELEMENT_NODE)
        {
            NSLog(@"</%s>", currentNode->name);
        }
    }
}

Note how I use %s so that I can use the zero-terminated C strings without having to convert them to NSStrings.

Obviously there are other ways to iterate through the document, for example by means of recursion. But this is meant to show how you can walk through nodes and their children and how you can also get the attributes.

Next time we will have a look how we can somehow wrap this pure C-code that we can more easily find and access parts of it. We cannot simply wrap xmlNode into an Objective-C class because then we might end up freeing the structure while an node instance is still present, thus creating a whole lot of junk pointers and introducing crash potential.

This is the case with the Objective-C HTML Parser project on GitHub by Ben Reeves. But even though I don’t share Ben’s philosophy, his prior work served as the starting point for this article.


Categories: Recipes

23 Comments »

  1. Really loving the diagrams, I’ve been using zootreeves’ parser for a while and it’s lightning fast but there’s some clunkiness to it. Excited to see libxml used here without a wrapper, might have to start doing it this way.

  2. Why don’t we collaborate on a more elegant objective-C interface to it?

  3. I never thought about using libxml directly, because it seemed hard. But I really like your introduction, one less FUD to worry about in my head.

  4. Hi,

    What application do you use for the diagrams?

  5. Omnigraffle on Mac with an freely available stencil set.

  6. may I ask which one? Is for my personal use only.

  7. It’s a nice paper, I love it very much, especially the diagrams. Thanks very much.

  8. Nice article, I would say that if everything was XML, it would be much easier to parse, but that is not generally the case.

  9. That’s why I made DTHTMLParser based on libxml which can also deal with HTML.