Taming HTML Parsing with libxml (2)

Jan 20, 2012

The parsing of HTML necessary for my DTCoreText open source project is done entirely with the sophisticated use of NSScanner. But it has been long on my list to rewrite all of this parsing with a the industry standard libxml2 which comes preinstalled on all iOS devices. Not only is this potentially much faster in dealing with large chunks of HTML. It probably is also more intelligent in correcting structural errors of the HTML code you throw at it.

In part 1 of this series I showed you how to add the libxml2 library to your project and explained the basic concepts of using libxml2 for parsing any odd HTML snippet into a DOM. In this here part 2 we will create an Objective-C based wrapper around libml2 so that we can use it just like NSXMLParser, only for HTML.

There are basically two methods – approaches – to handling XML/HTML: DOM or SAX. DOM (short for Document Object Model) refers to building a tree of nodes and leaves which you then can query for information based on their path (with XPath). SAX (short for Simple API for XML) basically walks through the the document and sends you events as it discovers things. For example the begin of an element, the end of it, when a comment was found, etc.

NSXMLParser is a SAX parser for well-formed XML. That means you can use it only on XML, it cannot deal with the nature of HTML where some elements might not require that you close them. But if you look at the delegate protocol and compare it to the SAX interface of libxml2 you find an eerie resemblance. It might as well be that NSXMLParser is just a wrapper around the XML SAX interface of libxml2.

As we have previously learned we know that libxml2 also possesses a HTML parser which reuses most of the same internal data structures as its XML cousin. So we shall endeavor to create DTHTMLParser as a SAX parser for HTML.

Don’t worry about piecing together all the source code from this article, I will put that online in my DTFoundation project. Instead I want you to follow my lead and understand the pieces.

The libxml2 SAX Interface

The SAX interface works such that you register a handler function – in C – for each SAX event that you are interested in. This is done via a C-structure of type htmlSAXHandler which is nothing more than an aggregate of multiple C-function pointers.

In the libxml headers you often see some “html…” data structure typedef’d into one with “xml” as prefix. The reason for this is that it reuses whatever it can from the XML parser. But this also adds confusion – as I found through my experimenting – because some of the events that make sense for parsing XML don’t fire for HTML, like for example the one that reports CDATA elements. HTML does not have those, but the function pointer is still present in the htmlSAXHandler struct.

By CMD+Clicking on a type we can swim upstream to find the original definition of htmlSAXHandler:

htmlSAXHandler = xmlSAXHandler = _xmlSAXHandler =

struct _xmlSAXHandler {
    internalSubsetSAXFunc internalSubset;
    isStandaloneSAXFunc isStandalone;
    hasInternalSubsetSAXFunc hasInternalSubset;
    hasExternalSubsetSAXFunc hasExternalSubset;
    resolveEntitySAXFunc resolveEntity;
    getEntitySAXFunc getEntity;
    entityDeclSAXFunc entityDecl;
    notationDeclSAXFunc notationDecl;
    attributeDeclSAXFunc attributeDecl;
    elementDeclSAXFunc elementDecl;
    unparsedEntityDeclSAXFunc unparsedEntityDecl;
    setDocumentLocatorSAXFunc setDocumentLocator;
    startDocumentSAXFunc startDocument;
    endDocumentSAXFunc endDocument;
    startElementSAXFunc startElement;
    endElementSAXFunc endElement;
    referenceSAXFunc reference;
    charactersSAXFunc characters;
    ignorableWhitespaceSAXFunc ignorableWhitespace;
    processingInstructionSAXFunc processingInstruction;
    commentSAXFunc comment;
    warningSAXFunc warning;
    errorSAXFunc error;
    fatalErrorSAXFunc fatalError; /* unused error() get all the errors */
    getParameterEntitySAXFunc getParameterEntity;
    cdataBlockSAXFunc cdataBlock;
    externalSubsetSAXFunc externalSubset;
    unsigned int initialized;
    /* The following fields are extensions available only on version 2 */
    void *_private;
    startElementNsSAX2Func startElementNs;
    endElementNsSAX2Func endElementNs;
    xmlStructuredErrorFunc serror;
};

All these function pointer types are aptly named SomethingFunc and we’ll get to how these need to look like in a minute. The problem for us to solve how to bridge from our Objective-C paradigm (using self, properties, IVARs) into the C-function paradigm (no access to our parser object).

This problem is compounded by the fact that we are building this with ARC so that we don’t have to deal with memory management any more. Our resulting class will run on iOS 4 and above. Somehow we need to pass enough information to each event-function so that we can communicate with the parser object.

There are two reasonable possibilities: either we pass a reference to the parser instance or we pass a pointer to a struct that contains multiple values that we might need. I decided for the first because it seems cleaner to me.

libxml2 provides a concept of a user_data pointer that is passed to the event-functions together with other relevant information. We can use (abuse?) this for passing a reference to self. ARC requires this to be be cast to (__bridge void *) which in plain English means: “please treat this as just some typeless memory pointer and also don’t worry about retaining or releasing. I’ll take care of that, don’t panic.”

Like any self-respecting class we’ll start out with a good block of IVAR definitions and and init method.

@implementation DTHTMLParser
{
    htmlSAXHandler _handler;
 
    NSStringEncoding _encoding;
    NSData *_data;
 
    __unsafe_unretained id <DTHTMLParserDelegate> _delegate;
    htmlParserCtxtPtr _parserContext;
}

Notice that I’m defining the IVARs in the implementation, not the interface (in the header). This is possible as of using the new Apple LLVM compiler and provides for a nice way to hide the IVARs that we don’t want to be visible.

- (id)initWithData:(NSData *)data encoding:(NSStringEncoding)encoding
{
	self = [super init];
	if (self)
	{
		_data = data;
		_encoding = encoding;
 
		xmlSAX2InitHtmlDefaultSAXHandler(&_handler);
	}
 
	return self;
}

We save the data and the encoding into our IVARs, _data being a strong reference to it. That means that it retains it during the lifetime of our parser and will only release it when the parser goes away.

We also initialize the _handler structure via the xmlSAX2InitHtmlDefaultSAXHandler function. This sets up some initial internals and also chooses the SAX2 method for our work. There’s also an older SAX1 method, but we obviously prefer the newer one. This setup has to happen inside the init method because right after such an instance of a DTHTMLParser is create we want to be able to set the delegate which should take care of wiring up the needed functions.

Each event-function should have a corresponding delegate method in our DTHTMLParserDelegate protocol, but to avoid unnecessary function-calling we only add the functions to the handler for which the delegate actually implements the function.

#pragma mark Properties
 
- (void)setDelegate:(__unsafe_unretained id<DTHTMLParserDelegate>)delegate;
{
	if (delegate != _delegate)
	{
		_delegate = delegate;
 
		if ([_delegate respondsToSelector:@selector(parserDidStartDocument:)])
		{
			_handler.startDocument = _startDocument;
		}
		else
		{
			_handler.startDocument = NULL;
		}
 
		if ([_delegate respondsToSelector:@selector(parserDidEndDocument:)])
		{
			_handler.endDocument = _endDocument;
		}
		else
		{
			_handler.endDocument = NULL;
		}
// ...

This way when you set the delegate it is inspected one by one whether it responds to certain methods of the protocol. And whenever it does the corresponding function pointer is set in the _handler IVAR.

At the top of the file we need to define these functions. If we do just that then the compiler will warn us for all of these implementations that there was “no previous prototype”. In C a you call the first line of a C-function it’s prototype. So we just have an extra block for these.

#pragma mark Event function prototypes
 
void _startDocument(void *context);
void _endDocument(void *context);
void _startElement(void *context, const xmlChar *name,const xmlChar **atts);
void _endElement(void *context, const xmlChar *name);
void _characters(void *context, const xmlChar *ch, int len);
void _comment(void *context, const xmlChar *value);
void _error(void *context, const char *msg, ...);

Wherever you see a context parameter this will be the user_data pointer we will provide. And for example the _startElement function gets a name and an array of attributes as well. The xmlChar type is just a char. This is actually always provided as UTF8-encoded string regardless of what the data is encoded as.

Let’s look first at an easy event-function and then at a more complex one.

void _startDocument(void *context)
{
    DTHTMLParser *myself = (__bridge DTHTMLParser *)context;
 
    [myself.delegate parserDidStartDocument:myself];
}

You see how I cast the void pointer back into a pointer of type DTHTMLParser. I have to name the variable different than “self” because that is a reserved word. Then I can simply call the delegate method that corresponds to this event function.

By the way, these are the delegate methods we want to implement:

@class DTHTMLParser;
 
@protocol DTHTMLParserDelegate <NSObject>
 
@optional
- (void)parserDidStartDocument:(DTHTMLParser *)parser;
- (void)parserDidEndDocument:(DTHTMLParser *)parser;
- (void)parser:(DTHTMLParser *)parser didStartElement:(NSString *)elementName attributes:(NSDictionary *)attributeDict;
- (void)parser:(DTHTMLParser *)parser didEndElement:(NSString *)elementName;
- (void)parser:(DTHTMLParser *)parser foundCharacters:(NSString *)string;
- (void)parser:(DTHTMLParser *)parser foundComment:(NSString *)comment;
- (void)parser:(DTHTMLParser *)parser parseErrorOccurred:(NSError *)parseError;
 
@end

The @class is necessary before the protocol definition because a good delegate protocol also passes a reference to the caller. This also enables the casual reader of your code to know at a glance that this method belongs to a protocol connected to this class.

Now for something more complex, let’s look at didStartElement. The complexity of this method results from the arts parameter being alternating key, value, key, value etc.

void _startElement(void *context, const xmlChar *name,const xmlChar **atts)
{
	DTHTMLParser *myself = (__bridge DTHTMLParser *)context;
 
	NSString *nameStr = [NSString stringWithUTF8String:(char *)name];
 
	NSMutableDictionary *attributes = nil;
 
	if (atts)
	{
		NSString *key = nil;
		NSString *value = nil;
 
		attributes = [[NSMutableDictionary alloc] init];
 
		int i=0;
		while (1)
		{
			char *att = (char *)atts[i++];
 
			if (!key)
			{
				if (!att)
				{
					// we're done
					break;
				}
 
				key = [NSString stringWithUTF8String:att];
			}
			else
			{
				if (att)
				{
					value = [NSString stringWithUTF8String:att];
				}
				else
				{
					// solo attribute
					value = key;
				}
 
				[attributes setObject:value forKey:key];
 
				value = nil;
				key = nil;
			}
		}
	}
 
	[myself.delegate parser:myself didStartElement:nameStr attributes:attributes];
}

The while loop walks through the attributes and and when a NULL is encountered where a key would be then it breaks out of the loop. It is also necessary to deal with a situation where there is only an attribute name, but no value because HTML also allows that. Theoretically one attribute could also be occurring multiple times, but we ignore this. For later convenience we want the key/values for the attributes in an NSDictionary.

The final piece is how the parsing actually is started. For this we have a parse method, just like NSXMLParser.

- (BOOL)parse
{
	void *dataBytes = (char *)[_data bytes];
	int dataSize = [_data length];
 
	// detect encoding if necessary
	xmlCharEncoding charEnc = 0;
 
	if (!_encoding)
	{
		charEnc = xmlDetectCharEncoding(dataBytes, dataSize);
	}
	else
	{
		// convert the encoding
		// TODO: proper mapping from _encoding to xmlCharEncoding
		CFStringEncoding cfenc = CFStringConvertNSStringEncodingToEncoding(_encoding);
		CFStringRef cfencstr = CFStringConvertEncodingToIANACharSetName(cfenc);
		const char *enc = CFStringGetCStringPtr(cfencstr, 0);
 
		charEnc = xmlParseCharEncoding(enc);
	}
 
	// create a parse context
	_parserContext = htmlCreatePushParserCtxt(&_handler, (__bridge void *)self, 
		dataBytes, dataSize, NULL, charEnc);
 
	// set some options
	htmlCtxtUseOptions(_parserContext, HTML_PARSE_RECOVER | HTML_PARSE_NONET |
		HTML_PARSE_COMPACT);
 
	// parse!
	int result = htmlParseDocument(_parserContext);
 
	return (result==0);
}

There is a xmlDetectCharEncoding function in libxml to try and detect the encoding of the data which we invoke if the passed encoding is 0. The way to convert the NSStringEncoding into the xmlCharEncoding is only work shift, that needs some improving, possibly by simply having a big switch statement. Maybe later.

There are multiple ways how you can parse a document, some of which take individual parameters, some take a html parsing context. I settled on using the one for the push interface because there the htmlParseDocument function does not return a pointer to an html document tree (DOM), but instead a result that is 0 for success or -1 for error.

It is my understanding that this variant does not actually build the entire tree but keeps just enough state information to track the current hierarchy level inside the HTML tree. The push parser variant also has the ability to parse chunks of HTML data as they become available, e.g. while downloading. But exploring that is outside our scope for now.

Conclusion

Once your eyes have gotten used to the coding style of libxml2 you can start to appreciate the plethora of functions that it provides. You can also peruse the official reference manual online.

Finally I invite your collaboration on using and helping to polish DTHTMLParser if you have any need for parsing HTML text from within your iOS apps. Look for it on the GitHub DTFoundation project as soon as I can add a bit of AppleDoc-style documentation to it.

Ad

Taming HTML Parsing with libxml (2)

The libxml2 SAX Interface

Conclusion

Like this:

Related

1 Comment »

Trackbacks

Leave a Comment

CC

Ad

Ad

Taming HTML Parsing with libxml (2)

The libxml2 SAX Interface

Conclusion

Sharing:

Like this:

Related

1 Comment »

Trackbacks

Leave a Comment

CC

Ad