Ad

Our DNA is written in Swift
Jump

HTML Entities

Similar to the previous article on decoding HTML colors we also need to decode HTML entities like ". So I found an authoritative list on the web, courtesy of Wikipedia and in this article I will demonstrate how to use quickly hacked up command line tool to convert it into Objective-C.

This shall serve as an example as to how quickly you can leverage your objC knowledge to build a useful tool for such a one-off operation. If you know how to reuse your skills from iOS development on command line tools then you can always quickly whip up a one-off tool to do some work that otherwise you would have needed to do manually with a text editor.

We start by creating a new project using the “Command Line Tool” template. For Type we use “Foundation” because this gives us access to most of the functions we are already using on a daily basis.

I’ve already saved the source code of the entities page to my disk, so let’s have a look at it.

<pre><tr>
<td>quot</td>
<td>"</td>
<td>U+0022 (34)</td>
<td>XML 1.0</td>
<td><i>(double)</i> <a href="/wiki/Quotation_mark" title="Quotation mark">quotation mark</a></td>
</tr>

So for every row (TR) have have the columns (TD) spelling out the entity name first, then the character itself and in the third column the unicode sequence that we’re interested in. We want to find all the &quot; instances and replace them with \u0022.

So in our command line tool we make a read loop that reads the file, iterates over all lines and converts the first and third columns into a lookup dictionary. But first let’s add a bit of boilerplate code to get the command line argument and display usage instructions if no parameter was passed.

int main (int argc, const char * argv[]) {
    NSAutoreleasePool * pool = [[NSAutoreleasePool alloc] init];
 
	// argv[0] is name of executable itself
	NSString *appPath = [NSString stringWithCString:argv[0] encoding:NSUTF8StringEncoding];
 
	// cut away the path
	NSString *appName = [appPath lastPathComponent];
	NSString *workingPath = [appPath stringByDeletingLastPathComponent];
 
	if (argc!=2)
	{
		printf("Usage: %s HTML_FILE\n", [appName cStringUsingEncoding:NSUTF8StringEncoding]);
		exit(1);
	}
 
	// argv[1] is HTML file name
	NSString *htmlFile = [NSString stringWithCString:argv[1] encoding:NSUTF8StringEncoding];
 
	// add working path
	NSString *htmlPath = [workingPath stringByAppendingPathComponent:htmlFile];
 
	NSLog(@"%@", htmlPath);
 
	[pool drain];
    return 0;
}

Since we want to be able to debug our command line tool we need to set the command line parameter to go over the check where we see if exactly 2 parameters have been passed. The first parameter is always the path to the program itself, the second one we want to be the name of the HTML file.

If we now Build&Debug then we won’t see the usage instructions, instead we’ll see the path where the tool expects to find the HTML file: /Users/Oliver/Desktop/htmlentities/build/Debug/entities.html

Ok, let’s put it there and then we can start with the actually work on reading and parsing the file.

	// load file
   NSError *error = nil;
	NSString *fileContents = [NSString stringWithContentsOfFile:htmlPath 
                                                      encoding:NSUTF8StringEncoding
                                                         error:&error];
 
   // fail if nothing was read
   if (!fileContents)
   {
      NSLog(@"Error reading file: %@", [error localizedDescription]);
      exit(1);
   }
 
 
   // split into lines
   NSArray *lines = [fileContents componentsSeparatedByString:@"\n"];
 
   NSMutableArray *currentRow = nil;
   NSMutableArray *outputRows = [NSMutableArray array];
 
   // loop
   for (NSString *oneLine in lines)
   {
       if ([oneLine hasPrefix:@"<tr>"])
      {
        currentRow = [NSMutableArray array];
      }
      else if ([oneLine hasPrefix:@"<td>"])
      {
         NSScanner *scanner = [NSScanner scannerWithString:oneLine];
 
         // skip opening tag
         [scanner scanString:@"<td>" intoString:NULL];
 
         // scan until closing tag
         NSString *tagContents = nil;
         if ([scanner scanUpToString:@"</td>" intoString:&tagContents])
         {
            // text inbetween is saved to current row dict
            [currentRow addObject:tagContents];
         }
      }
      else if ([oneLine hasPrefix:@"</tr>"])
      {
         if ([currentRow count]==7)
         {
            [outputRows addObject:currentRow];
         }
         currentRow = nil;
      }
   }
 
   NSLog(@"%@", outputRows);

Interestingly – since we preserved UTF-8 throughout – the second column of the rows already seems to be the appropriate unicode character. The third one is the hex and dec code. The first column is the entity name.

Now some META thinking is required. How do we want the code to look like that this tool should output? I’d say, just like with the colors, we like to have a big dictionary because there the lookup goes via a hash. Now you’re getting the benefit of ME doing the hard work. I wasted an hour on finding out how to properly encode single byte characters. Turns out that you cannot encode these like the rest with \u0022, but instead have to use the hex number encoding \x22. Or else Xcode will complain that this is not a valid unicode character.

  // output code to copy/paste
   NSMutableString *outputString = [NSMutableString string];
 
   [outputString appendString:@"entityLookup = [[NSDictionary alloc] initWithObjectsAndKeys:"];
 
   for (NSArray *oneRow in outputRows)
   {
      // object is the unicode character
      NSString *byte1 = [[[oneRow objectAtIndex:2] substringWithRange:NSMakeRange(2, 2)]
            lowercaseString];
      NSString *byte2 = [[[oneRow objectAtIndex:2] substringWithRange:NSMakeRange(4, 2)] 
            lowercaseString];
      NSString *sequ = nil;
 
      if ([byte1 isEqualToString:@"00"] && [byte2 compare:@"a0"] == NSOrderedAscending)
      {
         // one byte unicode we encode as \x12
         sequ = [NSString stringWithFormat:@"@\"\\x%@\"", byte2];
      }
      else
      {
         // two byte unicode we spell als \u1234
         sequ = [NSString stringWithFormat:@"@\"\\u%@%@\"", byte1, byte2];
      }
 
 
      [outputString appendString:sequ];
 
 
      // key is the entity name
      [outputString appendFormat:@", @\"%@\",\n", [oneRow objectAtIndex:0]];
   }
 
   [outputString appendString:@"nil];\n"];
 
 
   // output the entire blob
   NSLog(@"%@", outputString);

Now with this dictionary we can build some scanning code to walk through a string and take care of the replacements. There’s one special case, some characters might also be encoded with &# and a decimal 8364 instead of &euro;.

static NSDictionary *entityLookup = nil;
 
- (NSString *)stringByReplacingHTMLEntities
{
   if (!entityLookup)
   {
      entityLookup = [[NSDictionary alloc] initWithObjectsAndKeys:@"\x22", @"quot",
                      @"\x26", @"amp",
                      @"\x27", @"apos",
                      @"\x3c", @"lt",
                      @"\x3e", @"gt",
                      @"\u00a0", @"nbsp",
                      @"\u00a1", @"iexcl",
                      @"\u00a2", @"cent",
                      @"\u00a3", @"pound",
                      @"\u00a4", @"curren",
                      @"\u00a5", @"yen",
// [...]
                      @"\u2660", @"spades",
                      @"\u2663", @"clubs",
                      @"\u2665", @"hearts",
                      @"\u2666", @"diams",
                      nil];
   }
 
   NSScanner *scanner = [NSScanner scannerWithString:self];
   [scanner setCharactersToBeSkipped:nil];
 
   NSMutableString *output = [NSMutableString string];
 
   while (![scanner isAtEnd])
   {
      NSString *scanned = nil;
 
      if ([scanner scanUpToString:@"&" intoString:&scanned])
      {
         [output appendString:scanned];
      }
 
      if ([scanner scanString:@"&" intoString:NULL])
      {
         NSString *afterAmpersand = nil;
         if ([scanner scanUpToString:@";" intoString:&afterAmpersand]); 
         {
            if ([scanner scanString:@";" intoString:NULL])
            {
               if ([afterAmpersand hasPrefix:@"#"] && [afterAmpersand length]<6)
               {
                  NSInteger i = [[afterAmpersand substringFromIndex:1] integerValue];
                  [output appendFormat:@"%C", i];
               }
               else 
               {
                  NSString *converted = [entityLookup objectForKey:afterAmpersand];
 
                  if (converted)
                  {
                     [output appendString:converted];
                  }
                  else 
                  {
                     // not a valid sequence
                     [output appendString:@"&"];
                     [output appendString:afterAmpersand];
                     [output appendString:@";"];
                  }
               }
            }
            else 
            {
               // no semicolon 
               [output appendString:@"&"];
               [output appendString:afterAmpersand];
            }
         }
      }
   }
   return [NSString stringWithString:output];
}

I omitted part of the dictionary for brevity, find the full code in my GitHub repo where I’m also developming my NSAttributedStrings HTML Additions.

To test it, we just do:

   NSLog(@"%@", [@"A &spades; & to demonstrate; an &invalid; sequence; &Some text; &#8364;"
                           stringByReplacingHTMLEntities]);

Categories: Recipes

7 Comments »

  1. [appName cStringUsingEncoding:NSUTF8StringEncoding] should be written as [appName UTF8String]

  2. Will this work with strong and em instead of their semantically ambiguous counterparts b and i? If so, I would greatly encourage their use to curb the proliferation of b and i as they are bad for the visually impaired, who otherwise enjoy a fantastic user experience on iOS. Although if a developer intends to just add stylistic separation with no semantic implications, b and i should be ok. I hope Apple eventually supports VoiceOver for these strings.

  3. hoo, boy, the em and strong tags completely made that comment fall apart. Sorry, I was not aware that would happen…

  4. I removed the angle brackets for you. In NSAttributedString+HTML I implemented b = strong and i = em.

  5. Good to know. Appreciate the comment fix.

  6. generate error when build in xcode

    [BEROR]error: There is no SDK with the name or path ‘iphoneos’

  7. Then set it to a valid SDK.