Ad

Our DNA is written in Swift
Jump

Decompressing Files into Memory

As a hobby project I am working on uncovering hidden treasures that exist on all your iOS devices. Hidden, because there is no Objective-C API for them, Existing, because Apple includes a great deal of open source libraries in iOS, compiled as a dynamic library.

What items exist you can see if you check out what dylibs are there to be seen in “Link Binary with Libraries”. Most entries beginning with lib and ending with dylib can be used. Some people have reported getting rejected for adding the static variants of libraries like libxslt or libarchive, but that’s probably because Apple sees these symbols as duplicate to the ones contained in the dynamic libraries.

We previously looked at libxml2 for parsing HTML (and part 2), today we’ll familiarize ourselves with zlib for decompressing .gz and .zip files.

The first time I came in contact with decompressing files was on my MyAppSales open source project. There I was scraping iTunes Connect and the downloaded files were compressed in ZIP format. The new unofficial iTunes Connect API compresses the daily and weekly reports in GZIP format, which no longer works. This is why I set out to find a solution that works for both.

There are several compression schemes out there, but the most prevalent two are PKZIP (as popularized by WinZIP) and GZIP (as in GNU Zip). The former is a wrapper around the latter. GZIP only supports a single file, whereas PKZIP adds a special file header that has an index of the included files with pointers to the locations of the corresponding GZIP chunks.

In Memory Decompression: GZIP

zlib by itself can only deal with zlib-compressed streams (“deflated”), GZIP in turn adds a minimal header for this deflated content. Decompressing streams with pure zlib-compression or zlib+GZIP header is relatively straightforward. Here’s the method I gleaned (and cleaned up) from CocoaDev’s NSData category.

NSUInteger dataLength = [_data length];
NSUInteger halfLength = dataLength / 2;
 
NSMutableData *decompressed = [NSMutableData dataWithLength: dataLength + halfLength];
BOOL done = NO;
int status;
 
z_stream strm;
strm.next_in = (Bytef *)[_data bytes];
strm.avail_in = (uInt)dataLength;
strm.total_out = 0;
strm.zalloc = Z_NULL;
strm.zfree = Z_NULL;
 
// inflateInit2 knows how to deal with gzip format
if (inflateInit2(&strm, (15+32)) != Z_OK)
{
	return;
}
 
while (!done)
{
	// extend decompressed if too short
	if (strm.total_out >= [decompressed length])
	{
		[decompressed increaseLengthBy: halfLength];
	}
 
	strm.next_out = [decompressed mutableBytes] + strm.total_out;
	strm.avail_out = (uInt)[decompressed length] - (uInt)strm.total_out;
 
	// Inflate another chunk.
	status = inflate (&strm, Z_SYNC_FLUSH);
 
	if (status == Z_STREAM_END)
	{
		done = YES;
	}
	else if (status != Z_OK)
	{
		break;	
	}
}
 
if (inflateEnd (&strm) != Z_OK || !done)
{
	return;
}
 
// set actual length
[decompressed setLength:strm.total_out];

This method works by setting up a zstream struct with a pointer and length to the data bytes of the file. Then it initializes the decompressor with inflateInit2. The trailing 2 is important in this function as this is the version that knows how to deal with the GZIP header. The decompression occurs by calling inflate until this returns Z_STREAM_END. Finally the decompressor is freed up by calling inflateEnd.

I liked this approach because it keeps adding half of the compressed data size to the output mutable data object. This is way more efficient than constantly adding each decompressed chunk to the data forcing it to constantly reallocate larger bits of memory and copying the contents. At the end it uses setLength to specify the actual length of data.

In Memory Decompression: PKZIP

You can tell a GZIP and PKZIP file apart by inspecting the first two bytes of it. If these are ‘PK’ then you have a ZIP file.

PKZIP adds a special header so that multiple GZIPped files can peacefully coexist in a single .ZIP file. Dealing with this header is quite tedious so people are happy to use Minizip. This is another C-library that wraps this complexity up. If you take it as it is there are some compiler warnings, so – lazy me – I used the cleaned up version by Sam Soffes.

The Objective-C versions of decompressing PKZIP seem all to be more or less based on ZipArchive project by “Aish”. You can tell that this is the case because they generally contain the same bug dealing with the file date of the zipped files. If you find a reference to Jan 1st 1980 in there, you know what I mean.

The – simplified – structure of dealing with a PKZIP in memory is this:

unsigned char buffer[BUFFER_SIZE] = {0};
 
// open the file for unzipping
unzFile _unzFile = unzOpen((const char *)[_path UTF8String]);
 
// return if failed
if (!_unzFile)
{
	return;
}
 
// get file info
unz_global_info  globalInfo = {0};
 
if (!unzGetGlobalInfo(_unzFile, &globalInfo )==UNZ_OK )
{
	// there's a problem
	return;
}
 
if (unzGoToFirstFile(_unzFile)!=UNZ_OK)
{
	// unable to go to first file
	return;
}
 
// enum block can stop loop
BOOL shouldStop = NO;
 
// iterate through all files
do 
{
	unz_file_info zipInfo ={0};
 
	if (unzOpenCurrentFile(_unzFile) != UNZ_OK)
	{
		// error uncompressing this file
		return;
	}
 
	// first call for file info so that we know length of file name
	if (unzGetCurrentFileInfo(_unzFile, &zipInfo, NULL, 0, NULL, 0, NULL, 0) != UNZ_OK)
	{
		// cannot get file info
		unzCloseCurrentFile(_unzFile);
		return;
	}
 
	// reserve space for file name
	char *fileNameC = (char *)malloc(zipInfo.size_filename+1);
 
	// second call to get actual file name	
	unzGetCurrentFileInfo(_unzFile, &zipInfo, fileNameC, zipInfo.size_filename + 1, NULL, 0, NULL, 0);
	fileNameC[zipInfo.size_filename] = '\0';
	NSString *fileName = [NSString stringWithUTF8String:fileNameC];
	free(fileNameC);
 
	NSMutableData *tmpData = [[NSMutableData alloc] init];
 
	int readBytes;
	while((readBytes = unzReadCurrentFile(_unzFile, buffer, BUFFER_SIZE)) > 0)
	{
		[tmpData appendBytes:buffer length:readBytes];
	}
 
	// decompressed file now in tmpData, name in fileName
 
	// close the current file
	unzCloseCurrentFile(_unzFile);
}
while (!shouldStop && unzGoToNextFile(_unzFile )==UNZ_OK);

Now about these file dates … For some unfortunate historical reasons Microsoft didn’t think to include time zone support in DOS file dates. We will never now why. There are two ways how a file date can be represented in PKZIPped files. The zipInfo header struct both contains a dosdate value as well as a tmu_date struct. The spec states that if the dosdate is 0 then the tmu_date is to be used which has individual fields for hour, min, sec, day, month and year. But no time zone either.

Now the bug I alluded to above is to assume that – like on Unix – the dosdate is a time stamp, a number of seconds since a reference date. The implementations that get this wrong just assume that the dosdate is a number of seconds since beginning of 1980. The problem is that there might still be some files out there that ONLY use the dosdate, so we cannot just ignore that and go with the tmu_date.

I found the spec for the dos date hidden deep in Microsoft’s web. It simply compresses all the date parts into 2 bytes, using only as many bits as necessary for each value. And to save one bit on the seconds these are divided by 2.

long l = 1078768689; // a dosdate
 
int year = ((l>>25)&127) + 1980;  // 7 bits
int month = (l>>21)&15;  // 4 bits
int day = (l>>16)&31; // 5 bits
int hour = (l>>11)&31; // 5 bits
int minute = (l>>5)&63;	// 6 bits	
int second = (l&31) * 2;  // 5 bits

Crazy, isn’t it? I can see how somebody might have assumed that this is a number of seconds since 1980 as the year value is at the highest order bits.

If the dosdate is 0 then you can trust the tmu_date to contain the local time where the ZIP file was created. But I’ll leave this exercise to you. Just one hint: January is month 0.

We usually don’t care about getting the exact file stamps correct. If you do you need to either save the time zone inside the ZIP file as well, possibly as a plain text file. Or alternatively if you know that the files came from a certain server you can assume this time zone.

PKZIP also supports encrypting files with a password. If you never deal with thus encrypted files you can disable crypt support by defining NOCRYPT and NOUNCRYPT. This omits the crypt code from the compiled binary. The header minizip/crypt.h claims:

The encryption/decryption parts of this source code (as opposed to the non-echoing password parts) were originally written in Europe. The whole source package can be freely distributed, including from the USA. (Prior to January 2000, re-export from the US was a violation of US law.)

But then again, the original encryption is relatively weak, several cracking tools exist which can brute force it. There are two stronger encryption schemes introduced by WinZip and PKWare PKZip which are not even supported by minizip. So you are probably safer if you just omit the encrypting parts if you don’t want to jump hoops presented by Apple or the US Government or face the grief of not-supported encryption schemes.

Conclusion

The source code featured in this article is available as part of DTFoundation, check out the DTZipArchive class there.

There isn’t really much there once you gotten used calling C functions. And unfortunately documentation or tutorials on the subject matter are pretty hard to come by. But thankfully – almost always – somebody has blazed the trail and provided something that we can use to pattern our approach after.

Especially in Unix circles there is a third decompression scheme that I neglected to mention: tar.gz. This works around the single-file limitation of GZIP by concatenating the files first and then deflating them. The methods to deal with these files are available in libarchive.dylib, available in the dylibs I mentioned above. I am looking for somebody who can confirm that this is indeed app-store-legal before I add support for tar.gz to DTZipArchive.

I would also be interested to hear from somebody using minizip in app store apps and whether the crypt code was omitted and/or the encryption exporting process was required.


Categories: Recipes

6 Comments »

  1. “The zipInfo header struct both contains a dosdate value as well as a tmu_date struct”

    As the author of yet another zip framework (http://bitbucket.org/tristero/zipaccessframework/overview), I am curious. Can you point me at the documentation that describes the tmu_date field as part of the zip file format?

    As far as I know, that’s an artifact of the minizip library – that struct is *not* present in regular zip files, and you can’t “rely on it if the dosdate field is zero”, because the dosdate field is the source of all the tmu_ fields. (Look at lines 928-931 in unzip.c)

    As it stands, I don’t bother with that field at all – the ‘UT’ extension is the one I prefer to use, and all the archiving tools I’ve used populate it correctly.

    I want to keep my library as correct as possible, so any information you can share is much appreciated.

    Jeff.

  2. There is no such documentation that I could find, the only reference I found is on Apple’s Open Source site: http://opensource.apple.com/source/gcc/gcc-5367/zlib/contrib/minizip/miniunz.c?txt

    If your approach gets the the correct dates then fine, with what I described you can also use the original dosdate value. I wonder, how do YOU deal with the absence of time zone information?

  3. I agree, I would have exactly the same problem with dosdate – that’s why I’ve ignored it.

    The UT date values seems fine to feed to [NSDate dateWithTimeIntervalSince1970:] which claims to interpret relative to a GMT epoch value (ie, it *is* timezone specific).

    https://developer.apple.com/library/mac/#documentation/Cocoa/Reference/Foundation/Classes/NSDate_Class/Reference/Reference.html

    “dateWithTimeIntervalSince1970:
    Creates and returns an NSDate object set to the given number of seconds from the first instant of 1 January 1970, GMT.

    + (id)dateWithTimeIntervalSince1970:(NSTimeInterval)seconds
    Parameters
    seconds
    The number of seconds from the reference date, 1 January 1970, GMT, for the new date. Use a negative argument to specify a date before this date.
    Return Value
    An NSDate object set to seconds seconds from the reference date..”

    Now, it may be that I have been luck in using zippers that correctly populate that field (the UT extension). That’s why I’m interested in any information to the contrary. In fact, the UT entry is supposed to hold one, two or three datestamps (mod/create/access) but I’ve never encountered a file with more than one (mod). I think it’s an optional thing that no-one bothers to implement.

  4. where is this “UT date” that you keep talking about?

  5. Zip File Format is officially documented in a file called APPNOTE.TXT

    Here’s a copy at PKWARE, which should be considered canon.

    http://www.pkware.com/documents/APPNOTE/APPNOTE-4.5.0.txt

    There is a section called “extra field” which starts:

    > This is for expansion. If additional information
    > needs to be stored for special needs or for specific
    > platforms, it should be stored here. Earlier versions
    > of the software can then safely skip this file, and
    > find the next file or header. This field will be 0
    > length in version 1.0.

    Zip file parsers can easily skip over the extra fields if they want to be lazy, but since that’s where Unixs (and other architectures) hide extra information (like owner/group/datecreated/dateaccessed/etc), you shouldn’t.

    Each extension block begins with a 16-bit tag, and for the Unix time information, the magic value, 0x5455 corresponds to ‘UT’ in (low-endian) Ascii.

    If you look at my ZipFileAccess.m around line 367, you can see how I parse the various extension blocks that I know about. I don’t do them all, and it’s questionable as to how much value there would be since not many people use Acorns, or OS2, or BeOS.