Re: Digest for cocoa-unbound@googlegroups.com - 3 Messages in 1 Topic

3 views

Skip to first unread message

Carlton Gibson

unread,

Apr 25, 2012, 2:45:10 AM4/25/12

to cocoa-...@googlegroups.com

Hi Hamish, Mike,

Thank you both — exactly what I need.

Regards,

Carlton

On 25 Apr 2012, at 08:30, cocoa-...@googlegroups.com wrote:

Today's Topic Summary
Group: http://groups.google.com/group/cocoa-unbound/topics

NSString -initWithData: … Guess Encoding [3 Updates]

NSString -initWithData: … Guess Encoding

Carlton Gibson <carlton...@gmail.com> Apr 24 11:26AM +0200

Hi All,

I face the standard situation of needing to decode an NSData instance into an NSString without knowing the string encoding in advance.

So I try each encoding in turn until I get a hit...

NSString *theString;
theString = [[NSString alloc] initWithData:someData encoding:NSUTF8StringEncoding];
if (theString == nil) {
theString = [[NSString alloc] initWithData:someData encoding:NSISOLatin1StringEncoding];
}
... and so on...

This has been done a million times already, and better no-doubt.

Can anyone point me to some code that wraps this up neatly, and catches all the edge cases etc?
Thank you!

Regards,

Carlton

Hamish Allan <ham...@gmail.com> Apr 24 10:51AM +0100

Hi Carlton,

Here's the category on NSFileManager I have for this, using libicucore:

// NSFileManager+OTAdditions.m
// Created by Hamish Allan
// Copyright 2012 Olive Toast.
// http://creativecommons.org/licenses/by/3.0/
// Attribution requirement limited to comments in source code.

#import "ucsdet.h"
#define UOnFailReturnNil(errorCode) if (U_FAILURE(errorCode)) {
NSLog(@"%s (%d): %s", __PRETTY_FUNCTION__, __LINE__,
u_errorName(errorCode)); if (charsetDetector)
ucsdet_close(charsetDetector); return nil; }

@implementation NSFileManager (OTAdditions)

- (NSString *)otCharsetForTextFileAtPath:(NSString *)path
{
UErrorCode errorCode = U_ZERO_ERROR;

UCharsetDetector *charsetDetector = ucsdet_open(&errorCode);
UOnFailReturnNil(errorCode);

NSData *characterData = [NSData dataWithContentsOfMappedFile:path];

ucsdet_setText(charsetDetector, [characterData bytes],
[characterData length], &errorCode);
UOnFailReturnNil(errorCode);

const UCharsetMatch *bestMatch = ucsdet_detect(charsetDetector, &errorCode);
UOnFailReturnNil(errorCode);

const char *encodingName = ucsdet_getName(bestMatch, &errorCode);
UOnFailReturnNil(errorCode);

NSString *encodingNameString = [NSString stringWithUTF8String:encodingName];
ucsdet_close(charsetDetector);

return encodingNameString;
}

@end

Hope this helps,
Hamish

Michael Ash <micha...@gmail.com> Apr 24 11:27AM -0400

On Apr 24, 2012, at 5:26 AM, Carlton Gibson wrote:

> Can anyone point me to some code that wraps this up neatly, and catches all the edge cases etc?

There actually isn't all that much to catch. I have a bit of sample code in this article under "Fallbacks":

http://mikeash.com/pyblog/friday-qa-2010-02-19-character-encodings.html

Note that once you get to MacOSRoman, you can stop checking any others, because MacOSRoman will successfully (if not necessarily correctly) decode any sequence of bytes you throw at it.

That approach is best if you have data that you really expect to be UTF-8, need some vaguely useful results if it's not, but don't really care about seriously detecting and correctly presenting the variety of weird encodings out there. If you really need a good chance of handling weird encodings, Hamish's approach is probably what you want to go for.

Mike

You received this message because you are subscribed to the Google Group cocoa-unbound.
You can post via email.
To unsubscribe from this group, send an empty message.
For more options, visit this group.

Reply all

Reply to author

Forward

0 new messages