Group: http://groups.google.com/group/cocoa-unbound/topics
- NSString -initWithData: … Guess Encoding [3 Updates]
Carlton Gibson <carlton...@gmail.com> Apr 24 11:26AM +0200
Hi All,
I face the standard situation of needing to decode an NSData instance into an NSString without knowing the string encoding in advance.
So I try each encoding in turn until I get a hit...
NSString *theString;
theString = [[NSString alloc] initWithData:someData encoding:NSUTF8StringEncoding];
if (theString == nil) {
theString = [[NSString alloc] initWithData:someData encoding:NSISOLatin1StringEncoding];
}
... and so on...
This has been done a million times already, and better no-doubt.
Can anyone point me to some code that wraps this up neatly, and catches all the edge cases etc?
Thank you!
Regards,
CarltonHamish Allan <ham...@gmail.com> Apr 24 10:51AM +0100
Hi Carlton,
Here's the category on NSFileManager I have for this, using libicucore:
// NSFileManager+OTAdditions.m
// Created by Hamish Allan
// Copyright 2012 Olive Toast.
// http://creativecommons.org/licenses/by/3.0/
// Attribution requirement limited to comments in source code.
#import "ucsdet.h"
#define UOnFailReturnNil(errorCode) if (U_FAILURE(errorCode)) {
NSLog(@"%s (%d): %s", __PRETTY_FUNCTION__, __LINE__,
u_errorName(errorCode)); if (charsetDetector)
ucsdet_close(charsetDetector); return nil; }
@implementation NSFileManager (OTAdditions)
- (NSString *)otCharsetForTextFileAtPath:(NSString *)path
{
UErrorCode errorCode = U_ZERO_ERROR;
UCharsetDetector *charsetDetector = ucsdet_open(&errorCode);
UOnFailReturnNil(errorCode);
NSData *characterData = [NSData dataWithContentsOfMappedFile:path];
ucsdet_setText(charsetDetector, [characterData bytes],
[characterData length], &errorCode);
UOnFailReturnNil(errorCode);
const UCharsetMatch *bestMatch = ucsdet_detect(charsetDetector, &errorCode);
UOnFailReturnNil(errorCode);
const char *encodingName = ucsdet_getName(bestMatch, &errorCode);
UOnFailReturnNil(errorCode);
NSString *encodingNameString = [NSString stringWithUTF8String:encodingName];
ucsdet_close(charsetDetector);
return encodingNameString;
}
@end
Hope this helps,
Hamish
Michael Ash <micha...@gmail.com> Apr 24 11:27AM -0400
On Apr 24, 2012, at 5:26 AM, Carlton Gibson wrote:
> Can anyone point me to some code that wraps this up neatly, and catches all the edge cases etc?
There actually isn't all that much to catch. I have a bit of sample code in this article under "Fallbacks":
http://mikeash.com/pyblog/friday-qa-2010-02-19-character-encodings.html
Note that once you get to MacOSRoman, you can stop checking any others, because MacOSRoman will successfully (if not necessarily correctly) decode any sequence of bytes you throw at it.
That approach is best if you have data that you really expect to be UTF-8, need some vaguely useful results if it's not, but don't really care about seriously detecting and correctly presenting the variety of weird encodings out there. If you really need a good chance of handling weird encodings, Hamish's approach is probably what you want to go for.
MikeYou received this message because you are subscribed to the Google Group cocoa-unbound.
You can post via email.
To unsubscribe from this group, send an empty message.
For more options, visit this group.