Re: Digest for cocoa-unbound@googlegroups.com - 3 Messages in 1 Topic

2 views
Skip to first unread message

Carlton Gibson

unread,
Apr 25, 2012, 2:45:10 AM4/25/12
to cocoa-...@googlegroups.com
Hi Hamish, Mike, 

Thank you both — exactly what I need. 

Regards,
Carlton

On 25 Apr 2012, at 08:30, cocoa-...@googlegroups.com wrote:

Group: http://groups.google.com/group/cocoa-unbound/topics

    Carlton Gibson <carlton...@gmail.com> Apr 24 11:26AM +0200  

    Hi All,
     
    I face the standard situation of needing to decode an NSData instance into an NSString without knowing the string encoding in advance.
     
    So I try each encoding in turn until I get a hit...
     
    NSString *theString;
    theString = [[NSString alloc] initWithData:someData encoding:NSUTF8StringEncoding];
    if (theString == nil) {
    theString = [[NSString alloc] initWithData:someData encoding:NSISOLatin1StringEncoding];
    }
    ... and so on...
     
    This has been done a million times already, and better no-doubt.
     
    Can anyone point me to some code that wraps this up neatly, and catches all the edge cases etc?
    Thank you!
     
    Regards,
     
    Carlton
     
    Hamish Allan <ham...@gmail.com> Apr 24 10:51AM +0100  

    Hi Carlton,
     
    Here's the category on NSFileManager I have for this, using libicucore:
     
    // NSFileManager+OTAdditions.m
    // Created by Hamish Allan
    // Copyright 2012 Olive Toast.
    // http://creativecommons.org/licenses/by/3.0/
    // Attribution requirement limited to comments in source code.
     
    #import "ucsdet.h"
    #define UOnFailReturnNil(errorCode) if (U_FAILURE(errorCode)) {
    NSLog(@"%s (%d): %s", __PRETTY_FUNCTION__, __LINE__,
    u_errorName(errorCode)); if (charsetDetector)
    ucsdet_close(charsetDetector); return nil; }
     
    @implementation NSFileManager (OTAdditions)
     
    - (NSString *)otCharsetForTextFileAtPath:(NSString *)path
    {
    UErrorCode errorCode = U_ZERO_ERROR;
     
    UCharsetDetector *charsetDetector = ucsdet_open(&errorCode);
    UOnFailReturnNil(errorCode);
     
    NSData *characterData = [NSData dataWithContentsOfMappedFile:path];
     
    ucsdet_setText(charsetDetector, [characterData bytes],
    [characterData length], &errorCode);
    UOnFailReturnNil(errorCode);
     
    const UCharsetMatch *bestMatch = ucsdet_detect(charsetDetector, &errorCode);
    UOnFailReturnNil(errorCode);
     
    const char *encodingName = ucsdet_getName(bestMatch, &errorCode);
    UOnFailReturnNil(errorCode);
     
    NSString *encodingNameString = [NSString stringWithUTF8String:encodingName];
    ucsdet_close(charsetDetector);
     
    return encodingNameString;
    }
     
    @end
     
    Hope this helps,
    Hamish
     
     
     
    Michael Ash <micha...@gmail.com> Apr 24 11:27AM -0400  

    On Apr 24, 2012, at 5:26 AM, Carlton Gibson wrote:
     
    > Can anyone point me to some code that wraps this up neatly, and catches all the edge cases etc?
     
    There actually isn't all that much to catch. I have a bit of sample code in this article under "Fallbacks":
     
    http://mikeash.com/pyblog/friday-qa-2010-02-19-character-encodings.html
     
    Note that once you get to MacOSRoman, you can stop checking any others, because MacOSRoman will successfully (if not necessarily correctly) decode any sequence of bytes you throw at it.
     
    That approach is best if you have data that you really expect to be UTF-8, need some vaguely useful results if it's not, but don't really care about seriously detecting and correctly presenting the variety of weird encodings out there. If you really need a good chance of handling weird encodings, Hamish's approach is probably what you want to go for.
     
    Mike
     

You received this message because you are subscribed to the Google Group cocoa-unbound.
You can post via email.
To unsubscribe from this group, send an empty message.
For more options, visit this group.


Reply all
Reply to author
Forward
0 new messages