Why UTF surrogate pairs are modified by V8?

43 views
Skip to first unread message

Roman Budnyjj

unread,
Jun 2, 2016, 2:44:39 PM6/2/16
to v8-users
Hi guys,
I'm trying to pass some string data to the JS functions, managed by V8.
These strings are UTF8-encoded and contain surrogate pairs (Emoji). 
I've found, that for some reason new versions (5.1.281.56) of library modify these strings, so
the contents of strings after conversion to std types back are not the same, as it was initially:

#include <include/v8.h>

#include <include/libplatform/libplatform.h>

#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <string>


class SampleArrayBufferAllocator : public v8::ArrayBuffer::Allocator {
 public:
  virtual void* Allocate(size_t length) {
    void* data = AllocateUninitialized(length);
    return data == NULL ? data : memset(data, 0, length);
  }
  virtual void* AllocateUninitialized(size_t length) { return malloc(length); }
  virtual void Free(void* data, size_t) { free(data); }
};

int main(int argc, char* argv[]) {
    v8::V8::InitializeICU();
    v8::V8::InitializeExternalStartupData(argv[0]);
    v8::Platform* platform = v8::platform::CreateDefaultPlatform();
    v8::V8::InitializePlatform(platform);
    v8::V8::Initialize();
    v8::V8::SetFlagsFromCommandLine(&argc, argv, true);
    SampleArrayBufferAllocator array_buffer_allocator;
    v8::Isolate::CreateParams create_params;
    create_params.array_buffer_allocator = &array_buffer_allocator;
    v8::Isolate* isolate = v8::Isolate::New(create_params);
    {
        v8::Isolate::Scope isolate_scope(isolate);
        v8::HandleScope handle_scope(isolate);
        // v8::Local<v8::Context> context = CreateShellContext(isolate);

        // \uD83C\uDC32\
        std::string src("\355\240\274\355\260\262");
        std::string dst =
            *v8::String::Utf8Value(
                v8::String::NewFromUtf8(
                    isolate, src.c_str(), v8::NewStringType::kNormal).ToLocalChecked());
        if (src != dst) {
            printf("!\n");
        }
    }

    return 0;
}

Both on my x64 machine and android-19 (ARM) it prints "!".
I also want to mention, that on older versions of V8 (3.27.34) this string stays unmodified. 
Could you please describe the reasons of such behavior?

Jochen Eisinger

unread,
Jun 3, 2016, 2:42:23 AM6/3/16
to v8-users

Your input string is encoded as CESU-8, not UTF-8. Older versions of V8 would silently accept that encoding, however, this leads to broken behavior when interacting with libraries actually expecting UTF-8, so we changed our implementation to actually require UTF-8, and invalid characters are replaced when converting the string to our internal utf-16 representation.


--
--
v8-users mailing list
v8-u...@googlegroups.com
http://groups.google.com/group/v8-users
---
You received this message because you are subscribed to the Google Groups "v8-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to v8-users+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Roman Budnyjj

unread,
Jun 3, 2016, 12:29:34 PM6/3/16
to v8-u...@googlegroups.com
Thank you for explanation!

You received this message because you are subscribed to a topic in the Google Groups "v8-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/v8-users/c01n8mxAqTQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to v8-users+u...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages