Debugging invalid UTF-8 data

5,044 views
Skip to first unread message

Franz Allan Valencia See

unread,
Mar 9, 2010, 1:19:14 AM3/9/10
to Protocol Buffers
Good day,

I am working on a java application which uses a 3rd party framework called CMeCab-Java. CMeCab-Java has two parts - the Java side & the Cpp side. One way to bridge the two which CMeCab-Java provides is via protobuf (and advantage of this approach over the other bridging approaches is that this is faster and easier to work with since what you get are objects and not streams).

What CMeCab-Java does specifically is it accepts an input String (CharSequence to be exact) and tokenizes it using MeCab (http://mecab.sourceforge.net/), and gives the result back as a Java object.

While playing with it, i found that there were too many round trips between Java & the Cpp side. So what I am trying to do is to minimize that (and hopefully improve performance). Specifically, instead of passing the n number of texts, what I did was assembled this n texts into a single long text delimited by 0x00 (i.e. {'the', 'quick', 'brown', 'fox' } becomes 'the' + 0x00 + 'quick' + 0x00 + 'brown' + 0x00 + 'fox') and passed to the Cpp side that via protobuf.

This works ok a single threaded application. However, once I multithread this request, I am getting the following error from protobuf (with gdb) which crashes the JVM:
libprotobuf ERROR google/protobuf/wire_format.cc:1059] Encountered string containing invalid UTF-8 data while serializing protocol buffer. Strings must contain only UTF-8; use the 'bytes' type for raw bytes.

But I am not sure why is that. Is there any flag I can turn on to see what this invalid UTF-8 data is and which string was it processing when I got that? (...Or is there any easier/better way for me to achieve the performance gains that I am looking for? :-) )

Thanks,
--
Franz Allan Valencia See | Java Software Engineer
fran...@gmail.com
LinkedIn: http://www.linkedin.com/in/franzsee
Twitter: http://www.twitter.com/franz_see

Kenton Varda

unread,
Mar 9, 2010, 2:13:54 AM3/9/10
to Franz Allan Valencia See, Protocol Buffers
Protocol Buffers are binary data, not text.  You can't store them in String (or CharSequence) objects because those are meant only for Unicode text.  If CMeCab tries to transfer protobuf messages as Strings then it is, unfortunately, broken.

If you want to figure out how you are hitting that log message, you can run in a debugger and insert a breakpoint at the file and line number shown in the message.

--
You received this message because you are subscribed to the Google Groups "Protocol Buffers" group.
To post to this group, send email to prot...@googlegroups.com.
To unsubscribe from this group, send email to protobuf+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/protobuf?hl=en.

Franz Allan Valencia See

unread,
Mar 9, 2010, 3:27:24 AM3/9/10
to Kenton Varda, Protocol Buffers
Actually, that String/CharSequence is being placed on a java Class generated by Protobuf.

Which debugger would you suggest? Pardon, I'm a noob on native libraries.


Thanks,

--
Franz Allan Valencia See | Java Software Engineer
fran...@gmail.com
LinkedIn: http://www.linkedin.com/in/franzsee
Twitter: http://www.twitter.com/franz_see

Franz Allan Valencia See

unread,
Mar 9, 2010, 9:39:44 AM3/9/10
to Protocol Buffers
Oh. And in my hs_err_pid. I have this:

bash-3.00# cat hs_err_pid574.log
#
# An unexpected error has been detected by HotSpot Virtual Machine:
#
#  SIGSEGV (0xb) at pc=0x700dce44, pid=574, tid=40
#
# Java VM: Java HotSpot(TM) Server VM (1.5.0_17-b04 mixed mode)
# Problematic frame:
# C  [libCMeCab.so+0xce44]  Java_net_moraleboost_mecab_impl_LocalProtobufTagger__1parse+0x1dc
#

---------------  T H R E A D  ---------------

Current thread (0x00f1f318):  JavaThread "pool-1-thread-1" [_thread_in_native, id=40]

siginfo:si_signo=11, si_errno=0, si_code=1, si_addr=0x00000004

Registers:
 O0=0x00000000 O1=0x01a3153c O2=0x0000032b O3=0x00000001
 O4=0xd28f7670 O5=0x0166d970 O6=0x68cfea78 O7=0x700dce30
 G1=0x00000000 G2=0x00001cc4 G3=0x00001ffc G4=0x003c7cc6
 G5=0xff032f18 G6=0x00000000 G7=0x703d3200 Y=0x00000000
 PC=0x700dce44 nPC=0x700dce48


Top of Stack: (sp=0x68cfea78)
0x68cfea78:   d28e1744 76242b10 76242b30 68cfebf4
0x68cfea88:   68cfeb78 000002b0 76286b68 700f7e00
0x68cfea98:   00f1f3d4 68cfec88 00000000 01463f90
0x68cfeaa8:   68cfecfc 00000000 68cfeba0 f8c0c280
0x68cfeab8:   d28f7670 01463f90 01a3153c 0000032b
0x68cfeac8:   00000000 00000063 01a2ef88 785e1d80
0x68cfead8:   d28e1728 00000000 745b6a28 785e1d80
0x68cfeae8:   01a3153c 00000000 0000000a d28e1a78

Instructions: (pc=0x700dce44)
0x700dce34:   01 00 00 00 82 10 00 08 c2 27 bf 60 c2 07 bf 60
0x700dce44:   c2 00 60 04 c2 27 bf 60 c2 07 bf 60 80 a0 60 00

Stack: [0x68c00000,0x68d00000),  sp=0x68cfea78,  free space=1018k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  [libCMeCab.so+0xce44]  Java_net_moraleboost_mecab_impl_LocalProtobufTagger__1parse+0x1dc
j  net.moraleboost.mecab.impl.LocalProtobufTagger._parse(J[B)[B+700489
j  net.moraleboost.mecab.impl.LocalProtobufTagger._parse(J[B)[B+0
j  net.moraleboost.mecab.impl.LocalProtobufTagger.parse(Lnet/moraleboost/mecab/impl/Messages$ParsingRequest;)Lnet/moraleboost/mecab/impl/Messages$ParsingResponse;+8
j  net.moraleboost.mecab.impl.ProtobufTagger.parse(Ljava/lang/CharSequence;)Lnet/moraleboost/mecab/impl/ProtobufNode;+44
j  net.moraleboost.mecab.impl.ProtobufTagger.parse(Ljava/lang/CharSequence;)Lnet/moraleboost/mecab/Node;+2
j  net.moraleboost.mecab.impl.ThreadBasedLoadLifecycleAwareTagger.getNodes(Ljava/lang/CharSequence;)Ljava/util/List;+5
j  net.moraleboost.mecab.impl.ThreadBasedLoadLifecycleAwareTagger.preLoad(Ljava/util/List;)Z+21
j  net.moraleboost.mecab.LoadLifecycleUtil$1.execute(Lnet/moraleboost/mecab/LoadLifecycleAware;Ljava/util/List;)Z+2
j  net.moraleboost.mecab.LoadLifecycleUtil.loadTemplate(Lnet/moraleboost/mecab/LoadLifecycleUtil$LoadCallback;Ljava/lang/Object;Ljava/util/List;)Z+16
j  net.moraleboost.mecab.LoadLifecycleUtil.preLoad(Ljava/lang/Object;Ljava/util/List;)Z+5
j  net.moraleboost.lucene.analysis.ja.GenericMeCabTokenizer.preLoad(Ljava/util/List;)Z+5
j  net.moraleboost.mecab.LoadLifecycleUtil$1.execute(Lnet/moraleboost/mecab/LoadLifecycleAware;Ljava/util/List;)Z+2
j  net.moraleboost.mecab.LoadLifecycleUtil.loadTemplate(Lnet/moraleboost/mecab/LoadLifecycleUtil$LoadCallback;Ljava/lang/Object;Ljava/util/List;)Z+16
j  net.moraleboost.mecab.LoadLifecycleUtil.preLoad(Ljava/lang/Object;Ljava/util/List;)Z+5
j  net.moraleboost.lucene.analysis.ja.GenericMeCabAnalyzer.preLoad(Ljava/util/List;)Z+5
j  net.moraleboost.mecab.LoadLifecycleUtil$1.execute(Lnet/moraleboost/mecab/LoadLifecycleAware;Ljava/util/List;)Z+2
j  net.moraleboost.mecab.LoadLifecycleUtil.loadTemplate(Lnet/moraleboost/mecab/LoadLifecycleUtil$LoadCallback;Ljava/lang/Object;Ljava/util/List;)Z+16
j  net.moraleboost.mecab.LoadLifecycleUtil.preLoad(Ljava/lang/Object;Ljava/util/List;)Z+5
...
j  net.moraleboost.mecab.LoadLifecycleUtil$1.execute(Lnet/moraleboost/mecab/LoadLifecycleAware;Ljava/util/List;)Z+2
j  net.moraleboost.mecab.LoadLifecycleUtil.loadTemplate(Lnet/moraleboost/mecab/LoadLifecycleUtil$LoadCallback;Ljava/lang/Object;Ljava/util/List;)Z+16
j  net.moraleboost.mecab.LoadLifecycleUtil.preLoad(Ljava/lang/Object;Ljava/util/List;)Z+5
...
j  java.util.concurrent.Executors$RunnableAdapter.call()Ljava/lang/Object;+4
j  java.util.concurrent.FutureTask$Sync.innerRun()V+22
j  java.util.concurrent.FutureTask.run()V+4
j  java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Ljava/lang/Runnable;)V+43
j  java.util.concurrent.ThreadPoolExecutor$Worker.run()V+28
j  java.lang.Thread.run()V+11
v  ~StubRoutines::call_stub
V  [libjvm.so+0x19b360]
V  [libjvm.so+0x2c05ac]
V  [libjvm.so+0x2dfb4c]
V  [libjvm.so+0x2db6e8]
V  [libjvm.so+0x67db8c]

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j  net.moraleboost.mecab.impl.LocalProtobufTagger._parse(J[B)[B+0
j  net.moraleboost.mecab.impl.LocalProtobufTagger.parse(Lnet/moraleboost/mecab/impl/Messages$ParsingRequest;)Lnet/moraleboost/mecab/impl/Messages$ParsingResponse;+8
j  net.moraleboost.mecab.impl.ProtobufTagger.parse(Ljava/lang/CharSequence;)Lnet/moraleboost/mecab/impl/ProtobufNode;+44
j  net.moraleboost.mecab.impl.ProtobufTagger.parse(Ljava/lang/CharSequence;)Lnet/moraleboost/mecab/Node;+2
j  net.moraleboost.mecab.impl.ThreadBasedLoadLifecycleAwareTagger.getNodes(Ljava/lang/CharSequence;)Ljava/util/List;+5
j  net.moraleboost.mecab.impl.ThreadBasedLoadLifecycleAwareTagger.preLoad(Ljava/util/List;)Z+21
j  net.moraleboost.mecab.LoadLifecycleUtil$1.execute(Lnet/moraleboost/mecab/LoadLifecycleAware;Ljava/util/List;)Z+2
j  net.moraleboost.mecab.LoadLifecycleUtil.loadTemplate(Lnet/moraleboost/mecab/LoadLifecycleUtil$LoadCallback;Ljava/lang/Object;Ljava/util/List;)Z+16
j  net.moraleboost.mecab.LoadLifecycleUtil.preLoad(Ljava/lang/Object;Ljava/util/List;)Z+5
j  net.moraleboost.lucene.analysis.ja.GenericMeCabTokenizer.preLoad(Ljava/util/List;)Z+5
j  net.moraleboost.mecab.LoadLifecycleUtil$1.execute(Lnet/moraleboost/mecab/LoadLifecycleAware;Ljava/util/List;)Z+2
j  net.moraleboost.mecab.LoadLifecycleUtil.loadTemplate(Lnet/moraleboost/mecab/LoadLifecycleUtil$LoadCallback;Ljava/lang/Object;Ljava/util/List;)Z+16
j  net.moraleboost.mecab.LoadLifecycleUtil.preLoad(Ljava/lang/Object;Ljava/util/List;)Z+5
j  net.moraleboost.lucene.analysis.ja.GenericMeCabAnalyzer.preLoad(Ljava/util/List;)Z+5
j  net.moraleboost.mecab.LoadLifecycleUtil$1.execute(Lnet/moraleboost/mecab/LoadLifecycleAware;Ljava/util/List;)Z+2
j  net.moraleboost.mecab.LoadLifecycleUtil.loadTemplate(Lnet/moraleboost/mecab/LoadLifecycleUtil$LoadCallback;Ljava/lang/Object;Ljava/util/List;)Z+16
j  net.moraleboost.mecab.LoadLifecycleUtil.preLoad(Ljava/lang/Object;Ljava/util/List;)Z+5
...
j  net.moraleboost.mecab.LoadLifecycleUtil$1.execute(Lnet/moraleboost/mecab/LoadLifecycleAware;Ljava/util/List;)Z+2
j  net.moraleboost.mecab.LoadLifecycleUtil.loadTemplate(Lnet/moraleboost/mecab/LoadLifecycleUtil$LoadCallback;Ljava/lang/Object;Ljava/util/List;)Z+16
j  net.moraleboost.mecab.LoadLifecycleUtil.preLoad(Ljava/lang/Object;Ljava/util/List;)Z+5
...
j  java.util.concurrent.Executors$RunnableAdapter.call()Ljava/lang/Object;+4
j  java.util.concurrent.FutureTask$Sync.innerRun()V+22
j  java.util.concurrent.FutureTask.run()V+4
j  java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Ljava/lang/Runnable;)V+43
j  java.util.concurrent.ThreadPoolExecutor$Worker.run()V+28
j  java.lang.Thread.run()V+11
v  ~StubRoutines::call_stub

---------------  P R O C E S S  ---------------

Java Threads: ( => current thread )
  0x012a4100 JavaThread "pool-1-thread-5" [_thread_blocked, id=44]
  0x011eed68 JavaThread "pool-1-thread-4" [_thread_blocked, id=43]
  0x0104eb18 JavaThread "pool-1-thread-3" [_thread_blocked, id=42]
  0x01dc1fc8 JavaThread "pool-1-thread-2" [_thread_in_native, id=41]
=>0x00f1f318 JavaThread "pool-1-thread-1" [_thread_in_native, id=40]
  0x011e1e08 JavaThread "Timer-0" [_thread_blocked, id=39]
  0x021b3948 JavaThread "Store portal.org.hibernate.cache.StandardQueryCache Spool Thread" daemon [_thread_blocked, id=38]
  0x00a27638 JavaThread "Store portal.org.hibernate.cache.UpdateTimestampsCache Spool Thread" daemon [_thread_blocked, id=37]
...
  0x0113caf8 JavaThread "resin-12" daemon [_thread_blocked, id=31]
  0x00697728 JavaThread "BlueDragon MailSender" daemon [_thread_blocked, id=30]
  0x00adb418 JavaThread "BlueDragon CFQUERY Backgrounder" daemon [_thread_blocked, id=29]
  0x007d8348 JavaThread "BlueDragon AlarmManager" daemon [_thread_blocked, id=28]
  0x005ba1b8 JavaThread "resin-11" daemon [_thread_blocked, id=27]
  0x005b9df0 JavaThread "resin-10" daemon [_thread_blocked, id=26]
  0x005b9a28 JavaThread "resin-9" daemon [_thread_blocked, id=25]
  0x005b9660 JavaThread "resin-8" daemon [_thread_blocked, id=24]
  0x005b9438 JavaThread "resin-7" daemon [_thread_blocked, id=23]
  0x005a7b70 JavaThread "resin-6" daemon [_thread_blocked, id=22]
  0x00344cc8 JavaThread "resin-5" daemon [_thread_blocked, id=21]
  0x00344130 JavaThread "resin-4" daemon [_thread_blocked, id=20]
  0x002bb698 JavaThread "resin-3" daemon [_thread_blocked, id=19]
  0x002bc408 JavaThread "resin-2" daemon [_thread_blocked, id=18]
  0x002bbc40 JavaThread "resin-1" daemon [_thread_blocked, id=17]
  0x0032b548 JavaThread "resin-0" daemon [_thread_blocked, id=16]
  0x002bba78 JavaThread "resin-thread-scheduler" daemon [_thread_blocked, id=15]
  0x0032ae40 JavaThread "resin-thread-launcher" daemon [_thread_blocked, id=14]
  0x002982b8 JavaThread "resin-alarm" daemon [_thread_blocked, id=13]
  0x00154468 JavaThread "Low Memory Detector" daemon [_thread_blocked, id=11]
  0x001521f8 JavaThread "CompilerThread1" daemon [_thread_blocked, id=10]
  0x001513d8 JavaThread "CompilerThread0" daemon [_thread_blocked, id=9]
  0x001505b8 JavaThread "AdapterThread" daemon [_thread_blocked, id=8]
  0x0014f828 JavaThread "Signal Dispatcher" daemon [_thread_blocked, id=7]
  0x0013d620 JavaThread "Finalizer" daemon [_thread_blocked, id=6]
  0x0013d140 JavaThread "Reference Handler" daemon [_thread_blocked, id=5]
  0x00038268 JavaThread "main" [_thread_blocked, id=1]

Other Threads:
  0x0013b060 VMThread [id=4]
  0x000b1478 WatcherThread [id=12]

VM state:not at safepoint (normal execution)

VM Mutex/Monitor currently owned by a thread: None

Heap
 PSYoungGen      total 151424K, used 80214K [0xcdc00000, 0xddc00000, 0xf8800000)
  eden space 87552K, 91% used [0xcdc00000,0xd2a55a40,0xd3180000)
  from space 63872K, 0% used [0xd3180000,0xd3180000,0xd6fe0000)
  to   space 77632K, 0% used [0xd9030000,0xd9030000,0xddc00000)
 PSOldGen        total 241664K, used 151551K [0x78400000, 0x87000000, 0xcdc00000)
  object space 241664K, 62% used [0x78400000,0x817ffff0,0x87000000)
 PSPermGen       total 49152K, used 32921K [0x74400000, 0x77400000, 0x78400000)
  object space 49152K, 66% used [0x74400000,0x764265a0,0x77400000)

Dynamic libraries:
0x00010000      /usr/jdk/instances/jdk1.5.0/bin/java
0xff3a0000      /lib/libthread.so.1
0xff370000      /lib/libdl.so.1
0xff200000      /lib/libc.so.1
0xff390000      /platform/SUNW,Sun-Fire-V240/lib/libc_psr.so.1
0xfe800000      /usr/jdk/instances/jdk1.5.0/jre/lib/sparc/server/libjvm.so
0xff1e0000      /lib/libsocket.so.1
0xff350000      /usr/lib/libsched.so.1
0xff1b0000      /usr/lib/libCrun.so.1
0xff190000      /lib/libm.so.1
0xff080000      /lib/libnsl.so.1
0xfe700000      /lib/libm.so.2
0xff160000      /lib/libscf.so.1
0xff140000      /lib/libdoor.so.1
0xff060000      /lib/libuutil.so.1
0xfe7e0000      /lib/libgen.so.1
0xfe6d0000      /lib/libmd.so.1
0xfe6b0000      /platform/SUNW,Sun-Fire-V240/lib/libmd_psr.so.1
0xfe690000      /lib/libmp.so.2
0xfe670000      /usr/jdk/instances/jdk1.5.0/jre/lib/sparc/native_threads/libhpi.so
0xfe610000      /lib/nss_files.so.1
0xfe5e0000      /usr/jdk/instances/jdk1.5.0/jre/lib/sparc/libverify.so
0xfe5a0000      /usr/jdk/instances/jdk1.5.0/jre/lib/sparc/libjava.so
0xfe580000      /usr/jdk/instances/jdk1.5.0/jre/lib/sparc/libzip.so
0xfafe0000      /usr/jdk/instances/jdk1.5.0/jre/lib/sparc/libj2pkcs11.so
0xfafb0000      /usr/lib/libpkcs11.so
0xfaf90000      /usr/lib/libcryptoutil.so.1
0xfaea0000      /usr/lib/security/pkcs11_softtoken.so
0xfadd0000      /usr/jdk/instances/jdk1.5.0/jre/lib/sparc/libnet.so
0xfadb0000      /lib/nss_nis.so.1
0xf8a60000      /usr/jdk/instances/jdk1.5.0/jre/lib/sparc/libmanagement.so
0x6e780000      /usr/jdk/instances/jdk1.5.0/jre/lib/sparc/libawt.so
0x6e600000      /usr/jdk/instances/jdk1.5.0/jre/lib/sparc/libmlib_image.so
0x74040000      /usr/jdk/instances/jdk1.5.0/jre/lib/sparc/headless/libmawt.so
0x703a0000      /usr/jdk/instances/jdk1.5.0/jre/lib/sparc/libnio.so
0x70260000      /lib/librt.so.1
0x70240000      /lib/libaio.so.1
0x70220000      /usr/lib/libsendfile.so.1
...
0x68e80000      /usr/local/lib/libstdc++.so.5
0x700b0000      /usr/local/lib/libgcc_s.so.1
...
0x703c0000      /lib/libpthread.so.1
0x6ff50000      /usr/local/lib/libz.so

...

Signal Handlers:
SIGSEGV: [libjvm.so+0x70d9b0], sa_mask[0]=0xffbffeff, sa_flags=0x00000004
SIGBUS: [libjvm.so+0x70d9b0], sa_mask[0]=0xffbffeff, sa_flags=0x00000004
SIGFPE: [libjvm.so+0x2731e4], sa_mask[0]=0xffbffeff, sa_flags=0x0000000c
SIGPIPE: [libjvm.so+0x2731e4], sa_mask[0]=0xffbffeff, sa_flags=0x0000000c
SIGILL: [libjvm.so+0x2731e4], sa_mask[0]=0xffbffeff, sa_flags=0x0000000c
SIGUSR1: SIG_DFL, sa_mask[0]=0x00000000, sa_flags=0x00000000
SIGUSR2: SIG_DFL, sa_mask[0]=0x00000000, sa_flags=0x00000000
SIGHUP: [libjvm.so+0x67ef14], sa_mask[0]=0xffbffeff, sa_flags=0x00000004
SIGINT: [libjvm.so+0x67ef14], sa_mask[0]=0xffbffeff, sa_flags=0x00000004
SIGQUIT: [libjvm.so+0x67ef14], sa_mask[0]=0xffbffeff, sa_flags=0x00000004
SIGTERM: [libjvm.so+0x67ef14], sa_mask[0]=0xffbffeff, sa_flags=0x00000004


---------------  S Y S T E M  ---------------

OS:                         Solaris 10 3/05 s10_74L2a SPARC
           Copyright 2005 Sun Microsystems, Inc.  All Rights Reserved.
                        Use is subject to license terms.
                            Assembled 22 January 2005
                       Preinstall part number 259-4425-01

uname:SunOS 5.10 Generic_138888-03 sun4u  (T2 libthread)
rlimit: STACK 8192k, CORE infinity, NOFILE 65536, AS infinity
load average:0.85 0.75 0.57

CPU:total 2 has_v8, has_v9, has_vis1, has_vis2, is_ultra3

Memory: 8k page, physical 8388608k(2633448k free)

vm_info: Java HotSpot(TM) Server VM (1.5.0_17-b04) for solaris-sparc, built on Nov 10 2008 02:34:32 by unknown with unknown Workshop:0x550

Kenton Varda

unread,
Mar 9, 2010, 2:27:56 PM3/9/10
to Franz Allan Valencia See, Protocol Buffers
On Tue, Mar 9, 2010 at 12:27 AM, Franz Allan Valencia See <fran...@gmail.com> wrote:
Actually, that String/CharSequence is being placed on a java Class generated by Protobuf.

Ah.  Then I think the problem is simply that NUL characters are not allowed in UTF-8 text.  So you need to find some other way to delimit your messages.
 
Which debugger would you suggest? Pardon, I'm a noob on native libraries.

Depends on the platform and compiler.
Reply all
Reply to author
Forward
0 new messages