Can we conclude anything...

George Petasis

unread,

Jun 24, 2013, 12:16:01 PM6/24/13

to

from this?

Program received signal SIGSEGV, Segmentation fault.
__memmove_ssse3_back () at
../sysdeps/x86_64/multiarch/memcpy-ssse3-back.S:97
97 movdqu (%rsi), %xmm0
Missing separate debuginfos, use: debuginfo-install
mysql-libs-5.5.31-1.fc18.x86_64 pcre-8.31-5.fc18.x86_64
(gdb) where
#0 __memmove_ssse3_back ()
at ../sysdeps/x86_64/multiarch/memcpy-ssse3-back.S:97
#1 0x00007ffff7e60690 in Tcl_DuplicateObj ()
from /opt/ActiveTcl-8.6/lib/libtcl8.6.so
#2 0x00007ffff7e2b412 in TEBCresume ()
from /opt/ActiveTcl-8.6/lib/libtcl8.6.so
#3 0x00007ffff7dbbd57 in TclNRRunCallbacks ()
from /opt/ActiveTcl-8.6/lib/libtcl8.6.so
#4 0x00007ffff7dc0a55 in TclEvalEx () from
/opt/ActiveTcl-8.6/lib/libtcl8.6.so
#5 0x00007ffff7e54e07 in Tcl_FSEvalFileEx ()
from /opt/ActiveTcl-8.6/lib/libtcl8.6.so
#6 0x00007ffff7e5a42d in Tcl_MainEx ()
from /opt/ActiveTcl-8.6/lib/libtcl8.6.so
#7 0x0000000000400a36 in main ()

I got this after a lengthy calculation, in my application, after
processing 50 million documents. The application at that time uses about
15Gb of memory (the machine has 32GB).

ActiveState 8.6, Linux 64 (Fedora 18). I really miss an ActiveTcl build
with debug symbols...

Also, the second time I observe the same crash (I cannot remember if the
document count is exactly the same though).

George

Alexandre Ferrieux

unread,

Jun 24, 2013, 3:19:57 PM6/24/13

to

On Monday, June 24, 2013 6:16:01 PM UTC+2, George Petasis wrote:
>
> ActiveState 8.6, Linux 64 (Fedora 18). I really miss an ActiveTcl build
> with debug symbols...

Any hope to reproduce with a tclsh compiled by yourself with debug symbols ?

-Alex

Georgios Petasis

unread,

Jun 25, 2013, 6:40:56 AM6/25/13

to

I will try (as always the problem is the packages I am using - too many).

I think that the problem may be caused by specific input - I am trying
to locate it and test. I got a crash even when spitting the 90 million
documents into chunks of 30 million documents, in the 2nd chunk.

But the script needs many hours to get there...
(As I merge dictionaries in tcl loops :D)

George

tomás zerolo

unread,

Jun 25, 2013, 12:57:28 PM6/25/13

to

George Petasis <peta...@yahoo.gr> writes:

> from this?
>
> Program received signal SIGSEGV, Segmentation fault.
> __memmove_ssse3_back () at
> ../sysdeps/x86_64/multiarch/memcpy-ssse3-back.S:97
> 97 movdqu (%rsi), %xmm0
> Missing separate debuginfos, use: debuginfo-install
> mysql-libs-5.5.31-1.fc18.x86_64 pcre-8.31-5.fc18.x86_64
> (gdb) where
> #0 __memmove_ssse3_back ()
> at ../sysdeps/x86_64/multiarch/memcpy-ssse3-back.S:97

This is a shot in the dark. Very dark. That said...

Once upon a time, libc's implementation of memcpy decided that it could
use the freedom given by the docs: "The memory areas must not overlap"
and started copying mem downward on some archs (notably x86_64 AFAIR).

Some applications, which made tactly use of memcpy "copying upwards"
started to break. Hilarity ensued:

https://lwn.net/Articles/414467/
http://sourceware.org/ml/libc-alpha/2011-04/msg00016.html

Might be that?

Regards
-- tomás

Georgios Petasis

unread,

Jun 28, 2013, 6:40:36 AM6/28/13

to

Here is a stack trace from the fossil head of 8.6 (as of last evening):

Program received signal SIGSEGV, Segmentation fault.

__memcpy_ssse3_back () at
../sysdeps/x86_64/multiarch/memcpy-ssse3-back.S:135
135 movdqu -16(%rsi), %xmm0
(gdb) where
#0 __memcpy_ssse3_back ()
at ../sysdeps/x86_64/multiarch/memcpy-ssse3-back.S:135
#1 0x00007ffff7d429b2 in Tcl_DuplicateObj (objPtr=0xdeae20)
at /home/tcl/sources/tcl/generic/tclObj.c:1587
#2 0x00007ffff7cf412e in TEBCresume (data=0x196d386e8, interp=0x613680,
result=0) at /home/tcl/sources/tcl/generic/tclExecute.c:3683
#3 0x00007ffff7c3129b in TclNRRunCallbacks (interp=0x613680, result=0,
rootPtr=0x0) at /home/tcl/sources/tcl/generic/tclBasic.c:4359
#4 0x00007ffff7c30dee in Tcl_EvalObjv (interp=0x613680, objc=3,
objv=0x620f80, flags=2097152)
at /home/tcl/sources/tcl/generic/tclBasic.c:4159
#5 0x00007ffff7c3322e in TclEvalEx (interp=0x613680,
script=0x670b30 "#!/bin/bash\n# The next line is executed by
/bin/sh, but not tcl \\\nexport ScriptsDir=\"$( cd \"$( dirname
\"${BASH_SOURCE[0]}\" )\" && pwd )\"\n# The next line is executed by
/bin/sh, but not tcl \\\ncd $Script"...,
numBytes=1563, flags=0, line=36, clNextOuter=0x0,
outerScript=0x670b30 "#!/bin/bash\n# The next line is executed by
/bin/sh, but not tcl \\\nexport ScriptsDir=\"$( cd \"$( dirname
\"${BASH_SOURCE[0]}\" )\" && pwd )\"\n# The next line is executed by
/bin/sh, but not tcl \\\ncd $Script"...) at
/home/tcl/sources/tcl/generic/tclBasic.c:5265
#6 0x00007ffff7d3009a in Tcl_FSEvalFileEx (interp=0x613680,
pathPtr=0x665d10,
encodingName=0x0) at /home/tcl/sources/tcl/generic/tclIOUtil.c:1809
#7 0x00007ffff7d3a0f6 in Tcl_MainEx (argc=-1, argv=0x7fffffffdfe0,
---Type <return> to continue, or q <return> to quit---
appInitProc=0x400992 <Tcl_AppInit>, interp=0x613680)
at /home/tcl/sources/tcl/generic/tclMain.c:417
#8 0x000000000040098b in main (argc=7, argv=0x7fffffffdfa8)
at /home/tcl/sources/tcl/unix/tclAppInit.c:84

I still have to check whether this is caused by a particular input
(although it seems to happen at different document ids from the database
- I now order them despite being time consuming, but to eliminate the
randomness of the select sql query...)

George

Georgios Petasis

unread,

Jun 28, 2013, 6:56:26 AM6/28/13

to

If I print the object that is duplicated, I get:

(gdb) p *(Tcl_Obj *) 0xdeae20
$1 = {refCount = 192,
bytes = 0xc1000000c1 <Address 0xc1000000c1 out of bounds>, length = 193,
typePtr = 0xc1000000c1, internalRep = {longValue = 828928688321,
doubleValue = 4.0954518775165223e-312, otherValuePtr = 0xc1000000c1,
wideValue = 828928688321, twoPtrValue = {ptr1 = 0xc1000000c1,
ptr2 = 0xc1000000c1}, ptrAndLongRep = {ptr = 0xc1000000c1,
value = 828928688321}}}
(gdb)

Is there a way to see what type the object is?

George

miguel.sofer

unread,

Jun 28, 2013, 11:21:15 AM6/28/13

to

OK ... that's mem corruption, not a refcounting problem or duplicate
object. The sign for that is that the bytes AND typeptr are out of
bounds (and equal, between themselves and to the internal rep too!).

Of course, the mem corruption MAY have its origin in a refcounting
problem, but it could well be something else: buffer overrun, use of
uninitialized or freed (and realloc'ed?) mem, whatever.

Those things are very hard to debug. Best bet to "catch them" is by
running under valgrind ... which in your case is probably too time
consuming?

Georgios Petasis

unread,

Jul 11, 2013, 3:31:26 PM7/11/13

to miguel.sofer

After running for many days under valgrind, it seems that the first
problem happens at exactly the same point the previous run crashed.

But it is a problem in one of my modules (a flex scanner), not tcl:

64494616
64494374
==12352== Invalid write of size 4
==12352== at 0xD950240: HTokenizerlex() (in
/opt/Ellogon/share/modules/HTokenizer/Linux/x86_64/libHTokenizer.so)
==12352== by 0xD952251: creole_HTokenizer(void*, Tcl_Interp*, int,
Tcl_Obj**) (in
/opt/Ellogon/share/modules/HTokenizer/Linux/x86_64/libHTokenizer.so)
==12352== by 0x4C5620D: TclNREvalObjv (tclBasic.c:4325)
==12352== by 0x4D1708A: TEBCresume (tclExecute.c:2900)
==12352== by 0x4C5629A: TclNRRunCallbacks (tclBasic.c:4359)
==12352== by 0x4C58E48: TclEvalObjEx (tclBasic.c:5939)
==12352== by 0x4C58DE1: Tcl_EvalObjEx (tclBasic.c:5920)
==12352== by 0x4C7C51C: Tcl_TimeObjCmd (tclCmdMZ.c:4095)
==12352== by 0x4C5620D: TclNREvalObjv (tclBasic.c:4325)
==12352== by 0x4D1708A: TEBCresume (tclExecute.c:2900)
==12352== by 0x4C5629A: TclNRRunCallbacks (tclBasic.c:4359)
==12352== by 0x4C55DED: Tcl_EvalObjv (tclBasic.c:4159)
==12352== Address 0xfe26dd8 is 0 bytes after a block of size 65,544 alloc'd
==12352== at 0x4A0887C: malloc (vg_replace_malloc.c:270)
==12352== by 0xD95189B: HTokenizerlex() (in
/opt/Ellogon/share/modules/HTokenizer/Linux/x86_64/libHTokenizer.so)
==12352== by 0xD952251: creole_HTokenizer(void*, Tcl_Interp*, int,
Tcl_Obj**) (in
/opt/Ellogon/share/modules/HTokenizer/Linux/x86_64/libHTokenizer.so)
==12352== by 0x4C5620D: TclNREvalObjv (tclBasic.c:4325)
==12352== by 0x4D1708A: TEBCresume (tclExecute.c:2900)
==12352== by 0x4C5629A: TclNRRunCallbacks (tclBasic.c:4359)
==12352== by 0x4C58E48: TclEvalObjEx (tclBasic.c:5939)
==12352== by 0x4C58DE1: Tcl_EvalObjEx (tclBasic.c:5920)
==12352== by 0x4C7C51C: Tcl_TimeObjCmd (tclCmdMZ.c:4095)
==12352== by 0x4C5620D: TclNREvalObjv (tclBasic.c:4325)
==12352== by 0x4D1708A: TEBCresume (tclExecute.c:2900)
==12352== by 0x4C5629A: TclNRRunCallbacks (tclBasic.c:4359)
==12352==
64494239

Not much info though...

And after about 15 documents, it stops under valgrind:

==12352== Invalid read of size 1
==12352== at 0x4A0A300: memcpy@@GLIBC_2.14 (mc_replace_strmem.c:882)
==12352== by 0x4D679B1: Tcl_DuplicateObj (tclObj.c:1587)
==12352== by 0x4CFE529: DictIncrCmd (tclDictObj.c:2166)
==12352== by 0x4C5620D: TclNREvalObjv (tclBasic.c:4325)
==12352== by 0x4D1708A: TEBCresume (tclExecute.c:2900)
==12352== by 0x4C5629A: TclNRRunCallbacks (tclBasic.c:4359)
==12352== by 0x4C55DED: Tcl_EvalObjv (tclBasic.c:4159)
==12352== by 0x4C5822D: TclEvalEx (tclBasic.c:5265)
==12352== by 0x4D55099: Tcl_FSEvalFileEx (tclIOUtil.c:1809)
==12352== by 0x4D5F0F5: Tcl_MainEx (tclMain.c:417)
==12352== by 0x40098A: main (tclAppInit.c:84)
==12352== Address 0xc1000000c1 is not stack'd, malloc'd or (recently)
free'd
==12352==
==12352==
==12352== Process terminating with default action of signal 11 (SIGSEGV)
==12352== Access not within mapped region at address 0xC1000000C1
==12352== at 0x4A0A300: memcpy@@GLIBC_2.14 (mc_replace_strmem.c:882)
==12352== by 0x4D679B1: Tcl_DuplicateObj (tclObj.c:1587)
==12352== by 0x4CFE529: DictIncrCmd (tclDictObj.c:2166)
==12352== by 0x4C5620D: TclNREvalObjv (tclBasic.c:4325)
==12352== by 0x4D1708A: TEBCresume (tclExecute.c:2900)
==12352== by 0x4C5629A: TclNRRunCallbacks (tclBasic.c:4359)
==12352== by 0x4C55DED: Tcl_EvalObjv (tclBasic.c:4159)
==12352== by 0x4C5822D: TclEvalEx (tclBasic.c:5265)
==12352== by 0x4D55099: Tcl_FSEvalFileEx (tclIOUtil.c:1809)
==12352== by 0x4D5F0F5: Tcl_MainEx (tclMain.c:417)
==12352== by 0x40098A: main (tclAppInit.c:84)
==12352== If you believe this happened as a result of a stack
==12352== overflow in your program's main thread (unlikely but
==12352== possible), you can try to increase the size of the
==12352== main thread stack using the --main-stacksize= flag.
==12352== The main thread stack size used in this run was 8388608.

George

miguel.sofer

unread,

Jul 11, 2013, 3:57:39 PM7/11/13

to

OK, the first one seems to be a buffer overrun in HTokenizerlex -
writing one past the allocated memory. The key info is in

==12352== Invalid write of size 4

...

==12352== Address 0xfe26dd8 is 0 bytes after a block of size 65,544
alloc'd

It sometimes crashes and sometimes doesn't: it depends on who owns the
bytes you are illegally trying to overwrite. If it belongs to this same
process, the OS doesn't notice and lets you go on ... until mem
corruption kills you later in a completely unrelated spot. OTOH, if you
do not own that mem, the OS will kill you - it's a segfault.

Fix this mem corruption first, and then rerun. The rest could (but
doesn't have to) be a consequence of this one.

Georgios Petasis

unread,

Aug 8, 2013, 2:29:12 PM8/8/13

to

Hi all,

The problem finally was in the input. One of the documents had a "word"
that exceeded the 16kB input buffer of gnu flex, causing the memory
corruption which leads to a crash after a while.

George

Andreas Leitgeb

unread,

Aug 18, 2013, 12:52:41 PM8/18/13

to

Not knowing much about gnu flex, I wonder, if this isn't a
bug in gnu flex, if it deals so badly with too long "words"...

(Unless this is some documented performance-versus-safety tradeoff)