Aggressive Golang Garbage Collection Issues When Using cgo?

1,052 views
Skip to first unread message

Steven Estes

unread,
May 23, 2019, 12:34:30 AM5/23/19
to golang-nuts

YottaDB is a key-value NoSQL databaase with a C API. To make it more friendly to Golang users, we have created a Go wrapper (see https://godoc.org/lang.yottadb.com/go/yottadb). A Rust wrapper is in development, with other languages to follow.

With the Golang wrapper, we have run into a variety of panics and crashes that take several forms. Those forms in order of frequency are as follows:

  • Panics due to “sweep increased allocation count”.

  • Panics due to invalid address (bad pointer) in Go heap.

  • SIG-11 failures in various places usually dealing with garbage collection.

  • Panics due to “unknown return address” when trying to return from functions.

  • Mutex deadlocks (everyone waiting and nobody holding) and HUGE memory usage.

Researching these issues notes that the issue is “usually” either due to bad cgo code or “race” conditions. The problems get MUCH worse when GOGC is set to 1. Running with either of the options “-msan” or “-race” has not flagged any issues. I am using Golang 1.12.5 on Ubuntu 18.04.

A bit of background:

The C API that we are wrapping has parameters that are pointers to structures that themselves contain pointers to information. Consequently, we decided to use Golang structures mostly as anchor points to C allocation memory holding the actual structures since cgo does not allow Golang structures passed to C to have pointers to Golang memory. Since Golang is not a language where a coder needs to track memory and release it, we make use of Finalizers when the C storage is allocated so when the base Golang structure is garbage collected, a function is run that releases the allocated C memory attached to it. So far so good.

Here is the basic C structure that this code uses (C.ydb_buffer_t):

typedef struct
{
	unsigned int	len_alloc;
	unsigned int	len_used;
	char		*buf_addr;
} ydb_buffer_t;


Here are the related Golang structures:

type BufferT struct {
	cbuft    *internalBufferT // Allows BufferT to be copied via assignment
	ownsBuff bool             // If true, we should clean the cbuft when Free'd
}

type internalBufferT struct {
	cbuft *C.ydb_buffer_t     // C flavor of the ydb_buffer_t struct
}

type BufferTArray struct {
	cbuftary *internalBufferTArray
}

type internalBufferTArray struct {
	elemUsed  uint32            // Number of elements used
	elemAlloc uint32            // Number of elements in array
	cbuftary  *[]C.ydb_buffer_t // C flavor of ydb_buffer_t array
}


The internal variant of each of BufferT and BufferTArray is so the BufferT/BufferTArray can be copied by an assignment statement without double free issues or using of storage after free, etc. It is technique we found was suggested to deal with this issue as was use of the Finalizer to cause the internal structure to release its C storage which involves not only releasing the C control block but also the buffer that it points to – and in the case of the array, the same for each element in the array.

Our API necessarily does a lot with cgo and due to the array of C structures (and cgo not supporting arrays or variadic plists – both of which we require), has to do its own pointer arithmetic to access the element we want when dealing with individual array elements anchored by internalBufferTArray. We have a similar issue with calling one of the C API calls (ydb_lock_st()) because it takes a variadic plist and again, cgo doesn’t support variadic plists. So we’ve cooked up a C function that takes an array of uintptr values and turns it into a variadic plist call. So again, we’re loading up elements of this parameter array using pointer arithmetic and converting those pointers to uinptr values which seems to put them on GC’s radar if the go structures are not subsequently used.

One issue we found and solved was with this latter usage. The scenario was this:

  1. Function call created a key that contained a variable name and two subscripts that were passed in to the LockST() wrapper call. Each of these was ultimately housed in a ydb_buffer_t structure though the subscripts are in the array version.

  2. The LockST() code looked at the structures and pulled out the C pointers to the ydb_buffer_t structures then turning those addresses into uintptr to store in the parameter array.

  3. At this point, it seems, since there were no more references to the passed-in Golang structures and the C pointers we had pulled out of them had been morphed to uintptr values, that the Golang structures became eligible for garbage collection. But because there was a Finalizer set for these structures, the C storage was released when we were not finished with it yet – in fact had not even made the call yet to the C routine.

  4. We got errors in the C code and saw that though the structures were well-formed in the Go routine, the stuff that ended up being passed in was complete garbage.

  5. We “fixed” this issue by putting a meaningless reference to the golang variadic list that held all these values just prior to returning from the routine which kept them from being garbage collected such that it ran successfully.

That fixed one of around 6 different failure types we were having. The test that mostly generates the failures (though others do it too) is where we are launching 10 goroutines that each do random operation types (set, get, lock, delete, next, previous, etc) for a set period of time, then exit. It is only when launching goroutines simultaneously that we see these errors. If we launch workers as separate processes, we do not see these issues. We also do not see these issues when the driver program is C instead of Golang.

The main C code base can be found at https://gitlab.com/YottaDB/DB/YDB (master branch) while the go wrapper can be found at https://gitlab.com/YottaDB/Lang/YDBGo (develop branch) – note LockST() is in simple_api.go and you can see the added reference “fix” at the end of it. But quite simply, nearly every call in this API has failed in one way or another with one of the symptoms mentioned above.

Is there some better way to do this? Should we be doing something different to escape the Finalizer/GC race to eliminate storage while we’re still using pointers to that storage? So far, we’ve been dealing with best-guesses as we do not know Golang’s internals. Although I’ve read some articles on Golang’s garbage collection, the best-guess I’ve been able to come up with is we have a race between our usage of C and Golang’s eagerness to GC the main structures which then pulls the C memory out from under us while we are using it.

Note this gets slightly murkier when one understands that one of the API calls (TpST()) initiates a transaction and then calls back into a given Go routine to actually execute the transaction. And indeed some of the cores we’ve seen show us that we are either in or just coming out of a transaction so the transaction itself garbage collected things we are still using – or perhaps its usage just reused some released storage from a previously executed transaction. <head spins> But even cutting down that test so that only one request type runs is getting both the “bad pointer in Go heap” and the “sweep increased allocation count” errors when run in 10 goroutines.

Here are the tops of some actual failures. I’ve not included the entire failure as it references a lot of different threads so it fairly long but let me know if you want to see them:

runtime: nelems=21 nalloc=5 previous allocCount=4 nfreed=65535
fatal error: sweep increased allocation count

goroutine 38 [running]:
runtime.throw(0x5288be, 0x20)
	/snap/go/3739/src/runtime/panic.go:617 +0x72 fp=0xc000157ab8 sp=0xc000157a88 pc=0x4395c2
runtime.(*mspan).sweep(0x7fae4f5d7258, 0x4c0600, 0x118e800)
	/snap/go/3739/src/runtime/mgcsweep.go:329 +0x8ef fp=0xc000157ba0 sp=0xc000157ab8 pc=0x43015f
runtime.sweepone(0x0)
	/snap/go/3739/src/runtime/mgcsweep.go:136 +0x282 fp=0xc000157c08 sp=0xc000157ba0 pc=0x42f612
runtime.deductSweepCredit(0x2000, 0x0)
	/snap/go/3739/src/runtime/mgcsweep.go:438 +0x80 fp=0xc000157c30 sp=0xc000157c08 pc=0x4302c0
runtime.(*mcentral).cacheSpan(0x7f5460, 0x4143f5)
	/snap/go/3739/src/runtime/mcentral.go:43 +0x61 fp=0xc000157c90 sp=0xc000157c30 pc=0x424fa1
runtime.(*mcache).refill(0x7fae4f5cc6d0, 0x4)
	/snap/go/3739/src/runtime/mcache.go:135 +0x86 fp=0xc000157cb0 sp=0xc000157c90 pc=0x424cd6
runtime.(*mcache).nextFree(0x7fae4f5cc6d0, 0x118ea04, 0x448441, 0x4d8420, 0x2582c18)
	/snap/go/3739/src/runtime/malloc.go:786 +0x88 fp=0xc000157ce8 sp=0xc000157cb0 pc=0x41b148
runtime.mallocgc(0x10, 0x502040, 0x4c6e01, 0x44db90)
	/snap/go/3739/src/runtime/malloc.go:939 +0x7a8 fp=0xc000157d98 sp=0xc000157ce8 pc=0x41ba98
runtime.growslice(0x502040, 0x0, 0x0, 0x0, 0x1, 0xc000157e98, 0x0, 0x7fae240015a0)
	/snap/go/3739/src/runtime/slice.go:181 +0x228 fp=0xc000157e00 sp=0xc000157d98 pc=0x44ddb8
main.genName(0x4c76a8, 0x0, 0x448441)
	/home/estess/go/src/rwalk4/randomWalk.go:66 +0x2d3 fp=0xc000157eb8 sp=0xc000157e00 pc=0x4c6e13
main.runProc(0x0, 0x0, 0xa, 0x3fc999999999999a, 0x0, 0x0)
	/home/estess/go/src/rwalk4/randomWalk.go:82 +0x45 fp=0xc000157f78 sp=0xc000157eb8 pc=0x4c7145
main.main.func2(0xc0000b8020, 0xc0000b8028, 0x0, 0xc0000b8010)
	/home/estess/go/src/rwalk4/randomWalk.go:141 +0x68 fp=0xc000157fc0 sp=0xc000157f78 pc=0x4c76a8
runtime.goexit()
	/snap/go/3739/src/runtime/asm_amd64.s:1337 +0x1 fp=0xc000157fc8 sp=0xc000157fc0 pc=0x464671
created by main.main
	/home/estess/go/src/rwalk4/randomWalk.go:133 +0x178



The other type of failure:

runtime: pointer 0xc000202648 to unallocated span span.base()=0xc000200000 span.limit=0xc000204000 span.state=0
runtime: found in object at *(0xc000137840+0x10)
object=0xc000137840 s.base()=0xc000136000 s.limit=0xc000138000 s.spanclass=10 s.elemsize=64 s.state=mSpanInUse
 *(object+0) = 0x524d82
 *(object+8) = 0xa
 *(object+16) = 0xc000202648 <==
 *(object+24) = 0x4
 *(object+32) = 0xc000202658
 *(object+40) = 0x4
 *(object+48) = 0xc000202668
 *(object+56) = 0x4
fatal error: found bad pointer in Go heap (incorrect use of unsafe or cgo?)

runtime stack:
runtime.throw(0x52b6cc, 0x3e)
	/snap/go/3739/src/runtime/panic.go:617 +0x72 fp=0x7f593affcd00 sp=0x7f593affccd0 pc=0x4395c2
runtime.findObject(0xc000202648, 0xc000137840, 0x10, 0x0, 0x0, 0x0)
	/snap/go/3739/src/runtime/mbitmap.go:397 +0x3bd fp=0x7f593affcd50 sp=0x7f593affcd00 pc=0x42237d
runtime.scanobject(0xc000137840, 0xc00003a670)
	/snap/go/3739/src/runtime/mgcmark.go:1174 +0x240 fp=0x7f593affcde0 sp=0x7f593affcd50 pc=0x42dc20
runtime.gcDrain(0xc00003a670, 0x7)
	/snap/go/3739/src/runtime/mgcmark.go:932 +0x209 fp=0x7f593affce38 sp=0x7f593affcde0 pc=0x42d3f9
runtime.gcBgMarkWorker.func2()
	/snap/go/3739/src/runtime/mgc.go:1926 +0x16c fp=0x7f593affce78 sp=0x7f593affce38 pc=0x46018c
runtime.systemstack(0x0)
	/snap/go/3739/src/runtime/asm_amd64.s:351 +0x66 fp=0x7f593affce80 sp=0x7f593affce78 pc=0x462616
runtime.mstart()
	/snap/go/3739/src/runtime/proc.go:1153 fp=0x7f593affce88 sp=0x7f593affce80 pc=0x43db80

goroutine 82 [GC worker (idle)]:
runtime.systemstack_switch()
	/snap/go/3739/src/runtime/asm_amd64.s:311 fp=0xc00004df60 sp=0xc00004df58 pc=0x4625a0
runtime.gcBgMarkWorker(0xc000039400)
	/snap/go/3739/src/runtime/mgc.go:1890 +0x1be fp=0xc00004dfd8 sp=0xc00004df60 pc=0x429afe
runtime.goexit()
	/snap/go/3739/src/runtime/asm_amd64.s:1337 +0x1 fp=0xc00004dfe0 sp=0xc00004dfd8 pc=0x464671
created by runtime.gcBgMarkStartWorkers
	/snap/go/3739/src/runtime/mgc.go:1784 +0x77

Running these tests is a bit complicated. We are working on putting a docker container together that has the runtime all setup so all that is needed to fire up the test. If desired, this will be made available.

Anyway, any insights as to how to make progress with this issue would be most appreciated. I’ve tried to keep this as brief as possible. We don’t understand what the Golang errors are telling us. So if further explanations somewhere would help, please do let us know.

Steve

Ian Lance Taylor

unread,
May 23, 2019, 1:29:07 AM5/23/19
to Steven Estes, golang-nuts
On Wed, May 22, 2019 at 8:34 PM Steven Estes <est...@yottadb.com> wrote:
>
> At this point, it seems, since there were no more references to the passed-in Golang structures and the C pointers we had pulled out of them had been morphed to uintptr values, that the Golang structures became eligible for garbage collection. But because there was a Finalizer set for these structures, the C storage was released when we were not finished with it yet – in fact had not even made the call yet to the C routine.
>
> We got errors in the C code and saw that though the structures were well-formed in the Go routine, the stuff that ended up being passed in was complete garbage.
>
> We “fixed” this issue by putting a meaningless reference to the golang variadic list that held all these values just prior to returning from the routine which kept them from being garbage collected such that it ran successfully.

You should be using runtime.KeepAlive. That's what it's for. When
you put a finalizer on a Go pointer that points to a C pointer, and
the finalizer frees the C pointer, and you pass the C pointer to C
code, then you must be careful to use runtime.KeepAlive on the Go
pointer after every single call where it passes the C pointer into C.

Ian

zod...@gmail.com

unread,
May 23, 2019, 1:11:53 PM5/23/19
to golang-nuts
Thank you very much Ian. Being new to the language, I was unaware of KeepAlive() but had been wondering how you kept the GC's aggressive tendencies under control.

I know what I'll be doing today..

Steve


zod...@gmail.com

unread,
May 24, 2019, 1:06:05 AM5/24/19
to golang-nuts
Unfortunately, after fitting the code with KeepAlive() calls (some probably superfluous), the issues and their frequency have not notably changed. It did allow me to remove the hack I had put in LockST() described in my original note though. I've even gone as far as commenting out the finalizers, which because of the flavor of the interface being used, probably won't leave much if any C memory stranded but is not the way we would want it in production. However it is fine for a test. Even with the finalizers out of the way, the issues remain.

Is there something else I can try to isolate this (msan and race have not been fruitful)? Note our C code has some memory sanity options in it too - we have a storage manager in it that on request will over-allocate all requests placing the requested storage in the middle, then back-filling all but the requested storage with 0xdeadbeef values with the entire storage chain having their backfills verified on every storage invocation. That too is running clean. Also, the first thing our C code does when it gets control is take out a lock because our C code is not (yet) multi-threaded so that part is single-threaded.

Steve

Jan Mercl

unread,
May 24, 2019, 5:12:43 AM5/24/19
to zod...@gmail.com, golang-nuts
On Fri, May 24, 2019 at 3:05 AM <zod...@gmail.com> wrote:
>
> Unfortunately, after fitting the code with KeepAlive() calls (some probably superfluous), the issues and their frequency have not notably changed. It did allow me to remove the hack I had put in LockST() described in my original note though. I've even gone as far as commenting out the finalizers, which because of the flavor of the interface being used, probably won't leave much if any C memory stranded but is not the way we would want it in production. However it is fine for a test. Even with the finalizers out of the way, the issues remain.
>
> Is there something else I can try to isolate this (msan and race have not been fruitful)? Note our C code has some memory sanity options in it too - we have a storage manager in it that on request will over-allocate all requests placing the requested storage in the middle, then back-filling all but the requested storage with 0xdeadbeef values with the entire storage chain having their backfills verified on every storage invocation. That too is running clean. Also, the first thing our C code does when it gets control is take out a lock because our C code is not (yet) multi-threaded so that part is single-threaded.

I took a look at the code. Some things that stand out immediately:

- Dozen of comments in Go code using "Golang" as the name of the language.

- Yoda with happy is

func (buft *BufferT) Free() {
if nil == buft {
return
}
if true == buft.ownsBuff {
buft.cbuft.Free()
}
buft.cbuft = nil
}

(https://gitlab.com/YottaDB/Lang/YDBGo/blob/4a8c7ab467677c664e1a52ac2b87a9a537cbb605/buffer_t.go#L138)

- It seems that in more than one place the code violates the rules
about Go pointers crossing the CGo barrier, for example at
https://gitlab.com/YottaDB/Lang/YDBGo/blob/4a8c7ab467677c664e1a52ac2b87a9a537cbb605/buffer_t.go#L39

- Exported API that has unsafe.Pointer typed parameters:
https://gitlab.com/YottaDB/Lang/YDBGo/blob/4a8c7ab467677c664e1a52ac2b87a9a537cbb605/buffer_t.go#L39
while not even mentioning what's safe to pass as the unsafe pointer
argument. Must it be only pointer to C allocated memory or can it be a
pointer to anything managed by the Go runtime?

I looked at the code only for a few minutes, but the impression I got
is that this is not at all about "Aggressive Golang Garbage Collection
Issues When Using cgo". I think it instead about not
following/ignoring the documented rules for Go/CGo interaction as well
as not following/ignoring the documented rules about safe/supported
uses of unsafe.Pointers.
Message has been deleted
Message has been deleted

Ian Lance Taylor

unread,
May 24, 2019, 12:21:29 PM5/24/19
to zod...@gmail.com, golang-nuts
On Thu, May 23, 2019 at 9:05 PM <zod...@gmail.com> wrote:
>
> Is there something else I can try to isolate this (msan and race have not been fruitful)? Note our C code has some memory sanity options in it too - we have a storage manager in it that on request will over-allocate all requests placing the requested storage in the middle, then back-filling all but the requested storage with 0xdeadbeef values with the entire storage chain having their backfills verified on every storage invocation. That too is running clean. Also, the first thing our C code does when it gets control is take out a lock because our C code is not (yet) multi-threaded so that part is single-threaded.

I haven't looked at your code so I don't know if it will help, but you
can try setting the environment variable GODEBUG=cgocheck=2 when you
run the program. That will turn on a more intensive set of tests for
whether you are violating the cgo pointer passing rules.

In general, read those rules, at https://golang.org/cmd/cgo, and make
sure that you are following them.

Ian

zod...@gmail.com

unread,
May 24, 2019, 3:42:45 PM5/24/19
to golang-nuts
Thankyou for your reply - my responses are embedded below..


On Friday, May 24, 2019 at 1:12:43 AM UTC-4, Jan Mercl wrote:
On Fri, May 24, 2019 at 3:05 AM <zod...@gmail.com> wrote:
>
> Unfortunately, after fitting the code with KeepAlive() calls (some probably superfluous), the issues and their frequency have not notably changed. It did allow me to remove the hack I had put in LockST() described in my original note though. I've even gone as far as commenting out the finalizers, which because of the flavor of the interface being used, probably won't leave much if any C memory stranded but is not the way we would want it in production. However it is fine for a test. Even with the finalizers out of the way, the issues remain.
>
> Is there something else I can try to isolate this (msan and race have not been fruitful)? Note our C code has some memory sanity options in it too - we have a storage manager in it that on request will over-allocate all requests placing the requested storage in the middle, then back-filling all but the requested storage with 0xdeadbeef values with the entire storage chain having their backfills verified on every storage invocation. That too is running clean. Also, the first thing our C code does when it gets control is take out a lock because our C code is not (yet) multi-threaded so that part is single-threaded.

I took a look at the code. Some things that stand out immediately:

- Dozen of comments in Go code using "Golang" as the name of the language.

[SEstes] I had been told by a Go consultant to refer to it as Golang to reduce confusion (much like how this forum is named). If that's in error I apologize and will change the references. But for the record, there are 13 references to "golang" in the main directory, not dozens..

- Yoda with happy is

        func (buft *BufferT) Free() {
                if nil == buft {
                        return
                }
                if true == buft.ownsBuff {
                        buft.cbuft.Free()
                }
                buft.cbuft = nil
        }

(https://gitlab.com/YottaDB/Lang/YDBGo/blob/4a8c7ab467677c664e1a52ac2b87a9a537cbb605/buffer_t.go#L138)

- It seems that in more than one place the code violates the rules
about Go pointers crossing the CGo barrier, for example at
https://gitlab.com/YottaDB/Lang/YDBGo/blob/4a8c7ab467677c664e1a52ac2b87a9a537cbb605/buffer_t.go#L39

[SEstes] Since line 39 is a comment line in the header of a section, I'm not sure what code you are referring to. But the *only* place that we knowingly pass a pointer to golang storage into C is in the the key_t.go routines NodeNextST() and NodePrevST() where we pass a pointer to a Go uint32 into C (but do so as a C.int because that is what C expects). All other places where we pass structures or buffers into C we are passing the address of C.malloc()'d memory which is *anchored* in Go structures. This is permissible. What is not permissible is passing the address of a structure (Go or C) that contains pointers to Go storage (at least last I read - not sure if that has recently changed). We have tried to do what we needed to do without violating any cgo rules - which has been ..  at times difficult but I think we succeeded.


- Exported API that has unsafe.Pointer typed parameters:
https://gitlab.com/YottaDB/Lang/YDBGo/blob/4a8c7ab467677c664e1a52ac2b87a9a537cbb605/buffer_t.go#L39
while not even mentioning what's safe to pass as the unsafe pointer
argument. Must it be only pointer to C allocated memory or can it be a
pointer to anything managed by the Go runtime?

[SEstes] Admittedly this BufferTFromPtr routine could be documented better. While the existing comments and the name do mostly hint at its usage, I will update this to be more descriptive. The routine takes an unsafe pointer because it is driven by transaction callback routine so the overall topology is Go -> C -> Go but the C side of this interface has a generic parameter so it is passed as a void * in C which of course becomes an unsafe pointer in Go. From discussing with the group, the fact that this routine is exported is an artifact of an earlier version so it won't be exported in the next update.

I looked at the code only for a few minutes, but the impression I got
is that this is not at all about "Aggressive Golang Garbage Collection
Issues When Using cgo". I think it instead about not
following/ignoring the documented rules for Go/CGo interaction as well
as not following/ignoring the documented rules about safe/supported
uses of unsafe.Pointers.

[SEstes] The original problem which was described as being in LockST() was indeed due to aggressive GC and was cured by Ian's suggestion to use KeepAlive(). I thought the problems had a single origin but was evidently wrong. Something else is going on which has yet to be understood.

[SEstes] As for cgo rules, I assure you I have read and put into practice every documented rule for cgo/Go interaction as well as the uses of unsafe pointers as best I am able. There are no "known" violations. There are constraints on what i'm able to do when wrapping existing (released and largely immutable) interfaces. Obviously if something was just flat not possible, we'd create a new call to do what we want but I'm of the opinion that this code follows the rules though since it is interfacing with C and doesn't support some of the things that C does (array references and variadic plists for example), we have to get a little off the beaten path for workarounds. If there is some way the code does NOT follow the rules, I really would like to know exactly where and how it could work instead if possible. We do not intend for production code to violate rules. That's just a recipe for trouble.

[SEstes] Just to make clear that one doesn't need to look at the entire wrapper, I have a test that is executing a single API call (DeleteE() from easy_api.go) over and over against an empty database (so it always gets a not-found error). In 20 minutes of runtime (done as 20 consecutive 1 minute runs), something around half of those runs will die with either the Go heap issue or the sweep increased allocation issue. And they generally fail in the first or last 5 seconds of the run - but of course not always.. Here is what the test routine looks like (released under AGPL):

//////////////////////////////////////////////////////////////////
//                                //
// Copyright (c) 2018-2019 YottaDB LLC and/or its subsidiaries. //
// All rights reserved.                        //
//                                //
//    This source code contains the intellectual property    //
//    of its copyright holder(s), and is made available    //
//    under a license.  If you do not know the terms of    //
//    the license, please stop and do not read further.    //
//                                //
//////////////////////////////////////////////////////////////////
package main

/* Run for a specified duration calling random features of the YottaDB Go Easy API.

We fork off N threads, where has a chance of:

- Getting a global/local
- Setting a global/local
- Killing a global/local
- Data'ing a global/local
- Walking through a global/local
- Walking through a subscript
- Incrementing a global/local
- Settings locks on a global/local
- Starting a TP transaction
- Ending a TP transaction
- Creating routines in a TP transaction

The goal is to run for some time with no panics
*/

import (
    "fmt"
    "lang.yottadb.com/go/yottadb"
    "math/rand"
    "sync"
    "time"
)

const DRIVER_THREADS    = 10
const MAX_DEPTH        = 10
const NEST_RATE        = 0.20
const TEST_TIMEOUT    = 60 // test timeout in seconds for the driver threads to stop

func Ok() {
    // Noop that means everything is OK
}

type testSettings struct {
    maxDepth     int
    nestedTpRate float64
}

/* Generate a variable to be used in runProc()
 */
func genName() []string {
    var ret []string
    num_subs := rand.Int() % 5
    global_id := rand.Int() % 4
    switch global_id {
    case 0:
        ret = append(ret, "^MyGlobal1")
    case 1:
        ret = append(ret, "^MyGlobal2")
    case 2:
        ret = append(ret, "MyLocal1xx")
    case 3:
        ret = append(ret, "MyLocal2xx")
    }
    for i := 0; i < num_subs; i++ {
        ret = append(ret, fmt.Sprintf("sub%d", i))
    }
    return ret
}

/* randomly attempts various EasyAPI functions with a generated variable
 * see header comment for details
 */
func runProc(tptoken uint64, errstr *yottadb.BufferT, settings testSettings, curDepth int) int32 {
    t := genName()
    varname := t[0]
    subs := t[1:]
    delType := rand.Float64()
    var err error
    if delType < 0.5 {
        err = yottadb.DeleteE(tptoken, errstr, yottadb.YDB_DEL_TREE, varname, subs)
    } else {
        err = yottadb.DeleteE(tptoken, errstr, yottadb.YDB_DEL_NODE, varname, subs)
    }
    // There are some error codes we accept; anything other than that, raise an error
    if err != nil {
        panic(err)
    }
    return 0
}

func main() {
    var wg sync.WaitGroup
    var doneMutex sync.Mutex

    defer yottadb.Exit()
    tptoken := yottadb.NOTTP
    done := false


    go func() { //wait for test timeout then set done to true
        time.Sleep(TEST_TIMEOUT * time.Second)
        doneMutex.Lock()
        done = true
        doneMutex.Unlock()
    }()

    for i := 0; i < DRIVER_THREADS; i++ {
        wg.Add(1)
        go func() {
            for { //loop until test timeout then break
                doneMutex.Lock()
                d := done
                doneMutex.Unlock()
                if d {
                    break
                }
                runProc(tptoken, nil, testSettings{
                    MAX_DEPTH,
                    NEST_RATE,
                }, 0)
            }
            wg.Done()
        }()
    }

    wg.Wait() //wait for existing routines to end
}

[SEstes] A lot of this can be ignored as it is cut down from a test that did a lot of random API usages. Pretty much any one of API interfaces will do but just this one causes all of these errors in spades. DeleteE() is in easy_api.go and invokes a few methods to build up the key, then drives the C routine - over and over so only uses 5-10% of this wrapper. No variadic plist is involved, no callbacks, just a simple call. And this spectacularly fails.

Steven Estes

unread,
May 24, 2019, 3:52:11 PM5/24/19
to golang-nuts
Thankyou Ian,

I did try a few runs with GODEBUG set to "cgocheck=2" and just the "normal" errors (bad Go heap ptr, and sweep increased allocation) occurred. I'm assuming I would have seen a different kind of panic if it had detected something.

Like I mentioned in the last post, we are very careful not to pass Go memory to with one exception where we pass a pointer to a Go uint32 that we expect to be filled in by the call. My understanding is this is permitted (cgo does not complain though it is very quick to complain if I try to pass the address of a string or byte array or structure).

Michael Jones

unread,
May 24, 2019, 3:58:04 PM5/24/19
to Steven Estes, golang-nuts
and when you change that "pass a pointer" to "have a shim c-side function return a uint32 that you just assign on the Go side," what happens?

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/3e71a176-ef6f-4fad-b282-0318d47fda4f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--
Michael T. Jones
michae...@gmail.com

Steven Estes

unread,
May 24, 2019, 4:07:35 PM5/24/19
to golang-nuts
Since those C functions are already returning a return code through the return value and C only has one return, that's not really possible. I could allocate 4 bytes of C memory, pass the address to that, make the call, return the value and free the memory - and I will do that if we get to the point where those two functions are the last two not working but we aren't there yet. As I noted, just running DeleteE() is enough to get failures and there is NO passing of Go storage addresses to C in that path.

SteveE

On Friday, May 24, 2019 at 11:58:04 AM UTC-4, Michael Jones wrote:
and when you change that "pass a pointer" to "have a shim c-side function return a uint32 that you just assign on the Go side," what happens?

<snip>--
Michael T. Jones
michae...@gmail.com

Steven Estes

unread,
May 28, 2019, 5:41:04 PM5/28/19
to golang-nuts
Ian,

Did you (or anybody else reading this) have any other thoughts on what I can try since GODEBUG="cgocheck=2" did not show anything? Nor (in summary) has -msan, -race or the memory sanity debugging built into our C code. We're kinda stuck at this point.

Steve

Wojciech S. Czarnecki

unread,
May 28, 2019, 8:55:41 PM5/28/19
to golan...@googlegroups.com
On Tue, 28 May 2019 10:41:03 -0700 (PDT)
Steven Estes <est...@yottadb.com> wrote:

> ... anybody else reading this) have any other thoughts on what I can try

> since GODEBUG="cgocheck=2" did not show anything? Nor (in summary)
> has -msan, -race or the memory sanity debugging built into our C code.
> We're kinda stuck at this point.

On Wed, 22 May 2019 15:16:11 -0700 (PDT)
Steven Estes <est...@yottadb.com> wrote:

> With the Golang wrapper, we have run into a variety of panics and crashes
> that take several forms. Those forms in order of frequency are as follows:

> - Panics due to “sweep increased allocation count”.
> - Panics due to invalid address (bad pointer) in Go heap.
> - SIG-11 failures in various places usually dealing with garbage collection.
> - Panics due to “unknown return address” when trying to return from functions.
> - Mutex deadlocks (everyone waiting and nobody holding)

Its a telltale of corrupt memory (I haven't yet seen Go corrupting C side).
Likely C code happily writes to the address that was freed at Go space.

> Running with either of the options “-msan” or “-race” has not

Neither can help with C process gone wild. It is the Go code instrumented,
and msan just informs common (C/LLVM) sanitizer about Go writes and reads
(both of Go code usually does within its own boundaries).


> A bit of background:
>
[...]
> we make use of Finalizers when the C storage is allocated so when the base
> Golang structure is garbage collected, a function is run that releases the
> allocated C memory attached to it. So far so good.

No. It is not good. Please read carefully:

https://golang.org/pkg/runtime/#SetFinalizer

paying high attention to phrases starting with "It is not guaranteed..."

> The C API that we are wrapping has parameters that are pointers to
> structures that themselves contain pointers to information...

[...] hundreds of lines of description from OP cut [...]

> Anyway, any insights as to how to make progress with this issue would be
> most appreciated. I’ve tried to keep this as brief as possible.

From this brief I understood it all was supposed to be "super fast" but after
reading https://docs.yottadb.com/MultiLangProgGuide/cprogram.html I am not
so sure. So I can not offer anything above Michael Jones' advice:
**separate memory concerns with shims**:

>MJ Michael Jones <michae...@gmail.com> wrote:
>MJ and when you change that "pass a pointer" to "have a shim c-side function
>MJ return a uint32 that you just assign on the Go side


> We don’t understand what the Golang errors are telling us.

They tells a tale of corrupt memory. C side probably is writing to the
locations Go already moved, freed and even assigned to some other
data structure. [Less probable but still possible is corruption due to
"unsafe" fiddling made wrong.]

You apparently have assumed that if something is called Finalizer it will run.
It more often will not (and not only in Go[1]). Because of that, alas,
I'd suggest to rethink architecture then start anew. (Shredding all code
made on wrong assumptions and starting it anew is way cheaper man-hour
wise than refurbishing then debugging for leftovers.)

[1] https://hownot2code.com/2016/10/07/why-using-finalizers-is-a-bad-idea/

> Steve

Hope this helps.

--
Wojciech S. Czarnecki
<< ^oo^ >> OHIR-RIPE

kddavi...@gmail.com

unread,
May 29, 2019, 6:11:55 PM5/29/19
to golang-nuts
Greetings,

I noticed something mostly unrelated to the problem, but wanted to give a heads-up. I could be wrong here, but I'm pretty sure the genRand() function from your embedded code isn't actually random. IIRC, math/rand.Int is deterministic, and needs to be seeded (usually with time.Now().UnixNano) before use. Otherwise it will generate the same number every time it is ran.

-K

kddavi...@gmail.com

unread,
May 30, 2019, 6:13:42 PM5/30/19
to golang-nuts
Sry I ment genName()

Steven Estes

unread,
May 30, 2019, 7:22:05 PM5/30/19
to golang-nuts
 Thankyou kddavidson772. Although it's not critical for this particular test, I will add that to this test and another very similar to it that operates slightly differently. I have put the same seed code you suggest in other routines but other folks wrote the initial versions of this particular test and that piece seems to have been missed out on even in the original from which my test was cut down from.
Reply all
Reply to author
Forward
0 new messages