bulding pipelines in golang

288 views
Skip to first unread message

clement...@gmail.com

unread,
Nov 11, 2018, 1:23:15 PM11/11/18
to golang-nuts
Hi,

I recently confronted the problem of building data stream pipeline in golang to build some etl programs.

I found it was not so simple to use the golang idioms.
Using them i noticed i needed deep care and understanding to produce correct code.
I also felt they was uselessly verbose and repetitive.

I have searched for existing libraries produced by the community
but I failed to find a suitable one for my need.
Something like ratchet (https://github.com/dailyburn/ratchet) was close to it,
but its unlikely i use such complex api.

I came to write my own for this purpose,

To achieve it i have been extensively using the reflection API, which is really helpful.
Unfortunately, I have been using interface{} almost everywhere, and that would be difficult to change.
In exchange this api provides lots of flexibility.

For comparison and introduction purposes i have rewrote the pipeline walk file example
available in the blog at https://blog.golang.org/pipelines

The version i provide is available at

This version implements a slightly more complex flow mecanism as it bufferizes path,
however, it is much easier, shorter (<150 LOC), and i believe clearer code than the original version.
Where the last property is subject to personal opinion, yet, I encourage you to consider it.

Overall, while this implementation is slower because there are tons of additionnal indirections,
I think it is still interesting to consider to compensate that
with easier and appropriate bufferization and paralellisation whenever possible
to achieve better performance.
I intended that this library helps to make that happens as easily and smoothly as possible.

For your information you can also find a bulk version
of the original bounded version provided in the blog
This last program is more suitable for full comparison with the version i provide.

I hope to engage in interesting conversations around that,
thanks for reading.

clement...@gmail.com

unread,
Feb 13, 2019, 12:50:48 PM2/13/19
to golang-nuts
hi,

in recent days i have been working on fixing some issues with above presented code (https://github.com/clementauger/st/blob/master/examples/walkfiles/main.go#L38)
besides the fact the internal code uses a lot of reflection mechanisms to implement the api, putting in de facto on the wrong side of the performance line.
there was a bunch of memory management on the user land side that could be improved.

for example, that stage https://github.com/clementauger/st/blob/master/examples/walkfiles/main.go#L68 creates as many slices as input batch to process.
see also line 70 71 72.

to quickly take a look to the given solution open  https://github.com/clementauger/st/blob/master/examples/walkfiles/main5.go#L51

You will find the usage of sync.Pool coupled to two new kind of stages
PoolBuffered L53 and PoolRecycler L62
they take care of allocations/recyclings performed in order to pass in and out a routine context L58 by accumulating or consuming values of containers such as slices.
this solves lines 68 and 70 of the original demonstration code.

You will also find the introduction of self-contained function within the Concurrently call L58.
those functions returns the iterable function instead of being it. They can also return instances of st.Mapper directly, as demonstrated.
this helps to solve lines 71 and 72 of the original demonstration code.
in an earlier version, this problem was handled with a sync.Pool (https://github.com/clementauger/st/blob/master/examples/walkfiles/main3.go#L69)
However, this is not the best solution available.
because those self-contained functions are executed once for each worker,  they can allocate on their own routine at beginning without needing of synchronization.

ftr, i have kept several versions i have edited since the original i posted in the previous email,
etc...

finally, note that the last commit is a major change as it breaks the Mapper interface.
its not yet tagged, it will happen later.

I am searching for ideas about how i could get ride of the use of reflect package.
static analysis can not work.
dynamic go code generation did not work, might work, seems difficult.
any idea is welcome.

thanks for reading.

clement...@gmail.com

unread,
Feb 14, 2019, 6:12:25 PM2/14/19
to golang-nuts
hi,

sorry to come back so quickly, after those last updates about memory management optimizations,
i want to share those latest optimizations about the cpu and some thoughts...

the results are good, on this configuration, compared to the bounded.go.

the solution consists of a new type of stream that detects the input functions and return instances
of compiled function stored into a cache.
the example provided is made to get ride of 95% of the reflect package calls using the original api.


for this demonstration the cache was manually created.
but it is reasonable to say that static analysis cannot fully help.

overall, there is one thing i want to emphasizes on,
its not central, but it matters.
the context management is leaky.
what i notice the most is the additional useless work
it creates to check for cancellation everywhere
because in theory and in practice it should be tested
within each independent component.
while this api clearly highlights the matter i believe its phenomena that
also occurs for regular writers as soon as they try to write larger
piece of code.

i need to add more on the fact that static analysis is not a good
option to generate some kind of hot path like demonstrated.
there is some problems inherent both to the api proposed
and the type system where the call stack graph can t be built entirely statically upfront.
leading to extraneous callstacks and type conversions.
at best a static analysis can help discover some hot path,
but inevitably some wont be solved until the very last moment,
this all thing leading to a mixture of stubbed functions
from cache and from jit reflection. which in both cases are bad for performance.
not to mention maintenance.

my thoughts is that while regular go code when written with all optimizations and care,
is definitely better, it is also harder to write, over redundant and harder to manage over-the-quantity.

the language does not provide elegant solution about that.
the attempt to move toward generic and error handles (?) seems
to try to answer those questions among others, i guess, I might be wrong.
but for now this is still questionable.

i have tried to provide a solution around that,
implementing in a somewhat-generic way those patterns that the language
has promoted.
the result is that the cost/benefit is not the same for all aspects.
for performance it s a win against simple implementations,
but still a loss against well developed solutions, untested but i strongly believe so.
for usability (?), i say this is a win for this implementation.

i hope for the language to find a way to improve expressiveness
so it is easier to consumer resources with best care and performance.
Reply all
Reply to author
Forward
0 new messages