bulding pipelines in golang

clement...@gmail.com

unread,

Nov 11, 2018, 1:23:15 PM11/11/18

to golang-nuts

Hi,

I recently confronted the problem of building data stream pipeline in golang to build some etl programs.

I found it was not so simple to use the golang idioms.

Using them i noticed i needed deep care and understanding to produce correct code.

I also felt they was uselessly verbose and repetitive.

I have searched for existing libraries produced by the community

but I failed to find a suitable one for my need.

Something like ratchet (https://github.com/dailyburn/ratchet) was close to it,

but its unlikely i use such complex api.

I came to write my own for this purpose,

you can find it here https://github.com/clementauger/st

To achieve it i have been extensively using the reflection API, which is really helpful.

Unfortunately, I have been using interface{} almost everywhere, and that would be difficult to change.

In exchange this api provides lots of flexibility.

For comparison and introduction purposes i have rewrote the pipeline walk file example

available in the blog at https://blog.golang.org/pipelines

The version i provide is available at

https://github.com/clementauger/st/blob/master/examples/walkfiles/main.go

This version implements a slightly more complex flow mecanism as it bufferizes path,

however, it is much easier, shorter (<150 LOC), and i believe clearer code than the original version.

Where the last property is subject to personal opinion, yet, I encourage you to consider it.

Overall, while this implementation is slower because there are tons of additionnal indirections,

I think it is still interesting to consider to compensate that

with easier and appropriate bufferization and paralellisation whenever possible

to achieve better performance.

I intended that this library helps to make that happens as easily and smoothly as possible.

For your information you can also find a bulk version

of the original bounded version provided in the blog

at https://github.com/clementauger/x/blob/master/01_pipeline/bounded_bulk.go.

This last program is more suitable for full comparison with the version i provide.

I hope to engage in interesting conversations around that,

thanks for reading.

clement...@gmail.com

unread,

Feb 13, 2019, 12:50:48 PM2/13/19

to golang-nuts

hi,

in recent days i have been working on fixing some issues with above presented code (https://github.com/clementauger/st/blob/master/examples/walkfiles/main.go#L38)

besides the fact the internal code uses a lot of reflection mechanisms to implement the api, putting in de facto on the wrong side of the performance line.

there was a bunch of memory management on the user land side that could be improved.

for example, that stage https://github.com/clementauger/st/blob/master/examples/walkfiles/main.go#L68 creates as many slices as input batch to process.

clement...@gmail.com

unread,

Feb 14, 2019, 6:12:25 PM2/14/19

to golang-nuts

hi,

sorry to come back so quickly, after those last updates about memory management optimizations,

i want to share those latest optimizations about the cpu and some thoughts...

the results are good, on this configuration, compared to the bounded.go.

the solution consists of a new type of stream that detects the input functions and return instances

of compiled function stored into a cache.

the example provided is made to get ride of 95% of the reflect package calls using the original api.

the new api https://github.com/clementauger/st/blob/master/examples/walkfiles/main6.go#L890

the bunch of cached signatures https://github.com/clementauger/st/blob/master/examples/walkfiles/main6.go#L151

for this demonstration the cache was manually created.

but it is reasonable to say that static analysis cannot fully help.

overall, there is one thing i want to emphasizes on,

its not central, but it matters.

the context management is leaky.

what i notice the most is the additional useless work

it creates to check for cancellation everywhere

because in theory and in practice it should be tested

within each independent component.

while this api clearly highlights the matter i believe its phenomena that

also occurs for regular writers as soon as they try to write larger

piece of code.

i need to add more on the fact that static analysis is not a good

option to generate some kind of hot path like demonstrated.

there is some problems inherent both to the api proposed

and the type system where the call stack graph can t be built entirely statically upfront.

leading to extraneous callstacks and type conversions.

at best a static analysis can help discover some hot path,

but inevitably some wont be solved until the very last moment,

this all thing leading to a mixture of stubbed functions

from cache and from jit reflection. which in both cases are bad for performance.

not to mention maintenance.

my thoughts is that while regular go code when written with all optimizations and care,

is definitely better, it is also harder to write, over redundant and harder to manage over-the-quantity.

the language does not provide elegant solution about that.

the attempt to move toward generic and error handles (?) seems

to try to answer those questions among others, i guess, I might be wrong.

but for now this is still questionable.

i have tried to provide a solution around that,

implementing in a somewhat-generic way those patterns that the language

has promoted.

the result is that the cost/benefit is not the same for all aspects.

for performance it s a win against simple implementations,

but still a loss against well developed solutions, untested but i strongly believe so.

for usability (?), i say this is a win for this implementation.

i hope for the language to find a way to improve expressiveness

so it is easier to consumer resources with best care and performance.

Reply all

Reply to author

Forward