Using Go for big data?

Jason Buberel

unread,

Nov 17, 2015, 12:13:42 PM11/17/15

to golang-nuts

Gophers,

We're currently conducting market research on the use of Go for big data projects. We are looking for developers whose work involves analyzing large datasets who wouldn't mind spending a few minutes on a Google Hangout to discuss the following:

Are you currently processing large datasets (> 100TB)?
Is your processing batch- or stream-oriented (or both)?
What platform are you using currently - HadoopMR, Spark, Flink, Storm, Amazon EMR, Google BigTable/BigQuery?
Where are you running your workloads - on premise, on cloud or hybrid?

If you are interested, please contact me at jbub...@google.com.

Regards,
jason

Damian Gryski

unread,

Nov 17, 2015, 3:33:00 PM11/17/15

to golang-nuts

Maybe poke Walmart Labs and see what happened to http://torbit.com/blog/2013/02/19/big-data-at-torbit/

Damian

Glen Newton

unread,

Nov 17, 2015, 7:39:37 PM11/17/15

to golang-nuts

This discussion might be relevant: https://groups.google.com/forum/?utm_source=digest&utm_medium=email#!searchin/golang-nuts/newton/golang-nuts/Ij1uf9ql8Zk/KD8mFmlOhPkJ

Glen

sna...@gmail.com

unread,

Nov 22, 2015, 11:01:55 PM11/22/15

to golang-nuts

At Bitcasa we built our garbage collection in Go on Hadoop (though we got burned so many places using Hadoop, rewrites will just spin-up EC2 clusters). We copy-verify-deleted > 2 PB during the process as well as frequent re-evaluation of about 10TB of data. This was for Amazon-stored data, so we used their Hadoop system. Since their directory listings are sorted & our filenames were easily sharded, we batched 4096 subgroups then processed each subgroup as a stream with a GoLang. Top-level operations have set-notation semantics, but each piece has input-output channels (and all channels must be digested for the system to work).

The GoLang streams worked great, but the sharding was mostly because of network maximums per EC2 instance and wanting to iterate quickly.

One gotcha for anyone still considering Hadoop: We had to use much larger instances than we wanted because Hadoop's bucket/group step (between map & reduce) appears to require huge amounts of memory when the data really gets big & had OOMs in Hadoop code otherwise. They should have used on-disk append operations.

Relatedly, I will be open-sourcing the sorted-chan-string-as-a-set library I built for this purpose: Venn, Subtract, Merge, CountConsume, Save, Tee (hardest one to avoid deadlocks when a consumer is blocked on something else which is blocked on the other consumer).

For Tee, I imported github.com/eapache/channels::InfiniteChannel but wished its interface wasn't so odd.

ChrisLu

unread,

Nov 23, 2015, 7:18:20 PM11/23/15

to golang-nuts, sna...@gmail.com

Glow is a map reduce system written in Go. http://githug.com/chrislusf/glow

eapache, welcome to try it out! Glow also used a lot of channels inside.

Chris

Michael Jones

unread,

Nov 23, 2015, 9:06:27 PM11/23/15

to ChrisLu, golang-nuts, sna...@gmail.com

https://github.com/chrislusf/glow

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Zellyn

unread,

Nov 24, 2015, 8:30:49 AM11/24/15

to golang-nuts

FWIW, Walmart Labs is caddy-corner from the YouTube office, so you could ask one of the folks there to amble over and chat :-)

Jason Buberel

unread,

Dec 8, 2015, 6:13:55 PM12/8/15

to golang-nuts

As a follow-up, I wanted to thank everyone who replied my original request. Nearly 20 members of the community agreed to endure 30 minutes of my questioning, and much was learned. I don't have anything specific to announce, but please know that your input was incredibly helpful.