Clojure is a good choice for Big Data? Which clojure/Hadoop work to use?

1,182 views
Skip to first unread message

orazio

unread,
Jul 2, 2019, 12:07:49 PM7/2/19
to Clojure
Hi All,

I'm newbie on Clojure/Big Data, and i'm starting with hadoop.
I have installed Hortonworks HDP 3.1 
I have to design a Big Data Layer that ingests large iot datasets and social media datasets, process data with MapReduce job and produce aggregation to store on HBASE tables.

For now, my focus is addressed on data processing issue. My question is: Is Clojure a good choice for distributed data processing on hadoop ?
I found Cascalog as fully-featured data processing and querying library for Clojure or Java. But are there any active maintainers, for this library ? 
Do you know other excellent clojure/Hadoop work in the community, abaout data processing? 

I would appreciate some help.

Orazio

atdixon

unread,
Jul 2, 2019, 8:55:11 PM7/2/19
to Clojure
I've found Clojure to be an excellent fit for big data processing for a few reasons:

- the nature of big data is that it is often unstructured or semi-structured, and Clojure's immutable ad hoc map-based orientation is well suited to this
- much of the big data ecosystem is Java or JVM-based (and continues to be!) and Clojure interop with Java enables using all of the tooling and platforms in Clojure

That said, some Clojure libs in the space (like Cascalog that you mentioned) seem quiet the past few years. I personally would favor more active Java/JVM projects and simply interop with them from Clojure.

Here are a couple of issues that I've run into in Clojure -> Java interop in some of these big data platforms and their solutions:

1) Some big data java frameworks want you to extend their base classes and provide generic parameters as you do. Clojure's class generation tools (gen-class and proxy, etc) do not support providing generic parameters when extending Java types. The Java complier on the other hand will keep generic parameter values in the compiled target class as class metadata (which is how some of these big data systems -- like Apache Beam, for one -- are using them at runtime.) The solution here is to write Java classes that delegate back to Clojure functions thru vars.

2) These same frameworks often want you to serialize the functions you provide to distribute the code throughout the cluster. Clojure disables Serialization for the classes it generates, so using the same Java classes you create to achieve the generic parameter concretizations you will make Serializable and instantiate from Clojure by passing a Var bound to a function. Vars in Cljoure are serializable and so doing things this way allows (refs to) Clojure functions to be distributed across the cluster.

The key thing is that all of this is very simple to arrange in code once you get the basics down, but I've seen a few people stumble on these not knowing the tricks. And I realize my short descriptions here may leave some people wanted. I may try a blog post on these when time permits.

Gerard Klijs

unread,
Jul 2, 2019, 11:33:39 PM7/2/19
to Clojure
My biased first reaction to Hadoop is, do you really need it? It has a separate runtime, some overhead. And it seems to me it much easier to use Kafka, probably connect to get data in/out and Streams/Ksql to process the data. Because of Java interop and the nice generic Kafka Api it's really easy to use Clojure with Kafka, but there are several libraries you could use.

Thad Guidry

unread,
Jul 3, 2019, 8:56:09 AM7/3/19
to clo...@googlegroups.com

--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clojure+u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---
You received this message because you are subscribed to the Google Groups "Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clojure+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/clojure/fbc26ffb-5f00-46a7-bf33-7a899f1ffead%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

orazio

unread,
Jul 4, 2019, 4:22:44 AM7/4/19
to Clojure
Hi @atdixon and Thad, thanks for your help.

I provide more details about my project
My big data layer  is inspired by Lambda architecture. The pipeline include following layers and related tool choosed to address the issue:
- Nifi for data ingestion, and publisinh data/message on  kafka topic.
- Kafka as message broker that with kafka connect, allow me to store data in mongodb ( with mongodb sink and 1 day retention period ) and HDFS (hdfk sink with 1 year retention period)
- Real time processing with mongoDB using it's built-in QueryEngine taht provides extensive Querying, Filtering, and Searching abilities.
Batch processing of data stored on HDFS, that performs data aggregation and store result on a HBase Table. ? The question is : Which tool do you suggest to use for data processing sotred on HDFS ?
- Serving Layer with HBase/Phoneix to store and allow access to batch view.

Now i'm invoking your help to choose the most appropriate tool to execute batch jobs (map reduce) which will have to aggregate data.
Natahn Marz suggests Clojure/Cascalog. Do you know other excellent clojure/Hadoop work in the community, about data processing?
if you know some particularly appropriate tools, I could also consider other work/library outside the clojure community.

Thanks



Il giorno mercoledì 3 luglio 2019 14:56:09 UTC+2, Thad Guidry ha scritto:
On Tue, Jul 2, 2019 at 11:07 AM orazio <orazio...@gmail.com> wrote:
Hi All,

I'm newbie on Clojure/Big Data, and i'm starting with hadoop.
I have installed Hortonworks HDP 3.1 
I have to design a Big Data Layer that ingests large iot datasets and social media datasets, process data with MapReduce job and produce aggregation to store on HBASE tables.

For now, my focus is addressed on data processing issue. My question is: Is Clojure a good choice for distributed data processing on hadoop ?
I found Cascalog as fully-featured data processing and querying library for Clojure or Java. But are there any active maintainers, for this library ? 
Do you know other excellent clojure/Hadoop work in the community, abaout data processing? 

I would appreciate some help.

Orazio

--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to

For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---
You received this message because you are subscribed to the Google Groups "Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clo...@googlegroups.com.

Thad Guidry

unread,
Jul 4, 2019, 10:43:05 AM7/4/19
to clo...@googlegroups.com
"Batch" - doing things in chunks
"Processing" - THE WORLD :-)  because it means so many different things to so many folks (including your boss)

Without a doubt, you will love Apache Spark for your batch processing and writing Spark Programs to conquer any World you are building.
Spend time to install Spark standalone deploy and then use its powerful Spark Shell (the feeling of Clojure REPL  !!)
If you just want to jump in to a public cluster and Try Spark, then I would suggest Databricks
Spend time reading the features under Libraries drop-down menu on Apache Spark website.

You might even be encouraged enough to write an official API in Clojure for Apache Spark within a year!  (win-win)

One note of caution if you are building something for long term, you will eventually have a need for data versioning, ACID transactions, schema evolution, for this I use Delta Lake (not Datomic) since its fully compatible with Spark

Best of luck!



For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---
You received this message because you are subscribed to the Google Groups "Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clojure+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/clojure/25a56148-9231-4a1b-8bba-8cb79776ba6b%40googlegroups.com.

orazio

unread,
Jul 4, 2019, 1:06:57 PM7/4/19
to Clojure

probably as Thad says, for a farsighted choice the tool to use for batch processing is Apache Spark. But I'm worried about its learning curve and the time it takes. I don't have much time to develop my map reduce algorithems. I would like to use a consolidated and fairly used tool in production. Recently I also saw Scalding ( https://github.com/twitter/scalding ) Scalding is written in Scala and built on top of Cascading; it is a Java library that abstracts away low-level Hadoop details. It is adopted in production by may company such as Ebay,Sky, Twitter, Linkedin, Spotify, etc.. (https://github.com/twitter/scalding/wiki/Powered-By) .
Scalding seems more maintained and supported than Cascalog. With the many examples around on github, it seems to have a smoother learning curve than Apache Spark. You know Scalding, what do you think about it. Any suggestions are welcome.


Il giorno giovedì 4 luglio 2019 16:43:05 UTC+2, Thad Guidry ha scritto:
"Batch" - doing things in chunks
"Processing" - THE WORLD :-)  because it means so many different things to so many folks (including your boss)

Without a doubt, you will love Apache Spark for your batch processing and writing Spark Programs to conquer any World you are building.
Spend time to install Spark standalone deploy and then use its powerful Spark Shell (the feeling of Clojure REPL  !!)
If you just want to jump in to a public cluster and Try Spark, then I would suggest Databricks
Spend time reading the features under Libraries drop-down menu on Apache Spark website.

You might even be encouraged enough to write an official API in Clojure for Apache Spark within a year!  (win-win)

One note of caution if you are building something for long term, you will eventually have a need for data versioning, ACID transactions, schema evolution, for this I use Delta Lake (not Datomic) since its fully compatible with Spark

Best of luck!


Chris Nuernberger

unread,
Jul 4, 2019, 1:37:25 PM7/4/19
to clo...@googlegroups.com
Thad,

You approach seems very promising to me for a lot of jobs.  Spark runs on top of many things.

As far as a clojure layer on top, what do you think about sparkling?


Thad Guidry

unread,
Jul 4, 2019, 2:09:52 PM7/4/19
to clo...@googlegroups.com
Christian writes really good tools.  Sparkling is no exception.
I have yet to use it in production myself however, since I haven't had the need to use Clojure directly to solve any "data aggregation" problems.  Spark and other tools do that well enough, naturally.

As far as using a tool/programming language to solve "data integration" problems in large enterprise environments, I will ALWAYS use Open Source tools for that purpose.  Clojure is no exception.  But I do tend to choose open source hammers to drive nails.  Sometimes Clojure is missing the handle on its hammer, as we have all experienced, but that's on us since WE have the power to make Clojure better.  But often TIME is what we lack to build better API's, libraries, tools for Clojure expansion.

The Apache ecosystem offers many tools & libraries for "big data" and "data integration"  which I often turn to first because I lack TIME for building (long tail), but have enough TIME for learning new things (shorter tail that helps the long tail).



ri...@chartbeat.com

unread,
Jul 5, 2019, 1:43:16 PM7/5/19
to clo...@googlegroups.com
As much as I would love to convert a new data engineer to the ways of clojure, in my opinion, choosing a language to solve a problem is rarely a wise move. Do you have a team of engineers ready and willing to learn clojure or are you doing this yourself? We do a lot of work with all of the tools you mention (in clojure) but we built a lot of the frameworks ourselves or wrote wrappers around java tools. Not for the newbie... if your goal is to build this pipeline for your boss and you have any sort of deadline do yourself a favor and pick an existing, well documented, well googleable framework in a language that your team is familiar with. There are a ton of hurdles with everything you mentioned without even getting to clojure. You’re jumping in the deep end of the pool with no life jacket and you don’t know how to swim.

That said, if you ignore my advice you will learn a lot and we will be here to help, just be warned 😎

orazio

unread,
Jul 8, 2019, 1:49:35 AM7/8/19
to Clojure
Many thanks for your clarifications.
I don't have a team of engineers. Just myself, that I think with much modesty is not little.
I'm not familiar with clojure, i know java programming language.
The lambda's architecture pipeline i want to build will not be made entirely with clojure. As described above I will use existing tools that I don't need to develop (NiFi, Kafka, MongoDB, Hadoop, Hbase)
Let's focus only on the batch layer of the lambda architecture.
My doubt is that i did not find an optimal tool, recognized by the Big Data community as the best, for distributed data processing (map reduce) of historical data on HDFS.
Map reduce algorithms that I have to implement concern Word Count Algorithm of social data message (twitter,facebook,telegram) and iot data analisys and aggregation (such as average values each 30 minutes, each hour, each day).
Reading Nathan Marz big data book, Principles and best practices of scalable realtime data systems, he suggests clojure/Cascalog for distributed data processing on HDFS Hadoop.
I'm asking you if clojure/cascalog could be a good choice to do dataset processing (map reduce) and to store the resulting data aggregation to Hbase, or if you suggest other work.
Otherwise, if you know an existing, well documented, well googleable framework in java language to do distributed data processing and to store resulting data aggregation on Hbase,  it would be appreciated your advise about it.

Thanks again.
Orazio

--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clo...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---
You received this message because you are subscribed to the Google Groups "Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clo...@googlegroups.com.

--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clo...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---
You received this message because you are subscribed to the Google Groups "Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clo...@googlegroups.com.

--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clo...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---
You received this message because you are subscribed to the Google Groups "Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clo...@googlegroups.com.

Edward McBride

unread,
May 3, 2023, 8:55:28 AM5/3/23
to Clojure

Storage migration transfers data from one storage device to another. This involves moving blocks of storage and files from storage systems, whether they're on disk, tape or in the cloud. During migration is also an optimal time for organizations to perform data validation and reduction by identifying obsolete or corrupt data.

Database migration moves database files to a new device. This is done when an organization changes database vendors, upgrades the database software or moves a database to the cloud. Databases must be backed up before migrating.


понедельник, 8 июля 2019 г. в 08:49:35 UTC+3, orazio:
Reply all
Reply to author
Forward
0 new messages