Best practices for data visualization and analytics with Kylo

364 views
Skip to first unread message

Valentin Polezhaev

unread,
Jul 27, 2017, 1:52:36 PM7/27/17
to Kylo Community
Hello!

It looks like Kylo is a perfect tool for self-service data integration and transformation. It supports metadata management, creation of custom but governed data workflows and provides security mechanisms. It's all good, but I don't find any good docs about best way to integrate Kylo with BI tools and do analytics on prepared datasets.

Data typically goes through these stages:
1. Data ingest. Kylo supports it.
2. Data preparation and integration for some analysis, it can include complex computations and prepared mostly in batch mode. Kylo supports it.
3. Resulting dataset can be discovered by business-user or analyst. Kylo supports it.
4. User can build custom reports, charts, dashboards, etc on top of prepared dataset. It seems like Kylo doesn't support it.

I don't think that Kylo have to include rich data viz or analytic functions. In the first place Kylo is a data management platform. But I think at least it's worth saying somewhere how to integrate Kylo BI tools like Tableau or Power BI, or some well known tools from Hadoop ecosystem for querying and visualization.

To use BI tools on prepared datasets I can use SQL-on-Hadoop solution (i.e. Apache Drill) which provides ODBC\JDBC interface for BI Tools. With Drill there is no need to do custom setting, but another tool may require metadata in "understandable" (Hive metastore obviously) form. And latter can be a problem with Kylo.

I'm interesting in some assumptions and thoughts how Kylo creators and community plan to make data viz and analytics on prepared datasets.

Thanks a lot.


Matt Hutton

unread,
Jul 27, 2017, 2:09:35 PM7/27/17
to Kylo Community
Hi Valentin-

Thank you for the kind post.   Kylo was inspired to fill major gaps in the Hadoop ecosystem particularly towards the Data Lake use cases.  While we agree data visualization and BI are key features, we feel there are a lot of great open-source and commercial visualization tools out there already.   We don't want to compete with these, but interested in hearing your ideas about how we could integrate and add value.

We currently allow users to browse Hive tables and interactively query them from Kylo.  There is no plotting functionality yet, but in 0.8.3 late-August we should have integrated notebook support so you will be able to build custom notebooks with plots, etc.  See Zeppelin and Jupityer.   We are hoping to combine notebooks with Kylo's entity-based access so you can share or maintain private notebooks, and of course leverage Kylo's metadata catalog.

Regards,
Matt

Valentin Polezhaev

unread,
Jul 28, 2017, 5:35:40 AM7/28/17
to Kylo Community
Hello, Matt.

Thanks for your response.

I agreed, that Kylo fits very good for data lake implementation, and emphaize previosuly that data viz and analytics on prepared datasets are not features, that have to be implemented directly in Kylo. But for completeness of vision, it's worth to mention somewhere (in docs or in tutorial videos?) how to do queyring, reports and visualization on prepread (with Kylo) datasets via third party tools. In short, how to consume and use data prepared with Kylo without singinifcant IT involving. Without such description or guide data lake concept is not complete.

Zeppelin may require some training for business users to work with it, but it seems good solution.

Business users feel comfortable while using BI tools. Although many of them support Hadoop integration, direct connection to Hadoop data source may be complex for users and some use cases cannot be done (i.e. working with dataset, that cannot fit in the user PC memory).

In my opinion, good option is to use some SQL-on-Hadoop tool, which can provide API to BI tools, therefore incapsulating Hadoop at least at some level. Now we experiment with Apache Drill, it supports ANSI SQL, provides ODBC/JDBC connection for third parties, and allows to do interactive analytics on Hadoop data using distributed query execution. Also, ability to use ANSI SQL instead of HiveQL seems very good (according to self-service).

One interesting question arises here, how to setup Kylo, Drill (or another SQL-on-Hadoop tool which provides API) and BI tool to get seamless analytics on prepared with Kylo data via BI tool by user. Ideally, user no need to know something about Drill and especially how to configure it. IT involvement should also be minimized. We will work and experiment in this direction, will share interesting findings as soon as we have them.


Matt Hutton

unread,
Jul 28, 2017, 3:12:13 PM7/28/17
to Kylo Community
Sounds good. Looking forward to hearing your findings.
Reply all
Reply to author
Forward
0 new messages