Parquet file format

Scott Kinney

unread,

Jun 8, 2016, 4:41:39 PM6/8/16

to Druid User

We are evaluating Drill and Druid as a solution for querying gzipped json data files in s3. To make it more fun each line in a file can be one of like 10 different json structures. The structures can be very different. We want to be able to query this data in s3 so it doesn't have to live on ebs volumes ($$). Parquet seems to be more compact so this will save time in transferring the file at least and supposedly can allow for faster querying. I read this http://druid.io/docs/latest/comparisons/druid-vs-sql-on-hadoop.html and am a bit confused. It sounds like parquet can make druid queries faster. Can Druid ingest parquet files and is it efficient or at least as efficient as json?

Fangjin

unread,

Jun 8, 2016, 4:52:50 PM6/8/16

to Druid User

Scott, I think you may be misunderstanding how Druid works.

Druid is not a SQL-on-Hadoop solution. All Druid segments must be downloaded locally before they can be queried, unlike a system like Drill that can query for Parquet files in S3 directly. To use Parquet with Druid, you would have to read data from Parquet and convert it into Druid's segment format. There is an existing extension to do this.

The tradeoff is of course the latency in which queries return. If you can live with queries that take minutes or hours, then you can pull data from S3 into Drill and have Drill do the computation. Depending on the frequency of queries, this may be a more expensive option because of network transfer costs than if you had Druid download all segments locally and query that data instead.

On Wed, Jun 8, 2016 at 1:41 PM, Scott Kinney <skin...@gmail.com> wrote:

We are evaluating Drill and Druid as a solution for querying gzipped json data files in s3. To make it more fun each line in a file can be one of like 10 different json structures. The structures can be very different. We want to be able to query this data in s3 so it doesn't have to live on ebs volumes ($$). Parquet seems to be more compact so this will save time in transferring the file at least and supposedly can allow for faster querying. I read this http://druid.io/docs/latest/comparisons/druid-vs-sql-on-hadoop.html and am a bit confused. It sounds like parquet can make druid queries faster. Can Druid ingest parquet files and is it efficient or at least as efficient as json?

--
You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+...@googlegroups.com.
To post to this group, send email to druid...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/a81405e4-61c1-4ff8-99b0-44c4af164c6c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Scott Kinney

unread,

Jun 8, 2016, 5:02:34 PM6/8/16

to Druid User

Ah, ok.

We are trying to move a bunch of data out of MySQL to S3. If we want to query it with Druid we will end up pulling it down from s3 and onto the Druid ebs volumes defeating the whole purpose.

I was starting to think Druid might work for us but now I don't think so.

Thanks Fangjin.

Fangjin

unread,

Jun 8, 2016, 5:05:13 PM6/8/16

to Druid User

Scott, what are you trying to do?

To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/e5fafd60-7bb0-4a28-a950-2f7fc23ac8ab%40googlegroups.com.

Fangjin

unread,

Jun 8, 2016, 5:05:32 PM6/8/16

to Druid User

In terms of general product requirements and data volumes?

Gian Merlino

unread,

Jun 8, 2016, 5:22:39 PM6/8/16

to druid...@googlegroups.com

Hey Scott,

Druid is really different from Drill– Drill is a query engine, it queries data where it lies. Druid by contrast is a data store. It indexes your data into its own data format and then distributes that data across all the Druid nodes *before* queries are issued. The query path involves local reads of data already present on the Druid nodes. This design generally offers better performance, but you do need to store the indexed dataset on the Druid nodes. This can be substantially smaller than the raw data (thanks to compression and rollup).

Hope this helps.

Gian

On Wed, Jun 8, 2016 at 1:41 PM, Scott Kinney <skin...@gmail.com> wrote:

We are evaluating Drill and Druid as a solution for querying gzipped json data files in s3. To make it more fun each line in a file can be one of like 10 different json structures. The structures can be very different. We want to be able to query this data in s3 so it doesn't have to live on ebs volumes ($$). Parquet seems to be more compact so this will save time in transferring the file at least and supposedly can allow for faster querying. I read this http://druid.io/docs/latest/comparisons/druid-vs-sql-on-hadoop.html and am a bit confused. It sounds like parquet can make druid queries faster. Can Druid ingest parquet files and is it efficient or at least as efficient as json?

--

Scott Kinney

unread,

Jun 8, 2016, 5:41:50 PM6/8/16

to Druid User

We are trying to move a big chunk of data out of mysql and into s3. Keeping the data on ebs volumes is expensive but we still want to query it.

Scott Kinney

unread,

Jun 8, 2016, 5:43:36 PM6/8/16

to Druid User

Ok, i'll do some test to see if the compression and rollup will be enough.

Gian Merlino

unread,

Jun 8, 2016, 5:45:02 PM6/8/16

to druid...@googlegroups.com

Depending on how much data you have, and how much compute you want to dedicate to querying, you may be able to use the local disks that instances already come with. r3 and i2 instances are pretty popular for Druid query nodes, and the i2s especially have a lot of local storage. Many users in aws find that they don't need ebs.

Gian

To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/5194a747-d882-4b01-949f-c8a0f68bce82%40googlegroups.com.

Scott Kinney

unread,

Jun 8, 2016, 5:52:26 PM6/8/16

to Druid User

I was just thinking that we could use the instance store volumes since the data will still persist in s3. Maybe, there's a lot of data.

Fangjin

unread,

Jun 8, 2016, 6:03:05 PM6/8/16

to Druid User

Hi Scott, just out of curiosity, what issues are you facing with mysql?

To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/5194a747-d882-4b01-949f-c8a0f68bce82%40googlegroups.com.

Scott Kinney

unread,

Jun 8, 2016, 6:05:02 PM6/8/16

to Druid User

This is more about cost saving. All this EBS is expensive.

Reply all

Reply to author

Forward