Num of segments pinot can handle?

50 views
Skip to first unread message

Denis Dudinski

unread,
Jan 31, 2019, 12:24:33 PM1/31/19
to Pinot Users
Hi team,

I have a couple of questions regarding data segmentation in pinot:

1. What is the max num of segments pinot can handle?
2. Does pinot tolerate skewed segment sizes? For example one segment is 10G and another one is 100Mb?
3. How does pinot merge "same" records from different segments during query processing? As I understand different segments can contain data for the same record. Is there any kind of "record id" that I need to describe in schema?

Thank you!

kishore g

unread,
Jan 31, 2019, 1:33:15 PM1/31/19
to Denis Dudinski, us...@pinot.apache.org, Pinot Users
Adding Apache Pinot User Group  
1. Max segments we have seen so far is 1 Million across 2k+ tables. 
2. While it can tolerate skew, it's not advisable. Preferred segment size is 500mb to 1GB. You can go up to 2GB in some cases. Note, there is no parallelism during query execution within a segment.
3. Can you explain what do you mean by different segments can contain data for the same record (a concrete example will be helpful)?.

--
You received this message because you are subscribed to the Google Groups "Pinot Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pinot_users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pinot_users/a2392199-4f0c-4e81-9693-8de18d0a757a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Denis Dudinski

unread,
Jan 31, 2019, 2:38:55 PM1/31/19
to kishore g, us...@pinot.apache.org, Pinot Users
Hi Kishore,

Wow, awesomely quick answer, thanks!

Example for point 3:
1. Sensor produces data: <sensorId1, observationTime, metric1, metric2,...>
2. Sensor’s data, along with data from other sensors, flows into Pinot via hdfs pipe (sensors-as-Avro->segment preparation job->push)
3. Sensor’s data is available for queries.
4. Sensor produces more data: <sensorId1, observationTime2, metric12, metric22, ...>
5. Segment preparation & import flow is run once again.
6. Now we have two segments, each containing data for the same sensorId1.
7. User performs query SELECT * FROM Sensor and sees duplication of sensorId1 data...

This case seems really obvious so I guess I’m missing something...

Thank you!

kishore g

unread,
Jan 31, 2019, 4:11:47 PM1/31/19
to Denis Dudinski, us...@pinot.apache.org, Pinot Users
That's expected for time series analytics. The two events represent the data captured from the sensor for distinct points in time. Ideally, you run something like select count(*), sum(metric) from Sensor where sensorId=X 
Feel free to ping us on  apache-pinot.slack.com 
thanks,
Kishore G

Denis Dudinski

unread,
Feb 1, 2019, 3:24:27 AM2/1/19
to kishore g, Pinot Users
Got it. Thanks a lot!
Reply all
Reply to author
Forward
0 new messages