Efficiently Loading a Large Time Series Dataset into KairosDB

105 views
Skip to first unread message

Abdel K

unread,
Sep 5, 2021, 2:49:28 PM9/5/21
to KairosDB

I am trying to load 100 billion multi-dimensional time series datapoints into Graphite from a CSV file with the following format:

  • timestamp value_1 value_2 .... value_n

I tried to find a fast loading method on the official documentation and here's how I am currently doing the insertion (my codebase is in Python):

Screenshot 2021-09-05 at 20.48.24.png

As the code above shows, my code is reading the dataset CSV file and preparing batches of 65000 data points, then sending the datapoints using requests.post.

However, this method is not very efficient. In fact, I am trying to load 100 billion data points and this is taking way longer than expected, loading only 3 Million rows with 100 columns each has been running for 29 hours and still has 991 hours to finish!!!!

b5Uff.png

I am certain there is a better way to load the dataset into KairosDB. Any suggestions for a faster loading better.

Abdel K

unread,
Sep 5, 2021, 2:51:17 PM9/5/21
to KairosDB
A small correction, I wrote Graphite in the beginning, it's of course KairosDB that I am trying to load the data into.

Best,
Abdel

Brian Hawkins

unread,
Sep 15, 2021, 9:38:20 AM9/15/21
to KairosDB
I did a presentation a while ago on what it would take to do 1 million inserts per second into kairos:  https://www.youtube.com/watch?v=O5BVRUMsBp0

This largely depends on your setup, how many kairos nodes do you have?  How big is your cassandra cluster?  How many instances are inserting data?

I would test your script against a null endpoint (comment out the send or use a mock service) to see how fast your script can read and process the data.  You may need to divide up the data into multiple files and use multiple instances of the script to send the data.  At 1 million / second you should be able to load the data in just over a day.  The larger the cluster the faster you can go if you divide up your import script.

Brian

Message has been deleted

Abdel K

unread,
Sep 15, 2021, 9:53:32 AM9/15/21
to KairosDB
Hello, thanks for your response,

I am using both H2 and Cassandra as storage. Cassandra, for the most part, seems to be significantly faster. 

I am using only one node, no cluster, only one instance inserting data. I think this is the bottleneck of my method. 

Any ideas of ways around using multiple instances? 

Thanks! 
Abdel

Brian Hawkins

unread,
Sep 15, 2021, 3:21:22 PM9/15/21
to KairosDB
H2 was never meant for that kind of load.  It is mostly for doing testing.  A single Cassandra node should handle the load, and in some ways better than a cluster because it doesn't have to replicate any data.  You will likely need multiple Kairos nodes because of the cassandra client limit on sending data to the server.

Brian

Reply all
Reply to author
Forward
0 new messages