issues from running "import" command in pyspark environment

28 views
Skip to first unread message

Jason You

unread,
Apr 23, 2018, 12:00:57 PM4/23/18
to OpenTSDB
On Linux, I use a single CLI "import" command to bulk load a lot of files to openTSDB without problem. However, when I created multiple instances of "import" command from pyspark environment to bulk load files in parallel, all instances "import" command returned errors and only partial data from these files are loaded in openTSDB. My questions are:
Is it possible to create multiple instances of "./tsd import file1", "./tsd import file2", ... to bulk load data to openTSDB in parallel in PySpark environment on Linux.
2. If the answer for question 1 is yes, what are the steps to avoid the errors I had.

Thanks,
Jason

ManOLamancha

unread,
May 22, 2018, 2:26:43 PM5/22/18
to OpenTSDB
You should be able to to but you're likely hitting HBase limits at some point. What kind of errors are you seeing in the output? 

Alpha Kernel

unread,
Jul 17, 2018, 9:01:31 AM7/17/18
to OpenTSDB
I had a lot of data to import and ran into memory issues with trying to read in large files. What worked for me was to split it into smaller chunks, then in a screen session import each file in a bash for loop. I ended up running it on three different VMs getting HBase write speeds up to 450k/sec.
Reply all
Reply to author
Forward
0 new messages