issues from running "import" command in pyspark environment

28 weergaven
Naar het eerste ongelezen bericht

Jason You

ongelezen,
23 apr 2018, 12:00:5723-04-2018
aan OpenTSDB
On Linux, I use a single CLI "import" command to bulk load a lot of files to openTSDB without problem. However, when I created multiple instances of "import" command from pyspark environment to bulk load files in parallel, all instances "import" command returned errors and only partial data from these files are loaded in openTSDB. My questions are:
Is it possible to create multiple instances of "./tsd import file1", "./tsd import file2", ... to bulk load data to openTSDB in parallel in PySpark environment on Linux.
2. If the answer for question 1 is yes, what are the steps to avoid the errors I had.

Thanks,
Jason

ManOLamancha

ongelezen,
22 mei 2018, 14:26:4322-05-2018
aan OpenTSDB
You should be able to to but you're likely hitting HBase limits at some point. What kind of errors are you seeing in the output? 

Alpha Kernel

ongelezen,
17 jul 2018, 09:01:3117-07-2018
aan OpenTSDB
I had a lot of data to import and ran into memory issues with trying to read in large files. What worked for me was to split it into smaller chunks, then in a screen session import each file in a bash for loop. I ended up running it on three different VMs getting HBase write speeds up to 450k/sec.
Allen beantwoorden
Auteur beantwoorden
Doorsturen
0 nieuwe berichten