I want to know how can i import csv data with 50000 columns and 100000 rows

chen jin

unread,

Jun 13, 2024, 8:54:22 PMJun 13

to Druid User

I tried to import a csv file with 50,000 columns and 100,000 rows. It took 80 minutes to parse the data, but an error was reported when writing the segment.

The data node MiddleManager runtime conf is :

```

# Task launch parameters
druid.indexer.runner.javaOptsArray=["-server","-Xms10g","-Xmx10g","-XX:MaxDirectMemorySize=20g","-Duser.timezone=UTC","-Dfile.encoding=UTF-8","-XX:+ExitOnOutOfMemoryError","-Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager"]

```

when i load data task with :

```

{
    "type": "index_parallel",
    "spec": {
        "ioConfig": {
            "type": "index_parallel",
            "inputSource": {
                "type": "local",
                "baseDir": "/opt/large_data5c10w.txt",
                "filter": "*"
            },
            "inputFormat": {
                "type": "csv",
                "findColumnsFromHeader": false,
                "columns": ["dataTime","column001","column002",...,"column0049998","column0049999"
                ]
            },
            "appendToExisting": true
        },
        "tuningConfig": {
            "type": "index_parallel",
            "partitionsSpec": {
                "type": "dynamic",
                "maxTotalRows": 2000000
            },
            "maxNumConcurrentSubTasks": 1
        },
        "dataSchema": {
            "dataSource": "large_data5c10w",
            "timestampSpec": {
                "column": "dataTime",
                "format": "auto"
            },
            "transformSpec": {},
            "dimensionsSpec": {
                "dimensions": [{"type": "int","name": "column001"},..."column0049999"
                ]
            },
            "granularitySpec": {
                "queryGranularity": "none",
                "rollup": false
            }
        }
    }
}

```

``` exception msg:

2024-06-13T14:46:47,149 INFO [[index_parallel_large_data5c10w_lgmfjcii_2024-06-13T13:29:05.085Z]-appenderator-persist] org.apache.druid.segment.realtime.appenderator.AppenderatorImpl - Persisted stats: processed rows: [17536], persisted rows[148], sinks: [1], total fireHydrants (across sinks): [29], persisted fireHydrants (across sinks): [2]
2024-06-13T14:46:47,150 INFO [[index_parallel_large_data5c10w_lgmfjcii_2024-06-13T13:29:05.085Z]-appenderator-merge] org.apache.druid.segment.realtime.appenderator.AppenderatorImpl - Preparing to push (stats): processed rows: [17536], sinks: [1], fireHydrants (across sinks): [29]
2024-06-13T14:50:06,066 ERROR [task-runner-0-priority-0] org.apache.druid.indexing.common.task.IndexTask - Encountered exception in BUILD_SEGMENTS.
java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: Direct buffer memory
at org.apache.druid.indexing.common.task.IndexTask.generateAndPublishSegments(IndexTask.java:1055) ~[druid-indexing-service-29.0.1.jar:29.0.1]
at org.apache.druid.indexing.common.task.IndexTask.runTask(IndexTask.java:548) ~[druid-indexing-service-29.0.1.jar:29.0.1]
at org.apache.druid.indexing.common.task.AbstractTask.run(AbstractTask.java:179) ~[druid-indexing-service-29.0.1.jar:29.0.1]
at org.apache.druid.indexing.common.task.batch.parallel.ParallelIndexSupervisorTask.runSequential(ParallelIndexSupervisorTask.java:1214) ~[druid-indexing-service-29.0.1.jar:29.0.1]
at org.apache.druid.indexing.common.task.batch.parallel.ParallelIndexSupervisorTask.runTask(ParallelIndexSupervisorTask.java:551) ~[druid-indexing-service-29.0.1.jar:29.0.1]
at org.apache.druid.indexing.common.task.AbstractTask.run(AbstractTask.java:179) ~[druid-indexing-service-29.0.1.jar:29.0.1]

```

How can I optimize?

Peter Marshall

unread,

Jun 18, 2024, 3:49:46 AMJun 18

to Druid User

Wow - 50,000 columns?! I would suggest looking at reducing the column count - Druid is a columnar database, for sure - but do your end users really need to be able to group by / filter / do calculations on 50,000 columns in the same UI?

The direct error you received is related not to the middlemanager processes, but to the ingestion tasks themselves that spin up and ingest the data.

java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: Direct buffer memory

This is an out of memory error in the Java Machine running those processes, caused when the process tried to parse and generate the segment files.

Tune the JVM memory settings yourself in javaoptsarray within the runtime.properties files. See here:

https://druid.apache.org/docs/latest/configuration/#middlemanager-configuration

If you are using start-druid script, the settings are calculated for you. Take a look at https://druid.apache.org/docs/latest/operations/single-server to see the info about start-druid, including how to see what values would be used on a particular machine. I say this as you may want to try running the ingestion on a machine with more memory - start-druid may then automatically allocate more memory for the tasks all by itself.

Hope this helps!

chen jin

unread,

Jun 18, 2024, 8:32:13 AMJun 18

to Druid User

Thank you peter, too many columns is slow to ingest data, we will reduce the number of columns.

John Kowtko

unread,

Jun 18, 2024, 9:21:00 AMJun 18

to Druid User

Hi Chen,

In addition to Peter's suggestions, there are maybe a couple other things you could try as well:

* reduce the values for maxRowsInMemory and maxRowsPerSegment ... if your rows are large then reducing these may help keep the size under control

* set maxColumnsTo Merge to something moderate, like 50 or 100 ...

* If the majority of the columns are sparse properties, mostly tenant specific attributes in a multi-tenant table, you might consider lumping them under a nested JSON column

Doc refs:

* https://druid.apache.org/docs/latest/ingestion/native-batch#tuningconfig

* https://druid.apache.org/docs/latest/querying/sql-data-types#nested-columns

Thanks. John

Reply all

Reply to author

Forward