Regarding data ingestion of parqut files

47 views
Skip to first unread message

Ajay Kumar Gupta

unread,
Aug 16, 2023, 6:23:14 AM8/16/23
to ArangoDB
Hi everyone
I am experimenting with ingestion of data in arangodb. I have a parquet data for 50 million rows and 55 columns, consider it as vertex data. I want to ingest it in arangodb.
Q1. I used ArangoDB Datasource for Apache Spark but ingestion time is 3 hours. The temp space of pyspark its taking more than 100 gb. I am using the following configuration:

def write_to_arangodb(df, database, collection, user, password, hosts, endpoints, table, batch_size):

    df.write \
        .format("com.arangodb.spark") \
        .mode("append") \
        .option("database", database) \
        .option("collection", collection) \
        .option("user", user) \
        .option("password", password) \
        .option("arangodb.hosts", hosts) \
        .option("endpoints", endpoints) \
        .option("table", table) \
        .option("async", True) \
        .option("batchSize", batch_size) \
        .option("overwriteMode", "ignore")\
        .option("timeout", 30000000) \
        .save()

write_to_arangodb(df, "Vertex", "Vertex", "root", "", "0.0.0.0:8529", "0.0.0.0:8535", "Vertex", 50000)

Cn someone suggest if I am doing the right way? or Am I missing something?

Also is there any way to ingest parquet file using command line. The above df is made from parquet file. Thanks
Reply all
Reply to author
Forward
0 new messages