Regarding data ingestion of parqut files

49 views

Skip to first unread message

Ajay Kumar Gupta

unread,

Aug 16, 2023, 6:23:14 AM8/16/23

to ArangoDB

Hi everyone

I am experimenting with ingestion of data in arangodb. I have a parquet data for 50 million rows and 55 columns, consider it as vertex data. I want to ingest it in arangodb.

Q1. I used ArangoDB Datasource for Apache Spark but ingestion time is 3 hours. The temp space of pyspark its taking more than 100 gb. I am using the following configuration:

def write_to_arangodb(df, database, collection, user, password, hosts, endpoints, table, batch_size):

df.write \
.format("com.arangodb.spark") \
.mode("append") \
.option("database", database) \
.option("collection", collection) \
.option("user", user) \
.option("password", password) \
.option("arangodb.hosts", hosts) \
.option("endpoints", endpoints) \
.option("table", table) \
.option("async", True) \
.option("batchSize", batch_size) \
.option("overwriteMode", "ignore")\
.option("timeout", 30000000) \
.save()

write_to_arangodb(df, "Vertex", "Vertex", "root", "", "0.0.0.0:8529", "0.0.0.0:8535", "Vertex", 50000)

Cn someone suggest if I am doing the right way? or Am I missing something?

Also is there any way to ingest parquet file using command line. The above df is made from parquet file. Thanks

Reply all

Reply to author

Forward

0 new messages