I am experimenting with ingestion of data in arangodb. I have a parquet data for 50 million rows and 55 columns, consider it as vertex data. I want to ingest it in arangodb.
Q1. I used ArangoDB Datasource for Apache Spark but ingestion time is 3 hours. The temp space of pyspark its taking more than 100 gb. I am using the following configuration:
def write_to_arangodb(df, database, collection, user, password, hosts, endpoints, table, batch_size):
df.write \
.format("com.arangodb.spark") \
.mode("append") \
.option("database", database) \
.option("collection", collection) \
.option("user", user) \
.option("password", password) \
.option("arangodb.hosts", hosts) \
.option("endpoints", endpoints) \
.option("table", table) \
.option("async", True) \
.option("batchSize", batch_size) \
.option("overwriteMode", "ignore")\
.option("timeout", 30000000) \
.save()
write_to_arangodb(df, "Vertex", "Vertex", "root", "", "
0.0.0.0:8529", "
0.0.0.0:8535", "Vertex", 50000)
Cn someone suggest if I am doing the right way? or Am I missing something?
Also is there any way to ingest parquet file using command line. The above df is made from parquet file. Thanks