+-----+--------------------+--------------------+--------------------+--------------------+--------------------+
|index| input| tf.identity| tf.identity_1| tf.identity_2| tf.identity_3|
+-----+--------------------+--------------------+--------------------+--------------------+--------------------+
| 0|[[[0.524009594978...|[[[0.7539101, 1.1...|[[[-0.8792659, 0....|[[[[3.1407502, 0....|[[[0.97345, 0.002...|
| 1|[[[0.410070298834...|[[[0.7540849, 1.1...|[[[-0.87942857, 0...|[[[[3.149644, 0.2...|[[[0.97356623, 0....|
| 2|[[[0.855045850696...|[[[0.7491907, 1.1...|[[[-0.8787027, 0....|[[[[3.1266716, 0....|[[[0.97360265, 0....|
| 3|[[[0.005269990329...|[[[0.74902785, 1....|[[[-0.87985, 0.14...|[[[[3.1379035, 0....|[[[0.9735233, 0.0...|
| 4|[[[0.524009594978...|[[[0.7539101, 1.1...|[[[-0.8792659, 0....|[[[[3.1407502, 0....|[[[0.97345, 0.002...|
| 5|[[[0.410070298834...|[[[0.7540849, 1.1...|[[[-0.87942857, 0...|[[[[3.149644, 0.2...|[[[0.97356623, 0....|
| 6|[[[0.855045850696...|[[[0.7491907, 1.1...|[[[-0.8787027, 0....|[[[[3.1266716, 0....|[[[0.97360265, 0....|
| 7|[[[0.005269990329...|[[[0.74902785, 1....|[[[-0.87985, 0.14...|[[[[3.1379035, 0....|[[[0.9735233, 0.0...|
+-----+--------------------+--------------------+--------------------+--------------------+--------------------+
However, the dataframe implementation takes more time and consumes more memory than the xShards implementation. Because the spark dataframe only supports Array[Array[...]] and the performance of ndarray.tolist() is not satisfying. And we are trying to optimize it.
I recommend you to use xShards for OpenVINO inference. You can follow the steps below:
1. Create the SparkXShards (https://bigdl.readthedocs.io/en/latest/doc/Orca/Overview/data-parallel-processing.html#xshards-distributed-data-parallel-python-processing)
You can read the image files using spark dataframe and convert it to xShards of pandas dataframe using orca.utils.to_pandas. Or you can create a spark rdd with image file paths in each partition, and create an xShards using SparkXShards(rdd), then you can use shards.transform_shard(your_func) to do file reading, preprocessing, etc.
2. Prepare xShards for the estimator.predict
Transform the xShards to each partition is a dictionary of {'x': batched input ndarray} using shards.transform_shard(your_func) , then feed into the estimator predict function.
The Memory Issue you mentioned before "usage memory is only 342 MB after that value my spark session being crashed":
The storage memory in spark ui is used for storing all of the cached data, broadcast variables are also stored here. Any persist option which includes MEMORY in it, spark will store that data in this segment. Because the estimator only broadcasts the openvino models, and this 342MB is the memory used to store the broadcasted OV model, so it is always 342MB when your spark session crashed.
I tested on local with 5g driver memory, 48 images when the input is ndarray, 34 images xshards, and 54 images spark dataframe (partition number: 2, batch size 4). And on yarn cluster mode, 190 images xshards (num_nodes:2, executor_memory: 10g, driver_memory: 10g).
I have observed the excessive memory usage issue, and I'm trying to locate and solve it.
Best Regards,--
You received this message because you are subscribed to a topic in the Google Groups "User Group for BigDL" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/bigdl-user-group/WQjdzfHs9Qg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to bigdl-user-gro...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/81bf465a-19e9-4eab-b68e-05f8f3aea1fan%40googlegroups.com.