BigDL Orca training with dataset is text

AN-TRUONG Tran Phan

unread,

Nov 11, 2023, 8:53:37 PM11/11/23

to User Group for BigDL

Dear all,

I have a project to detect whether email is spam or ham. I finished training on TensorFlow 2 and Keras in a single computer environment. I changed the environment to using BigDL Orca on Spark, and my code has an error like the below. I need help with tf dataset-type text.

Thank you so much

File "/home/ubuntu/anaconda3/envs/py37/lib/python3.7/site-packages/ray/serialization.py", line 361, in serialize
return self._serialize_to_msgpack(value)
File "/home/ubuntu/anaconda3/envs/py37/lib/python3.7/site-packages/ray/serialization.py", line 341, in _serialize_to_msgpack
self._serialize_to_pickle5(metadata, python_objects)
File "/home/ubuntu/anaconda3/envs/py37/lib/python3.7/site-packages/ray/serialization.py", line 301, in _serialize_to_pickle5
raise e
File "/home/ubuntu/anaconda3/envs/py37/lib/python3.7/site-packages/ray/serialization.py", line 298, in _serialize_to_pickle5
value, protocol=5, buffer_callback=writer.buffer_callback)
File "/home/ubuntu/anaconda3/envs/py37/lib/python3.7/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 73, in dumps
cp.dump(obj)
File "/home/ubuntu/anaconda3/envs/py37/lib/python3.7/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 620, in dump
return Pickler.dump(self, obj)
File "/home/ubuntu/anaconda3/envs/py37/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 1074, in __reduce__
return convert_to_tensor, (self._numpy(),)
File "/home/ubuntu/anaconda3/envs/py37/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 1117, in _numpy
raise core._status_to_exception(e) from None # pylint: disable=protected-access
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot convert a Tensor of dtype variant to a NumPy array.
Stopping orca context

log

mail_data.csv

code.py

huangka...@gmail.com

unread,

Nov 12, 2023, 9:16:17 PM11/12/23

to User Group for BigDL

Hi,

I think you need to put the datasets in a creator function as well, e.g. https://github.com/intel-analytics/BigDL/blob/main/python/orca/tutorial/tf/transfer_learning.py#L65

Can you try it to see if this error can be resolved?

Thanks,

Kai

AN-TRUONG Tran Phan

unread,

Nov 17, 2023, 4:31:27 AM11/17/23

to huangka...@gmail.com, User Group for BigDL

Dear Kai and all,

I have tried https://github.com/intel-analytics/BigDL/blob/main/python/orca/tutorial/tf/transfer_learning.py#L65 but it not working.

My data is "text" from a file CSV, and it is not the correct format with BigDL Orca on Spark.

I am trying to put data from the CSV file into BigDL dataset format but without success. Does anyone have any other ways similar to mine?

https://bigdl.readthedocs.io/en/latest/doc/Orca/Howto/tf2keras-quickstart.html#step-3-define-the-dataset

Best regards,

Truong

--
You received this message because you are subscribed to the Google Groups "User Group for BigDL" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bigdl-user-gro...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bigdl-user-group/fa70c4e6-5801-4dad-ba6d-cfe8c08ca650n%40googlegroups.com.

--

Trân Trọng,

An Trường.

huangka...@gmail.com

unread,

Nov 21, 2023, 9:30:42 PM11/21/23

to User Group for BigDL

Hi Truong,

Sorry for the late reply.

Actually the step here: https://bigdl.readthedocs.io/en/latest/doc/Orca/Howto/tf2keras-quickstart.html#step-3-define-the-dataset is not BigDL specific, it is just a function that creates a dataset acceptable by TensorFlow. Basically you can utilize how you use TensorFlow to handle the CSV file here as well.

Alternatively, you can use SparkXShards API in BigDL to read csv files? Example here: https://github.com/intel-analytics/BigDL/blob/main/python/orca/tutorial/NCF/process_xshards.py#L74