mlflow.pyfunc.log_model doesn't work correctly when artifact storage is HDFS and artifact size is size higher than 2gb

Сергей Красовский

unread,

Jan 26, 2021, 3:42:27 PM1/26/21

to mlflow-users

Hello MLflow team

There is an issue on pyarrow side which leads to the following misbehaving of MLflow when artifacts are kept on HDFS:

1. When a size of an artifact is less than 6144mb, then mlflow.pyfunc.log_model uploads corrupted artifact to HDFS with size not greater than 2gb.

2. When a size of an artifact is higher or equals to 6144mb, then there will be an exception.

Stacktrace:

"""

site-packages/mlflow/store/artifact/hdfs_artifact_repo.py in log_artifacts(self, local_dir, artifact_path)

66 destination = posixpath.join(hdfs_subdir_path, each_file)

67 with hdfs.open(destination, 'wb') as output_stream:

---> 68 output_stream.write(open(source, "rb").read())

69

70 def list_artifacts(self, path=None):

site-packages/pyarrow/io.pxi in pyarrow.lib.NativeFile.write()

site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

OSError: HDFS Write failed, errno: 22 (Invalid argument)

"""

Python script to reproduce the issue just on pyarrow lib:

"""

import os
import pyarrow as pa

os.environ["JAVA_HOME"]="<java_home>"
os.environ['ARROW_LIBHDFS_DIR'] = "<path>/libhdfs.so"
connected = pa.hdfs.connect(host="<host>",port=8020)
destination = "hdfs://<host>:8020/user/tmp/6144m.txt"
source = "/tmp/6144m.txt"
with connected.open(destination, "wb") as output_stream:
output_stream.write(open(source, "rb").read())
connected.close()

"""

The issue was reported to pyarrow team and the answer is:

"""
It appears that writes over 2GB are implemented incorrectly.
https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/hdfs.cc#L277

the tSize type in libhdfs is an int32_t. So that static cast is truncating data
https://issues.apache.org/jira/browse/ARROW-11391

I would recommend breaking the work into smaller pieces as a workaround

"""

Cheers,

Sergey

Jules Damji

unread,

Jan 26, 2021, 4:16:45 PM1/26/21

to Сергей Красовский, mlflow-users

Thanks Sergey. Do mind filing an issue on the mlflow.org so we can triage and keep track of it?

Much appreciated.

Cheers

Jules

Sent from my iPhone

Pardon the dumb thumb typos :)

On Jan 26, 2021, at 12:42 PM, Сергей Красовский <kraso...@gmail.com> wrote:

Hello MLflow team

--
You received this message because you are subscribed to the Google Groups "mlflow-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mlflow-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/mlflow-users/cbc258a9-5a43-4873-8ea4-5e8c313e57ban%40googlegroups.com.

Sergey Krasovskiy

unread,

Jan 27, 2021, 3:50:28 AM1/27/21

to mlflow-users

Thanks Jules. Done: https://github.com/mlflow/mlflow/issues/4025

Cheers

Sergey

Reply all

Reply to author

Forward