Hello MLflow team
There is an issue on pyarrow side which leads to the following misbehaving of MLflow when artifacts are kept on HDFS:
1. When a size of an artifact is less than 6144mb, then mlflow.pyfunc.log_model uploads corrupted artifact to HDFS with size not greater than 2gb.
2. When a size of an artifact is higher or equals to 6144mb, then there will be an exception.
Stacktrace:
"""
site-packages/mlflow/store/artifact/hdfs_artifact_repo.py in log_artifacts(self, local_dir, artifact_path)
66 destination = posixpath.join(hdfs_subdir_path, each_file)
67 with hdfs.open(destination, 'wb') as output_stream:
---> 68 output_stream.write(open(source, "rb").read())
69
70 def list_artifacts(self, path=None):
site-packages/pyarrow/io.pxi in pyarrow.lib.NativeFile.write()
site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()
OSError: HDFS Write failed, errno: 22 (Invalid argument)
"""
Python script to reproduce the issue just on pyarrow lib:
"""
import os
import pyarrow as pa
os.environ["JAVA_HOME"]="<java_home>"
os.environ['ARROW_LIBHDFS_DIR'] = "<path>/libhdfs.so"
connected = pa.hdfs.connect(host="<host>",port=8020)
destination = "hdfs://<host>:8020/user/tmp/6144m.txt"
source = "/tmp/6144m.txt"
with connected.open(destination, "wb") as output_stream:
output_stream.write(open(source, "rb").read())
connected.close()
"""
The issue was reported to pyarrow team and the answer is:
"""
Cheers,
Sergey