I'm trying to build the maestro dataset in the Tensor2Tensor format, as a preliminary step before building my own dataset. I'm using the t2t-datagen command.
I found fixes for various problems on the way:
- I had to register the problem (score2perf_maestro_language_uncropped_aug) in the t2t_datagen.py file, otherwise it wasn't recognized.
- I had to augment the script provided here with a region parameter in the pipeline options.
- I had to add: 'sound': ['libsndfile1-dev'], in the EXTRAS_REQUIRE of the setup (to fix a bug when librosa is imported).
- I had to specify the same version for tensorflow and tensorflow-estimator in the setup.py file (==2.6.0) to fight a bug that led to this error: "AttributeError: module 'tensorflow.tools.docs.doc_controls' has no attribute 'inheritable_header'", itself leading to another: "tensorflow.python.framework.errors_impl.AlreadyExistsError: Another metric with the same name already exists."
I'm now struggling with another problem.
The code runs fine at first, starts workers and goes through the steps until it tackles:
JOB_MESSAGE_BASIC: Executing operation input_transform_train/ReadAllFromTFRecord/ReadAllFiles/Reshard/ReshufflePerKey/GroupByKey/Read+input_transform_train etc
At this point, I run into this error:
INFO:apache_beam.runners.dataflow.dataflow_runner:2021-12-08T18:12:25.346Z: JOB_MESSAGE_ERROR: Traceback (most recent call last):
File "apache_beam/runners/common.py", line 1233, in apache_beam.runners.common.DoFnRunner.process
File "apache_beam/runners/common.py", line 571, in apache_beam.runners.common.SimpleInvoker.invoke_process
File "apache_beam/runners/common.py", line 1369, in apache_beam.runners.common._OutputProcessor.process_outputs
File "/usr/local/lib/python3.8/site-packages/apache_beam/io/filebasedsource.py", line 386, in process
for record in source.read(range.new_tracker()):
File "/usr/local/lib/python3.8/site-packages/apache_beam/io/tfrecordio.py", line 184, in read_records
with self.open_file(file_name) as file_handle:
File "/usr/local/lib/python3.8/site-packages/apache_beam/io/filebasedsource.py", line 173, in open_file
File "/usr/local/lib/python3.8/site-packages/apache_beam/io/filesystems.py", line 244, in open
return filesystem.open(path, mime_type, compression_type)
File "/usr/local/lib/python3.8/site-packages/apache_beam/io/gcp/gcsfilesystem.py", line 177, in open
return self._path_open(path, 'rb', mime_type, compression_type)
File "/usr/local/lib/python3.8/site-packages/apache_beam/io/gcp/gcsfilesystem.py", line 138, in _path_open
raw_file = gcsio.GcsIO().open(path, mode, mime_type=mime_type)
File "/usr/local/lib/python3.8/site-packages/apache_beam/io/gcp/gcsio.py", line 223, in open
downloader = GcsDownloader(
File "/usr/local/lib/python3.8/site-packages/apache_beam/io/gcp/gcsio.py", line 585, in __init__
project_number = self._get_project_number(self._bucket)
File "/usr/local/lib/python3.8/site-packages/apache_beam/io/gcp/gcsio.py", line 166, in get_project_number
self.bucket_to_project_number[bucket] = bucket_metadata.projectNumber
AttributeError: 'NoneType' object has no attribute 'projectNumber'
It seems it cannot reach my bucket metadata somehow. My data_dir and temp_location folder do exist. The run creates some files in the temp_location folder in the first steps. The bucket is attached to the project I define in the script.
Any idea of what might be going on? Any idea on how to debug this? The bug appears in the appache-beam code, so I can't use any print or debug mode there.
Would appreciate any hint!