I have a simple apache beam project using python 3 to transform some data and write to big query, it uses a package called texstat, if I run locally everything works, but when I run on dataflow I get the following error:
NameError: name 'textstat' is not defined [while running 'generatedPtransform-441']
This is my current setup.py file:
import setuptools
REQUIRED_PACKAGES = ['textstat==0.5.6']
PACKAGE_NAME = 'my_package'
PACKAGE_VERSION = '0.0.1'
setuptools.setup(
name=PACKAGE_NAME,
version=PACKAGE_VERSION,
description='Example project',
install_requires=REQUIRED_PACKAGES,
packages=setuptools.find_packages(),
)
and this are my pipeline args
pipeline_args = [
'--project={}'.format('etl-example'),
'--runner={}'.format('Dataflow'),
'--temp_location=gs://dataflowtemporal/',
'--setup_file=./setup.py',
]
and I run it like this
pipeline_options = PipelineOptions(pipeline_args)
pipeline_options.view_as(StandardOptions).streaming = True
pipeline = beam.Pipeline(options=pipeline_options)
#The actual pipelines it is running
pipeline.run()
I also tried with running this on the terminal before running the job:
python setup.py sdist --formats=gztar
but I get the same results of texstat not being found. Another thing I tries was without setup.py and only with the argument
--requirements_file=./requirements.txt
But again, texstat is not found
At this point I don't know what else to try.
PS: Sorry, I deleted the last post because the code was all wonky and unreadable, sorry for the spam