I've recently started learning Terraform and have been looking for a way to work with it and python together. Specifically I was thinking about using Boto3 and Terraform together. Do you think your module would help solve my current tasks? I've placed an * next to items that I think I should/could use python.
Use case:
Creating a data-pipeline that will deposit process a few hundred thousand web articles.
Each publisher will have control over their crawler. They want to set the crawl frequency & schedule.
Each publication could have a custom text extractions for each template
Each crawler has to have a custom Browser ID.
Crawler Stage:
Scrapy crawler will run in ECS containers.
The container hosts will have a NFS share mounted.
The NFS share will have a folder for each publication's dedicated crawler, this will provide persistence for configurations & crawl history.
*Use python to loop through a dict of publications, checking to see if there is a folder on the NFS share & create it if it's not.
* Loop through the publication dict and create a container that is mapped to each publication's sub directory (nfs://nfs_srvc/crawler/nytimes/)
- Set the container environment variables so the crawler knows where to deposit HTML & Images in S3.
~Each publication now has a dedicated crawler~
The other issue that I'm trying to solve is how best to create tens of thousands of folders in S3 buckets. As files move through the data pipeline, they will be deposited in different buckets. In some cases the files will be deposited in to one folder and will be deleted as they are processed. In other cases they will remain in the bucket (archival) which will have a folder for each publication.
Infrastructure build:
Create S3 buckets
* When a new publication crawler is created using python, the s3 bucket folders are created as well.
My thoughts is I would have a python app that would do the following:
Run a new initial setup where I use Terraform & python to create a new copy of the platform & infrastructure (for enterprise customers).
Add a new publication:
- *new publication yml is loaded in python
- * create new Terraform files
- * create folders on S3 & NFS
- * generate and copy crawler configuration files to NFS
- run terraform to add the new containers and set them to run on schedule (per publication).