Responsible for the design, deployment, configuration, and operations for a multi-node big data cluster and model deployment patterns for ML teams. This includes working with open source and/or commercial stacks to support the full SDLC. Resource will work to deploy, manage, and maintain development, test and production environments for the big data platform.
Build and maintain large scale ML Infrastructure and ML pipelines. Contribute to machine learning platform and tools to enable both prediction and optimization of models using AWS stack (Sagemaker, Jupyter Notebooks, Airflow)
Extend existing ML Platform and frameworks for scaling model training & deployment.
AWS experience a must
Sagemaker experience preferred but not a must if the candidate has good knowledge of fundamentals and can self-learn on the fly
Partner closely with various remote business & engineering teams to drive the adoption, integration of model outputs.
Data Science understanding from a platform perspective is important to this role -model management, supervised /unsupervised models and general evolving best practices in the ML space
Develop scripts to automate and streamline operations and configurations in the infrastructure
Use knowledge of enterprise security solutions like LDAP and Kerberos to maintain a secure environment
Research performance issues; Optimize platform for performance
Troubleshoot and resolve issues in all operational environments
Forward thinking by continuously adopting new ideas and technologies to solve business problems
3-5 years of experience.
Prior Experience supporting Data science /ML teams is preferred
Experience working with ML Frameworks in a platform capacity (PyTorch/TensorFlow/Scikitlearn etc.) is preferred
Good understanding of model deployment best practices and ability to demonstrate that understanding via prior work or projects
Experience using Sagemaker or similar ML tools will be strongly preferred
Experience in AWS EMR is preferred
Experience in Hadoop SQL tools - Hive and Presto
Any Scripting Knowledge - Shell, Perl, python
Strong Experience in Linux OS
Experience on AWS services/network - Instances, Security Groups, Bootstrap actions, S3 Buckets, VPC/Subnet
- Experience in providing on-call support.