We are seeking an experienced Site Reliability Engineer (SRE) – ML Platform / MLOps Engineer to support the reliability, scalability, and performance of our Machine Learning platforms. This role focuses on building and operating production-grade ML systems using Kubernetes, Python, cloud infrastructure, and modern MLOps practices.
The ideal candidate has strong experience in MLOps, cloud-native architecture, containerization, and CI/CD, along with a solid understanding of ML models and Large Language Models (LLMs). You will work closely with data scientists, ML engineers, and software teams to design, deploy, and maintain robust ML pipelines and services.
Design, deploy, and maintain scalable ML platforms using Kubernetes, Docker, and cloud services (primarily AWS)
Build and operate end-to-end MLOps pipelines, including model training, validation, deployment, and monitoring
Ensure high availability, reliability, and performance of ML production systems
Develop automation tools and services using Python
Implement and manage CI/CD pipelines for ML and microservices workloads
Support ML workloads involving LLMs and traditional ML models
Collaborate with data scientists to productionize models and optimize workflows
Administer Linux systems and troubleshoot infrastructure issues
Design cloud-native microservices and APIs for ML applications
Manage and integrate data stores such as MongoDB and search platforms like Apache Solr
Implement monitoring, alerting, logging, and benchmarking for ML systems
Translate business requirements into technical solutions
Contribute to best practices around testing, security, and operational excellence
6+ years of hands-on experience in MLOps / SRE / Platform Engineering
Strong proficiency in Python
Extensive experience with Kubernetes and containerized environments
Solid knowledge of AWS (or Azure/GCP) cloud platforms
Experience with MongoDB
Strong Linux administration skills
Experience with microservices architectures
Hands-on experience with CI/CD pipelines
Working knowledge of ML models and Large Language Models (LLMs)
Experience productionizing ML systems built with open-source tools
Python
Kubernetes & Docker
AWS (or Azure/GCP)
MongoDB
Microservices Architecture
Apache Solr
MLOps frameworks (Kubeflow, MLflow, Airflow, DataRobot, Argo, etc.)
CI/CD pipelines
Linux system administration
REST APIs and cloud integrations
Experience with workflow orchestration tools such as Kubeflow, Airflow, or Argo
Experience building custom integrations between cloud-based systems using APIs
Strong understanding of software testing, benchmarking, and continuous integration
Exposure to ML methodology and best practices
Ability to design and implement cloud-based ML solutions
Excellent communication skills and ability to collaborate across teams