Title:
Site Reliability Engineer SRE – ML platform
Location:
Austin, TX OR Sunnyvale, CA
Duration
: Long Term
Note: Focus is to have 60% SRE and
40% ML Ops
Job Description :
- Continuous
Deployment using GitHub Actions, Flux, Kustomize
- Design
and implement cloud solutions, build MLOps on cloud AWS
- Data
science model containerization, deployment using docker, VLLM, Kubernetes
- Communicate
with a team of data scientists, data engineers and architects, document
the processes
- Develop
and deploy scalable tools and services for our clients to handle machine
learning training and inference.
- Knowledge
of ML models and LLM
Qualifications:
- 6+
years of experience in ML Ops with strong knowledge in Kubernetes, Python,
MongoDB and AWS.
- Good
understanding of Apache SOLR.
Note: Focus is to have 60% SRE and
40% ML Ops
|
Skill
Area
|
Includes
|
Weight
(%)
|
|
Platform
Reliability & Containerization
|
Kubernetes, Docker, Microservices, Linux
|
30%
|
|
MLOps
& AWS Cloud
|
Model deployment,
versioning, monitoring, AWS (SageMaker, S3, Lambda, EKS)
|
25%
|
|
CI/CD
& GitOps
|
GitHub Actions, Flux
|
15%
|
|
Monitoring
& Observability
|
Splunk,
Grafana, Prometheus, performance tracking
|
15%
|
|
Integration
& Collaboration
|
Python
scripting, API integrations, Apache Solr, LLM awareness, teamwork with data
scientists & engineers
|
15%
|
- Proficient with Linux
administration.
- Knowledge of ML models and LLM.
- Ability to understand tools used
by data scientists and experience with software development and test
automation
- Ability to design and implement
cloud solutions and ability to build MLOps pipelines on cloud solutions
(AWS)
- Experience working with cloud
computing and database systems
- Experience building custom
integrations between cloud-based systems using APIs
- Experience developing and
maintaining ML systems built with open-source tools
- Experience with MLOps Frameworks
like Kubeflow, MLFlow, DataRobot, Airflow etc., experience with Docker and
Kubernetes
- Experience developing containers
and Kubernetes in cloud computing environments
- Familiarity with one or more
data-oriented workflow orchestration frameworks (Kubeflow, Airflow, Argo,
etc.)
- Ability to translate business
needs to technical requirements
- Strong understanding of software
testing, benchmarking, and continuous integration
- Exposure to machine learning
methodology and best practices
- Good communication skills and
ability to work in a team
Share
the Resumes & Below Details to my Official email id sek...@transreach.com only
- Legal
Name (First/Last):
- Phone
(Primary and secondary):
- Candidate
Email:
- Current
Location (City, State):
- Work
Authorization / Visa Status :
- Interview
Availability:
- LinkedIn
URL:
- Education
Details ( Bachelors / Masters , University Name , Location, Year of Pass
out ) :
- Availability
once Confirmed:
- Total
years of Work Experience in USA :
- Over
All Years of Work Experience :
- Open
to Relocate ( Yes / No ) :
- Expected Hourly Bill Rate on C2C :