Please share resume to vineet...@hire-in.com
Max Rate: $50/hr. C2C
SUMMARY –
The client is looking for a high-level Databricks SRE/Support Engineer who has hands-on experience supporting AI Dojo (AI/ML upskilling platform) or similar large-scale AI/ML training environments.
They need someone who can support thousands of users, ensure the Databricks platform runs smoothly, automate infrastructure using Terraform + GitHub Actions, and handle troubleshooting, security, monitoring, and performance optimization.
Strong Databricks knowledge and DevOps/IaC skills are absolutely mandatory.
Position: AI Dojo Databricks SRE/Support Engineer
Location: Remote
As Databricks SRE and Support Engineer, you will work on operations related to AI Dojo (AI/ML upskilling program developed by Optum/UHG) on Databricks.
This individual contributor (IC) role requires experience on working on large-scale AI/ML platforms guaranteeing stability, reliability, scalability, and performance.
Experience with modern Infrastructure and DevOps tools and paradigms, as well as proven hands-on knowledge with Databricks is a must.
PRIMARY RESPONSIBILITIES:
• Continuous support: Provide continuous SRE support to thousands of geographically distributed users on the AI Dojo Databricks platform: respond to tickets, triage support, liaise with customers.
• Automation & DevOps: Improve existing Infrastructure as Code (IaC) according to best DevOps practices.
• Systems Monitoring: Develop and maintain monitoring frameworks to timely respond to outages and other service interruptions.
• Security & Compliance: Collaborate with internal cybersecurity teams to ensure all systems and operations comply with industry standards and are secure against evolving threats.
• Capacity Planning & Cost Optimization: Forecast and manage capacity requirements for the AI/ML training environment, while identifying opportunities to reduce costs without compromising performance.
REQUIRED QUALIFICATIONS:
• Bachelor’s degree in computer science, information technology, or a related field.
• 6+ years of infrastructure experience: Proven experience working on large-scale, cloud-based, enterprise-level software platforms and deep understanding of Databricks environment. In particular:
• 3+ years of practical experience in Infrastructure-as-Code and CI/CD tools like Terraform, Git Actions and alike.
• 3+ years of experience working in support teams that are geographically distributed