Immediate Opportunity--AI Infrastructure Platform Engineer

0 views

Skip to first unread message

Sradha Priyadarsini

unread,

3:55 PM (4 hours ago) 3:55 PM

Hello All,

Role: AI Infrastructure Platform Engineer

Client: Wells Fargo & Company
Location: Charlotte, NC

In This Role, You Will
- Lead complex infrastructure initiatives supporting Generative AI and Predictive AI platforms from design to production operations.
- Serve as a technical lead for platforms supporting AI/ML model training, inference, and batch workloads.
- Design, build, deploy, and operate OpenShift-based container platforms optimized for high-performance GPU workloads.
- Build, support and operate scalable GPU SuperPod architecture with large multi-node GPU clusters.
- Own monitoring, alerting, and observability using Grafana, Splunk, and enterprise telemetry tools.
- Define SLIs/SLOs and build actionable alerts to proactively detect performance, capacity, and resiliency risks.
- Build AI- and agent-based automation tools for self-healing, scaling, diagnostics, and incident remediation.
- Apply AIOps techniques to reduce alert fatigue and improve platform reliability.
- Lead production incident analysis and ensure operational rigor and root-cause prevention.
- Mentor engineers and influence stakeholders across a geographically distributed organization.

Required Qualifications
- 5+ years of infrastructure engineering experience.
- 5+ years troubleshooting complex end-to-end architectures(including CI/CD pipeline).
- 5+ years Linux systems experience.
- 4+ years supporting AI/ML platforms.
- 4+ years of Kubernetes / container platform experience including production support.

Desired Qualifications
- Experience with Generative AI and Predictive AI platforms.
- Hands-on GPU platform operations including scheduling, quota, and performance tuning.
- Experience with OpenShift in GPU-enabled, multi-tenant environments.
- Experience designing or operating GPU Super Pods.
- Deep experience with observability using Grafana, Splunk, and custom telemetry pipelines.
- Experience building AI- or agent-driven automation tooling (AIOps).
- Hands-on experience supporting AI/ML workloads on GCP and Azure, including GPU-backed services and managed AI infrastructure
- Experience operating hybrid or multi-cloud AI platforms, with an understanding of cloud-native services, networking, identity, and cost optimization for Generative and Predictive AI
- Strong monitoring of AI signals such as inference latency and GPU utilization.
- Experience with BCP/DR, resiliency, and highly available architectures.

Job Expectations
- Participation in a 24x7 on-call rotation.
- Ownership for production stability, platform health, and customer outcomes.
- Operate in regulated enterprise environments with strong risk and control focus.

Thanks and Regards, Shraddha. P			kasmoglobal.com
US IT Recruiter
+1 409 655 2620 srad...@kasmoglobal.com
Follow us on

Reply all

Reply to author

Forward

0 new messages