Please share the resume
Position- Senior Site Reliability Engineer (GCP)
Location- Atlanta GA (Hybrid Onsite)
Job Type- Contract
Interview process- 1 Internal discussion + 1 client discussion
Job Description:
We are seeking a highly skilled and proactive Senior Specialist, Site Reliability Engineering (SRE) to help drive reliability, scalability, and performance of our critical platforms while bringing deep technical expertise in Google Cloud Platform. This role is ideal for a senior-level engineer who combines deep technical expertise with a passion for automation, observability, and operational excellence and who is highly technical, thrives in distributed systems, and is passionate about operational excellence and modern cloud practices.
As a Senior Specialist, you’ll work on complex reliability challenges, lead technical initiatives, and collaborate across engineering, product, and infrastructure teams to ensure our systems are resilient and efficient.
· 7 or more years of experience in SRE, DevOps, cloud engineering, or infrastructure engineering.
· Strong experience with GCP architecture, networking, identity, and managed services.
· Expertise with Kubernetes and container platforms.
· Hands on experience implementing infrastructure as code using Terraform.
· Strong proficiency with modern observability stacks.
· Experience in Python, PowerShell, or similar languages.
· Experience with orchestration platforms such as Harness.
· Proven ability to diagnose and solve complex reliability problems in distributed systems.
· Experience leveraging AI tools to enhance workflow automation, experimentation, and problem solving.
· Excellent communication skills and the ability to influence outcomes across teams.
· Reliability Engineering
Architect and implement solutions that improve system reliability, scalability, and performance across GCP based services.
Define and manage SLIs, SLOs, and error budgets for critical systems.
Automate operational tasks, reduce toil, and improve the reliability posture of our environments.
Influence system and application architecture to ensure reliability is designed from the beginning.
· Incident Management and Root Cause Analysis
Serve as the technical lead during major incidents and drive restoration efforts.
Conduct detailed root cause analysis and deliver long term corrective actions.
Champion and facilitate blameless postmortems and continuous improvement practices.
· Cloud Architecture and Operations (GCP Focused)
Design Architect and improve GCP infrastructure including VPC design, Cloud DNS, load balancing, Cloud Armor equivalents for WAF and filtering, cloud storage patterns, managed compute platforms such as GKE and Cloud Run, and data warehouse platforms such as BigQuery.
Collaborate with teams to implement resilient multi zone and multi-region cloud architectures.
Lead the design and implementation of disaster recovery strategies and automated failover patterns within GCP.
Manage and optimize core GCP services such as IAM, service accounts, logging, and network controls.
Apply governance guardrails for secure multi project environments using tools such as GCP Organization policies, Cloud Identity, and related controls.
· Automation and Infrastructure as Code
Build infrastructure using Terraform and maintain consistent, scalable IaC patterns.
Create automation using Python, Bash, PowerShell, or similar languages.
Participate in CI and CD pipeline improvements and ensure high quality deployments into GCP environments.
· Monitoring & Tooling
Enhance observability through metrics, logs, and tracing using tools such as Prometheus, Grafana, Google Cloud Operations Suite, or similar solutions.
Build dashboards, alerts, and automated remediation systems that support reliability and performance goals.
Analyze cloud level logs such as VPC Flow Logs and Cloud Audit Logs to strengthen security and performance.
· Technical Leadership
Collaborate with security and software engineering teams to drive reliability and cloud excellence.
Influence system design and architecture to embed reliability from the ground up.
Stay current with GCP capabilities and recommend improvements to enhance performance, security, and efficiency.
Preferred Skills
· Experience in regulated or high-availability environments (e.g., financial services, healthcare).
· Familiarity with chaos engineering, performance optimization, and capacity planning.
· Software development background using languages such as Python or Go.
· Experience designing multi-region fault tolerant architectures in GCP.
Sincerely,
Srikanth