Job Title: SRE Architect
Location: Atlanta GA
Work Mode: Hybrid (Employees follow a rotating hybrid
schedule: 5 consecutive days from office (Thu–Fri + Mon–Wed), followed by 5
days work from home(Thu to next wed), and then repeat)
Look for only local candidates
Rate: Competitive rates (Keep as low as possible)
Job Title: Site
Reliability Engineering (SRE) Architect
Location: Atlanta,
Georgia
Client – Delta Airlines.
Role Summary:
As an SRE Architect, you will be
a pivotal technical leader responsible for designing, building, and evolving
the foundational systems and practices that ensure the reliability,
scalability, performance, and efficiency of our critical services. Moving beyond
day-to-day operations, you will focus on the strategic architectural direction
of SRE function, defining standards, blueprints, and frameworks that enable
development teams and fellow SRE operations team to build and operate highly
resilient systems. Leverage deep expertise in software engineering, distributed
systems, cloud infrastructure, and SRE principles to influence technology
choices, establish best practices, and foster a proactive culture of
reliability across the organization and much beyond observability pillar.
Key Responsibilities:
- Reliability Strategy &
Design:
- Architect and design highly
available, scalable, secure, and cost-effective infrastructure and
application patterns on AWS
- Define and evangelize SRE best
practices, standards, and blueprints for service design, deployment,
monitoring, and operational readiness across the engineering organization
- Review current observability
implementation to identify gaps and define steps to reach next level
maturity of observability setup to provide deep insights into
system health and behaviour
- With overall maturity lead the
definition and implementation strategy for Service Level Indicators
(SLIs), Service Level Objectives (SLOs), and Error Budgets for critical
services
- Platform Architecture & Automation:
- Design solutions to
systematically reduce operational toil through automation and improved
system design
- Evaluate current SRE tools and
automation frameworks (e.g., CI/CD pipelines, Infrastructure as Code
modules, automated incident remediation, chaos engineering platforms) and
suggest enhancement that will help overall enhancement of capability
- Evaluate, prototype, and
recommend new technologies, tools, and methodologies to enhance system
reliability, developer productivity, and operational efficiency
- Technical Leadership &
Consultation:
- Act as a senior technical
advisor and subject matter expert on reliability, scalability, and
performance for development and platform teams
- Provide architectural guidance
during the design phase of new services and features to ensure
reliability principles are embedded early (shift-left)
- Mentor and coach other SREs
and engineers, fostering technical excellence and adherence to SRE
principles
- Lead architectural reviews and
production readiness assessments for critical systems
- Resilience:
- Lead blameless postmortems for
significant incidents, ensuring root causes are identified and systemic
architectural improvements are prioritized and implemented
- Architect and advocate for
resilience patterns (e.g., circuit breaking, rate limiting, graceful
degradation, chaos engineering) within applications and infrastructure
Required Qualifications:
- Proven experience in an
architectural role, designing solutions for reliability, scalability, and
performance
- Deep understanding and
practical application of SRE principles (SLIs/SLOs, error budgets, toil
reduction, automation, incident management, postmortems)
- Expertise in cloud computing
platforms (e.g., AWS) including infrastructure, networking, and security
services
- Strong experience with
containerization and orchestration technologies (Kubernetes, Docker,
serverless computing)
- Solid experience designing and
implementing observability solutions (e.g., Dynatrace, Prometheus,
Grafana, ELK/EFK Stack, Jaeger, OpenTelemetry)
- Strong programming/scripting
skills (e.g., Python, Go, Bash) for automation and tool development
- Excellent analytical,
problem-solving, and strategic thinking skills.
- Strong communication,
collaboration, and leadership skills with the ability to influence
technical direction across teams
Preferred Qualifications:
- Experience designing and
implementing chaos engineering practices and platforms
Job Title: SRE Architect
Location: Atlanta GA
Work Mode: Hybrid (Employees follow a rotating hybrid
schedule: 5 consecutive days from office (Thu–Fri + Mon–Wed), followed by 5
days work from home(Thu to next wed), and then repeat)
Look for only local candidates
Rate: Competitive rates (Keep as low as possible)
Job Title: Site
Reliability Engineering (SRE) Architect
Location: Atlanta,
Georgia
Client – Delta Airlines.
Role Summary:
As an SRE Architect, you will be
a pivotal technical leader responsible for designing, building, and evolving
the foundational systems and practices that ensure the reliability,
scalability, performance, and efficiency of our critical services. Moving beyond
day-to-day operations, you will focus on the strategic architectural direction
of SRE function, defining standards, blueprints, and frameworks that enable
development teams and fellow SRE operations team to build and operate highly
resilient systems. Leverage deep expertise in software engineering, distributed
systems, cloud infrastructure, and SRE principles to influence technology
choices, establish best practices, and foster a proactive culture of
reliability across the organization and much beyond observability pillar.
Key Responsibilities:
- Reliability Strategy &
Design:
- Architect and design highly
available, scalable, secure, and cost-effective infrastructure and
application patterns on AWS
- Define and evangelize SRE best
practices, standards, and blueprints for service design, deployment,
monitoring, and operational readiness across the engineering organization
- Review current observability
implementation to identify gaps and define steps to reach next level
maturity of observability setup to provide deep insights into
system health and behaviour
- With overall maturity lead the
definition and implementation strategy for Service Level Indicators
(SLIs), Service Level Objectives (SLOs), and Error Budgets for critical
services
- Platform Architecture & Automation:
- Design solutions to
systematically reduce operational toil through automation and improved
system design
- Evaluate current SRE tools and
automation frameworks (e.g., CI/CD pipelines, Infrastructure as Code
modules, automated incident remediation, chaos engineering platforms) and
suggest enhancement that will help overall enhancement of capability
- Evaluate, prototype, and
recommend new technologies, tools, and methodologies to enhance system
reliability, developer productivity, and operational efficiency
- Technical Leadership &
Consultation:
- Act as a senior technical
advisor and subject matter expert on reliability, scalability, and
performance for development and platform teams
- Provide architectural guidance
during the design phase of new services and features to ensure
reliability principles are embedded early (shift-left)
- Mentor and coach other SREs
and engineers, fostering technical excellence and adherence to SRE
principles
- Lead architectural reviews and
production readiness assessments for critical systems
- Resilience:
- Lead blameless postmortems for
significant incidents, ensuring root causes are identified and systemic
architectural improvements are prioritized and implemented
- Architect and advocate for
resilience patterns (e.g., circuit breaking, rate limiting, graceful
degradation, chaos engineering) within applications and infrastructure
Required Qualifications:
- Proven experience in an
architectural role, designing solutions for reliability, scalability, and
performance
- Deep understanding and
practical application of SRE principles (SLIs/SLOs, error budgets, toil
reduction, automation, incident management, postmortems)
- Expertise in cloud computing
platforms (e.g., AWS) including infrastructure, networking, and security
services
- Strong experience with
containerization and orchestration technologies (Kubernetes, Docker,
serverless computing)
- Solid experience designing and
implementing observability solutions (e.g., Dynatrace, Prometheus,
Grafana, ELK/EFK Stack, Jaeger, OpenTelemetry)
- Strong programming/scripting
skills (e.g., Python, Go, Bash) for automation and tool development
- Excellent analytical,
problem-solving, and strategic thinking skills.
- Strong communication,
collaboration, and leadership skills with the ability to influence
technical direction across teams
Preferred Qualifications:
- Experience designing and
implementing chaos engineering practices and platforms
Thanks and Regards
Solomon
sol...@gtssminds.com