Please share your profiles to Sult...@nextgen-is.com
Only share Austin TX locals
Position: : Systems Analyst 3 (DevOps/Site Reliability Engineer
Location: Austin TX
- Hybrid
Duration: 5 Months
Client: Texas Health and Human Services Commission –
529601671
Job
Summary
We
are seeking an experienced Systems Analyst 3 with a strong background in Site
Reliability Engineering (SRE) and DevOps practices. The ideal candidate will be
responsible for ensuring the reliability, scalability, and performance of
production systems by applying software engineering principles to
infrastructure and operations.
This
role requires close collaboration with development teams to build resilient,
observable, and automated systems that meet defined service level objectives.
Key
Responsibilities
- Ensure the availability, reliability, and performance
of production systems
- Design, implement, and maintain scalable and highly
available distributed systems
- Monitor system health using logging, monitoring, and
alerting tools
- Define and manage SLIs, SLOs, and error budgets
- Perform incident management, root cause analysis (RCA),
and postmortems
- Collaborate with development teams to improve system
architecture and performance
- Automate infrastructure and operational processes using
scripting and DevOps tools
- Implement containerization and orchestration solutions
using Docker and Kubernetes
- Integrate security and compliance requirements into
system operations
- Develop and maintain documentation, including runbooks
and operational procedures
Required
Qualifications
- Minimum 8 years of experience in Systems Engineering,
DevOps, or Site Reliability Engineering
- Strong expertise in Linux/Unix systems and system
internals
- Proficiency in one or more programming or scripting
languages such as Python, Go, Java, or Bash
- Experience with cloud platforms such as AWS or GCP
- Hands-on experience with containerization and
orchestration tools (Docker, Kubernetes)
- Strong understanding of monitoring, logging, and
alerting concepts
- Experience working with highly available, distributed
systems
- Experience with incident management and root cause
analysis
- Knowledge of integrating security and compliance into
operational workflows
Preferred
Qualifications
- Experience with observability tools such as Prometheus,
Grafana, Datadog, Splunk, or Application Insights
- Experience supporting 24x7 production environments and
on-call rotations
- Familiarity with chaos engineering and resiliency
testing
- Experience with feature flags, canary deployments, and
progressive delivery
- Strong documentation and communication skills
Work
Environment
- Hybrid work model with onsite presence required in
Austin, TX (2 days per week)
- Standard business hours with potential need for
after-hours or weekend support
- Candidates must be local to Texas