Job Title: Site Reliability
Engineer (SRE)
Location: Seattle, WA
Duration: 3 Months (CTH)
Required Skills: Vulnerability Management,
Observability & Server Patching
Send me the resumes to Venu.e...@infosharesystems.com
Job Description:
This role
is responsible for ensuring the security, reliability, and operational
excellence of server infrastructure through proactive vulnerability management,
effective server patching, and robust observability practices. The SRE will
leverage platforms such as Brinqa for vulnerability
aggregation and prioritization, and Datadog for monitoring,
alerting, and service observability.
The ideal
candidate will work closely with engineering, security, and application teams
to identify and remediate risks, execute patching strategies, and continuously
improve system visibility, reliability, and compliance.
Key
Responsibilities:
Vulnerability
Management
- Manage and continuously improve the enterprise vulnerability
management program using Brinqa for aggregation, prioritization, and
reporting.
- Identify, analyze, and assess vulnerabilities across server
infrastructure, including operating systems, applications, and supporting
components.
- Partner with security, infrastructure, and application teams to
prioritize remediation efforts based on risk and business impact.
- Ensure adherence to corporate security policies, regulatory
requirements, and industry best practices.
Server
Patching & Remediation
- Plan, schedule, and execute server patching activities for
operating systems and third-party software.
- Track patch compliance and remediation metrics, including mean time
to patch (MTTP).
- Develop and maintain automation scripts and tooling to streamline
patching workflows and improve efficiency.
- Reduce operational risk by standardizing patching processes and
minimizing service disruption.
Observability
& Reliability
- Maintain and enhance observability of supported services using
Datadog.
- Design and implement effective monitoring, alerting, and dashboards
to improve service reliability and operational awareness.
- Define and measure service-level indicators (SLIs), service-level
objectives (SLOs), and success metrics.
- Analyze incidents and trends to drive continuous improvement in
system reliability and performance.
Collaboration
& Operations
- Collaborate with application owners, platform teams, and other
stakeholders to support core SRE and operational objectives.
- Provide guidance and best practices related to reliability,
security, and operational resilience.
- Support incident response, root cause analysis, and post-incident
reviews where applicable.
Skills
& Qualifications:
- Strong hands-on experience with server operating systems (Windows
Server, Linux) and patching methodologies.
- Solid understanding of vulnerability management frameworks,
risk-based prioritization, and remediation practices.
- Experience with vulnerability management tools such as Brinqa, Qualys,
or similar platforms.
- Proven experience implementing observability solutions using Datadog.
- Experience working in on-premise and Microsoft Azure
environments.
- Hands-on experience with containerized applications using Docker
and Kubernetes (K8s).
- Experience with CI/CD pipelines, including GitOps-based
deployments using ArgoCD.
- Proficiency in automation and scripting (e.g., Python, PowerShell,
Bash).
- Experience supporting on-call rotations, incident
response, and production issue resolution.
- Good knowledge of networking concepts,
including TCP/IP, DNS, load balancing, firewall rules, and troubleshooting
connectivity issues.
- Familiarity with ITIL concepts and operational
best practices.
- Strong communication and cross-team collaboration skills.
- Ability to work independently, manage multiple priorities, and
operate effectively in a fast-paced environment.