In person interview || Senior Site Reliability Engineer (SRE) with Linux at Alpharetta, GA || ​O​nsite

0 views
Skip to first unread message

Varshendra Chaturvedi

unread,
1:11 PM (5 hours ago) 1:11 PM
to US IT Technical Recruitement Group
Hi,


I hope this email finds you well.

If you're interested in pursuing this opportunity, kindly share your updated resume with us at your earliest convenience.

Please fee​l free to reach out if you have any questions or need further details about the role

If anyone of your friends or Colleagues interested in the below position, please let me know references are highly appreciated...!

If this information was not relevant to you, I extend my apology for sending this email.

 

Role: ​Senior​ Site Reliability Engineer (SRE) with Linux  

Location: ​  Alpharetta, GA || ​O​nsite

Position Type: Long Term Contract


Experience (Years): 10+ years 


​Need H1B

Local candidates are required for in-person interviews.


Job Description:


Skill Set - Expertise in UNIX + ​LINUX Administration +​ AWS/ AZURE Cloud monitoring + ​Terraform/ Ansible + ​Prometheus/ Grafana observability experience).

Work Location - Alpharetta

Experience required for role - 6+ years

•       Production experience in SRE / ​Infrastructure / ops for large-scale systems

•       Strong programming/scripting skills (Python, Go, Java, or equivalent)

•       Deep experience with containerization (Docker), orchestration (Kubernetes, etc.)

•       Infrastructure-as-code (Terraform, Helm, CloudFormation, Ansible, etc.)

•       Familiarity with GPU / AI compute clusters, high-performance data storage, and distributed architectures

•       Experience with monitoring / observability / logging / alerting tools (Prometheus, Grafana, ELK / EFK, Datadog, etc.)

•       Networking & systems engineering knowledge (TCP/IP, DNS, routing, load balancing, distributed storage)

•       Solid experience in capacity planning, performance tuning, scaling, and incident response

•       Demonstrated ability to lead RCAs, deploy fixes, and drive reliability improvements

•       Experience in regulated environments (financial services, compliance, audit, security) is a strong plus

•       Excellent communication, documentation, and cross-team collaboration skills

•       Proven track record of reducing operational toil via automation

Experience: 6+ years of experience as a Site Reliability Engineer or in a similar role, with hands-on experience in supporting IaaS platforms with networking and system engineering knowledge. 

•       Operate, monitor, and maintain the infrastructure supporting GenAI applications (training, inference, feature store, data ingestion, model serving)

•       Design and build automation for core platform capabilities, reducing manual toil

•       Develop and maintain infrastructure-as-code (IaC) for provisioning and managing compute, storage, network, GPU clusters, Kubernetes / container orchestration, etc.

•       Establish, monitor, and enforce SLOs/SLIs/SLAs, error budgets, alerting, and dashboards

•       Lead incident response, root cause analysis (RCA), postmortems, and systemic remediation

•       Perform capacity planning, scaling strategies, workload scheduling, and resource forecasting

•       Optimize cost vs. performance tradeoffs in large-scale compute environments

•       Harden systems for security, compliance, auditability, and data governance

•       Collaborate across teams (cloud engineers, data engineers, infrastructure, security) to ensure safe deployment, rollout, rollback, and integration of new systems

•       Define disaster recovery (DR) strategies, backup/restore practices, fault tolerance mechanisms

•       Maintain runbooks, operational playbooks, documentation, and training materials

•       Participate in on-call rotations and respond to production incidents 24/7 as needed

•       Continuously evaluate and integrate new tools, frameworks, or technologies to enhance platform reliability



Skills: Digital : Python~Digital : Docker~Digital : Kubernetes~Digital : Site Reliability Engineering (SRE) Experience Required: 6-8 



--


Thanks and Regards,

                 

Varshendra Chaturvedi​

---------------------------------------------------

Next Level Business Services INC.

Talent Solutions | Digital Transformation | Data Analytics

E-mail:​ varshendra...@recruiter.nlbtech.com | Web: www.nlbservices.com

An ISO 27001 and 20000-1 Certified & Minority Business Enterprise (CMBE) 

Reply all
Reply to author
Forward
0 new messages