SRE WITH SKAN AI - CONCORD, CA (Locals)
In- Person Interview required for Final Round
Overview:
To manage the technical deployment and upkeep of our on-prem infrastructure, including
Virtual Assistant, Skan Portal and Gateway components. In this critical role, you’ll lead
customer-facing implementations, validate infrastructure environments, perform
connectivity assurance, and support continuous reliability and scalability.
Responsibilities:
• Deployment Planning & Coordination: Lead end-to-end customer Gateway
deployments - from strategic planning and infrastructure readiness validation to
installation and activation.
• Infrastructure Validation: Confirm customer environments meet requirements for
networking (TCP/IP, firewall, DNS), storage and computing before deployment.
• Connectivity Testing & Assurance: Conduct thorough end-to-end connectivity testing
across Skan stack components (VA, Portal, Gateway).
• Technical Transition Management: Facilitate smooth transitions from development to
Production, liaising with stakeholders to set and meet enterprise technical standards.
• Documentation & Standards: Produce and maintain comprehensive documentation -
architecture diagrams, network configurations, troubleshooting guides, and
deployment workflows.
• System Monitoring & Health Maintenance: Set up and utilize monitoring and logging
tools to proactively identify and resolve performance or stability issues.
Qualifications:
• 5–7 years in Systems or Infrastructure Engineering with a strong track record in largescale
deployment projects.
• 3+ years of hands-on experience with OpenShift/Kubernetes, preferably in on-prem
environments. Experience with containerization (Docker, Kubernetes).
• Demonstrated understanding of OpenShift Container Platform (OCP)
architecture in
large Enterprise Environment.
• Knowledgeable with various assets in OCP environment such as
• Pods, Deployments, StatefulSets, Services, Routes, Namespaces / Projects
• ConfigMaps, Secrets
• Persistent Volumes / Persistent Volume Claims
• Resource requests/limits
• Node, cluster, and container troubleshooting
• Strong experience in doing RCA for
• Pod crashes
• Restart loops
• Containerized Databases instances such as Postgres, Star Rocks
• Networking & Security: Solid knowledge of networking (TCP/IP, DNS, DHCP), firewall
configurations, and security best practices.
• Monitoring Tools: Familiarity with system monitoring, logging, and incident remediation
frameworks.
• Core Competencies: Strong project coordination, proactive problem-solving, and
cross-functional communication skills essential.
• 3+ years of experience working with PostgreSQL
• Strong scripting/automation abilities in Python, or similar.
Preferred:
• Familiarity with Flink, and Star Rocks for data layer integration.
• Knowledge of Citrix-based architecture.
Tech Stack Alignment:
• OpenShift, Kubernetes, Docker, GCP, Grafana, Splunk logs, Redis, PostgreSQL, Star
Rocks, S3, Auth0, Superset.
Day in Life of the Role – (Key Responsibilities)
• Must have demonstrated prior experience to be able to work independently.
• Monitor application behavior consistently across all 3 environments
• Troubleshoot incidents for RCA end-to-end with minimal supervision
• Work directly with infrastructure, platform, DBA and application teams as needed
• Play a key role in Capacity usage monitoring, management and expansion when
required.
• Drive reliability improvements, alert tuning, and operational maturity