Role : Senior Java Developer / Forgerock Developer
Location : Alpharetta,
GA ( Onsite )
Duration : Long Term
1. Project Overview
Client is seeking a supplier to provide
engineering, maintenance, and enhancement services for its Google Cloud
Platform ("GCP") Supercomputer Solutions. The supplier will be
responsible for supporting and enhancing two key product areas: Cluster Toolkit
and HyperCompute Cluster Service (HCS). This work involves a combination of
ongoing operational tasks, testing, documentation, and specific development
deliverables.
2. Scope of Work & Deliverables
The supplier will be responsible for the services
and deliverables detailed below.
2.1. Ongoing Maintenance
- The contractor must
provide ongoing maintenance and enhancements for all 6 projects covered
under the original Statement of Work.
2.2. Cluster Toolkit Cluster Toolkit is an open-source software solution
that simplifies the deployment of high-performance computing (HPC), artificial
intelligence (AI), and machine learning (ML) workloads on Google Cloud.
Ongoing Responsibilities:
- Stability Testing: Test the
stability of new products, beginning with A3U. This includes:
- Building NVIDIA
Collective Communications Library (NCCL) tests on a Slurm cluster.
- Setting up and
running pairwise tests to identify and report bad nodes.
- Integration Test
Triage: Perform
rotational duties to manage and triage integration test failures. This
includes:
- Monitoring daily
failure chats and flake tools.
- Reporting on failures
and performing advanced handling, such as creating new bug reports and
categorizations.
- Documentation: Improve,
organize, and maintain the Cluster Toolkit documentation. This process
involves:
- Gathering existing
documents and identifying information gaps.
- Creating new
documentation and updating existing materials.
- Organizing the
information in g3docs, consolidating it in a team Google Drive, and
establishing a review process.
- Project Cleanup: Once a week,
clean up the 'hpc-toolkit-dev' project by identifying and deleting unused
resources.
- Security: Triage and
address security alerts by checking for them, creating PageRanks (PRs) to
resolve them, and applying the necessary updates.
Key Deliverables:
- HPC VM Image Releases: Deliver 4-6
High-Performance Computing Virtual Machine (HPC VM) image releases during
2025.
- Software Widget
Releases: Release
new software widgets every two weeks during 2025, including managing any
necessary hotfixes.
2.3. HyperCompute Cluster Service (HCS) HCS is a service that enables the deployment and
management of resilient, high-performance AI and HPC systems at scale.
Key Deliverables:
- API Integration
Testing: Add
comprehensive integration tests for all HCS Application Programming
Interface (API) surfaces. Coverage must include:
- HypercomputeClusters: Create, Delete,
Update, Get, and List requests and responses.
- Network: NetworkInitialize
params.
- Storage: StorageInitialize,
FileStoreInitialize, Filestore tier, ParallelstoreInitialize, and
GcsInitialize params.
- Compute: Resource
request, Guest accelerator, Disk, Provisioning model, Reservation
affinity and type, Orchestrator, Slurm, Node test, Storage configuration,
and Slurm partition.
- Critical User Journey
(CUJ) Validation: Add integration tests to validate the following
critical user journeys:
- Creating a cluster
that consumes a reservation.
- Creating a cluster with
a new network and new storage.
- Creating a cluster
using a pre-existing network and storage created both outside of HCS and
by a previous HCS deployment.
- Destroying all
components of an HCS-created cluster.
- Destroying a cluster
while leaving the network and storage intact.
- Updating a Slurm
cluster to add a new reservation to both new and existing partitions.
Required Mandatory Details Must be filled By candidate
:
Required Details
|
Details
to be filled by candidate
|
|
|
Candidate
Name
|
|
Position
|
Senior Java Developer / Forgerock Developer
|
Present
location (city and state)
|
|
Relocation-
YES/NO
|
|
Work
Authorization( H-1B, EAD, GC, USC)
|
|
Telephone
No ( No Google / Text Now or VOIP Number )
|
|
E-mail
ID
|
|
Currently
Working (Yes/No)
|
|
Type
of Hire - Contract/ C2H
|
|
Onsite
availability (post-selection)
|
|
Total
onsite experience, working in US
|
|
Overall
relevant experience of candidate
|
|
Availability
for Interview (Preferred Time)
|
|
Rate
/ Salary
|
|
Bachelor’s
/ Master’s University / Stream / Pass out year/ Location
|
|
LinkedIn
Id
|
|
Current
Employer
|
|
Current
Client / Project
|
|
Candidate
ID Submitted( Drivers License/Passport)
&
Work Authorization (if H1B/EAD)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|