AI Infrastructure Runtime Engineer

0 views

Skip to first unread message

Arjun Tomar

unread,

May 28, 2026, 12:47:50 PM (3 days ago) May 28

to Arjun Tomar

Hi,

I hope this message finds you well.

I am Arjun from Vyze Inc., currently working on an urgent requirement with one of our esteemed clients. Based on your profile, I believe this opportunity will be best fit for you. If you are available and interested in this contract opportunity, kindly share your updated resume along with your availability for a discussion at your earliest convenience.

Due to the urgent nature of this requirement, we would appreciate a prompt response.

Job Description -

We are urgently looking to onboard a top-tier On-Premises LLM Inference & GPU Systems Engineer for an exciting project with one of our premium clients. We are specifically seeking high-caliber professionals with deep, hands-on experience in On-Premises LLM Inference & GPU Systems Engineering.

Please confirm the candidate's current location and their availability for an in-person interview upon submission.

Kindly review the detailed JD below before submitting profiles.

Key Requirements:

Experience:10+ years of total experience is mandatory.
Location:Local to Charlotte, NC only. There are no relocation or remote options for this role.
Interview Process: Candidates must be available for a Face-to-Face interview at the client’s office. Please only submit candidates who are 100% comfortable with an in-person interview.
Onsite Policy: Must be comfortable working on-site as per client requirements.

“Need Old LinkedIn with photo” before 2020

“Dl and Visa copy”

Genuine Visa

Client: NTT Data

Important Note: Please avoid submitting junior or unrelated profiles. We are looking for strong, hands-on professionals who can lead the technical direction of AI products.

Job Description:

We are seeking an AI Infrastructure Runtime Engineer to build and maintain large-scale on-prem LLM infrastructure. This is an enterprise private GenAI environment running on NVIDIA H200 GPU clusters and an OpenShift AI deployment ecosystem. You will manage production inference internally, including self-hosting open-source LLMs like Llama. We are focused exclusively on inferencing; this role involves no model training infrastructure or fine-tuning pipelines.

Key Responsibilities:

NVIDIA GPU Runtime Optimization: Drive extreme runtime efficiency and optimization for the token generation pipeline. Specifically manage prefill/decode optimization and KV cache management.
Inference Serving: Deploy and manage inference engines including vLLM and TensorRT-LLM.
Hardware Utilization: Optimize GPU throughput tuning, batching strategies, and latency optimization. Manage workload orchestration using RunAI and Kubernetes GPU orchestration.
Model Lifecycle Management: Oversee the complete Hugging Face model lifecycle, including model onboarding, deployment, and retirement.
Platform Operations: Operate and maintain the OpenShift AI ecosystem as the primary container platform for GenAI workloads.

Required Qualifications:

5+ years expertise as an LLM Systems Engineer or AI Infrastructure Runtime Engineer.
5+ years hands-on experience with NVIDIA H200 clusters and runtime optimization techniques (KV Cache, prefill/decode).
3+ years experience in OpenShift AI and GPU orchestration tools like RunAI.
Strong experience with modern inference frameworks, specifically vLLM and TensorRT-LLM.
Proven track record managing the Hugging Face deployment lifecycle.

Best Regards

Arjun Tomar

Recruiter |Vyze Inc.

Call :+1 571-456-3086 |E-Mail: ato...@vyzeinc.com

Hangout: atoma...@gmail.com

24718 Tribe Square #306, Dulles, VA 20166

cid:image001.png@01D942EA.C99ED370

We believe great business comes from honest Relationships

image001.png

Reply all

Reply to author

Forward

0 new messages