[Jobs] Research Engineer for VLMs

104 views
Skip to first unread message

Heni Ben Amor

unread,
Jun 4, 2026, 9:53:03 PM (2 days ago) Jun 4
to ml-...@googlegroups.com

At PerceptAI, we are building the Foundation Model for Physical AI and Spatial Intelligence in the real world. Our mission is to develop a breakthrough, multi-modal understanding of the physical environment, empowering both humans and machines to intelligently perceive, reason about, and interact with the space around them. By turning complex real-world data into high-fidelity, actionable digital twins, we are unlocking the next generation of spatial awareness for critical industries, including urban operations, robotics, emergency response, and beyond.

As a Research Engineer for VLM, you will be at the forefront of core architecture development, bridging the gap between language, vision, and 3D geometry. You will design, train, and scale large vision-language models optimized for spatial reasoning, semantic grounding, and situational awareness. Leveraging cutting-edge multi-modal AI, you will develop the foundational models that allow PerceptAI to not just see the world, but deeply comprehend its spatial layouts, 3D scenegraphs, and agentic possibilities at an unprecedented scale.

Responsibilities

  • Scale Large VLM Training: Design and optimize high-throughput multi-modal data pipelines to train large-scale vision-language models, ensuring efficient distributed training and alignment across large GPU clusters.
  • Advance Spatial Reasoning & 3D Scenegraphs: Pioneer architectures that integrate 3D modeling and structured 3D scenegraphs into VLMs, enabling deep semantic grounding and open-ended queries about complex physical environments.
  • Build Agentic Frameworks & Benchmarks: Develop robust agentic frameworks for spatial decision-making and establish rigorous benchmarking suites to evaluate VLM performance on spatial intelligence and situational awareness.
  • Optimize Efficient Inference: Implement cutting-edge optimization techniques to ensure low-latency, high-efficiency deployment of massive VLMs for real-time applications.
  • Collaborative Innovation: Work closely with engineering and product leads to translate cutting-edge multi-modal research into high-performance, production-grade APIs and software toolkits.

Required Qualifications

  • Education: Master’s or PhD in Computer Science, Electrical Engineering, Robotics, or a related field with a heavy emphasis on Multi-Modal Deep Learning and Computer Vision (or equivalent industry experience).
  • Deep Learning & VLM Mastery: Expert-level fluency in standard deep learning frameworks (PyTorch, JAX) and a proven track record of training, fine-tuning, and troubleshooting large-scale vision-language models (VLMs).
  • Spatial & 3D Grounding Expertise: Strong foundational knowledge of spatial reasoning, semantic grounding, and bridging 2D/3D visual data with textual representations.
  • Large-Scale Engineering: Hands-on experience working with massive, unstructured multi-modal datasets and managing large-scale distributed training infrastructure (e.g., Megatron-LM, DeepSpeed).
  • Implementation Excellence: Ability to write clean, maintainable, production-quality research and deployment code (Python/C++), with a strong focus on algorithmic efficiency and optimized inference.

Pluses (Nice to Have)

  • Direct experience integrating 3D representations (e.g., NeRFs, Gaussian Splatting, or 3D meshes) into vision-language architectures.
  • Experience working with 3D scenegraphs, hierarchical spatial data, or embodied AI frameworks.
  • Familiarity with agentic tool-use, multi-step planning, or complex situational awareness workflows using LLMs/VLMs.
  • Familiarity with open-set recognition, Zero-Shot learning, or Vision-Language Models (VLMs) applied to 3D spaces.
  • A strong publication record at machine learning, NLP, and computer vision venues (CVPR, ICCV, ECCV, NeurIPS, ICLR, ACL, EMNLP).

What We Offer

At PerceptAI, we offer a highly competitive base salary and substantial equity/stock options, allowing you to deeply share in the upside of a nascent technology field around Physical AI and Foundation Models. We also provide comprehensive health, dental, and vision benefits and the possibility to work remotely (within USA). As part of our team, you will have the opportunity to work on highly impactful, real-world problems that reshape emergency response, robotics, and urban infrastructure.

To Apply: Join us as we bridge the gap between advanced deep learning and 3D geometry to build the ultimate spatial foundation model. To apply, please send your resume + an optional 2 page portfolio document with GitHub/Google Scholar profile to: jo...@usepercept.ai

Reply all
Reply to author
Forward
0 new messages