Hello
We are
actively looking for a Data Engineer – AI Systems. If you or your consultant are
actively looking for a new job please share your profile.
Role: Data Engineer – AI Systems
Duration: 6+ Months
Location: St. Louis, Missouri ( ONSITE )
Data
Engineer – AI Systems (Databricks)
We’re
building intelligent, Databricks-powered AI systems that structure and activate
information from diverse enterprise sources (Confluence, OneDrive, PDFs, and
more). As a Data Engineer, you’ll design and optimize the data pipelines
that transform raw and unstructured content into clean, AI-ready datasets for
machine learning and generative AI agents.
You’ll
collaborate with a cross-functional team of Machine Learning Engineers,
Software Developers, and domain experts to create high-quality data foundations
that power Databricks-native AI agents and retrieval systems.
Key
Responsibilities
- Develop Scalable Pipelines:
Design, build, and maintain high-performance ETL and ELT workflows using
Databricks, PySpark, and Delta Lake.
- Data Integration: Build APIs and connectors to
ingest data from collaboration platforms such as Confluence, OneDrive, and
other enterprise systems.
- Unstructured Data Handling:
Implement extraction and transformation pipelines for text, PDFs, and
scanned documents using Databricks OCR and related tools.
- Data Modeling: Design Delta Lake and Unity
Catalog data models for both structured and vectorized (embedding-based)
data stores.
- Data Quality & Observability:
Apply validation, version control, and quality checks to ensure pipeline
reliability and data accuracy.
- Collaboration: Work closely with ML Engineers
to prepare datasets for LLM fine-tuning and vector database creation, and
with Software Engineers to deliver end-to-end data services.
- Performance & Automation:
Optimize workflows for scale and automation, leveraging Databricks Jobs,
Workflows, and CI/CD best practices.
What You
Bring
- Experience with data engineering, ETL development, or data
pipeline automation.
- Proficiency in Python, SQL, and PySpark.
- Hands-on experience with Databricks, Spark, and Delta
Lake.
- Familiarity with data APIs, JSON, and unstructured
data processing (OCR, text extraction).
- Understanding of data versioning, schema evolution,
and data lineage concepts.
- Interest in AI/ML data pipelines, vector databases,
and intelligent data systems.
Bonus
Skills
- Experience with vector databases (e.g., Pinecone, Chroma,
FAISS) or Databricks’ Vector Search.
- Exposure to LLM-based architectures, LangChain, or Databricks
Mosaic AI.
- Knowledge of data governance frameworks, Unity Catalog,
or access control best practices.
- Familiarity with REST API development or data
synchronization services (e.g., Airbyte, Fivetran, custom connectors).
Thank You
Satti Reddy