Full Eiffel Training Data Inventory
Available Eiffel Source Code
| Source | Files | Est. Lines | Est. Tokens | Quality | Access |
|-----------------------------------------|----------------|------------|-------------|--------------------------|--------------|
| Simple Eiffel | 796 | 90,000 | ~1.2M | Excellent (DBC, tested) | ✅ Local |
| ISE EiffelStudio library | 3,957 | ~800,000 | ~10M | Excellent (production) | ✅ Local |
| ISE EiffelStudio contrib | 6,274 | ~940,000 | ~12M | Good (mixed) | ✅ Local |
| https://github.com/gobo-eiffel/gobo | ~1,500 | ~300,000 | ~4M | Excellent (Eric Bezault) | ✅ GitHub |
| https://github.com/finnianr/eiffel-loop | 4,100+ classes | ~615,000 | ~8M | Good (Finnian Reilly) | ✅ GitHub |
| Reference docs | - | 116,000 | ~1.5M | High (curated) | ✅ Local |
| OOSC2 examples | - | ~20,000 | ~0.3M | Canonical | ⚠️ Scattered |
| ECMA-367 spec | 1 | ~15,000 | ~0.2M | Authoritative | ✅ Local |
Total Available Training Corpus
| Category | Lines | Tokens |
|--------------------|-------------|-------------|
| Pure Eiffel source | ~2,750,000 | ~35M |
| Documentation | ~150,000 | ~2M |
| Grand Total | ~2.9M lines | ~37M tokens |
---
Training Feasibility Assessment
For Different Approaches
| Approach | Min Tokens Needed | We Have | Feasibility |
|------------------------------|-------------------|---------|------------------|
| Train from scratch | 1-10 trillion | 37M | ❌ Not feasible |
| Full fine-tune (all weights) | 100M+ | 37M | ⚠️ Marginal |
| LoRA fine-tune | 1-10M | 37M | ✅ Excellent |
| QLoRA fine-tune | 1-10M | 37M | ✅ Excellent |
| RAG + prompting | 100K+ | 37M | ✅ Already doing |
Key insight: We have 37M tokens of high-quality Eiffel code. For LoRA/QLoRA fine-tuning, this is 3-30x more than needed. The corpus is excellent.
---
Your Hardware Path (5070 Ti 16GB)
Given your setup (Windows 11 + WSL2 + 5070 Ti 16GB), here's the practical path:
What Will Work
| Model Size | Method | Seq Length | Status |
|-----------------------------|-------------|------------|---------------------------|
| 7B (Qwen2.5-Coder, Mistral) | QLoRA 4-bit | 2048 | ✅ Comfortable |
| 7B | LoRA 16-bit | 1024 | ⚠️ Tight but works |
| 13B | QLoRA 4-bit | 1024 | ⚠️ Possible, conservative |
| 30-34B | Any | Any | ❌ Rent cloud |
Recommended Stack
Base Model: Qwen2.5-Coder-7B (best code model in 7B class)
Method: QLoRA 4-bit
Seq Length: 2048 tokens (covers most Eiffel classes)
LoRA Rank: 16 (r=16, alpha=32)
Trainer: LLaMA-Factory (easiest UI)
Hardware: Your 5070 Ti 16GB via WSL2
---
The Practical Plan
Phase 1: Dataset Creation (1-2 weeks)
Convert your Eiffel corpus to instruction pairs:
{"instruction": "Write a void-safe Eiffel feature that parses JSON",
"input": "",
"output": "<actual simple_json code>"}
{"instruction": "Add Design by Contract to this feature",
"input": "set_name (n: STRING) do name := n end",
"output": "set_name (n: STRING)\n require\n n_not_void: n /= Void\n do\n name := n\n ensure\n name_set: name = n\n end"}
{"instruction": "Fix VJAR void safety error",
"input": "x := detachable_value\nx.do_something",
"output": "if attached detachable_value as l_x then\n l_x.do_something\nend"}
Target: 10,000-50,000 instruction pairs from:
- Simple Eiffel (yours - highest quality)
- ISE stdlib (patterns)
- Gobo (portable patterns)
- EiffelLoop (real-world usage)
- Your reference docs (gotchas, patterns)
Phase 2: First Training Run (1 day)
On your 5070 Ti:
# WSL2 Ubuntu
conda activate qlora
cd LLaMA-Factory
# Run training (will take 4-12 hours depending on dataset size)
python train.py \
--model_name Qwen/Qwen2.5-Coder-7B \
--quantization 4bit \
--lora_rank 16 \
--max_length 2048 \
--batch_size 1 \
--gradient_accumulation 16 \
--epochs 3 \
--dataset your_eiffel_instructions.jsonl
Cost: $0 (your hardware)
Time: 4-12 hours
Phase 3: Deployment
Export LoRA adapter → Load in Ollama → Use locally:
# After training
ollama create eiffel-expert -f Modelfile
# Usage
ollama run eiffel-expert "Write a SCOOP-compatible cache class in Eiffel"
Phase 4: Dogfood Loop
You + Claude + Eiffel-Expert-LoRA
↓
Generate Eiffel code
↓
You verify/correct
↓
Corrections → New training pairs
↓
Retrain monthly (on your GPU)
↓
Model improves each cycle
↓
(repeat)
---
Cost Comparison
| Approach | One-Time Cost | Monthly Cost | Quality |
|------------------------|---------------|--------------|-------------|
| Claude API only | $0 | $20-100 | Good |
| Claude + RAG/Oracle | $0 | $20-100 | Better |
| Your GPU + QLoRA | $0 | $0 | Specialized |
| Cloud fine-tune (A100) | $50-200 | $0 | Specialized |
---
Bottom Line
| Question | Answer |
|----------------------------------------|-------------------------------------------|
| Do we have enough data? | Yes - 37M tokens is 3-30x what LoRA needs |
| Can you train locally? | Yes - 5070 Ti 16GB handles 7B QLoRA fine |
| Time to first model? | 2-3 weeks (dataset prep + training) |
| Cost? | $0 (your hardware) |
| Will it be better than generic Claude? | Yes for Eiffel-specific tasks |
The Eiffel ecosystem has enough code. Your GPU is capable. The only work is converting existing code to instruction format - and that can be semi-automated.
My recommendation: Start with Simple Eiffel only (highest quality, you control it). 10K instruction pairs from your 71 libraries. Train on your GPU. If it works, expand to include ISE/Gobo/EiffelLoop.