Introducing ProofScaffold: A Modular, AI-Friendly Toolchain for Metamath

51 views
Skip to first unread message

Mingli Yuan

unread,
Feb 11, 2026, 2:05:05 AM (9 days ago) Feb 11
to meta...@googlegroups.com
Dear Metamath Community,

I am writing to share an open-source project we have been working on called ProofScaffold (https://github.com/epistemic-frontier/proof-scaffold).

Our team has deep respect for set.mm and the rigorous foundation this community has built. However, as we explore the intersection of Metamath and Large Language Models (LLMs), we’ve encountered a persistent challenge: feeding a 48MB monolithic file to an AI often leads to context dilution, hallucinated imports, and noisy proof searches.

To solve this, we built ProofScaffold. It acts as a package manager and linker (written in Python) for Metamath. It allows us to split massive databases into composable, compilable, and independently verifiable packages with explicit dependencies and exports—much like Cargo or NPM, but for formal math.

Crucially, the Trust Computing Base (TCB) does not change. Python acts strictly as an untrusted builder/linker. The final output is a standard, flattened .mm transient monolith that is verified by metamath-exe or metamath-knife.

We believe this modularity is the key to unlocking AI's true potential in formal mathematics:

- Targeted Retrieval: By scoping the context to a specific package (e.g., just fol or zf), we drastically reduce noise for the LLM.

- Controlled Semantic Boundaries: Explicit exports provide the AI with a strict subset of permissible symbols and axioms. This prevents hallucinated reasoning, such as an AI accidentally employing the Axiom of Choice in a strict ZF-only context.

- Verifiable Incremental Loops: An AI can generate a proof, verify it locally within the package, and map any verifier errors directly back to the specific package contract (e.g., missing $f or label conflicts). This makes AI self-correction much more reliable.

- Curriculum Alignment: Modular packages naturally form a curriculum (axioms → definitions → lemmas → theorems), providing high-quality gradient data for training models, rather than overwhelming them with the entire set.mm at once.

We have successfully ported a prelude and are currently porting the logic package (aligned with the |- conventions of set.mm). Our next step is to further subdivide logic into prop, fol, eq, and setvar packages, and to generate machine-readable interface manifests to help AI planners.

A Question for the Community regarding PyPI Naming:

To make this ecosystem easily accessible for AI researchers and engineers, we plan to publish these modularized databases as Python packages on PyPI.

I would like to ask if the community is comfortable with us using the metamath- prefix for these packages (e.g., metamath-logic, metamath-zfc). I want to be entirely respectful of the Metamath trademark/legacy and ensure this doesn't cause confusion with official tools. If the community prefers we use a different namespace (e.g., proof-scaffold-logic), we will gladly do so.

I would love to hear your thoughts, feedback, or critiques on this approach.

Best regards,

Mingli Yuan

Glauco

unread,
Feb 11, 2026, 8:07:43 AM (9 days ago) Feb 11
to Metamath
Hi Mingli Yuan,

I've given Gemini your README.md and asked it to prove mp2 with your framework. Below is the Python code that Gemini returned.
Is it completely hallucinated, or is it close?

from logic.propositional.hilbert import System
from logic.propositional.hilbert.lemmas import Proof, ProofBuilder

def prove_mp2(sys: System) -> Proof:
    """
    Double Modus Ponens: φ, ψ, (φ → (ψ → χ)) ⊢ χ
    """
    # 1. Initialize the builder with the system and theorem name
    lb = ProofBuilder(sys, "mp2")

    # 2. Define Hypotheses (mp2.1, mp2.2, mp2.3)
    h_phi = lb.hyp("mp2.1", "φ")
    h_psi = lb.hyp("mp2.2", "ψ")
    h_chi_nested = lb.hyp("mp2.3", "φ → (ψ → χ)")

    # 3. Apply the first Modus Ponens (Step 4 in .mmp)
    # This uses h_phi and h_chi_nested to get (ψ → χ)
    step4 = lb.mp("s1", h_phi, h_chi_nested, note="Step 4: MP h_phi, h_chi_nested")

    # 4. Apply the second Modus Ponens (Step 5 in .mmp)
    # This uses h_psi and step4 to get χ
    step5 = lb.mp("s2", h_psi, step4, note="Step 5: MP h_psi, s1")

    # 5. Return the build targeting the final result
    return lb.build(step5)

BR
Glauco

Mingli Yuan

unread,
Feb 11, 2026, 9:33:24 AM (9 days ago) Feb 11
to meta...@googlegroups.com

Hi Glauco,

Thanks for checking.

Logically, the structure is correct: from φ and (φ → (ψ → χ)) we get (ψ → χ) by MP, then from ψ we get χ by MP again. So the proof skeleton is right.

However, my porting work is not finished yet, the logic.propositional.hilbert and related packages are not released yet, so this code can not be verified by the community right now. There are around 20k lines of proofs in the logic part of metamath, please give me some time to finish it.

Or, I may release the prelude and logic package earlier before it is fully ported, if anyone is interested.

Best,

Mingli


--
You received this message because you are subscribed to the Google Groups "Metamath" group.
To unsubscribe from this group and stop receiving emails from it, send an email to metamath+u...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/metamath/6caee51b-6405-4a23-837e-8d47339b5df7n%40googlegroups.com.

Mingli Yuan

unread,
Feb 11, 2026, 3:32:18 PM (8 days ago) Feb 11
to meta...@googlegroups.com

Hi Glauco,

I saw your email and felt very inspired. So I worked late this night to make the code ready.

Now I have published metamath-prelude and metamath-logic. They are still "alpha" versions and not perfect, but I think they are usable now.

1. Proof Lab I made a new repo called Proof Lab here: https://github.com/epistemic-frontier/proof-lab

You can clone it and just follow the README. It can verify the Python proofs (like your mp2 example) using the real Metamath verifier.

2. Original Authorship Also, I want to say one thing. Even though I use a Python interface, I try my best to keep the original author's comments in the source code. For example, please check this file: https://github.com/epistemic-frontier/metamath-logic/blob/main/src/logic/propositional/hilbert/lemmas.py#L222

You can see the note from NM (Norman Megill) on 30-Sep-1992 is kept there. I think attribution is very important.

Please try the lab. Let me know if you have any problems.

Best,

Mingli

Glauco

unread,
Feb 11, 2026, 4:46:42 PM (8 days ago) Feb 11
to Metamath
Hi Mingli,

I've tried it in Play-With-Docker, here is the script that worked for me (from a clean new PWD instance):

curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env
git clone https://github.com/epistemic-frontier/metamath-logic.git
git clone https://github.com/epistemic-frontier/metamath-prelude.git
git clone https://github.com/epistemic-frontier/proof-lab.git
cd proof-lab
uv run skfd verify prove_modus_tollens.py


here's the output

[node1] (local) ro...@192.168.0.29 ~/proof-lab
$ uv run --frozen skfd verify prove_modus_tollens.py
Verifying script: prove_modus_tollens.py ...
Found 1 proof(s): prove_modus_tollens
Running prove_modus_tollens... OK (modus_tollens)
Emitting 1 proofs to Metamath... OK

Is the proof stored somewhere?

Glauco

Mingli Yuan

unread,
Feb 11, 2026, 7:43:55 PM (8 days ago) Feb 11
to meta...@googlegroups.com

Hi Glauco,

Please try again with new clones, I have updated some scripts.

Verifying each single proof will generate the *.mm file in the target folder.

Verifying the whole project will generate *.mm file and source map files in the target folder also.

```
uv run --frozen skfd verify proof-lab
```

Best,

Mingli


Mingli Yuan

unread,
Feb 11, 2026, 7:47:25 PM (8 days ago) Feb 11
to meta...@googlegroups.com
btw, it use mmverify.py by default and CI, you can add other verifiers:

```
skfd verifier list
skfd verfier add knife KNIFE_PATH
```

 Mingli

moz...@gmail.com

unread,
12:12 PM (12 hours ago) 12:12 PM
to meta...@googlegroups.com, mingl...@gmail.com
Hi ¥Yuan,

Your ProofScaffold you are describing is essentially an instance of “MathOps”!

TL;DR: I have developed the `arrowgram` commutative-diagrams editor
https://hotdocx.github.io/arrowgram then I setup GPT-5.3-Codex inside
my AI workspace to read the JSON text of those diagrams and generate
the `emdash` proof assistant for lax ω-categories
https://github.com/hotdocx/emdash (and to generate its report
https://hotdocx.github.io/r/26043CPAL64001 as a book/slides with
embedeed `arrowgram` diagrams), and you can try (and clone) my shared
AI workspace and fund it at GetPaidX.com a.k.a.
https://LastRevision.pro/r/26044DLGJ77000 — and probably, you should
disregard the rest of this email.

Bye.

---

APPENDIX:

What is MathOps? Indeed, as soon as you can setup some « MathOps »
(i.e. math DevOps engineering) for a long-running “LLM ↔
proof-checker” feedback loop then autoformalization (e.g. “≈130k lines
in two weeks”) is expected (strangely, Terminal-Bench 2.0 has no such
benchmarks!).

For category theory, the “proof-checker” (i.e. computational logic)
question has been solved since Kosta Dosen “Cut-Elimination in
Categories” (1999), but the tools to easily specify/implement it were
not available until Frédéric Blanqui's Lambdapi logical framework
(rewrite/unification rules) and:

emdash 2 — Functorial programming for strict/lax ω-categories in Lambdapi
https://github.com/1337777/cartier/blob/master/cartierSolution19.lp
(99% generated by AI, lol)
https://hotdocx.github.io/r/26043CPAL64001

And with good “mathops/devops engineering”, a general LLM such as
GPT-5.3-Codex with Codex CLI has succeeded in coding this whole ≈4k
lines file `cartierSolution19.lp` from just text prompts containing
`arrowgram` commutative diagrams which hinted at this “dependent
hom/arrow/comma category” construction:

## 1) Dependent arrow/comma/hom for a dependent category and
“stacking” pasting diagrams

We use displayed/dependent categories over a base category `B` (i.e. a
fibration `E : B → Cat`, or a more general isofibration `E → B`
without transport/arrow-action).

For `b₀ ∈ B` and `e₀ ∈ E(b₀)`, the dependent hom/comma/arrow-category
is the functor:
- `Homd_E(e₀, (–,–)) : E ×_B (Hom_B(b₀, –))ᵒᵖ → Cat`
- notations: `Homd_E : Π b₀, E[b₀]ᵒᵖ → (Σ b₁, E[b₁] × Hom_B(b₀, b₁)ᵒᵖ) → Cat`

On a fibre object `e₁ ∈ E(b₁)` and a base arrow `b₀₁ : b₀ → b₁`, it
returns the fibre hom-category, where `(b₀₁)! e₀` is the
`E`-action/transport of `e₀` along `b₀₁`:
- `Homd_E(e₀, (e₁ , b₀₁)) ≔ Hom_{E(b₁)}( (b₀₁)! e₀ , e₁ )`

That is, this syntactic triangle/simplicial classifier `homd_`,
reduces to `globular` semantics, but is necessary for iterating
simplicially (triangles → tetrahedra → …) and for expressing
non-cartesian lax ω-functors and lax ω-transfors.

Moreover the functoriality of `Homd_E(e₀, (–,–))`, especially in its
second argument `(Hom_B(b₀, –))ᵒᵖ`, espresses precisely the
“stacking” of 2-cells along a 1-cell (i.e. generalization of
horizontal composition of 2-cells meeting at a 0-cell).

https://hotdocx.github.io/r/26043CPAL64002

This dependent comma/arrow/hom category `homd_` is analoguous to the
(directed) “bridge type” construction in the technique of logical
relations/parametricity, i.e. Ambrus Kaposi “Internal parametricity,
without an interval”; and it is really an *internalized*
type-theory/computational-logic as contrasted from Hugo Herbelin “A
parametricity-based formalization of semi-simplicial and semi-cubical
sets” which is merely a (formalized) combinatorial analysis.

## 2) Internalized computational-logic and “syntax elaboration”

The `emdash` Lambdapi specification looks like the categorical
semantics (“categories-with-families”) of a programming language; but
because it is carefully formulated to be “computational” and its
dependent-types/logic is “internalized”, then it can actually be
translated as a traditional programming-language/proof-assistant
surface syntax ( https://github.com/hotdocx/emdash ) whose HOAS
bidirectional type-checking and elaboration/engineering in TypeScript
is easily guaranteed.

For example, the fibred/displayed functor `FF : Functord_(Z) E D`
between isofibrations `E D : Catd Z` over `Z` is read as `z :^f Z, e
:^f E[z] ⊢ FF[e] : D[z]` (where binders `:^f` are functorial and `:^n`
are natural, and `:^o` is binding over objects-only). Thus `Functord`
is the traditional Π-type (`Functord_(Z) E D ≡ Π_(Z) (Functor_catd E
D)`), but with a construction `fapp1_funcd`, to express *lax*
functoriality in the `e : E[z]` variable (where `→_` is dependent hom
`homd_`, is used to express that this expression may be non-cartesian
in general, i.e. *lax* functor):
- `z :^f Z, e :^n E[z], z' :^f Z, f :^n z → z', e' :^n E[z'], σ :^f e
→_f e' ⊢ FF₁(σ) : FF[e] →_f FF[e']`

Similarly from the usual “diagonal” components `z :^f Z, e :^n E[z] ⊢
ϵ[e] : FF[e] → GG[e]` for a displayed transfor/transformation `ϵ :
Transfd_(Z) FF GG`, there is a construction `tapp1_fapp0` for
“off-diagonal” components (i.e. the index/subscript is an arrow
instead of an object), to express *lax* naturality/functoriality in
the `e : E[z]` variable:
- `ϵ :^f Transfd FF GG, z :^f Z, e :^n E[z], z' :^f Z, f :^n z → z',
e' :^n E[z'], σ :^f e →_f e' ⊢ ϵ_(σ) : FF[e] →_f GG[e']`

These constructions are expressed *internally* (as stable head symbols
`fdapp1_int_transfd` and `tdapp1_int_func_transfd`, etc... the
discharge/abstraction/lambda of the full (or partial) list of
variables in the context-patterns above), therefore those newly
introduced variables vary functorially/naturally. Note that when `E :
Functor Z Cat` is a module/profunctor, then the exchange of the order
of binder variables (i.e. “module-transport” vs “functor-action”) must
be an explicit operation:
- `z :^n Z, e :^f E[z], z' :^n Z, f :^f z → z' ⊢ (f)_!(e) ≔ E₁(f)(e) : E[z']`

And because there is available a “context-extension” / total-category
/ Sigma-category construction `Total_cat E : Cat` (a.k.a. `Σ_(B) E :
Cat`) for any fibred category `E : Catd B`, all these surface syntax
can actually happen within any ambient context `Γ, ⋯ ⊢` (i.e. the base
`Z` is itself `Total_cat Z0` for `Z0 : Catd Γ`). And the usual
categorical-logic fact that the Sigma/total-category `Σ b, –` is
left-adjoint to weakening/lift `–[b]`, means that the dependent
hom/arrow-category has an alternative formulation (`homd_curry` in the
lambdapi spec file `cartierSolution19.lp`):
- `Homd_E : Π b₀, E[b₀]ᵒᵖ → Π b₁, Hom_B(b₀, b₁)ᵒᵖ → E[b₁] → Cat[b₁]`

In reality, the *internal* computational-logic for lax ω-categories is
*easier* to express than for only strict 1-categories; because the
hom/comma of a category `Hom_D(y, F –)` (i.e. comma vs hom, is
necessary) is recursively a (fibred) category, and the arrow-action of
a *lax* functor `F₁ : Hom_C(x, –) → Hom_D(F₀ x, F –)` is recursively a
fibred functor which is non-cartesian (because of laxness), i.e. a
commuting-triangle/identity 2-arrow is mapped to a non-identity
2-arrow by `((F₁)₁)₀ : Homd_(Hom_C(x, –))(f, (g ∘ f, g)) →
Homd_(_)((F₁)₀ f, ((F₁)₀ (g ∘ f), (F₁)₀ g))`.

But for ω-iteration of the natural n-transfors construction (i.e
natural family rather than an individual n-arrow), there needs to be a
way to re-interpret a natural transformation again as a functor: this
is the question solved by the next section.

## 3) Naturality as “cut accumulation” and the exchange law

A (fibred) transformation: `F ⇒ G`, is also a transformation from the
unit/hom profunctor (i.e. “off-diagonal”): `hom(~, —) ⇒ hom(F ~, G
—)`, and is back again as a fibred functor between isofibrations: `∫
hom(~, —) → ∫ hom(F ~, G —)`; thus we can continue the ω-iteration of
this construction.

The key point is that *naturality can be oriented as an “accumulation”
rewrite* on these transfor's “off-diagonal” components `ϵ_(–)`
(cut-elimination style, where `⋅` is the vertical compositon/cut):
- `(G b) ⋅ ϵ_(a) ↪ ϵ_(b⋅a)`
- `ϵ_(b) ⋅ (F a) ↪ ϵ_(b⋅a)`

An instance of this accumulation rule is the *exchange law* between
horizontal and vertical compositions: an unambiguous “pasting diagram”
with two vertical 2-cells `α ≔ a` then `β ≔ b`, and one horizontal
2-cell `ϵ ≔ (e ∘ —)` for `e : f → g` (where `G ≔ (g ∘ —)` and `F ≔ (f
∘ —)` are horizontal post-composition/whiskering), will normalize to a
unique form (where `∘` is horizontal composition):
- `(g ∘ β) ⋅ (e ∘ α) ↪ e ∘ (β⋅α)`

https://hotdocx.github.io/r/26043CPAL65001

Moreover such formulations of naturality are useful to express
computationally the triangular identities of *adjoint functors*:
- `ϵ_(f) ∘ LAdj(η_(g)) ↪ f ∘ LAdj(g)`

and for sheaves and schemes, polynomial functors, higher inductive
types, etc... (ref: `cartierSolution16.lp`)

## 4) GPT-5.3-Codex diagrammatic prompting via `arrowgram` and “MathOps”

What is MathOps? Example: How to enable the GPT-5.3-Codex + Codex CLI
coding agent to (natively) understand and generate commutative
diagrams? A MathOps solution is: `arrowgram`
https://github.com/hotdocx/arrowgram/ an open-source strict JSON text
specification for commutative diagrams with exporting to
JSON/SVG/PNG/TikZ and with an AI-editor that generates and renders
`markdown` papers documents and slides presentations with embedded
diagrams specs, from any uploaded source references.

While the core `arrowgram` open-source app can be downloaded and used
offline, there is an academic-publishing overlay online SaaS app:
https://hotdocx.github.io to enable rapid/iterative MathOps from idea,
to sharing, to community peer review!

MathOps is what happens after you get “desk-rejected” by POPL 2026,
lol... A MathOps solution is the ability to share workspace sessions
(docker), with pre-installed running AI coding agents +
proof-checkers, for “vibe coding” live with co-workers, in a
marketplace where large-scale community-workspaces can get paid and
funded by fans and local businesses and governments; bypassing
well-known falsifications and intoxications.

This MathOps idea is implemented by
https://GetPaidX.com — Publish and fund AI workspaces

backed by AI communities of 20,000+ developers in Dubai, Mumbai,
Shanghai https://meetup.com/dubai-ai https://meetup.com/mumbai-ai
https://meetup.com/shanghai-ai ... Try and experiment the AI workspace
for `emdash` here (e.g ask Codex “where is the dependent hom defined?
is it correct”):
https://GetPaidX.com/r/26044DLGJ77000


---

References:
[1] Kosta Dosen, Zoran Petric (1999). “Cut-Elimination in Categories”
[2] This summer visit to Ambrus Kaposi in Budapest, and `emdash`
discussions at RHPL@FSTTCS 2025 in India
[3] GPT-5.3-Codex
Reply all
Reply to author
Forward
0 new messages