Idea: Release ".txt" proofs for set.mm databases ("mpetxt") for AI/ML tools, esp Generative Pre-trained Transformers (GPTs)

322 views
Skip to first unread message

David A. Wheeler

unread,
Feb 13, 2023, 1:20:03 PM2/13/23
to Metamath Mailing List
All: I'm thinking it might be a good idea to create a new "database area",
"mpetxt" (to match "mpeuni"/"mpegif"), with many simple text files - 1 per proof.
The goal would be to make it easier for AI/ML tools to learn from them
so they perhaps could generate potential proofs.

Current AI (specifically machine learning) tools have made a lot of progress.
In particular Generative Pre-trained Transformers (GPTs) like GPT-3 have managed
to generate remarkably good or at least useful answers in many cases.
However, they aren't magic. Our compressed database files are hard for them to learn
from, and the HTML tables we show are human-readable but also
challenging for a computer to learn from. On the other hand, once presented as typical text,
they have internal structures that are similar enough to text and code that it's
plausible similar solutions could also work for them.

It appears that Microsoft is going to invest LOTS of money in scraping the web to
learn from, using that to train & generate better results. Google appears to be rushing
to do the same, as a countermeasure. They can already generate passable code
from prompts in many situations; even when they get it wrong, people report the
results are often helpful as ideas & information for helping to solve the real problem.

So I propose generating text files (1/proof) that would make it easier for these
tools to potentially learn how to generate Metamath proofs. Even when they get it wrong,
they *might* provide useful tips when trying to create a proof. It's not clear to me that
the 140K samples from MPE are enough for them to learn... but they *might* be, and the
only way to know is to let them try. The idea is that you'd ask one of these tools
"Please complete the following: Metamath proof of NAME in database set.mm NAME $p ...."
and it would complete it with an answer.

Below is an example of what those files might look like. I'd basically exploit
Metamath's "show statement" and "show proof" and try to strip out distractors
that could interfere with machine learning.
It might be wise to make the hypotheses & goal a single long sentence, since it's not clear if
users can enter embedded newlines in their prompts, but I haven't done that here.
I think it's important that each proof be in its own file, so the ML algorithm knows
exactly where they begin and end.

There are currently 40627 $p statements in set.mm. I'm guessing that most will fit within
one 4KiB blocks (storage is allocated by blocks), so this would require ~162 MiB of space.
The us.metamath.org website isn't brimming with space, but we can do that.

--- David A. Wheeler

========================
Metamath proof of pm5.74 in database set.mm
pm5.74 $p |- ( ( ph -> ( ps <-> ch ) ) <-> ( ( ph -> ps ) <-> ( ph -> ch ) ) ) $= ... $.

PROOF:
1 biimp $p |- ( ( ps <-> ch ) -> ( ps -> ch ) )
2 1 imim3i $p |- ( ( ph -> ( ps <-> ch ) ) -> ( ( ph -> ps ) -> ( ph -> ch ) ) )
3 biimpr $p |- ( ( ps <-> ch ) -> ( ch -> ps ) )
4 3 imim3i $p |- ( ( ph -> ( ps <-> ch ) ) -> ( ( ph -> ch ) -> ( ph -> ps ) ) )
5 2,4 impbid $p |- ( ( ph -> ( ps <-> ch ) ) -> ( ( ph -> ps ) <-> ( ph -> ch ) ) )
6 biimp $p |- ( ( ( ph -> ps ) <-> ( ph -> ch ) ) -> ( ( ph -> ps ) -> ( ph -> ch ) ) )
7 6 pm2.86d $p |- ( ( ( ph -> ps ) <-> ( ph -> ch ) ) -> ( ph -> ( ps -> ch ) ) )
8 biimpr $p |- ( ( ( ph -> ps ) <-> ( ph -> ch ) ) -> ( ( ph -> ch ) -> ( ph -> ps ) ) )
9 8 pm2.86d $p |- ( ( ( ph -> ps ) <-> ( ph -> ch ) ) -> ( ph -> ( ch -> ps ) ) )
10 7,9 impbidd $p |- ( ( ( ph -> ps ) <-> ( ph -> ch ) ) -> ( ph -> ( ps <-> ch ) ) )
11 5,10 impbii $p |- ( ( ph -> ( ps <-> ch ) ) <-> ( ( ph -> ps ) <-> ( ph -> ch ) ) )

========================
Metamath proof of axtgcgrrflx in database set.mm
axtrkg.p $e |- P = ( Base ` G ) $.
axtrkg.d $e |- .- = ( dist ` G ) $.
axtrkg.i $e |- I = ( Itv ` G ) $.
axtrkg.g $e |- ( ph -> G e. TarskiG ) $.
axtgcgrrflx.1 $e |- ( ph -> X e. P ) $.
axtgcgrrflx.2 $e |- ( ph -> Y e. P ) $.
axtgcgrrflx $p |- ( ph -> ( X .- Y ) = ( Y .- X ) ) $= ... $.

PROOF:
1 df-trkg $a |- TarskiG = ( ( TarskiGC i^i TarskiGB ) i^i ( TarskiGCB i^i { f | [. ( Base ` f ) / p ]. [. ( Itv ` f ) / i ]. ( LineG ` f ) = ( x e. p , y e. ( p \ { x } ) |-> { z e. p | ( z e. ( x i y ) \/ x e. ( z i y ) \/ y e. ( x i z ) ) } ) } ) )
2 inss1 $p |- ( ( TarskiGC i^i TarskiGB ) i^i ( TarskiGCB i^i { f | [. ( Base ` f ) / p ]. [. ( Itv ` f ) / i ]. ( LineG ` f ) = ( x e. p , y e. ( p \ { x } ) |-> { z e. p | ( z e. ( x i y ) \/ x e. ( z i y ) \/ y e. ( x i z ) ) } ) } ) ) C_ ( TarskiGC i^i TarskiGB )
3 inss1 $p |- ( TarskiGC i^i TarskiGB ) C_ TarskiGC
4 2,3 sstri $p |- ( ( TarskiGC i^i TarskiGB ) i^i ( TarskiGCB i^i { f | [. ( Base ` f ) / p ]. [. ( Itv ` f ) / i ]. ( LineG ` f ) = ( x e. p , y e. ( p \ { x } ) |-> { z e. p | ( z e. ( x i y ) \/ x e. ( z i y ) \/ y e. ( x i z ) ) } ) } ) ) C_ TarskiGC
5 1,4 eqsstri $p |- TarskiG C_ TarskiGC
6 axtrkg.g $e |- ( ph -> G e. TarskiG )
7 5,6 sseldi $p |- ( ph -> G e. TarskiGC )
8 axtrkg.p $e |- P = ( Base ` G )
9 axtrkg.d $e |- .- = ( dist ` G )
10 axtrkg.i $e |- I = ( Itv ` G )
11 8,9,10 istrkgc $p |- ( G e. TarskiGC <-> ( G e. _V /\ ( A. x e. P A. y e. P ( x .- y ) = ( y .- x ) /\ A. x e. P A. y e. P A. z e. P ( ( x .- y ) = ( z .- z ) -> x = y ) ) ) )
12 11 simprbi $p |- ( G e. TarskiGC -> ( A. x e. P A. y e. P ( x .- y ) = ( y .- x ) /\ A. x e. P A. y e. P A. z e. P ( ( x .- y ) = ( z .- z ) -> x = y ) ) )
13 12 simpld $p |- ( G e. TarskiGC -> A. x e. P A. y e. P ( x .- y ) = ( y .- x ) )
14 7,13 syl $p |- ( ph -> A. x e. P A. y e. P ( x .- y ) = ( y .- x ) )
15 axtgcgrrflx.1 $e |- ( ph -> X e. P )
16 axtgcgrrflx.2 $e |- ( ph -> Y e. P )
17 oveq1 $p |- ( x = X -> ( x .- y ) = ( X .- y ) )
18 oveq2 $p |- ( x = X -> ( y .- x ) = ( y .- X ) )
19 17,18 eqeq12d $p |- ( x = X -> ( ( x .- y ) = ( y .- x ) <-> ( X .- y ) = ( y .- X ) ) )
20 oveq2 $p |- ( y = Y -> ( X .- y ) = ( X .- Y ) )
21 oveq1 $p |- ( y = Y -> ( y .- X ) = ( Y .- X ) )
22 20,21 eqeq12d $p |- ( y = Y -> ( ( X .- y ) = ( y .- X ) <-> ( X .- Y ) = ( Y .- X ) ) )
23 19,22 rspc2v $p |- ( ( X e. P /\ Y e. P ) -> ( A. x e. P A. y e. P ( x .- y ) = ( y .- x ) -> ( X .- Y ) = ( Y .- X ) ) )
24 15,16,23 syl2anc $p |- ( ph -> ( A. x e. P A. y e. P ( x .- y ) = ( y .- x ) -> ( X .- Y ) = ( Y .- X ) ) )
25 14,24 mpd $p |- ( ph -> ( X .- Y ) = ( Y .- X ) )


Jon P

unread,
Feb 14, 2023, 1:46:56 PM2/14/23
to Metamath
I think that's a great idea David! Seems like very low downside and it would be interesting to see if language models could train on the datasets. 

There's already an option to Customise GPT3 with additional datasets so once this is ready it could be an interesting experiment to try.

Cris Perdue

unread,
Feb 14, 2023, 9:32:33 PM2/14/23
to meta...@googlegroups.com
David,

I take this idea as an approach to making metamath proofs more digestible by machines, and I love it for that.

It might be worth mentioning that it is easier than it might appear to extract the text from a web page as in the Metamath site. In fact a tiny experiment on my part came out nicer than I hoped.  Opening up the Chrome debugging console on the page for pm5.74, I did:

> console.log(document.body)

and got:

Metamath Proof Explorer < Previous   Next >
Nearby theorems
Mirrors  >  Home  >  MPE Home  >  Th. List  >  pm5.74 Structured version   Visualization version   GIF version
Theorem pm5.74 262
Description: Distribution of implication over biconditional. Theorem *5.74 of [WhiteheadRussell] p. 126. (Contributed by NM, 1-Aug-1994.) (Proof shortened by Wolf Lammen, 11-Apr-2013.)
Assertion
Ref Expression
pm5.74 ⊢ ((𝜑 → (𝜓 ↔ 𝜒)) ↔ ((𝜑 → 𝜓) ↔ (𝜑 → 𝜒)))
Proof of Theorem pm5.74
Step Hyp Ref Expression
1   biimp 207 . . . 4 ⊢ ((𝜓 ↔ 𝜒) → (𝜓 → 𝜒))
2 1 imim3i 64 . . 3 ⊢ ((𝜑 → (𝜓 ↔ 𝜒)) → ((𝜑 → 𝜓) → (𝜑 → 𝜒)))
3   biimpr 212 . . . 4 ⊢ ((𝜓 ↔ 𝜒) → (𝜒 → 𝜓))
4 3 imim3i 64 . . 3 ⊢ ((𝜑 → (𝜓 ↔ 𝜒)) → ((𝜑 → 𝜒) → (𝜑 → 𝜓)))
5 2, 4 impbid 204 . 2 ⊢ ((𝜑 → (𝜓 ↔ 𝜒)) → ((𝜑 → 𝜓) ↔ (𝜑 → 𝜒)))
6   biimp 207 . . . 4 ⊢ (((𝜑 → 𝜓) ↔ (𝜑 → 𝜒)) → ((𝜑 → 𝜓) → (𝜑 → 𝜒)))
7 6 pm2.86d 108 . . 3 ⊢ (((𝜑 → 𝜓) ↔ (𝜑 → 𝜒)) → (𝜑 → (𝜓 → 𝜒)))
8   biimpr 212 . . . 4 ⊢ (((𝜑 → 𝜓) ↔ (𝜑 → 𝜒)) → ((𝜑 → 𝜒) → (𝜑 → 𝜓)))
9 8 pm2.86d 108 . . 3 ⊢ (((𝜑 → 𝜓) ↔ (𝜑 → 𝜒)) → (𝜑 → (𝜒 → 𝜓)))
10 7, 9 impbidd 202 . 2 ⊢ (((𝜑 → 𝜓) ↔ (𝜑 → 𝜒)) → (𝜑 → (𝜓 ↔ 𝜒)))
11 5, 10 impbii 201 1 ⊢ ((𝜑 → (𝜓 ↔ 𝜒)) ↔ ((𝜑 → 𝜓) ↔ (𝜑 → 𝜒)))
Colors of variables: wff setvar class
Syntax hints:   → wi 4   ↔ wb 198
This theorem was proved from axioms:  ax-mp 5  ax-1 6  ax-2 7  ax-3 8
This theorem depends on definitions:  df-bi 199
This theorem is referenced by:  pm5.74i  263  pm5.74ri  264  pm5.74d  265  pm5.74rd  266  bibi2d  334  pm5.32  570  orbidi  976
  Copyright terms: Public domain W3C validator

So perhaps the "plain text" may be accessible enough to systems with well-developed HTML and DOM processing. Arguably it might be more desirable to have the ASCII text rather than Unicode in the formulas, as parsing of the ASCII text is presumably more well-defined.

There is something quite similar that I think would be even cooler: JSON-format versions of theorem pages. JSON is a well-defined textual format, extremely popular and widespread in use on the Web and elsewhere. I'm thinking of a JSON-based format something like:

{"theorem": "m5.74",
 "hyps": [],
 "wff": "( ( ph -> ( ps <-> ch ) ) <-> ( ( ph -> ps ) <-> ( ph -> ch ) ) )",
 "proof": [
 {"i": 1, "ref": "biimp" "f": "( ( ps <-> ch ) -> ( ps -> ch ) )"},
 {"i": 2, "deps": [1], "ref": "imim3i", "f":( ( ph -> ( ps <-> ch ) ) -> ( ( ph -> ps ) -> ( ph -> ch ) ) )"},
 {"i": 3 "ref": "biimpr", "f": "( ( ps <-> ch ) -> ( ch -> ps ) )"},
 {"i": 4 "deps": [3], "ref": "imim3i", "f": "( ( ph -> ( ps <-> ch ) ) -> ( ( ph -> ch ) -> ( ph -> ps ) ) )"},
 {"i": 5, "deps": [2,4], "ref": "impbid", "f":  "( ( ph -> ( ps <-> ch ) ) -> ( ( ph -> ps ) <-> ( ph -> ch ) ) )"},
 (and so on)
 ]
}

(The details of course could vary, while still making use of the JSON format.)

Presumably pages in such a format would include enough information such as distinct variable groups,
to enable reproducing actual proofs. It could be used as data for experimental proof displayers. It would also
be very straightforward to build web apps that present proofs in custom formats within web browsers, as
browsers support JSON format very well.

I see no reason this kind of format would should not be at least as accessible to AI as plain text.

-Cris


--
You received this message because you are subscribed to the Google Groups "Metamath" group.
To unsubscribe from this group and stop receiving emails from it, send an email to metamath+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/metamath/6025E3C8-6FC7-4934-BD53-A3EBC813048F%40dwheeler.com.

Mario Carneiro

unread,
Feb 14, 2023, 9:58:00 PM2/14/23
to meta...@googlegroups.com
I think a json version of the web pages would make a lot more sense than a plain text version, if the intent is to support machine consumption by tools that can't parse metamath already. There are even uses for that in the web pages themselves, since an interactive version of the web pages would like to have the underlying data in a more easily processed form so that it can do things with it. In fact, we could even eliminate the mpeuni/mpegif pages entirely and replace them with a on-the-spot client side renderer which uses the json file as input along with some global formatting data (i.e. a json version of the $t comment), which would significantly decrease the space cost of the web site without changing the user experience at all.

Igor Ieskov

unread,
Feb 15, 2023, 2:09:42 AM2/15/23
to Metamath
I used a tool which I recently developed to generate plain text of a proof. Currently it generates all steps including floatings, but it could be changed to generate only essentials. Also the plain text format may be changed and saving in json may be implemented too. One big question is if this tool is reliable enough to generate correct output. So just letting you know about this alternative approach.

The source code:

Example output:
--- TBL pm5.74 ---------------------------------------------------------------------------
1: hyp: wph | wff ph
2: hyp: wps | wff ps
3: hyp: wch | wff ch
4: 2, 3 wb | wff ( ps <-> ch )
5: 1, 4 wi | wff ( ph -> ( ps <-> ch ) )
6: 1, 2 wi | wff ( ph -> ps )
7: 1, 3 wi | wff ( ph -> ch )
8: 6, 7 wb | wff ( ( ph -> ps ) <-> ( ph -> ch ) )
9: 2, 3 biimp | |- ( ( ps <-> ch ) -> ( ps -> ch ) )
10: 4, 2, 3, 1, 9 imim3i | |- ( ( ph -> ( ps <-> ch ) ) -> ( ( ph -> ps ) -> ( ph -> ch ) ) )
11: 2, 3 biimpr | |- ( ( ps <-> ch ) -> ( ch -> ps ) )
12: 4, 3, 2, 1, 11 imim3i | |- ( ( ph -> ( ps <-> ch ) ) -> ( ( ph -> ch ) -> ( ph -> ps ) ) )
13: 5, 6, 7, 10, 12 impbid | |- ( ( ph -> ( ps <-> ch ) ) -> ( ( ph -> ps ) <-> ( ph -> ch ) ) )
14: 6, 7 biimp | |- ( ( ( ph -> ps ) <-> ( ph -> ch ) ) -> ( ( ph -> ps ) -> ( ph -> ch ) ) )
15: 8, 1, 2, 3, 14 pm2.86d | |- ( ( ( ph -> ps ) <-> ( ph -> ch ) ) -> ( ph -> ( ps -> ch ) ) )
16: 6, 7 biimpr | |- ( ( ( ph -> ps ) <-> ( ph -> ch ) ) -> ( ( ph -> ch ) -> ( ph -> ps ) ) )
17: 8, 1, 3, 2, 16 pm2.86d | |- ( ( ( ph -> ps ) <-> ( ph -> ch ) ) -> ( ph -> ( ch -> ps ) ) )
18: 8, 1, 2, 3, 15, 17 impbidd | |- ( ( ( ph -> ps ) <-> ( ph -> ch ) ) -> ( ph -> ( ps <-> ch ) ) )
19: 5, 8, 13, 18 impbii | |- ( ( ph -> ( ps <-> ch ) ) <-> ( ( ph -> ps ) <-> ( ph -> ch ) ) )
-----------------------------------------------------------------------------------

Igor

David Wheeler

unread,
Feb 15, 2023, 1:04:49 PM2/15/23
to metamath

2023-02-14 21:57 GMT-05:00 "Mario Carneiro" <di....@gmail.com>:
> I think a json version of the web pages would make a lot more sense than a
> plain text version, if the intent is to support machine consumption by
> tools that can't parse metamath already. There are even uses for that in
> the web pages themselves, since an interactive version of the web pages
> would like to have the underlying data in a more easily processed form so
> that it can do things with it. In fact, we could even eliminate the
> mpeuni/mpegif pages entirely and replace them with a on-the-spot client
> side renderer which uses the json file as input along with some global
> formatting data (i.e. a json version of the $t comment), which would
> significantly decrease the space cost of the web site without changing the
> user experience at all.

I can see value in generating JSON formats, and it's wouldn't be that hard.

I do see big downsides for the use cases listed here though:
1. For training a GPT, I fear that the system would focus on "it's a JSON file!"
and any tool-specific training would be swamped by thinking it's the same as other JSON files.
I think a GPT system that trains on the "Internet as a whole" is more likely to
generate some useful information if we minimize distractors.
In short, make it REALLY OBVIOUS that this text is regular in some way, yet
clearly different from anything else, and expose structure where we reasonably can.

2. For rendering on the client side, there are big advantages to showing static data
as straight HTML. For one, people who disable JavaScript can still see it fine.
Sure, sometimes it's useful to run arbitrary code sent by strangers, but there are good
reasons to minimize that as a requirement :-). It also means that search engines see
"what we actually have".

That said, I can see JSON being quite useful, but that still leaves generating a GPT-friendly
text fiole as useful.

--- David A. Wheeler

Igor Ieskov

unread,
Feb 15, 2023, 2:44:02 PM2/15/23
to Metamath
Unfortunately I found bugs in my tool which I mentioned in my previous post. I tested it by generating a table proof, then generating a compressed proof from the table proof obtained on the previous step, and again generating a second table proof from the compressed proof obtained on the previous step. Ideally those two table proofs should be identical, but there are discrepancies for some assertions. I will try to fix bugs and let you know once it is ready (if this topic is still relevant at that time :) ). Sorry for posting before validation.

-
Igor

David A. Wheeler

unread,
Feb 15, 2023, 7:59:19 PM2/15/23
to Metamath Mailing List


> On Feb 14, 2023, at 1:46 PM, Jon P <drjonp...@gmail.com> wrote:
>
> I think that's a great idea David! Seems like very low downside and it would be interesting to see if language models could train on the datasets.
>
> There's already an option to Customise GPT3 with additional datasets so once this is ready it could be an interesting experiment to try.
>
> https://openai.com/blog/customized-gpt-3/

Thanks so much for the link! That *is* interesting. That links to some tips:
https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset
Among their tips:
• "Aim for at least ~500 examples".
We certainly have that!!
• "Ensure that the prompt + completion doesn't exceed 2048 tokens, including the separator".
That will exclude some of our proofs. However, if I use axtgcgrrflx as an example,
its goal (with 6 hypotheses) and 25-step proof only needs 942 words if I filter
it the way I proposed (and I could shave off a few more words).

So I guess we don't need to wait for someone to scan the web.
We will still need to figure out how to rig the examples to maximize the training.
That's way more black art than science. A few links:
https://www.leewayhertz.com/build-a-gpt-model/
https://research.aimultiple.com/large-language-model-training/
https://www.datacamp.com/blog/a-beginners-guide-to-gpt-3

If you're not familiar with large language models (LLMs), a helpful explanation is
"Talking About Large Language Models" <https://arxiv.org/abs/2212.03551>.
Fundamentally, these systems look at previous words & statistically
predict likely continuations; with large datasets & parameters, sometimes they can do that well.

OpenAI makes it easy to start with their huge GPT-3 model, though it costs money
(I can't get any free credits because I already used some for work).
Kaggle.com lets you run smaller models for free, but smaller models tend to be much less capable.

If anyone has thoughts about applying ML generative text models to Metamath proofs
I'd love to hear them.

--- David A. Wheeler

Johnathan Mercer

unread,
Feb 15, 2023, 8:56:36 PM2/15/23
to Metamath
"If anyone has thoughts about applying ML generative text models to Metamath proofs
I'd love to hear them."

Hi David et al, 
I'm completing a CS PhD and my dissertation is on neural language models and automated/interactive theorem proving. I thought the mm community had already worked closely with OpenAI (see papers below) on GPT-f?

I've copied and filtered my ATP/ITP Paper DB for public viewing if you want to see a comprehensive list/metadata of papers across various formalization efforts (scroll down past datasets in the gallery). Here is a link!


Please let me know if you have any questions or want to chat. I'm happy to finally be able to contribute to the conversation.

Best,
John
 

David A. Wheeler

unread,
Feb 15, 2023, 9:13:50 PM2/15/23
to Metamath
On Wednesday, February 15, 2023 at 8:56:36 PM UTC-5 wrote:
"If anyone has thoughts about applying ML generative text models to Metamath proofs
I'd love to hear them."

Hi David et al, 
I'm completing a CS PhD and my dissertation is on neural language models and automated/interactive theorem proving. I thought the mm community had already worked closely with OpenAI (see papers below) on GPT-f?

We worked with them, but the work was their own, and there's no reason we can't do more :-).

What I have in mind is a little different - using a large GPT directly. Also, I'm not worried if some of the answers are wrong; they wanted to see if it could generate *entire* proofs entirely automatically. I think it'd be perfectly reasonable to have a system that *tried* to create a proof, and even if it fails, provide some insight that would help a human create the proof.

--- David A. Wheeler

David A. Wheeler

unread,
Feb 16, 2023, 10:05:01 AM2/16/23
to Metamath Mailing List

> On Feb 14, 2023, at 1:46 PM, Jon P <drjonp...@gmail.com> wrote:
>
> I think that's a great idea David! Seems like very low downside and it would be interesting to see if language models could train on the datasets.
>
> There's already an option to Customise GPT3 with additional datasets so once this is ready it could be an interesting experiment to try.
>
> https://openai.com/blog/customized-gpt-3/

Thanks so much for that link. So I looked into creating a specially-trained model
to create metamath proofs. that page points to this pricing model for GPT-3:
https://openai.com/api/pricing/#faq-token

I did a quick estimate of training costs for the top GPT-3 model on
all of set.mm. It comes out to about $2600.
That's less than I expected, and that's easily affordable in a grant (you could train multiple times),
but it's not an amount I have pocket change for :-(.
Inference isn't expensive; trying to generate axtgcgrrflx is $0.15 (15 cents).
See below for how I created the quick estimate.
We could do a subset of set.mm (e.g., skip mathboxes, etc.), though fewer training examples
would typically produce less-good results.

I can't possibly afford that myself. That said, there are alternatives. There are many other
Generative Pre-trained Transformer (GPT) models that are less capable but
also less expensive. My current personal computer is useless
for ML training, but Kaggle.com provides a nice Jupyter notebook interface for AI/ML
work that's free to use. Here are some pages that discuss using GPT-2 and Kaggle:
https://www.kaggle.com/code/changyeop/how-to-fine-tune-gpt-2-for-beginners
https://www.kaggle.com/code/nulldata/fine-tuning-gpt-2-to-generate-netlfix-descriptions/notebook
https://www.kaggle.com/general/116387

I had also been looking at adding definition checks to metamath-knife, and maybe trying to
help this new JavaScript metamath proof assistant. Sigh, not enough time in the day :-).

--- David A. Wheeler



=== Cost estimate ===

Here's. cost estimate using https://openai.com/api/pricing/

OpenAI's most powerful GPT-3 model is Davinci. Their charges are:
* Fine-tuned training: $0.0300 / 1K tokens,
* Usage: $0.1200 / 1K tokens.

I did a quick estimate of tokens in the entirety of set.mm using:
metamath 'set scroll continuous' 'set width 9999' 'read set.mm' 'show statement *' 'show proof * /lemmon /renumber' 'quit' | grep -v -- '-------' | wc
That yielded 64218080 words. A token is about 4 characters or 0.75 words, so for a quick estimate, tokens = 64218080/0.75 = 85624107 tokens.
Fine-tuned training of this entire set = (85624106/1000)*$0.03 = $2569.

Theorem axtgcgrrflx : 942 words, tokens = 942/0.75 = 1256, cost = (1256/1000)*$0.12 = $0.15

Jon P

unread,
Feb 16, 2023, 6:36:47 PM2/16/23
to Metamath
Mmm yeah interesting, I agree $2600 and then more for inference is quite a lot. I think your original plan might be quite a good one as if multiple groups are going to scrape the web for training data then we might be able to get into a bunch of different models.

This is an interesting conversation with chatGPT. Seems it already knows quite a lot.

Me: 
Tell me what you think this is


Metamath proof of pm5.74 in database set.mm
pm5.74 $p |- ( ( ph -> ( ps <-> ch ) ) <-> ( ( ph -> ps ) <-> ( ph -> ch ) ) ) $= ... $.

PROOF:
1 biimp $p |- ( ( ps <-> ch ) -> ( ps -> ch ) )
[I pasted in the whole proof but omit it here for brevity]

ChatGPT:
This is a formal proof in the Metamath language for the statement pm5.74, which asserts that the formula

((ph -> (ps <-> ch)) <-> ((ph -> ps) <-> (ph -> ch)))

is a theorem, where ph, ps, and ch are propositional variables.

The proof appears to be a sequence of logical deductions using various rules of inference, including bi-implication introduction (biimp), implication introduction (imim3i), bidirectional implication elimination (impbid), and the modus ponens rule (pm). The proof involves manipulating the logical connectives and truth tables of the involved propositions until the desired statement is proven.

The proof is based on the set.mm database, which is a large collection of mathematical theorems and their formal proofs in the Metamath language.

Me: 
is this database part of your training data?

ChatGPT:
As an AI language model, I do not have access to information about the specific training data used to train me. However, I am aware that the set.mm database is a public database of formal mathematical proofs in the Metamath language, and it is often used as a benchmark for automated theorem proving systems. Some researchers have used the set.mm database as part of their training data for developing automated theorem provers or natural language understanding systems, but I cannot say for certain whether it was used as part of my own training data.

Me: 
Please take the following partial proof and continue it for another line


Metamath proof of axtgcgrrflx in database set.mm
axtrkg.p $e |- P = ( Base ` G ) $.
axtrkg.d $e |- .- = ( dist ` G ) $.
axtrkg.i $e |- I = ( Itv ` G ) $.
axtrkg.g $e |- ( ph -> G e. TarskiG ) $.
axtgcgrrflx.1 $e |- ( ph -> X e. P ) $.
axtgcgrrflx.2 $e |- ( ph -> Y e. P ) $.
axtgcgrrflx $p |- ( ph -> ( X .- Y ) = ( Y .- X ) ) $= ... $.

PROOF:
1 df-trkg $a |- TarskiG = ( ( TarskiGC i^i TarskiGB ) i^i ( TarskiGCB i^i { f | [. ( Base ` f ) / p ]. [. ( Itv ` f ) / i ]. ( LineG ` f ) = ( x e. p , y e. ( p \ { x } ) |-> { z e. p | ( z e. ( x i y ) \/ x e. ( z i y ) \/ y e. ( x i z ) ) } ) } ) )
2 inss1 $p |- ( ( TarskiGC i^i TarskiGB ) i^i ( TarskiGCB i^i { f | [. ( Base ` f ) / p ]. [. ( Itv ` f ) / i ]. ( LineG ` f ) = ( x e. p , y e. ( p \ { x } ) |-> { z e. p | ( z e. ( x i y ) \/ x e. ( z i y ) \/ y e. ( x i z ) ) } ) } ) ) C_ ( TarskiGC i^i TarskiGB )
3 inss1 $p |- ( TarskiGC i^i TarskiGB ) C_ TarskiGC
4 2,3 sstri $p |- ( ( TarskiGC i^i TarskiGB ) i^i ( TarskiGCB i^i { f | [. ( Base ` f ) / p ]. [. ( Itv ` f ) / i ]. ( LineG ` f ) = ( x e. p , y e. ( p \ { x } ) |-> { z e. p | ( z e. ( x i y ) \/ x e. ( z i y ) \/ y e. ( x i z ) ) } ) } ) ) C_ TarskiGC
5 1,4 eqsstri $p |- TarskiG C_ TarskiGC

ChatGPT:
6 1,5 eqtrans $p |- TarskiG = TarskiGC $
7 subst $e |- ( ( ph -> G e. TarskiGC ) -> ( ph -> G e. TarskiG ) ) $
8 hyp $e, ph |- ph $
9, 10 mp $e |- ( ph -> G e. TarskiG ) $
11 7, 9 mp $e |- ( ph -> G e. TarskiGC ) $
12 eqtrans $p |- ( ph -> G e. TarskiGC ) = ( ph -> G e. ( ( TarskiGC i^i TarskiGB ) i^i ( TarskiGCB i^i { f | [. ( Base f ) / p ]. [. ( Itv f ) / i ]. ( LineG ` f ) = ( x e. p , y e. ( p \ { x } ) |-> { z e. p | ( z e. ( x i y ) / x e. ( z i y ) / y e. ( x i z ) ) } ) } ) ) ) $
13 6, 11, 12 impbii $p |- ( ph -> ( X .- Y ) = ( Y .- X ) ) $

Me:
how many theorems are proven in the metamath database

ChatGPT:
As of the latest update to the Metamath database on January 24, 2022, there are a total of 272,443 theorems that have been proven. However, it's worth noting that this number may change as new theorems are added to the database over time.

----------------

So yeah I haven't tried it on any partial proofs and then like tried to run it in MMJ2 or something but that might be interesting to see if it can already do any more steps. It already has the structure pretty well organised.

It also gives a clue of when it scanned the database, though that number of theorems is wildly wrong.

David A. Wheeler

unread,
Feb 16, 2023, 7:25:24 PM2/16/23
to Metamath Mailing List

> On Feb 16, 2023, at 6:36 PM, Jon P <drjonp...@gmail.com> wrote:
>
> Mmm yeah interesting, I agree $2600 and then more for inference is quite a lot. I think your original plan might be quite a good one as if multiple groups are going to scrape the web for training data then we might be able to get into a bunch of different models.
>
> This is an interesting conversation with chatGPT. Seems it already knows quite a lot.

That is *utterly* fascinating. Especially since we certainly didn't present data in a format that's easy to train on.
Many of its general-purpose text answers are right, too.

> ChatGPT:
> 6 1,5 eqtrans $p |- TarskiG = TarskiGC $
> 7 subst $e |- ( ( ph -> G e. TarskiGC ) -> ( ph -> G e. TarskiG ) ) $
> 8 hyp $e, ph |- ph $
> 9, 10 mp $e |- ( ph -> G e. TarskiG ) $
> 11 7, 9 mp $e |- ( ph -> G e. TarskiGC ) $
> 12 eqtrans $p |- ( ph -> G e. TarskiGC ) = ( ph -> G e. ( ( TarskiGC i^i TarskiGB ) i^i ( TarskiGCB i^i { f | [. ( Base f ) / p ]. [. ( Itv f ) / i ]. ( LineG ` f ) = ( x e. p , y e. ( p \ { x } ) |-> { z e. p | ( z e. ( x i y ) / x e. ( z i y ) / y e. ( x i z ) ) } ) } ) ) ) $
> 13 6, 11, 12 impbii $p |- ( ph -> ( X .- Y ) = ( Y .- X ) ) $

I compared that to the actual continuation of axtgcgrrflx (below). ChatGPT does manage to look similar in terms for format. It doesn't manage to figure out much in the way of the actual steps (either statement to use nor resulting statement), which is too bad, because that makes not useful even as a proof hinting assistant.

That said, it's not particularly trained on it, and the public-available data certainly doesn't make it easy to train on. It's possible a tool specifically trained on the dataset would do much better.


> Me:
> how many theorems are proven in the metamath database
>
> ChatGPT:
> As of the latest update to the Metamath database on January 24, 2022, there are a total of 272,443 theorems that have been proven. However, it's worth noting that this number may change as new theorems are added to the database over time.

Looking at just set.mm, it has 203588 statements; 2716 are $a and 40627 are $p.
Once you add the other databases (especially iset.mm), maybe there are around 270,000 statements
across all metamath databases. So perhaps it's confusing "theorem" with "statement".

--- David A. Wheeler

========

Here's the actual current proof (ChatGPT was given 1-5):
1 df-trkg $a |- TarskiG = ( ( TarskiGC i^i TarskiGB ) i^i ( TarskiGCB i^i { f | [. ( Base ` f ) / p ]. [. ( Itv ` f ) / i ]. ( LineG ` f ) = ( x e. p , y e. ( p \ { x } ) |-> { z e. p | ( z e. ( x i y ) \/ x e. ( z i y ) \/ y e. ( x i z ) ) } ) } ) )
2 inss1 $p |- ( ( TarskiGC i^i TarskiGB ) i^i ( TarskiGCB i^i { f | [. ( Base ` f ) / p ]. [. ( Itv ` f ) / i ]. ( LineG ` f ) = ( x e. p , y e. ( p \ { x } ) |-> { z e. p | ( z e. ( x i y ) \/ x e. ( z i y ) \/ y e. ( x i z ) ) } ) } ) ) C_ ( TarskiGC i^i TarskiGB )
3 inss1 $p |- ( TarskiGC i^i TarskiGB ) C_ TarskiGC
4 2,3 sstri $p |- ( ( TarskiGC i^i TarskiGB ) i^i ( TarskiGCB i^i { f | [. ( Base ` f ) / p ]. [. ( Itv ` f ) / i ]. ( LineG ` f ) = ( x e. p , y e. ( p \ { x } ) |-> { z e. p | ( z e. ( x i y ) \/ x e. ( z i y ) \/ y e. ( x i z ) ) } ) } ) ) C_ TarskiGC
5 1,4 eqsstri $p |- TarskiG C_ TarskiGC
6 axtrkg.g $e |- ( ph -> G e. TarskiG )
7 5,6 sseldi $p |- ( ph -> G e. TarskiGC )
8 axtrkg.p $e |- P = ( Base ` G )
9 axtrkg.d $e |- .- = ( dist ` G )
10 axtrkg.i $e |- I = ( Itv ` G )
11 8,9,10 istrkgc $p |- ( G e. TarskiGC <-> ( G e. _V /\ ( A. x e. P A. y e. P ( x .- y ) = ( y .- x ) /\ A. x e. P A. y e. P A. z e. P ( ( x .- y ) = ( z .- z ) -> x = y ) ) ) )
12 11 simprbi $p |- ( G e. TarskiGC -> ( A. x e. P A. y e. P ( x .- y ) = ( y .- x ) /\ A. x e. P A. y e. P A. z e. P ( ( x .- y ) = ( z .- z ) -> x = y ) ) )
13 12 simpld $p |- ( G e. TarskiGC -> A. x e. P A. y e. P ( x .- y ) = ( y .- x ) )
14 7,13 syl $p |- ( ph -> A. x e. P A. y e. P ( x .- y ) = ( y .- x ) )
15 axtgcgrrflx.1 $e |- ( ph -> X e. P )
16 axtgcgrrflx.2 $e |- ( ph -> Y e. P )
17 oveq1 $p |- ( x = X -> ( x .- y ) = ( X .- y ) )
18 oveq2 $p |- ( x = X -> ( y .- x ) = ( y .- X ) )
19 17,18 eqeq12d $p |- ( x = X -> ( ( x .- y ) = ( y .- x ) <-> ( X .- y ) = ( y .- X ) ) )
20 oveq2 $p |- ( y = Y -> ( X .- y ) = ( X .- Y ) )
21 oveq1 $p |- ( y = Y -> ( y .- X ) = ( Y .- X ) )
22 20,21 eqeq12d $p |- ( y = Y -> ( ( X .- y ) = ( y .- X ) <-> ( X .- Y ) = ( Y .- X ) ) )
23 19,22 rspc2v $p |- ( ( X e. P /\ Y e. P ) -> ( A. x e. P A. y e. P ( x .- y ) = ( y .- x ) -> ( X .- Y ) = ( Y .- X ) ) )
24 15,16,23 syl2anc $p |- ( ph -> ( A. x e. P A. y e. P ( x .- y ) = ( y .- x ) -> ( X .- Y ) = ( Y .- X ) ) )


Auguste Poiroux

unread,
Feb 17, 2023, 4:27:24 AM2/17/23
to Metamath
Hi everyone!

I am very interested in this discussion topic too. There is an organization that is currently working on something highly related to this topic and is starting to release open-source datasets/models on HuggingFace: https://huggingface.co/hoskinson-center. They provide a math dataset and a first model of 6.7B parameters trained on it. In their math dataset, they put several proof databases from different formal systems, including Metamath. They used this code: https://github.com/zhangir-azerbayev/mm-extract to extract a "human readable" form of the proofs. This seems to be close to the format you propose David.
I think the model they provide can be used to make proof suggestions. I don't know what this is worth though, I just stumbled across it. It may not be the best, but I think it can be a really good start.
The work looks very recent, maybe they will publish some evaluation results of their model soon.

Auguste

Mario Carneiro

unread,
Feb 17, 2023, 12:49:16 PM2/17/23
to meta...@googlegroups.com
I do see big downsides for the use cases listed here though:
1. For training a GPT, I fear that the system would focus on "it's a JSON file!"
and any tool-specific training would be swamped by thinking it's the same as other JSON files.
I think a GPT system that trains on the "Internet as a whole" is more likely to
generate some useful information if we minimize distractors.
In short, make it REALLY OBVIOUS that this text is regular in some way, yet
clearly different from anything else, and expose structure where we reasonably can.

I think this is looking at it the wrong way. Stuff trained on the entire internet is going to have a clear upper bound on the precision of the answers it can give. I'm not really interested in supplying a poor quality data source just for the purpose of a speculative application like this. For projects that are actually interested in doing ML specifically on metamath, it is very normal to have some data wrangling by the researchers beforehand to prepare things for the model. This kind of glue code is usually some ad-hoc python script so it helps if we are providing a data source that is easily manipulable. (I'm working on a paper with Josef Urban regarding ATP for metamath and it benefits significantly from metamath-knife as a way of getting information from mm files, but it's not always possible to use mm-knife and if we are going for something more easily digestible from many languages without a dedicated library JSON is a strong choice.)


2. For rendering on the client side, there are big advantages to showing static data
as straight HTML. For one, people who disable JavaScript can still see it fine.
Sure, sometimes it's useful to run arbitrary code sent by strangers, but there are good
reasons to minimize that as a requirement :-). It also means that search engines see
"what we actually have".
 
While I have some sympathy for this argument, I also think it is holding us back a lot. Catering to the least common denominator of no-JS means that our web pages are stuck permanently in web 1.0, with no search functionality, intra-page links, subproof expansion, mathml rendering, go-to-definition for symbols, or a laundry list of other features. Furthermore, doing the rendering client-side means that the cost of transmission is lower which means shorter load times and no need for a two-hour processing phase. (Even without making the web pages dynamic I've been considering generating the web pages on-demand and using nginx only as a cache. I already have a prototype that can serve HTML directly from the .mm file, no preprocessing necessary, and that means we don't have to worry about slow updates and stuff going down due to issues that occurred during the massive install script.)


--
You received this message because you are subscribed to the Google Groups "Metamath" group.
To unsubscribe from this group and stop receiving emails from it, send an email to metamath+u...@googlegroups.com.

Cris Perdue

unread,
Feb 17, 2023, 3:52:35 PM2/17/23
to meta...@googlegroups.com
This discussion is bringing up some interesting topics, including other efforts to apply ML to theorem proving. All very interesting to see.

About serving JSON with Javascript rendering vs. serving straight HTML --

First I want to say that Mario's comments are persuasive as well as interesting from my point of view.

Next, this discussion seems to be rapidly morphing into "JSON versus HTML" rather than "JSON versus plain text", especially given that others are already generating plain text from set.mm for machine learning purposes. Yes?

With that in mind --

Is there much reason not to create JSON in addition to HTML? If disk space is a concern, consider that it can be a win-win to store precompressed pages (i.e. with gzip). You see, the lion's share of requests are for compressed data anyway, and if it is precompressed on the server, there is actually less processing to be done when serving the page that way. This article shows how it can be accomplished with an nginx server like ours in a handful of lines of shell scripting and server configuration: https://blog.llandsmeer.com/tech/2019/08/29/precompression.html.

And for what it's worth, about users turning off JavaScript --

It is a tiny percentage of users who turn off JavaScript. Numbers around 1% have been thrown around, but for more careful analysis see for example this blog posting from uk.gov -- https://gds.blog.gov.uk/2013/10/21/how-many-people-are-missing-out-on-javascript-enhancement/. They found that out of around 500,000 visits to the gov.uk home page, about 1,000 loaded a "noscript" image, indicating that JavaScript was turned off or not supported. Around 4,000 visits loaded neither that one nor one served when JavaScript was turned on. The posting and comments discuss various reasons why this might have occurred, including a suggestion that numerous visitors "have disabled image-loading in their mobile browsers", which would also be an interesting scenario for the metamath site if true.

Besides, for those who care enough to visit a site like us.metamath.org, why not just turn JavaScript back on, anyway?

This next may not wind up being a big concern, but about search engines indexing sites in which a lot of the HTML is generated by JavaScript -- I know that for some years Google Search has been pre-rendering pages during crawl, apparently not to the pixel level, but to the document structure (DOM tree) level, which would be the important one for concerns about use of JSON. And Google apparently is not alone in this. Here is a fairly detailed article on the subject, somewhat focused on Google - https://www.seoquake.com/blog/render-pages/.

Finally, I find it hard to overstate the potential that can be unlocked by providing suitable (JSON-format) data describing proofs to Web browsers. Browsers these days are really user interface engines, not just HTML displayers.

-Cris

Benoit

unread,
Feb 18, 2023, 8:40:23 AM2/18/23
to Metamath
On Monday, February 13, 2023 at 7:20:03 PM UTC+1 David A. Wheeler wrote:
All: I'm thinking it might be a good idea to create a new "database area",
"mpetxt" (to match "mpeuni"/"mpegif"), with many simple text files - 1 per proof.

So your idea is to replace the term "mpeascii" from https://groups.google.com/g/metamath/c/_c3viH80_w8/m/i1PQSRJJBAAJ with "mpetxt"?  I'm fine with that.

Benoît
 

David A. Wheeler

unread,
Feb 18, 2023, 1:09:50 PM2/18/23
to Metamath Mailing List

David A. Wheeler:
> I do see big downsides for the use cases listed here though:
> 1. For training a GPT, I fear that the system would focus on "it's a JSON file!" ....

> On Feb 17, 2023, at 12:49 PM, Mario Carneiro <di....@gmail.com> wrote:
> I think this is looking at it the wrong way. Stuff trained on the entire internet is going to have a clear upper bound on the precision of the answers it can give. I'm not really interested in supplying a poor quality data source just for the purpose of a speculative application like this. For projects that are actually interested in doing ML specifically on metamath, it is very normal to have some data wrangling by the researchers beforehand to prepare things for the model. This kind of glue code is usually some ad-hoc python script so it helps if we are providing a data source that is easily manipulable.

Sure, ML training normally does a lot of wrangling to make clean data,. I've done some ML work myself (not on Metamath, sadly). But since my wallet isn't deep, I was speculating that by providing easier-to-train-on text we might get some more value out of the investments others are making. I think it'd be relatively easy to do, and then we could see if it has a benefit.

> (I'm working on a paper with Josef Urban regarding ATP for metamath and it benefits significantly from metamath-knife as a way of getting information from mm files, but it's not always possible to use mm-knife and if we are going for something more easily digestible from many languages without a dedicated library JSON is a strong choice.)

David A. Wheeler:
> 2. For rendering on the client side, there are big advantages to showing static data
> as straight HTML. For one, people who disable JavaScript can still see it fine.
> Sure, sometimes it's useful to run arbitrary code sent by strangers, but there are good
> reasons to minimize that as a requirement :-). It also means that search engines see
> "what we actually have".
>

> While I have some sympathy for this argument, I also think it is holding us back a lot. Catering to the least common denominator of no-JS means that our web pages are stuck permanently in web 1.0, with no search functionality, intra-page links, subproof expansion, mathml rendering, go-to-definition for symbols, or a laundry list of other features. Furthermore, doing the rendering client-side means that the cost of transmission is lower which means shorter load times and no need for a two-hour processing phase.

What I want is for *a* way for people to access the basic information without *requiring* client-side JavaScript. I see no reason that it be the *only* way we provide the information, we already provide it in multiple forms. I just want to retain that capability. So you don't need to feel held back :-).

Indeed, you can implement rendering with client-side JavaScript today, just download the .mm files. The current CORS setting means that any client-side JavaScript program can do that, it doesn't even need to be hosted on us.metamath.org. So technically people today can view the generated HTML on the website, *or* use any client-side JavaScript/WebAssembly. We "just" need someone to write the client-side renderer.

Also, almost everything you listed doesn't require client-side JavaScript. Intra-page links work (they're just hyperlinks), subproof expansion works (use HTML <details>), MathML ships in Chromium & Chrome & Firefox (Firefox's support is only partial but it's probably more than enough for us) per <https://developer.mozilla.org/en-US/docs/Web/MathML>, go-to-definition is just another hyperlink, and so on. Basic searches can be supported via a form that queries a search engine (e.g., Google).

As far as speed goes, JavaScript is often slower than HTML where both can be used. In practice JavaScript files are often quite large, and JavaScript has to be parsed, examined, executed, and then its execution must cause document tree changes which finally get rendered. Web browsers are *really* good at rapidly & progressively rendering HTML. On systems I've worked with, HTML (even with server roundtrips) is often *MUCH* faster when measuring paint-to-screen times. JavaScript *has* gotten faster over the years, thankfully, but the gap is still large. Of course, when you need functionalities only provided by JavaScript then you need it. Like all technologies there are tradeoffs.

> On Feb 17, 2023, at 3:52 PM, Cris Perdue <cr...@perdues.com> wrote:

> It is a tiny percentage of users who turn off JavaScript.


I'm one of those people who sometimes turns off JavaScript, and many sites work fine. The more you learn about security, the more worrying it gets to "run all code sent to your web browser from attackers", especially since all browsers' JavaScript engines are large (& thus hard to review & have endless vulnerabilities), and websites often have no idea what JavaScript they're sending to you (much comes from ad brokers from other ad brokers from anyone who want to make someone run code they created). Let's try to support users who are trying to protect themselves.

> Is there much reason not to create JSON in addition to HTML? If disk space is a concern, consider that it can be a win-win to store precompressed pages (i.e. with gzip).


No *big* reason. Indeed, if we want to make it *easy* for people to generate their own renderers or analyze Metamath results without specialized parsing, the obvious way to do that would be to generate JSON. I think that'd make a lot of sense. That still involves a daily pre-run, but I don't think there's any serious problem in doing 1/day regenerations.

We'd have to figure out specifics, which means we need to figure out how they'd be used. If we generate a separate JSON file for each theorem, then it's really easy for a rendered to get just what it needs. However, that's a lot of files, and compression might not help our storage needs. The filesystem block size is 4KiB, so anything under 4KiB is 4KiB storage. We could generate a single massive JSON file of all theorems; that might be useful for ML training if they plan to process the JSON further, but the file would be huge so I think a client-side program would be better off loading the .mm database directly instead. A massive JSON file might be useful for other purposes, and that *would* be best stored compressed.

We currently don't have *lots* of extra storage space. If we add a major new set of generated data it's likely we'll need more space. We could retire mpegif, add more storage ($ that currently I pay, but maybe others could chip in), or look for another provider ($?).

What started this thread was trying to see if we could get "free" training on GPT-3 by presenting something easier for them to consume, *without* controlling them. I don't think generating JSON will do that. The JSON certainly could make it easier for organizations who want to specially train ML systems on the existing databases, and are willing to do "a little cleaning" by processing the JSON. If we pregenerate JSON from the .mm databases, the most likely use cases would be to make it easier for rendering tools & ML labs that want to do a little (but not a lot) of work.

--- David A. Wheeler

Mario Carneiro

unread,
Feb 18, 2023, 4:08:35 PM2/18/23
to meta...@googlegroups.com
> While I have some sympathy for this argument, I also think it is holding us back a lot. Catering to the least common denominator of no-JS means that our web pages are stuck permanently in web 1.0, with no search functionality, intra-page links, subproof expansion, mathml rendering, go-to-definition for symbols, or a laundry list of other features. Furthermore, doing the rendering client-side means that the cost of transmission is lower which means shorter load times and no need for a two-hour processing phase.

What I want is for *a* way for people to access the basic information without *requiring* client-side JavaScript. I see no reason that it be the *only* way we provide the information, we already provide it in multiple forms. I just want to retain that capability. So you don't need to feel held back :-).

In case it wasn't clear (and it probably wasn't), in a future world in which the default mechanism is some super fancy client side renderer it is still possible to offer a fallback "static site" of HTML pages which is nevertheless not generated in advance and is being served dynamically server-side. This has a relatively small generation cost even if we don't stick a CDN in front of it, but if we do then I think that's a viable approach to having all the HTML pages with basically zero preprocessing, just a smarter server. I would really love it if the only thing the site build needs to do is download and load the latest set.mm .
 
Indeed, you can implement rendering with client-side JavaScript today, just download the .mm files.
 
I don't think this is a very competitive option though, since you would have to load all of set.mm. The advantage of pre-processed json files is that when you have a single theorem you want to look at you don't need to request everything and start up a whole client-side MM verifier first.

The current CORS setting means that any client-side JavaScript program can do that, it doesn't even need to be hosted on us.metamath.org. So technically people today can view the generated HTML on the website, *or* use any client-side JavaScript/WebAssembly. We "just" need someone to write the client-side renderer.

An in-browser .mm browser and proof assistant is a very cool project, but somewhat distinct from what I'm talking about, which is a replacement for the existing proof docs. That's not to say that we shouldn't have such a thing available as well on metamath.org as a kind of "try it online" mode. But the heavy startup cost will definitely make it feel like a different experience.
 
Also, almost everything you listed doesn't require client-side JavaScript. Intra-page links work (they're just hyperlinks), subproof expansion works (use HTML <details>), MathML ships in Chromium & Chrome & Firefox (Firefox's support is only partial but it's probably more than enough for us) per <https://developer.mozilla.org/en-US/docs/Web/MathML>, go-to-definition is just another hyperlink, and so on.

While it is true that you can do almost all of that with static HTML, it would make the pages far heavier than they currently are. Load times for large proof pages are already pretty bad and this would make it a lot worse. Not to mention that more text means more pre-processing time during site build.

Basic searches can be supported via a form that queries a search engine (e.g., Google).

Oof. I'm sure you know that this is not a very good option for a host of reasons. It's basically equivalent to not having a search mechanism at all, since google already indexes the site.
 
As far as speed goes, JavaScript is often slower than HTML where both can be used. In practice JavaScript files are often quite large, and JavaScript has to be parsed, examined, executed, and then its execution must cause document tree changes which finally get rendered. Web browsers are *really* good at rapidly & progressively rendering HTML. On systems I've worked with, HTML (even with server roundtrips) is often *MUCH* faster when measuring paint-to-screen times. JavaScript *has* gotten faster over the years, thankfully, but the gap is still large. Of course, when you need functionalities only provided by JavaScript then you need it. Like all technologies there are tradeoffs.

I am imagining the main purpose of the JS here is to pump up a basic version of the page into the fully interactive / hyperlinked version of the page, without having to put loads of redundant text in all the tags. There is no need for a heavy JS library here; I did essentially this for the MM0 docs (https://digama0.github.io/mm0/peano/thms/expr.html) and the JS support library for it (https://digama0.github.io/mm0/peano/proof.js) is very light and does not need to come with every page. (There is still more that can be done to make the MM0 docs more interactive but it helps a lot if there is an actual server on the other end rather than just a static site.)
 
> Is there much reason not to create JSON in addition to HTML? If disk space is a concern, consider that it can be a win-win to store precompressed pages (i.e. with gzip).

No *big* reason. Indeed, if we want to make it *easy* for people to generate their own renderers or analyze Metamath results without specialized parsing, the obvious way to do that would be to generate JSON. I think that'd make a lot of sense. That still involves a daily pre-run, but I don't think there's any serious problem in doing 1/day regenerations.

We'd have to figure out specifics, which means we need to figure out how they'd be used. If we generate a separate JSON file for each theorem, then it's really easy for a rendered to get just what it needs. However, that's a lot of files, and compression might not help our storage needs. The filesystem block size is 4KiB, so anything under 4KiB is 4KiB storage. We could generate a single massive JSON file of all theorems; that might be useful for ML training if they plan to process the JSON further, but the file would be huge so I think a client-side program would be better off loading the .mm database directly instead. A massive JSON file might be useful for other purposes, and that *would* be best stored compressed.

We currently don't have *lots* of extra storage space. If we add a major new set of generated data it's likely we'll need more space. We could retire mpegif, add more storage ($ that currently I pay, but maybe others could chip in), or look for another provider ($?).

Just to say once more: our current setup allows us to run a program on the server, and this is important: it means that we can generate all files on demand and use nginx only for caching so that the cost of generation (which should be a few milliseconds if everything is in memory already) is paid only once. That applies both to the HTML pages as well as any json API endpoints we add.
 
What started this thread was trying to see if we could get "free" training on GPT-3 by presenting something easier for them to consume, *without* controlling them. I don't think generating JSON will do that. The JSON certainly could make it easier for organizations who want to specially train ML systems on the existing databases, and are willing to do "a little cleaning" by processing the JSON. If we pregenerate JSON from the .mm databases, the most likely use cases would be to make it easier for rendering tools & ML labs that want to do a little (but not a lot) of work.

Has anyone asked for this? Will we be able to see the results of the work and know that it was useful? It is very strange to me to spend such a significant fraction of our time, processing power, and space on something that is effectively just intended for web crawlers and with loose constraints on the actual formats involved.

If we really want to use plain text for ML, why not feed the robots .mmp proofs instead? At least that way the stuff they regurgitate might potentially be useful as a poor version of an ATP. (I'm pretty skeptical even of that. I think it is most likely only to be good for a laugh. An ML trained for actual logical correctness needs a much better data source and architecture than just a transformer trained from text on the internet. That's not to say that you can't build such ML models, GPT-f is one of them, but these models work in a much more controlled environment like "guess the next theorem to apply" rather than "write something that looks like a proof".)

David A. Wheeler

unread,
Feb 18, 2023, 5:16:51 PM2/18/23
to Metamath Mailing List


> On Feb 18, 2023, at 4:08 PM, Mario Carneiro <di....@gmail.com> wrote:
> In case it wasn't clear (and it probably wasn't), in a future world in which the default mechanism is some super fancy client side renderer it is still possible to offer a fallback "static site" of HTML pages which is nevertheless not generated in advance and is being served dynamically server-side. This has a relatively small generation cost even if we don't stick a CDN in front of it, but if we do then I think that's a viable approach to having all the HTML pages with basically zero preprocessing, just a smarter server. I would really love it if the only thing the site build needs to do is download and load the latest set.mm .

I understand, that's just a typical dynamically-generated data file and/or HTML with possibly front-end JavaScript. Bog standard stuff. Of course, there are pluses & minuses with that approach. In the 2010s almost everyone switched to that model, today there's a big press back to statically-regenerated sites where practical (with Jekyll, etc.) because attackers routinely pop systems that do dynamic response.

There are big concerns if you hook up a program written in C to input sent by an attacker. Rigorous input filtering would definitely help, but using something like metamath-knife as a basis instead of metamath-exe might be better. So doing this would involve a nontrivial amount of software, *along* with analysis to ensure that it was unlikely to lead to a vulnerability. Statically generating files is a much simpler path from where we are. That's not to say we must go that way, but it'd take more effort to do full dynamic generation *with* confidence in its security.


> I don't think this is a very competitive option though, since you would have to load all of set.mm.

It's more competitive than you might think. Igor's prover loads the whole database in a few seconds. That said, I agree with:

> The advantage of pre-processed json files is that when you have a single theorem you want to look at you don't need to request everything and start up a whole client-side MM verifier first.


> What started this thread was trying to see if we could get "free" training on GPT-3 by presenting something easier for them to consume, *without* controlling them. I don't think generating JSON will do that. The JSON certainly could make it easier for organizations who want to specially train ML systems on the existing databases, and are willing to do "a little cleaning" by processing the JSON. If we pregenerate JSON from the .mm databases, the most likely use cases would be to make it easier for rendering tools & ML labs that want to do a little (but not a lot) of work.
>
> Has anyone asked for this?

Absolutely not!! I was noticing that Microsoft & Google are really doubling down on web scanning into GPTs, and trying to see if we could take advantage of that.

--- David A. Wheeler

Mario Carneiro

unread,
Feb 18, 2023, 6:04:47 PM2/18/23
to meta...@googlegroups.com
On Sat, Feb 18, 2023 at 5:16 PM David A. Wheeler <dwhe...@dwheeler.com> wrote:


> On Feb 18, 2023, at 4:08 PM, Mario Carneiro <di....@gmail.com> wrote:
> In case it wasn't clear (and it probably wasn't), in a future world in which the default mechanism is some super fancy client side renderer it is still possible to offer a fallback "static site" of HTML pages which is nevertheless not generated in advance and is being served dynamically server-side. This has a relatively small generation cost even if we don't stick a CDN in front of it, but if we do then I think that's a viable approach to having all the HTML pages with basically zero preprocessing, just a smarter server. I would really love it if the only thing the site build needs to do is download and load the latest set.mm .

I understand, that's just a typical dynamically-generated data file and/or HTML with possibly front-end JavaScript. Bog standard stuff. Of course, there are pluses & minuses with that approach. In the 2010s almost everyone switched to that model, today there's a big press back to statically-regenerated sites where practical (with Jekyll, etc.) because attackers routinely pop systems that do dynamic response.

There are big concerns if you hook up a program written in C to input sent by an attacker. Rigorous input filtering would definitely help, but using something like metamath-knife as a basis instead of metamath-exe might be better. So doing this would involve a nontrivial amount of software, *along* with analysis to ensure that it was unlikely to lead to a vulnerability. Statically generating files is a much simpler path from where we are. That's not to say we must go that way, but it'd take more effort to do full dynamic generation *with* confidence in its security.

While yes I agree that this would be much better done with metamath-knife than metamath-exe, I don't think there is a major concern here, or at least we can mostly mitigate the issues since this flow is extremely light on input from the attacker. Basically the only thing you can specify is the theorem you want to see. But for sure I'm not interested in doing this by building on metamath-exe, there are some great frameworks for writing these kind of servers in rust that should already handle most of the issues. The main thing to be concerned about is DOS attacks, and having a 'real' server like nginx in front helps with that.

But our threat model is pretty much nonexistent to begin with; almost all the server data and code is public domain in the first place and even if the server is taken down it's not a critical service. I think a much more critical problem for the project is giving a perception of being an old unfriendly software used by few.

> I don't think this is a very competitive option though, since you would have to load all of set.mm.

It's more competitive than you might think. Igor's prover loads the whole database in a few seconds.

I was talking about the data cost. Even if you have a highly optimized mm verifier, you can't get around the fact that you need to send some 30 MB over the wire before you can do anything. That's big even by modern bloated web page standards. If the user doesn't have a great connection then that might dwarf the cost of processing.

Antony Bartlett

unread,
Feb 19, 2023, 5:43:29 AM2/19/23
to meta...@googlegroups.com
On Sat, Feb 18, 2023 at 11:04 PM Mario Carneiro <di....@gmail.com> wrote:
Even if you have a highly optimized mm verifier, you can't get around the fact that you need to send some 30 MB over the wire before you can do anything.

Just to recap a few things from other threads, back in November, Samuel Goto demonstrated a dynamic (client-side) proof explorer here


It does indeed take some time to load while a set.mm is sent over the wire.  Hence the subsequent discussion of "streaming parsers", which would allow for the page to render as soon as the bit of set.mm that we're interested in has loaded rather than waiting for the whole thing.  (or a verifier could be running on whatever part of the file is available, successively as it's chunks download)

(Igor Ieskov's web-based proof assistant loads locally saved files, so doesn't have the same load time problem).

I expect to have a proposal for a JSON format - just a parse tree of a .mm file, really - early next month if not sooner.

and I plan on writing a JSON schema too, but haven't started on that yet.

And I'm not proposing anything until I have a working parser! - after which getting the streaming working may take a little longer.

Whether this proves popular, or whether I turn out to be its only adoptee, it's impossible to say, but I wouldn't be at all sorry to see the streaming parser superseded by some sort of server side JSON if that's what you're mulling over.

    Best regards,

        Antony

Mario Carneiro

unread,
Feb 19, 2023, 1:05:43 PM2/19/23
to meta...@googlegroups.com
I don't see those as competing goals at all. They play different roles: for browsing you would ideally have some server supplying only the required information to render a specific proof, and for a proof assistant (and to some extent a search tool as well) you want to have the whole database available, or at least all the signatures of all the theorems and the body for the current theorem. Especially when you consider how these things scale asymptotically, it is clear that the dependencies for one proof will be a lot less than the entirety of the file, and a streaming parser can only save you a factor of 2 if we assume that theorems of interest are sampled uniformly. For example I don't think wikipedia is sending a compressed archive of the whole project on every page load - it only sends the data for the current article, as well as server-side json blurbs for link hovers (https://en.wikipedia.org/wiki/Wikipedia:Tools/Navigation_popups). But when someone clicks the "start proof mode" button they will have an expectation that there will be a startup cost, so it is easy to justify loading set.mm and a streaming parser still helps for making that experience better.

--
You received this message because you are subscribed to the Google Groups "Metamath" group.
To unsubscribe from this group and stop receiving emails from it, send an email to metamath+u...@googlegroups.com.

Mario Carneiro

unread,
Feb 19, 2023, 1:08:19 PM2/19/23
to meta...@googlegroups.com
Oops, that was the wrong link, I meant to point to https://www.mediawiki.org/wiki/Page_Previews . (I investigated how this feature is implemented a while ago, with an eye for doing something similar for theorem references.)

David A. Wheeler

unread,
Feb 19, 2023, 2:19:15 PM2/19/23
to Metamath Mailing List
FYI: Our discussion topic seems to have broadened to discussing potential future directions for the website, so I've changed the subject line to "Future website directions". Its subject line was previously: "Re: [Metamath] Idea: Release ".txt" proofs for set.mm databases ("mpetxt") for AI/ML tools, esp Generative Pre-trained Transformers (GPTs)".

> On Feb 19, 2023, at 1:05 PM, Mario Carneiro <di....@gmail.com> wrote:
>
> I don't see those as competing goals at all. They play different roles: for browsing you would ideally have some server supplying only the required information to render a specific proof, and for a proof assistant (and to some extent a search tool as well) you want to have the whole database available, or at least all the signatures of all the theorems and the body for the current theorem.
> .... when someone clicks the "start proof mode" button they will have an expectation that there will be a startup cost, so it is easy to justify loading set.mm and a streaming parser still helps for making that experience better.

100% agree on that. "See one proof" is one use case, "get 'whole' database" (e.g., to create a proof) is another.

> While yes I agree that this would be much better done with metamath-knife than metamath-exe, I don't think there is a major concern here, or at least we can mostly mitigate the issues since this flow is extremely light on input from the attacker. Basically the only thing you can specify is the theorem you want to see. But for sure I'm not interested in doing this by building on metamath-exe, there are some great frameworks for writing these kind of servers in rust that should already handle most of the issues. The main thing to be concerned about is DOS attacks, and having a 'real' server like nginx in front helps with that.
>
> But our threat model is pretty much nonexistent to begin with; almost all the server data and code is public domain in the first place and even if the server is taken down it's not a critical service. I think a much more critical problem for the project is giving a perception of being an old unfriendly software used by few.

It's not a *critical* service absolutely, but there are a lot of attackers who just attack everything using a variety of tools, then see what they can get out of it. So I'd rather make it hard to successfully attack the site. I agree that strong filtering would eliminate many problems.

My first choice when building a website, where it's easily achieved, is static rendering (statically-generated pages) where the most important data is visible without client-side JavaScript. Those are easily understood, incredibly hard to attack, trivial to configure, trivial to mirror, support security/privacy-conscious users, etc. Some sites *cannot* be developed this way, of course!! But using that as the "starting point" helps clarify "what else do you need?" - that is, it forces a discussion of "what are the unspoken requirements?". It's also where we are in the metamath site.

A disadvantage fo static rendering is that you have to prerender all pages, which takes time. I see that as a non-problem for our case; if someone can't wait a few hours, they can use tools to locally generate the page themselves.

We can obviously use a program to dynamically generate many of the pages instead, but what would it need to do?


> I expect to have a proposal for a JSON format - just a parse tree of a .mm file, really - early next month if not sooner.

I think the more interesting case is a JSON format for a *single* proof.

For a specific database, perhaps the website needs to provide:
* The .mm database.
* Introductory HTML pages (explaining what's going on).
* At least one HTML display for each statement (theorem/axiom).
* A JSON format for each statement, to make it easy for renderers to grab that data & generate fancy displays. That implies the JSON should probably include all the data in the current HTML generate page, including data about related theorems & an expansion of every proof step.
* Auto-generated table-of-contents (HTML & JSON?)

I think it's important that at least 1 HTML format show "the basics" if JavaScript isn't enabled. This is easily done by having progressive rendering *or* an alternate HTML page. An alternate "fancy" HTML page could download the JSON format and do many fancy things, without having to parse the .mm file.

I don't know if we need a JSON format for the .mm file as a whole. One approach for doing that would be to generate a JSON file that in turn contains the contents of every per-theorem JSON file. That would be a really big file, even precompressed - does anyone *want* that? What do they plan to do with it? We already compress the .mm file because they've gotten big. One possible use would be training AI systems, but they'd need to do post-processing of that to make it useful for them anyway. My earlier proposal involved trying to create a format pre-processed to make it easier for them to read, because if they're specially training an ML system, some preprocessing usually gets better results.

--- David A. Wheeler

Mario Carneiro

unread,
Feb 19, 2023, 3:14:10 PM2/19/23
to meta...@googlegroups.com
On Sun, Feb 19, 2023 at 2:19 PM David A. Wheeler <dwhe...@dwheeler.com> wrote:
> While yes I agree that this would be much better done with metamath-knife than metamath-exe, I don't think there is a major concern here, or at least we can mostly mitigate the issues since this flow is extremely light on input from the attacker. Basically the only thing you can specify is the theorem you want to see. But for sure I'm not interested in doing this by building on metamath-exe, there are some great frameworks for writing these kind of servers in rust that should already handle most of the issues. The main thing to be concerned about is DOS attacks, and having a 'real' server like nginx in front helps with that.
>
> But our threat model is pretty much nonexistent to begin with; almost all the server data and code is public domain in the first place and even if the server is taken down it's not a critical service. I think a much more critical problem for the project is giving a perception of being an old unfriendly software used by few.

It's not a *critical* service absolutely, but there are a lot of attackers who just attack everything using a variety of tools, then see what they can get out of it. So I'd rather make it hard to successfully attack the site. I agree that strong filtering would eliminate many problems.

Obviously that wasn't to downplay the necessity for not putting buggy servers on the internet. We should certainly take all reasonable precautions. But it is important to keep in mind what the threat model is when talking about security concerns, and ours is about as low as one can be for something that is on the internet.
 
My first choice when building a website, where it's easily achieved, is static rendering (statically-generated pages) where the most important data is visible without client-side JavaScript. Those are easily understood, incredibly hard to attack, trivial to configure, trivial to mirror, support security/privacy-conscious users, etc. Some sites *cannot* be developed this way, of course!! But using that as the "starting point" helps clarify "what else do you need?" - that is, it forces a discussion of "what are the unspoken requirements?". It's also where we are in the metamath site.

With pre-rendering as a replacement for dynamic server side stuff, we can still do most of the things. For example we could prerender all the json queries and use client side JS to request them when doing cross linking stuff. This is only made difficult if the set of all queries is infinite, for example if we can do a search query.

The main advantages of dynamic rendering are:
(1) quicker turnaround time, e.g. if we want to hastily put up a mm100 theorem on the website;
(2) faster iteration - it is possible to make changes to the server and see the results ~immediately;
(3) lower storage cost, since we won't have to generate all the HTML for the web site in advance and just cache important stuff (I assume that most of the work going into the website build is wasted, since no one looks at that page before it is next regenerated);
(4) the ability to have many more rendering styles (unicode, gif, text, latex, mathml, json have all been considered and more are possible) and more databases, which is a combinatorial problem for the website build limited by storage space and generation time.
 
A disadvantage fo static rendering is that you have to prerender all pages, which takes time. I see that as a non-problem for our case; if someone can't wait a few hours, they can use tools to locally generate the page themselves.

For most web site readers, there is a HUGE gap between the effort required to click on a link and the effort required to download the tool, figure out how to make it work, and generate the page they want. I have a working setup and even I would not bother to go to the trouble in most cases. (Even when I do, the results often have broken links or missing GIF symbols because I didn't do the full build or don't have the rest of the web site directory hierarchy set up.)

A tool which can act as a *local* server would be a measurable improvement over piecemeal HTML generation, since you can give the impression of a fully built website without having to wait two hours first.
 
We can obviously use a program to dynamically generate many of the pages instead, but what would it need to do?

For the first version, I would aim simply to replicate the behavior / look and feel of the current metamath web pages. That's what I implemented in https://github.com/digama0/mm-web-rs . Once we have something that can potentially replace the functionality of metamath-exe it will be easier to consider incremental improvements, in an architecture where we can do fancy things if we want.
 
> I expect to have a proposal for a JSON format - just a parse tree of a .mm file, really - early next month if not sooner.

I think the more interesting case is a JSON format for a *single* proof.

For a specific database, perhaps the website needs to provide:
* The .mm database.
* Introductory HTML pages (explaining what's going on).
* At least one HTML display for each statement (theorem/axiom).
* A JSON format for each statement, to make it easy for renderers to grab that data & generate fancy displays. That implies the JSON should probably include all the data in the current HTML generate page, including data about related theorems & an expansion of every proof step.
* Auto-generated table-of-contents (HTML & JSON?)

I think it's important that at least 1 HTML format show "the basics" if JavaScript isn't enabled. This is easily done by having progressive rendering *or* an alternate HTML page. An alternate "fancy" HTML page could download the JSON format and do many fancy things, without having to parse the .mm file.

One technique that I use in the MM0 web pages is to use the HTML as the data source: put all the proofs in a table like they would be normally, but perhaps not as marked up as they should be. Then postprocess it with JS to add all the markup and make additional server requests if needed. That solves the "make the JS-disabled experience palatable" problem, since you just get the basic version of the page if no JS is running.

The sort of thing that might require additional data is parsing support; for example with MM0 pages the formulas are printed in ascii but ideally they would reflow using the layout algorithm, using correct indentation. Metamath web pages are horrible at this; they get good per-token rendering but reflowing is garbage so hopefully you just don't have too-long lines. mmj2 has a nice indenting formatter (actually there are several algorithms available); wouldn't it be nice if that could be done in-browser so that it adapts to the viewport width? If the lines are sent in HTML as token strings, then that means that the JS has to do the parsing itself so it needs to know the signatures of declarations like 'wi'.

Also expensive queries can be hidden behind an expandable drop-down and a server request. We probably want to keep axioms-used and used-by lists enabled by default even though they are somewhat expensive (although if the server has some time to warm up and precompute some data structures then it can cheaply answer these requests), but there are some other requests along the same lines, like "how many steps would this proof be if everything was expanded to axioms" (which is fun if you like to look at numbers in the range of 10^50), which are accessible to metamath-exe but not shown on the web page.
 
I don't know if we need a JSON format for the .mm file as a whole. One approach for doing that would be to generate a JSON file that in turn contains the contents of every per-theorem JSON file. That would be a really big file, even precompressed - does anyone *want* that? What do they plan to do with it? We already compress the .mm file because they've gotten big. One possible use would be training AI systems, but they'd need to do post-processing of that to make it useful for them anyway. My earlier proposal involved trying to create a format pre-processed to make it easier for them to read, because if they're specially training an ML system, some preprocessing usually gets better results.

I agree with this. The main use case for JSON is at the per-theorem level; a JSON for the mm file as a whole would be huge and you would be better off parsing MM directly at that point, since it would be more efficient. Some other potential files are table-of-contents like things, which cover the whole mm file but only represent one 'aspect' of the theorems. So an actual table of contents (the section name hierarchy), the theorem pages (i.e. give the statements of theorems 1000-1100), the definition list, the grammar (syntax axioms). The bare list of all theorem names, in order, might also be a useful query - I would just give this in plain text, newline separated. Maybe include the $f/$e/$a/$p classifications too.
 

Antony Bartlett

unread,
Feb 19, 2023, 3:18:16 PM2/19/23
to meta...@googlegroups.com
On Sun, Feb 19, 2023 at 1:05 PM Mario Carneiro <di....@gmail.com> wrote:
I don't see those as competing goals at all. They play different roles: for browsing you would ideally have some server supplying only the required information to render a specific proof, and for a proof assistant (and to some extent a search tool as well) you want to have the whole database available,
 
Cool, the need for a streaming parsers to stems entirely from the use of monolithic .mm files, which I have doubted the architectural soundness of, but if you're pointing out that they serve proof assistant users particularly well, but proof explorer users particularly badly, then yes, there's definitely room for both server and client side JSON long term.  Especially if we can agree on a particular format, and I'm flexible about that because I'm not very far along.
 
a streaming parser can only save you a factor of 2 if we assume that theorems of interest are sampled uniformly.

I'm particularly concerned about not trying the patience of newcomers, and as they tend to start by looking at how we've formalised propositional logic, I was anticipating a substantially greater value-proposition than might be implied by speed up of x2 on average ;-)
 
On Sun, Feb 19, 2023 at 7:19 PM David A. Wheeler <dwhe...@dwheeler.com> wrote:
> I expect to have a proposal for a JSON format - just a parse tree of a .mm file, really - early next month if not sooner.

I think the more interesting case is a JSON format for a *single* proof.

So how about if I was to offer you the parse tree of a .mmp file, instead of the parse tree of a .mm file?

That's very much on my roadmap, but I had to start with .mm because I don't believe the code for generating .mmp files exists on the JavaScript platform yet.  (Glauco's Yamma doesn't have it yet, and Igor's proof assistant doesn't use them or save any intermediate format I'm aware of).

    Best regards,

        Antony

Glauco

unread,
Feb 19, 2023, 3:59:57 PM2/19/23
to Metamath
 
That's very much on my roadmap, but I had to start with .mm because I don't believe the code for generating .mmp files exists on the JavaScript platform yet.  (Glauco's Yamma doesn't have it yet, and Igor's proof assistant doesn't use them or save any intermediate format I'm aware of).

The ctrl+g of mmj2 is not in yamma, yet (I guess that's the only feature I use in mmj2, that's not going to be in the first "stable" release)

But Thierry's metamath-vspa, has a command "Show Proof" that creates the .mmp file (if you want to have a look at his code)

Glauco

David A. Wheeler

unread,
Feb 19, 2023, 5:39:55 PM2/19/23
to Metamath Mailing List

> On Sun, Feb 19, 2023 at 2:19 PM David A. Wheeler <dwhe...@dwheeler.com> wrote:
> My first choice when building a website, where it's easily achieved, is static rendering (statically-generated pages) where the most important data is visible without client-side JavaScript. Those are easily understood, incredibly hard to attack, trivial to configure, trivial to mirror, support security/privacy-conscious users, etc. Some sites *cannot* be developed this way, of course!! But using that as the "starting point" helps clarify "what else do you need?" - that is, it forces a discussion of "what are the unspoken requirements?". It's also where we are in the metamath site.
>
> With pre-rendering as a replacement for dynamic server side stuff, we can still do most of the things. For example we could prerender all the json queries and use client side JS to request them when doing cross linking stuff. This is only made difficult if the set of all queries is infinite, for example if we can do a search query.


> On Feb 19, 2023, at 3:13 PM, Mario Carneiro <di....@gmail.com> wrote:
> The main advantages of dynamic rendering are:
> (1) quicker turnaround time, e.g. if we want to hastily put up a mm100 theorem on the website;
> (2) faster iteration - it is possible to make changes to the server and see the results ~immediately;
> (3) lower storage cost, since we won't have to generate all the HTML for the web site in advance and just cache important stuff (I assume that most of the work going into the website build is wasted, since no one looks at that page before it is next regenerated);
> (4) the ability to have many more rendering styles (unicode, gif, text, latex, mathml, json have all been considered and more are possible) and more databases, which is a combinatorial problem for the website build limited by storage space and generation time.

Sure. As always, it's all about trade-offs :-).

I don't know if quicker turnaround time & such is all that important.
A "local" server run (by, say, a tool that called metamath-knife) would have the same
tooling problems: You'd need to set up additional tools to run the tool locally.

Of that list, the ability to have more rendering styles is possibly the most compelling of that list.
But exactly what rendering styles?


> > We can obviously use a program to dynamically generate many of the pages instead, but what would it need to do?
>
> For the first version, I would aim simply to replicate the behavior / look and feel of the current metamath web pages. That's what I implemented in https://github.com/digama0/mm-web-rs . Once we have something that can potentially replace the functionality of metamath-exe it will be easier to consider incremental improvements, in an architecture where we can do fancy things if we want.

Okay, but what is desirable that is *different* long term? I think there are plausible good answers,
but it'd be good to have some idea about them.

> One technique that I use in the MM0 web pages is to use the HTML as the data source: put all the proofs in a table like they would be normally, but perhaps not as marked up as they should be. Then postprocess it with JS to add all the markup and make additional server requests if needed. That solves the "make the JS-disabled experience palatable" problem, since you just get the basic version of the page if no JS is running.

Yup! Sounds like a great approach.

> The sort of thing that might require additional data is parsing support; for example with MM0 pages the formulas are printed in ascii but ideally they would reflow using the layout algorithm, using correct indentation. Metamath web pages are horrible at this; they get good per-token rendering but reflowing is garbage so hopefully you just don't have too-long lines. mmj2 has a nice indenting formatter (actually there are several algorithms available); wouldn't it be nice if that could be done in-browser so that it adapts to the viewport width? If the lines are sent in HTML as token strings, then that means that the JS has to do the parsing itself so it needs to know the signatures of declarations like 'wi'.

An alternative is to provide HTML in a table (or some other grid), along with something that's simpler to parse but not shown to someone just looking at the HTML. It could be in an attribute, something with "display: none", whatever. If there's a client-side renderer being used, I would want it to be *easy* to write, not make it parse complex HTML. The individual theorems aren't big, so it'd be pretty easy to include both in one file.

> Also expensive queries can be hidden behind an expandable drop-down and a server request. We probably want to keep axioms-used and used-by lists enabled by default even though they are somewhat expensive (although if the server has some time to warm up and precompute some data structures then it can cheaply answer these requests), but there are some other requests along the same lines, like "how many steps would this proof be if everything was expanded to axioms" (which is fun if you like to look at numbers in the range of 10^50), which are accessible to metamath-exe but not shown on the web page.

Agreed. You can do the same with pregenerated static files, BTW, where searches bring you to a *dynamic* program that does the search & provides the results.

An alternative to a lengthy warm-up each time would be to precompute some data structures & answers, then store them in a database (probably using SQLite, but MySQL or PostgreSQL would work, though perhaps even a simple key/value store would be enough). Or do the lengthy warm-up only when that .mm has changed, then store that in a database. Then server restarts can be fast when the .mm file hasn't changed.

> > I don't know if we need a JSON format for the .mm file as a whole. ...
>
> I agree with this. The main use case for JSON is at the per-theorem level; a JSON for the mm file as a whole would be huge and you would be better off parsing MM directly at that point, since it would be more efficient. Some other potential files are table-of-contents like things, which cover the whole mm file but only represent one 'aspect' of the theorems. So an actual table of contents (the section name hierarchy), the theorem pages (i.e. give the statements of theorems 1000-1100), the definition list, the grammar (syntax axioms). The bare list of all theorem names, in order, might also be a useful query - I would just give this in plain text, newline separated. Maybe include the $f/$e/$a/$p classifications too.

--- David A. Wheeler

Jon P

unread,
Feb 25, 2023, 5:27:25 PM2/25/23
to Metamath
Re the subject of LLMs being trained on metamath this new model called LLama from facebook is interesting. It seems to be smaller than other systems and it looks like they're already training on metamath.


It links to this page and this paper which talk about the matheamtical aspects



It's interesting they talk about the "metamath benchmark". Maybe one approach is to take that and have a page which presents the metamath database for training (kind of like the .txt files or json or whatever is preferred) and then has some notes about the papers that have already been published about it (gpt-f, LLama) and their results. Might be interesting.

Mario Carneiro

unread,
Feb 25, 2023, 5:33:14 PM2/25/23
to meta...@googlegroups.com
A page with short descriptions of papers and projects using Metamath for AI would be great to put on the website. I think it would be good publicity for both metamath and the AI systems, and maybe we can solicit the paper authors to write the descriptions and provide related links as well.

--
You received this message because you are subscribed to the Google Groups "Metamath" group.
To unsubscribe from this group and stop receiving emails from it, send an email to metamath+u...@googlegroups.com.

David A. Wheeler

unread,
Feb 26, 2023, 11:52:03 AM2/26/23
to meta...@googlegroups.com, Mario Carneiro
I completely agree, a page on AI/ML would be great. The pages could do with a refresh anyway, many places suggest contacting Norm.

Norm used an HTML 1.0 convention of using uppercase tags. I suggest that new text use lowercase tags as a style choice
It doesn't matter technically, but it is the current convention, and it would make it easier to see which parts are unchanged.

Mario Carneiro

unread,
Feb 26, 2023, 11:04:35 PM2/26/23
to David A. Wheeler, meta...@googlegroups.com
FYI I did a fairly comprehensive HTML modernization in mm-web-rs [https://github.com/digama0/mm-web-rs/blob/master/src/render.rs] (which is basically a clone of the metamath-exe theorem page generator), which among other things lowercased all the tags and made it pass the w3c validator again (there is a link at the bottom of most pages but they don't pass more modern HTML standards and still use things like <FONT>).

David A. Wheeler

unread,
Feb 27, 2023, 9:51:49 AM2/27/23
to Mario Carneiro, Metamath Mailing List


> On Feb 26, 2023, at 11:04 PM, Mario Carneiro <di....@gmail.com> wrote:
>
> FYI I did a fairly comprehensive HTML modernization in mm-web-rs [https://github.com/digama0/mm-web-rs/blob/master/src/render.rs] (which is basically a clone of the metamath-exe theorem page generator), which among other things lowercased all the tags and made it pass the w3c validator again (there is a link at the bottom of most pages but they don't pass more modern HTML standards and still use things like <FONT>).

Awesome!

When I spoke I was thinking more about the static HTML files in:
https://github.com/metamath/set.mm
https://github.com/metamath/metamath-website-seed

... but yes, we want to modernize those as well.

--- David A. Wheeler

Thierry Arnoux

unread,
Feb 27, 2023, 5:27:29 PM2/27/23
to meta...@googlegroups.com, David A. Wheeler, Mario Carneiro

And since we're there, I'd also like to recall this other live site:

http://metamath.tirix.org:3030/mpests/toc

All pages are rendered dynamically, there are 3 modes:

- ASCII
- Unicode
- MathML

Source code here.

_
Thierry

David A. Wheeler

unread,
Mar 2, 2023, 12:53:23 PM3/2/23
to Thierry Arnoux, Metamath Mailing List, Mario Carneiro


> On Feb 27, 2023, at 5:27 PM, Thierry Arnoux <thierry...@gmx.net> wrote:
>
> And since we're there, I'd also like to recall this other live site:
>
> http://metamath.tirix.org:3030/mpests/toc
>
> All pages are rendered dynamically, there are 3 modes:
>
> - ASCII - http://metamath.tirix.org:3030/mpeascii/areacirc
> - Unicode - http://metamath.tirix.org:3030/mpeuni/areacirc
> - MathML - http://metamath.tirix.org:3030/mpests/areacirc
>
> Source code here. https://github.com/tirix/metamath-web

That *is* nice. The MathML in particular is nice.

Is there a list of Metamath database view generation tools? I can think of:
- https://github.com/digama0/mm-web-rs
- https://github.com/tirix/metamath-web
I know there are others. At least, there was one that let you expand/contract proof steps, which one was that?

So clearly we should talk about options to completely replace the current
process. Obviously we *could* replacing pre-generation with generating them dynamically;
alternatively we run these other tools as pre-generators.
I really *do* want it to be possible to view the key results without client-side JavaScript,
but this code & Mario's can clearly meet that need. In the case of tirix's code,
just view ASCII and/or Unicode and you're fine.

I do have a few specific comments on tirix's tool. They're all fixable of course:

1. I can't read most of the text because most of the text has low contrast. That includes description, step numbers, hyp refs, ref blue boxes, text expressions in mpascii, & navigators. Can you please change its CSS to have strong text contrast? It should be readable by color-blind people on a phone in sunlight. Creating strong text contrast *important" but also *easy* - it's just some CSS tweaks.
2. The text just runs off to the right, instead of going over multiple lines, making long ones unreadable. That's important but easy for the non-MathML case (just a little CSS), though I don't know how hard that is for MathML/MathJax.
3. I think it's important we support existing URLs. That's easy with pregeneration, just rename to taste. If this is dynamically generated, that would require some changes to support the .html endings *and* support returning static pages.
4. This one omits many of the capabilities of the metamath-exe generator, e.g., compare with <https://us.metamath.org/mpeuni/areacirc.html>. That includes:
Distinct variable group: 𝑥,𝑦,𝑅
Allowed substitution hints: 𝑆(𝑥,𝑦)
Colors of variables (off, setvar, class) ... this one isn't as important to me & would be a lot of work, so I'm not fussy there.
Syntax hints (with links to them)
List of axioms used (with links)
List of definitions used (with links)
List of theorems that reference it (with links)


--- David A. Wheeler

Thierry Arnoux

unread,
Mar 2, 2023, 1:44:15 PM3/2/23
to meta...@googlegroups.com, David A. Wheeler, Mario Carneiro
Hi David,

>  At least, there was one that let you expand/contract proof steps,
> which one was that?
>
The viewer you're talking about is probably this one:

http://metamath.tirix.org/geoisum1.html

There are two things in there:

First, this is also MathML, but this time generated not using Rust-based
metamath-knife, but the good old C based metamath-exe.
The code for that is still living in a PR:
https://github.com/metamath/metamath-exe/pull/9
(Actually the code was available way before that, but Norm did not feel
he was ready to open a GitHub repository at the time I wrote it)

Unfortunately this was not merged at the time and Wolf and Mario started
major clean up and refactoring. I see Mario has made some work but in
the mean time there are new conflicts to solve.
Also I tried to stick to the coding style I saw, which has now changed,
so it probably does not live up to the rest of the code anymore.


Second, the JS library which does the expand/contract is actually
independent and could be used on the "classic" Metamath pages, IIRC.

It is available here: https://github.com/tirix/mm-web-ui


I'll answer to the other comments about the tool in a separated post!

BR,
_
Thierry

Mario Carneiro

unread,
Mar 2, 2023, 4:33:30 PM3/2/23
to David A. Wheeler, Thierry Arnoux, Metamath Mailing List
On Thu, Mar 2, 2023 at 12:53 PM David A. Wheeler <dwhe...@dwheeler.com> wrote:
2. The text just runs off to the right, instead of going over multiple lines, making long ones unreadable. That's important but easy for the non-MathML case (just a little CSS), though I don't know how hard that is for MathML/MathJax.

This is a pretty fundamental problem with MathJax style displays, and I don't have a good solution. LaTeX generally isn't great about automatic line breaking inside mathematics, but metamath routinely involves very large formulas, so although this looks good for the mid-sized examples the large ones really suffer. A related issue is that mathjax isn't really able to cope with having thousands of large and complex formulas on a page and the CPU time / draw time really spikes on some of the examples. For MM0 I'm planning to mostly just sidestep this issue by using ASCII-first displays, perhaps using unicode or programming fonts with ligatures but not much more. That way line breaking is limited to just pretty printing which can feasibly be written with a handcrafted JS library.
 
3. I think it's important we support existing URLs. That's easy with pregeneration, just rename to taste. If this is dynamically generated, that would require some changes to support the .html endings *and* support returning static pages.

I don't think this will be difficult no matter what we do, since we can set up all the necessary routing in nginx. I am not above potentially making mpegif redirect to mpeuni though (or a future URL scheme), or serving only dynamic pages for deprecated formats. I think we should try to make https://us.metamath.org/mpeuni/areacirc.html link to a theorem about the area of a circle in perpetuity but not necessarily serving exactly the same HTML as it does today.
 
4. This one omits many of the capabilities of the metamath-exe generator, e.g., compare with <https://us.metamath.org/mpeuni/areacirc.html>. That includes:
  Distinct variable group:   𝑥,𝑦,𝑅
  Allowed substitution hints:   𝑆(𝑥,𝑦)
  Colors of variables (off, setvar, class) ... this one isn't as important to me & would be a lot of work, so I'm not fussy there.
  Syntax hints (with links to them)
  List of axioms used (with links)
  List of definitions used (with links)
  List of theorems that reference it (with links)

By the way (and I realize this wasn't directed at both tools), mm-web-rs was designed explicitly as a carbon copy replacement for metamath-exe generation, and supports all of these things. (Thierry's generator as I understand it is more of an exploration of what we could have if we break with the traditional format.) IIRC if you open them in a browser side by side the result is identical except for distinct variable groups, which use a different (arguably better) heuristic for collecting $d pairs into cliques. (Exact clique cover is an NP hard problem, though, so some amount of approximation is required.) I think the variable coloring heuristic uses some hard-coded typecode names so it is not suitable for general databases; there are actually quite a few aspects of the web site generator that are tailored for set.mm which should probably use $t commands instead.

I would link an example here, but I'm not sure where to host it. On that note, we might consider a place on the main website for experimental stuff now that us2 is down. Do you know if there is a spot which already works for this? I'm thinking of some place where it is safe to just drop files and have them persisted across website builds, with a URL that implies that they may go offline at any time.

David A. Wheeler

unread,
Mar 2, 2023, 5:37:29 PM3/2/23
to Metamath Mailing List


> On Thu, Mar 2, 2023 at 12:53 PM David A. Wheeler <dwhe...@dwheeler.com> wrote:
> 2. The text just runs off to the right, instead of going over multiple lines, making long ones unreadable. That's important but easy for the non-MathML case (just a little CSS), though I don't know how hard that is for MathML/MathJax.
>
> On Mar 2, 2023, at 4:33 PM, Mario Carneiro <di....@gmail.com> wrote:
> This is a pretty fundamental problem with MathJax style displays, and I don't have a good solution. LaTeX generally isn't great about automatic line breaking inside mathematics, but metamath routinely involves very large formulas, so although this looks good for the mid-sized examples the large ones really suffer. A related issue is that mathjax isn't really able to cope with having thousands of large and complex formulas on a page and the CPU time / draw time really spikes on some of the examples. For MM0 I'm planning to mostly just sidestep this issue by using ASCII-first displays, perhaps using unicode or programming fonts with ligatures but not much more. That way line breaking is limited to just pretty printing which can feasibly be written with a handcrafted JS library.

That makes sense. I don't *object* to beautiful presentations, but if that's unsolved, we could generally prefer pointing to the Unicode versions (where long formulas are easily handled).

MathJax is well aware of the need for automatic line breaking. They hope to have s version 3 that resolves it:
https://docs.mathjax.org/en/latest/output/linebreaks.html
https://stackoverflow.com/questions/40628064/mathjax-automatic-line-breaking-when-resizing
... I'm fine with letting them solve the problem & we use their solution. I don't know when they'll actually do that (if ever).

>
> 3. I think it's important we support existing URLs. That's easy with pregeneration, just rename to taste. If this is dynamically generated, that would require some changes to support the .html endings *and* support returning static pages.
>
> I don't think this will be difficult no matter what we do, since we can set up all the necessary routing in nginx.

Not *difficult*, but some are harder than others if we *dynamically* generate.
In nginx you would typically specify a location pattern as a parameter, and if it matches,
call a program (e.g., using fastcgi). Here's an example:
https://www.nginx.com/resources/wiki/start/topics/examples/fastcgiexample/

You'd *like* to have the web server directly serve static files (they're good at that!),
but currently directories like "mpeuni" have a lot of files, some of which might be dynamically generated
while others are probably statically generated. Typical websites put them in separate directories,
to make it easy to tell them apart.

If they're all dynamically generated via the same program, or all statically generated, then that's moot.

> I am not above potentially making mpegif redirect to mpeuni though (or a future URL scheme), or serving only dynamic pages for deprecated formats. I think we should try to make https://us.metamath.org/mpeuni/areacirc.html link to a theorem about the area of a circle in perpetuity but not necessarily serving exactly the same HTML as it does today.

Agreed. Heck, even if we stay with metamath-exe I would think we'd be open to improvements.
But since many people link to our pages, I want them to keep working to get "something interesting".


> 4. This one omits many of the capabilities of the metamath-exe generator, e.g., compare with <https://us.metamath.org/mpeuni/areacirc.html>.....
> By the way (and I realize this wasn't directed at both tools), mm-web-rs was designed explicitly as a carbon copy replacement for metamath-exe generation, and supports all of these things. (Thierry's generator as I understand it is more of an exploration of what we could have if we break with the traditional format.)

> IIRC if you open them in a browser side by side the result is identical except for distinct variable groups, which use a different (arguably better) heuristic for collecting $d pairs into cliques. (Exact clique cover is an NP hard problem, though, so some amount of approximation is required.)

Better is better!

> I think the variable coloring heuristic uses some hard-coded typecode names so it is not suitable for general databases; there are actually quite a few aspects of the web site generator that are tailored for set.mm which should probably use $t commands instead.

We could start with hard-coding, then move towards supporting a general form.
I agree that in the long term we want to be general.

I suspect type code coloring primarily useful to new users... but since we *do* want to new users/readers, that's still relevant. We could even add it later, but it also doesn't seem *that* hard to implement (at least hard-coded).


> I would link an example here, but I'm not sure where to host it. On that note, we might consider a place on the main website for experimental stuff now that us2 is down. Do you know if there is a spot which already works for this? I'm thinking of some place where it is safe to just drop files and have them persisted across website builds, with a URL that implies that they may go offline at any time.

I can trivially create the name "experimental.metamath.org" (or whatever you like), and point it somewhere. The real question is, where should it point?
* It *could* point to the metamath.org site. It doesn't have a lot of space for many more copies, and if you're not ready for it to get attacked that'd be a bad thing. That requires no new money.
* You could set up a site literally anywhere else. A linode node costs $6/month. Heroku has a low-cost plan (not free though).

--- David A. Wheeler

Mario Carneiro

unread,
Mar 2, 2023, 7:33:47 PM3/2/23
to meta...@googlegroups.com
On Thu, Mar 2, 2023 at 5:37 PM David A. Wheeler <dwhe...@dwheeler.com> wrote:


> On Thu, Mar 2, 2023 at 12:53 PM David A. Wheeler <dwhe...@dwheeler.com> wrote:
> 2. The text just runs off to the right, instead of going over multiple lines, making long ones unreadable. That's important but easy for the non-MathML case (just a little CSS), though I don't know how hard that is for MathML/MathJax.
>
> On Mar 2, 2023, at 4:33 PM, Mario Carneiro <di....@gmail.com> wrote:
> This is a pretty fundamental problem with MathJax style displays, and I don't have a good solution. LaTeX generally isn't great about automatic line breaking inside mathematics, but metamath routinely involves very large formulas, so although this looks good for the mid-sized examples the large ones really suffer. A related issue is that mathjax isn't really able to cope with having thousands of large and complex formulas on a page and the CPU time / draw time really spikes on some of the examples. For MM0 I'm planning to mostly just sidestep this issue by using ASCII-first displays, perhaps using unicode or programming fonts with ligatures but not much more. That way line breaking is limited to just pretty printing which can feasibly be written with a handcrafted JS library.

That makes sense. I don't *object* to beautiful presentations, but if that's unsolved, we could generally prefer pointing to the Unicode versions (where long formulas are easily handled).

MathJax is well aware of the need for automatic line breaking. They hope to have s version 3 that resolves it:
https://docs.mathjax.org/en/latest/output/linebreaks.html
https://stackoverflow.com/questions/40628064/mathjax-automatic-line-breaking-when-resizing
... I'm fine with letting them solve the problem & we use their solution. I don't know when they'll actually do that (if ever).


> 3. I think it's important we support existing URLs. That's easy with pregeneration, just rename to taste. If this is dynamically generated, that would require some changes to support the .html endings *and* support returning static pages.
>
> I don't think this will be difficult no matter what we do, since we can set up all the necessary routing in nginx.

Not *difficult*, but some are harder than others if we *dynamically* generate.
In nginx you would typically specify a location pattern as a parameter, and if it matches,
call a program (e.g., using fastcgi). Here's an example:
https://www.nginx.com/resources/wiki/start/topics/examples/fastcgiexample/

You'd *like* to have the web server directly serve static files (they're good at that!),
but currently directories like "mpeuni" have a lot of files, some of which might be dynamically generated
while others are probably statically generated. Typical websites put them in separate directories,
to make it easy to tell them apart.

I'm fairly sure there is a way to set up an nginx config to do essentially the following:

* On a request to /mpeuni/X.html:
  * If /var/www/mpeuni/X.html exists, serve that
  * Otherwise, call 'web-page-generator /mpeuni/X.html', save the result as /var/www/mpeuni/X.html (assuming it is not a 404) and serve it

If they're all dynamically generated via the same program, or all statically generated, then that's moot.

In this case 'web-page-generator' doesn't exactly have to be a single program, it can be a shell script or shim which calls out to multiple actual tools. It's basically another server behind the server, with nginx acting as a cache for it. It could handle a fairly wide range of different endpoints, including a mix of static and dynamic stuff. (Probably nginx can handle the static stuff though, assuming the relevant URLs can be easily described with globs.)
 
> I think the variable coloring heuristic uses some hard-coded typecode names so it is not suitable for general databases; there are actually quite a few aspects of the web site generator that are tailored for set.mm which should probably use $t commands instead.

We could start with hard-coding, then move towards supporting a general form.
I agree that in the long term we want to be general.

I suspect type code coloring primarily useful to new users... but since we *do* want to new users/readers, that's still relevant. We could even add it later, but it also doesn't seem *that* hard to implement (at least hard-coded).

Random coloring is another option to get an ok overall look (where the seed is something tied to the database itself or the typecode names so that it is stable enough for people to be able to use it as a consistent visual), or something based on the same rainbow spectrum used for the label numbers. It's definitely important for more than just new users, it is the only bit of color we use on metamath formulas which is an important aspect of visually breaking them up. It's the nearest thing we have to syntax highlighting.

A $t command seems the most appropriate; there is already one that contains almost exactly this information: 'htmlvarcolor' shows the color *key* but does not actually contain the color mapping (so in theory they could diverge).

htmlvarcolor '<SPAN CLASS=wff STYLE="color:blue;font-style:normal">wff</SPAN> '
    + '<SPAN CLASS=setvar STYLE="color:red;font-style:normal">setvar</SPAN> '
    + '<SPAN CLASS=class STYLE="color:#C3C;font-style:normal">class</SPAN>';

If this was instead

htmlvarcolor "wff" as "blue";
htmlvarcolor "setvar" as "red";
htmlvarcolor "class" as "#C3C";

then we would be able to derive both the coloring on symbols and the color key from the same information (in addition to being able to validate that these are actually typecodes). Currently the coloring on symbols is baked into the 'althtmldef' for variable tokens.

> I would link an example here, but I'm not sure where to host it. On that note, we might consider a place on the main website for experimental stuff now that us2 is down. Do you know if there is a spot which already works for this? I'm thinking of some place where it is safe to just drop files and have them persisted across website builds, with a URL that implies that they may go offline at any time.

I can trivially create the name "experimental.metamath.org" (or whatever you like), and point it somewhere. The real question is, where should it point?
* It *could* point to the metamath.org site. It doesn't have a lot of space for many more copies, and if you're not ready for it to get attacked that'd be a bad thing. That requires no new money.
* You could set up a site literally anywhere else. A linode node costs $6/month. Heroku has a low-cost plan (not free though).

I was thinking something more along the lines of metamath.org/experimental/foo.html. Wouldn't a new domain cost more?

For experiments I don't think we need anything that has a huge space footprint. The main things would be pre-generated web pages and dynamic server tests (uncached). There is some potential for an experimental server to cause trouble or take down the site, but that's always going to be the case for any dynamic server option and anything we write for this purpose needs to be prepared for that from the start.

The other reason to prefer doing this on metamath.org is to get practice for actually making these kinds of modifications. Most of the discussed changes would be rather similar to this.

If you can create dev.metamath.org for free and point it to the same linode server, then I guess that's roughly isomorphic to metamath.org/experimental/ . Perhaps that's better for keeping things separate, although it still has be same "take down the server" risks.
Reply all
Reply to author
Forward
0 new messages