the anatomy of a high-performance theorem-prover

Mark Tarver

unread,

Sep 22, 2016, 5:58:09 AM9/22/16

to Shen

The new high performance ATP will go into SP very soon. Coding these things is really a challenge because the architecture is about as dense as the Shen kernel. The main engine E is based on Stickel's Prolog Technology Theorem Prover of the 80s but adapted for sequents and I've added a preprocessor P to it that chops up problems before feeding them into E. This preprocessor is actually a little theorem prover itself and has the capacity to solve problems independently of E. I've called P a planner, but really this is IMO too grand a title for this device.

The top level function is crunchit! of type ((list fol.wff) * fol.wff) --> (list wff) --> boolean). The last argument is optional and defaults to [] if no argument is supplied. The function takes

1. A sequent composed of a pair of a list of fol.wffs L and a fol.wff W which is the conclusion.

2. A background theory T (list of wffs) which is compiled to code and which may be empty.

The ATP attempts to prove

L |-T W i.e. L entails W under T.

The logic is first-order logic without equality.

1. A logical constant is v, ~, &, =>, <=>, exists, all, =df

2. A recognised symbol is a symbol other than a logical constant.

3. A recognised symbol is a term.

4. A list composed of a recognised symbol followed by terms is a term.

5. A predicate is a recognised symbol.

6. An atom is a list composed of a predicate followed by terms or a recognised symbol (propositional case).

7. An atom is a fol.wff.

8. If P and Q are fol.wffs so are [~ P], [P v Q], [P & Q], [P => Q], [P <=> Q], [P =df Q], [all V P], [exists V P] where V is a recognised symbol.

This is standard but for =df which means 'is defined as' . This is used in T. If T contains [P =df Q] then all cases of the kind P are demodulated (normalised) to Q and crunchit! does that before attempting to solve.

Performance is about 750,000 inferences per second per GHz. This system will probably end up as part of something bigger. More anon.

Mark

Mark Tarver

unread,

Sep 23, 2016, 10:10:07 AM9/23/16

to Shen

To continue; the architecture is that of a preprover P that feeds into the main engine E. Suitably programmed the preprover can actually solve certain problems faster than E. E is programmed to perform incrementally bounded depth-first search up to a bound provided by the user. So the

pattern is:

1. Preprover applies demodulation rules and tries to solve the problem by basic reasoning; if it succeeds, then true is returned.

2. Else hands the problems over to the main engine for solution.

The 'basic reasoning' uses a modified tableau approach although indirect proof is not used i.e. the search proceeds by both trying to derive a contradiction in the assumptions and by trying to show the conclusion follows directly from the assumptions given.

Demodulation or rewriting to a normal form gives significant advantages in dealing with equivalences that are simply definitions. Let's see this in action:

Here is a function giving some axioms of naive set theory; m means member of, sub subset etc. You can probably work out the rest.

(define st

{--> (list fol.wff)}

-> [[all x [~ [m x e]]]

[all x [all y [[[sub x y] & [sub y x]] <=> [eq x y]]]]

[all x [all y [[sub x y] <=> [all z [[m z x] => [m z y]]]]]]

[all x [all y [[pow x y] <=> [all z [[m z x] <=> [sub z y]]]]]]

[all x [all y [[com x y] <=> [all z [[m z x] <=> [~ [m z y]]]]]]]

[all x [all y [all z [[prod x y z] <=> [all w [[m w x] <=> [[pair w] & [[m [fst w] y] & [m [snd w] z]]]]]]]]]

[all x [all y [all z [[int x y z] <=> [all w [[m w x] <=> [[m w y] & [m w z]]]]]]]]

[all x [all y [all z [[un x y z] <=> [all w [[m w x] <=> [[m w y] v [m w z]]]]]]]] ])

Let's compile this down. This is done on my 1.3GHz laptop.

(14-) (compile-fol (st))
107 Horn clauses generated
true

(15-) (crunchit! (@p [] [all x [~ [pow e x]]]) [(@p verbose 1)]

\\ prove the empty set is not a powerset of any set and print off stats.

run time: 0.7800025939941406 secs

815594 inferences

true

(16-) (crunchit! (@p [] [un a a a]) [(@p verbose 1)])  \\ A U A = A

run time: 0.3899993896484375 secs

352518 inferences

true

(19-) (crunchit! (@p []

[all x [all y [all z [[un x y z] <=> [un x z y]]

[(@p verbose 1)])

run time: 0.7490081787109375 secs

811613 inferences

true

(20-) (crunchit! (@p [] [all x [all y [[pow x y] => [m y x]]]]) [(@p verbose 1)])

run time: 5.989990234375 secs

6592766 inferences

true

(21-) (crunchit! (@p [] [all x [all y [[prod x y e] => [eq x e]]) [(@p verbose 1)])

run time: 0.0 secs
2832 inferences
true

(22-) (crunchit! (@p [] [all x [all y [[sub x y] => [un y x y]]) [(@p verbose 1)])

run time: 0.25 secs
279306 inferences
true

(23-) (crunchit! (@p []
               [all x [all y [all z [[sub z y] => [[un x y z])
               [(@p verbose 1)])

run time: 0.0159912109375 secs
12992 inferences
true

(24-) (crunchit! (@p [] [all x [~ [pow e x]]]) [(@p verbose 1)])

run time: 0.764007568359375 secs
815594 inferences
true

(25-) (crunchit! (@p [] [int a a a]) [(@p verbose 1)]) \\ A intersection A = A

run time: 208.3079948425293 secs
203763756 inferences
true

That's really huge - over 200 million inferences; and a tribute to the inferencing speed of E.  However we can choose to set up these problems differently. All the equivalences in st apart from the first are really implicit definitions.  Therefore we can demodulate all of these problems and use the first axiom as part of the explicit theory.  When we do that the preprover actually solves the problems itself at great speed.  This is because treating equivalences clausally is inefficient.  So here I enter the equivalences as definitions;  I generate a fresh variable V here to avoid variable capture.

(define fol.demodulate

{fol.wff --> fol.wff}

[eq X Y] -> [[sub X Y] & [sub Y X]]

[sub X Y] -> (let V (gensym v) [all V [[m V X] => [m V Y]]])

[pow X Y] -> (let V (gensym v) [all V [[m V X] <=> [sub V Y]]])

[com X Y] -> (let V (gensym v) [all V [[m V X] <=> [~ [m V Y]]]])

[prod X Y Z] -> (let V (gensym v)

[all V [[m V X]

<=> [[pair V] & [[m [fst V] Y] & [m [snd V] Z]]]]])

[int X Y Z] -> (let V (gensym v) [all V [[m V X] <=> [[m V Y] & [m V Z]]]])

[un X Y Z] -> (let V (gensym v) [all V [[m V X] <=> [[m V Y] v [m V Z]]]])

P -> P)

(define e-axiom

{--> fol.wff}

-> [all x [~ [m x e]]])

Off we go.

(54+) (crunchit! (@p [(e-axiom)] [un a a a]) [(@p verbose 1)])

run time: 0.0159912109375 secs
410 inferences
true : boolean

(55+) (crunchit! (@p [(e-axiom)] [int a a a]) [(@p verbose 1)])

run time: 0.0 secs
407 inferences
true : boolean

(56+) (crunchit! (@p [(e-axiom)]
[all x [all y [all z [[un x y z] <=> [un x z y
[(@p verbose 1)])

run time: 0.0 secs
4332 inferences
true : boolean

(57+) (crunchit! (@p [(e-axiom)] [all x [all y [[pow x y] => [

rbose 1)])

run time: 0.0 secs

1188 inferences

true : boolean

(57+) (crunchit! (@p [(e-axiom)] [all x [all y [[pow x y] => [m y x]]]]) [(@p verbose 1)])

run time: 0.0 secs

1188 inferences

true : boolean

(58+) (crunchit! (@p [(e-axiom)] [all x [all y [[prod x y e] => [eq x e]]]]) [(@p verbose 1)])

run time: 0.014984130859375 secs

6665 inferences

true : boolean

(59+) (crunchit! (@p [(e-axiom)] [all x [all y [[sub x y] => [un y x y]]]]) [(@p verbose 1)])

run time: 0.0 secs

848 inferences

true : boolean

(60+) (crunchit! (@p [(e-axiom)]

[all x [all y [all z [[sub z y] => [[un x y z] => [un x y y]]]]]]

)

[(@p verbose 1)])

run time: 0.0159912109375 secs

4114 inferences

true : boolean

(61+) (crunchit! (@p [(e-axiom)] [all x [~ [pow e x]]]) [(@p verbose 1)])

run time: 0.0 secs

551 inferences

true : boolean

Mark

Mark Tarver

unread,

Sep 23, 2016, 3:06:31 PM9/23/16

to Shen

In detail this is how it works

1. The formulae in the assumptions and conclusion are demodulated and put in prenex form.

2. The existentially quantified variables in the assumptions and the universally quantified ones in the conclusion are skolemised away. The latter requires recognising that universally quantified variables in a conclusion behave much as existentially quantified variables in the assumptions.

3. Remaining bound variables are replaced by Prolog variables.

4. The preprover P performs some basic inferencing in which the sequent is reduced to a series S of sequents whose components are literals.

5. Anything in S that cannot be solved by P is sent to the engine E to be crunched.

Mark

Mark Tarver

unread,

Sep 26, 2016, 8:26:39 AM9/26/16

to Shen

I've finally got the system working reasonably well and I'll send it out this week. The program will be part of something called Logic Lab which will be a bunch of logic tools for SP people to experiment with.

Debugging the program is a challenge because the alien nature of the resolution process, the complexity of the code and the fact that an error may show up after thousands of inferences means that you have to be painstaking. I've worked through it carefully. The system is logically incomplete but sound. Code is about 460 lines.

As regards the name FTP - fast theorem prover - will do. Thanks to all for your suggestions which are more imaginative than mine!

Having eliminated the glitches I now have a clearer picture of the the performance profile. The properly working system is generally much faster. These problems are solved at depth 5. You can see that the (int a a a) is now solved much more quickly - about 10^5 times faster!

(21+) (ftp [] [un a a a] -verbose 1)

run time: 0.0149993896484375 secs

17654 inferences

true : boolean

(22+) (ftp [] [int a a a] -verbose 1)

run time: 0.016002655029296875 secs

18807 inferences

true : boolean

(23+) (ftp [] [all x [all y [[pow x y] => [m y x]]]] -verbose 1)

run time: 0.03099822998046875 secs

42741 inferences

true : boolean

(24+) (ftp [] [all x [all y [[prod x y e] => [eq x e]]]] -verbose 1)

run time: 0.0 secs

1574 inferences

true : boolean

(25+) (ftp [] [all x [all y [[sub x y] => [un y x y]]]] -verbose 1)

run time: 0.046001434326171875 secs

41113 inferences

true : boolean

(26+) (crunchit! [] [all x [~ [pow e x]]] -verbose 1)

run time: 0.0 secs

1935 inferences

true : boolean

Two problems from the original set were not solved - by experiment I found that these could be solved at a higher depth.

(ftp [] [all x [all y [all z [[un x y z] <=> [un x z y]]]]] -verbose 1 -depth 8)

(ftp [] [all x [all y [all z [[sub z y] => [[un x y z] => [un x y y]]]]]] -verbose 1 -depth 7)

However sadly SBCL died on these two. Only by cutting the axiom set down to those axioms needed to solve the problem was FTP able to solve both problems.

These figures were all obtained without demodulation. With demodulation all problems are solved instantaneously.

Thoughts and reflections? Speed is no substitute for brains. FTP would probably be quite useful as a knowledge engine attached to a DB giving first-order querying capability.

Mark

Mark Tarver

unread,

Sep 26, 2016, 9:36:28 AM9/26/16

to Shen

(39+) (ftp [] [all x [all y [all z [[un x y z] <=> [un x z y]]]]] -verbose 1 -depth 5)

run time: 0.0630035400390625 secs

73484 inferences

false : boolean

(40+) (ftp [] [all x [all y [all z [[un x y z] <=> [un x z y]]]]] -verbose 1 -depth 6)

run time: 0.858001708984375 secs

887737 inferences

false : boolean

(41+) (ftp [] [all x [all y [all z [[un x y z] <=> [un x z y]]]]] -verbose 1 -depth 7)

run time: 80.8709945678711 secs

72921041 inferences

false : boolean

We've jumped from one order of magnitude to 2 in one bound increase; indicating that the search tree here is very bushy. Increasing it to 8 would cause a memory failure - the solution is out in the billions it seems. Decreasing the size of our axiom set will enable that bound to be reached. But demodulation is the best answer.

(42+) (define fol.demodulate

{fol.wff --> fol.wff}

[eq X Y] -> [[sub X Y] & [sub Y X]]

[sub X Y] -> (let V (gensym v) [all V [[m V X] => [m V Y]]])

[pow X Y] -> (let V (gensym v) [all V [[m V X] <=> [sub V Y]]])

[com X Y] -> (let V (gensym v) [all V [[m V X] <=> [~ [m V Y]]]])

[prod X Y Z] -> (let V (gensym v)

[all V [[m V X]

<=> [[pair V] & [[m [fst V] Y] & [m [snd V] Z]]]])

[int X Y Z] -> (let V (gensym v) [all V [[m V X] <=> [[m V Y] & [m V Z]]]])

[un X Y Z] -> (let V (gensym v) [all V [[m V X] <=> [[m V Y] v [m V Z]]]])

P -> P)

fol.demodulate : (fol.wff --> fol.wff)

(43+) (ftp [] [all x [all y [all z [[un x y z] <=> [un x z y]]]]] -verbose 1 -depth 5)

run time: 0.0 secs

4734 inferences

true : boolean

Mark

Antti Ylikoski

unread,

Sep 27, 2016, 8:35:02 AM9/27/16

to Shen

I have attached in this group entry the Compleat Guide for the MRS.

The Stanford University MRS, the Metalevel Reasoning System, is a 1985
general-purpose reasoning tool, written by Professor Michael
Genesereth's group in Stanford University.

The file is rather large for a 'net group attachment, >130 pages and
3.3MB, but I feel that it makes sense to send the file to the group
for the comparison of Mark's FTP work, and the Stanford MRS.

My own comments:

The MRS is "handy" for building rather simpleminded reasoning systems,
such as using the MRS as an expert system shell.

But if one will delve into the Guide for the MRS and compare it with
Mark's work, it is easy to see that from the mathematical, theory
standpoint, Mark's work is significantly stronger.

However, one shall note here that the two systems are originally
intended for different applications, to a certain extent.

yours, AJY

CS-TR-85-1080.pdf

Mark Tarver

unread,

Sep 27, 2016, 10:16:49 AM9/27/16

to Shen

Well yes; they are different and the guys who built the MRS probably spent more time on it. I wouldn't compare the FTP and the MRS.

But for those Shenturians interested there is a help page for the FTP which explains how it works - http://www.shenlanguage.org/professional/ftp.htm. This system will be available over the cloud for SP users early next month.

Mark

Mark Tarver

unread,

Sep 27, 2016, 10:32:24 AM9/27/16

to Shen

As you can see I've added identity to the FTP. Just started on the first Pelletier with identity.

(18-) (ftp [ [[a = b] v [c = d]] [[a = c] v [b = d]]] [[a = d] v [b = c]] -verbose 1 -id 1 -amplify 2)

run time: 0.0 secs

918 inferences

true

Mark

Mark Tarver

unread,

Sep 27, 2016, 2:10:21 PM9/27/16

to Shen

To put things in perspective; Vampire AFAIK is the top ATP - 250,000 lines of code. FTP is < 500; though probably in almost any other language apart from Prolog/Shen it would be much larger.

FTP is not logically complete nor designed to be so; it's a 'fire-and-forget' ATP that is designed to terminate fairly quickly.

Mark

Reply all

Reply to author

Forward

the anatomy of a high-performance theorem-prover - thoughts and insights

Mark Tarver

Mark Tarver

Mark Tarver

Mark Tarver

Mark Tarver

Antti Ylikoski

Mark Tarver

Mark Tarver

Mark Tarver