Keywords

1 Introduction

Machine learning (ML) has recently proven its worth in a number of fields, ranging from computer vision [17], to speech recognition [15], to playing games [28, 40] with reinforcement learning (RL) [45]. It is also increasingly applied in automated and interactive theorem proving. Learned predictors have been used for premise selection [1] in hammers [6], to improve clause selection in saturation-based theorem provers [9], to synthesize functions in higher-order logic [12], and to guide connection-tableau provers [21] and interactive theorem provers [2, 5, 14].

Future growth of the knowledge base of mathematics and the complexity of mathematical proofs will increase the need for proof checking and its better computer support and automation. Simultaneously, the growing complexity of software will increase the need for formal verification to prevent failure modes [10]. Automated theorem proving and mathematics will benefit from more advanced ML integration. One of the mathematical subfields that makes substantial use of automated theorem provers is the field of quasigroup and loop theory [32].

1.1 Contributions

In this paper, we propose to use a neural network to suggest lemmas to the Prover9 [25] ATP system by rewriting parts of the conjecture (Sect. 2). We test our method on a dataset of theorems collected in the work on the Abelian Inner Map** (AIM) Conjecture [24] in loop theory. For this, we use the AIMLEAP proof system [7] as a reinforcement learning environment. This setup is described in Sect. 3. For development we used a simpler Robinson arithmetic rewriting task (Sect. 4). With the insights derived from this and a comparison with other methods, we describe our own 3SIL method in Sect. 5. We use a neural network to process the state of the proving attempt, for which the architecture is described in Sect. 6. The results on the Robinson arithmetic task are described in Sect. 7.1. We show our results on the AIMLEAP proving task, both using our predictor as a stand-alone prover and by suggesting lemmas to Prover9 in Sect. 7.2. Our contributions are:

  1. 1.

    We propose a training method for reinforcement learning in theorem proving settings: stratified shortest solution imitation learning (3SIL). This method is suited to the structure of theorem proving tasks. This method and the reasoning behind it is explained in Sect. 5.

  2. 2.

    We show that 3SIL outperforms other baseline RL methods on a simpler, Robinson arithmetic rewriting task for which the reward structure is similar to theorem proving (Sect. 7.1).

  3. 3.

    We show that a standalone neurally guided prover trained by the 3SIL method outperforms the hand-engineered Waldmeister prover on the AIMLEAP benchmark (Sect. 7.2).

  4. 4.

    We show that using a neural rewriting step that suggests rephrased versions of the conjecture to be added as lemmas improves the ATP performance on equational problems (Sects. 2 and 7.2).

Fig. 1.
figure 1

Schematic representation of the proposed guidance method. In the first phase, we run a reinforcement learning loop to propose actions that rewrite a conjecture. This predictor is trained using the AIMLEAP proof environment. We collect the rewrites of the LHS and RHS of the conjecture. In the second phase, we add the rewrites to the ATP search input, to act as guidance. In this specific example, we only rewrote the conjecture for 1 step, but the added guidance lemmas are in reality the product of many steps in the RL loop.

2 ATP and Suggestion of Lemmas by Neural Rewriting

Saturation-based ATPs make use of the given clause [30] algorithm, which we briefly explain as background. A problem is expressed as a conjunction of many initial clauses (i.e., the clausified axioms and the negated goal which is always an equation in the AIM dataset). The algorithm starts with all the initial clauses in the unprocessed set. We then pick a clause from this set to be the given clause and move it to the processed set and do all inferences with the clauses in the processed set. The newly inferred clauses are added to the unprocessed set. This concludes one iteration of the algorithm, after which we pick a new given clause and repeat [23]. Typically, this approach is designed to be refutationally complete, i.e., the algorithm is guaranteed to eventually find a contradiction if the original goal follows from the axioms.

This process can produce a lot of new clauses and the search space can become quite large. In this work, we modify the standard loop by adding useful lemmas to the initial clause set. These lemmas are proposed by a neural network that was trained from zero knowledge to rewrite the left- and right-hand sides of the initial goal to make them equal by using the axioms as the available rewrite actions. Even though the neural rewriting might not fully succeed, the rewrites produced by this process are likely to be useful as additional lemmas when added to the problem. This idea is schematically represented in Fig. 1.

3 AIM Conjecture and the AIMLEAP RL Environment

Automated theorem proving has been applied in the theory surrounding the Abelian Inner Map** Conjecture, known as the AIM Conjecture. This is one of the top open conjectures in quasigroup theory. Work on the conjecture has been going on for more than a decade. Automated theorem provers use hundreds of thousands of inference steps when run on problems from this theory.

As a testbed for our machine learning and prover guidance methods we use a previously published dataset of problems generated by the AIM conjecture [7]. The dataset comes with a simple prover called AIMLEAP that can take machine learning advice.Footnote 1 We use this system as an RL environment. AIMLEAP keeps the state and carries out the cursor movements (the cursor determines the location of the rewrite) and rewrites that a neural predictor chooses.

The AIM conjecture concerns specific structures in loop theory [24]. A loop is a quasigroup with an identity element. A quasigroup is a generalization of a group that does not preserve associativity. This manifests in the presence of two different ‘division’ operators, one left-division (\(\backslash \)) and one right-division (/). We briefly explain the conjecture to show the nature of the data.

For loops, three inner map** functions (left-translation L, right-translation R, and the map** T) are:

$$\begin{aligned} \begin{array}{ll} L(u, x, y) := (y * x) \backslash (y * (x * u)) &{} \qquad \qquad T(u, x) := x \backslash (u * x) \\ R(u, x, y) := ((u * x) * y) / (x * y) &{} \end{array} \end{aligned}$$

These map**s can be seen as measures of the deviation from commutativity and associativity. The conjecture concerns the consequences of these three inner map** functions forming an Abelian (commutative) group. There are two more notions, that of the associator function a and the commutator function K:

$$\begin{aligned} a(x,y,z) := (x * (y * z)) \backslash ((x * y) * z) \qquad \qquad K(x,y) := (y * x) / (x * y) \end{aligned}$$

From these definitions, the conjecture can be stated. There are two parts to the conjecture. For both parts, the following equalities need to hold for all u, v, x, y, and z:

$$\begin{aligned} a(a(x,y,z),u,v) = 1 \qquad a(x,a(y,z,u),v) = 1 \qquad a(x,y,a(z,u,v)) = 1 \end{aligned}$$

where 1 is the identity element. These are necessary, but not sufficient for the two main parts of the conjecture. The first part of the conjecture asks whether a loop modulo its center is a group. In this context, the center is the set of all elements that commute with all other elements. This is the case if

$$\begin{aligned} K(a(x,y,z),u)&= 1. \end{aligned}$$

The second part of the conjecture asks whether a loop modulo its nucleus is an Abelian group. The nucleus is the set of elements that associate with all other elements. This is the case if

$$\begin{aligned} a(K(x,y),z,u) = 1 \qquad a(x,K(y,z),u) = 1 \qquad a(x,y,K(z,u)) = 1 \end{aligned}$$

3.1 The AIMLEAP RL Environment

Currently, work in this area is done using automated theorem provers such as Prover9 [24, 25]. This has led to some promising results, but the search space is enormous. The main strategy for proving the AIM conjecture thus far has been to prove weaker versions of the conjecture (using additional assumptions) and then import crucial proof steps into the stronger version of the proof. The Prover9 theorem prover is especially suited to this approach because of its well-established hints mechanism [48]. The AIMLEAP dataset is derived from this Prover9 approach and contains around 3468 theorems that can be proven with the supplied definitions and lemmas [7].

There are 177 possible actions in the AIMLEAP environment [7]. We handle the proof state as a tree, with the root node being an equality node. Three actions are cursor movements, where the cursor can be moved to an argument of the current position. The other actions all rewrite the current term at the cursor position with various axioms, definitions and lemmas that hold in the AIM context. As an example, this is one of the theorems in the dataset (\(\backslash \) and \(=\) are part of the language):

$$\begin{aligned} T(T(T(x,T(x,y) \backslash 1),T(x,y) \backslash 1),y)&= T((T(x,y) \backslash 1) \backslash 1,T(x,y) \backslash 1) \, . \end{aligned}$$

The task of the machine learning predictor is to process the proof state and recognize which actions are most likely to lead to a proof, meaning that the two sides of the starting equation are equal according to the AIMLEAP system. The only feedback that the environment gives is whether a proof has been found or not: there is no intermediate reward (i.e. rewards are sparse). The ramifications of this are further discussed in Sect. 5.1.

4 Rewriting in Robinson Arithmetic as an RL Task

To develop a machine learning method that can help solve equational theorem proving problems, we considered a simpler arithmetic task, which also has a tree-structured input and a sparse reward structure: the normalization of Robinson arithmetic expressions. The task is to normalize a mathematical expression to one specific form. This task has been implemented as a Python RL environment, which we make available.Footnote 2 The learning environment incorporates an existing dataset, constructed by Gauthier for RL experiments in the interactive theorem prover HOL4 [

$$\begin{aligned} \mathcal {L}&= \mathcal {L}_\text {policy}^\text {A2C} +\mathcal {L}_\text {value}^\text {A2C} \, . \end{aligned}$$

In addition to the policy function \(\pi \), the agent has access to a value function \(\mathcal {V}: \mathcal {S} \rightarrow \mathbb {R}\), that predicts the sum of future rewards obtained when given a state. In practice, both the policy and the value function are computed by a neural network predictor. The parameters of the predictor are set by stochastic gradient descent to minimize \(\mathcal {L}\). The set of parameters of the predictor that defines the policy function \(\pi \) is named \(\theta \), while the parameters that define the value function are named \(\mu \). The first part of the loss is the policy loss, which for one time step has the form

$$\begin{aligned} \mathcal {L}_\text {policy}^\text {A2C}&= - \log \pi _{\theta } (a_t|s_t) A(s_t,a_t) \, , \end{aligned}$$

where A(sa) is the advantage function. The advantage function can be formulated in multiple ways, but the simplest is as \(R_t - \mathcal {V}_{\mu }(s_t)\). That is to say: the advantage of an action in a certain state is the difference between the discounted rewards \(R_t\) after taking that action and the value estimate of the current state.

Minimizing \(\mathcal {L}_\text {policy}^\text {A2C}\) amounts to maximizing the log probability of predicting actions that are judged by the advantage function to lead to high reward.

The value estimates \(V_{\mu }(s)\) for computing the advantage function are supplied by the value predictor \(V_{\mu }\) with parameters \(\mu \), which is trained using the loss:

$$\begin{aligned} \mathcal {L}_\text {value}^\text {A2C}&= \frac{1}{2} \left( R_t - \mathcal {V}_{\mu }(s_t) \right) ^2 \, , \end{aligned}$$

which minimizes the advantage function. The logic of this is that the value estimate at timestep t, \(\mathcal {V}_{\mu }(s_t)\), will learn to incorporate the later rewards \(R_t\), ensuring that when later seeing the same state, the possible future reward will be considered. Note that the sets of parameters \(\theta \) and \(\mu \) are not necessarily disjoint (see Sect. 6).

Note how the above equations are affected if there is no non-zero reward \(r_t\) obtained at any timestep. In that case, the value function \(\mathcal {V}_{\mu }(s_t)\) will estimate (correctly) that any state will get 0 reward, which means that the advantage function A(sa) will also be 0 everywhere. This means that \(\mathcal {L}_\text {policy}^\text {A2C}\) will be 0 in most cases, which will lead to no or little change in the parameters of the predictor: learning will be very slow. This is the difficult aspect of the structure of theorem proving: there is only reward at the end of a successful proof, and nowhere else. This implies a possible strategy is to imitate successful episodes, without a value function. In this case, we would only need to train a policy function, and no approximate value function. This an aspect we explore in the design of our own method 3SIL, which we will explain shortly.

Compared to two-player games, such as chess and go, for which many approaches have been tailored and successfully used [41], theorem-proving has the property that it is hard to collect useful examples to learn from, as only successful proofs are likely to contain useful knowledge. In chess or go, however, one player almost always wins and the other loses, which means that we can at least learn from the difference between the two strategies used by those players. As an example, we executed 2 million random proof attempts on the AIMLEAP environment, which led to 300 proofs to learn from, whereas in a two-player setting like chess, we would get 2 million games in which one player would likely win.

ACER. The second RL baseline method we tested in our experiments is ACER, Actor-Critic with Experience Replay [49]. This approach can make use of data from older episodes to train the current predictor. ACER applies corrections to the value estimates so that data from old episodes may be used to train the current policy. It also uses trust region policy optimization [35] to limit the size of the policy updates. This method is included as a baseline to check if using a larger replay buffer to update the parameters would be advantageous.

PPO. Our third RL baseline is the widely used proximal policy optimization (PPO) algorithm [36]. It restricts the size of the parameter update to avoid causing a large difference between the original predictor’s behavior and the updated version’s behavior. The method is related to the above trust region policy optimization method. In this way, PPO addresses the training instability of many reinforcement learning approaches. It has been used in various settings, for example complex video games [4]. With its versatility, the PPO algorithm is well-positioned. We use the PPO algorithm with clipped objective, as in [36].

SIL-PAAC. Our final RL baseline uses only the transitions with positive advantage to train on for a portion of the training procedure, to learn more from good episodes. This was proposed as self-imitation learning (SIL) [29]. To avoid confusion with the method that we are proposing, we extend the acronym to SIL-PAAC, for positive advantage actor-critic. This algorithm outperformed A2C on the sparse-reward task Montezuma’s Revenge (a puzzle game). As theorem proving has a sparse reward structure, we included SIL-PAAC as a baseline. More information about the implementations for the baselines can be found in the Implementation Details section at the end of this work.

5.2 Stratified Shortest Solution Imitation Learning

We introduce stratified shortest solution imitation learning (3SIL) to tackle the equational theorem proving domain. It learns to explicitly imitate the actions taken during the shortest solutions found for each problem in the dataset. We do this by minimizing the cross-entropy \(-{\log p(a_{\textit{solution}} | s_t)}\) between the predictor output and the actions taken in the shortest solution. This is in contrast to the baseline methods, where value functions are used to judge the utility of decisions.

In our procedure this is not the case. Instead, we build upon the assumption for data selection that shorter proofs are better in the context of theorem proving and expression normalization. In a sense, we value decisions from shorter proofs more and explicitly imitate those transitions. We keep a history H for each problem, where we store the current shortest solution (states seen and actions taken) found for that problem in the training dataset. We can also store multiple shortest solutions for each problem if there are multiple strategies for a proof (the number of solutions kept is governed by the parameter k).

figure a
figure b

During training, in the case \(k=1\), we sample state-action pairs from each problem’s current shortest solution at an equal probability (if a solution was found). To be precise, we first randomly pick a theorem for which we have a solution, and then randomly sample one transition from the shortest encountered solution. This directly counters one of the phenomena that we had observed: the training examples for the baseline methods tend to be dominated by very long episodes (as they contribute more states and actions). This stratified sampling method ensures that problems with short proofs get represented equally in the training process.

The 3SIL algorithm is described in more detail in Algorithm 2. Sampling from a noisy version of policy \(\pi _{\theta }\) means that actions are sampled from the predictor-defined distribution and in 5% of cases a random valid action is selected. This is also known as the \(\epsilon \)-greedy policy (with \(\epsilon \) at 0.05).

Related Methods. Our approach is similar to the imitation learning algorithm DAGGER (Dataset Aggregation), which was used for several games [34] and modified for branch-and-bound algorithms in [16]. The behavioral cloning (BC) technique used in robotics [47] also shares some elements. 3SIL significantly differs from DAGGER and BC because it does not use an outside expert to obtain useful data, because of the stratified sampling procedure, and because of the selection of the shortest solutions for each problem in the training dataset. We include as an additional baseline an implementation of behavioral cloning (BC), where we regard proofs already encountered as coming from an expert. We minimize cross-entropy between the actions in proofs we have found and the predictions to train the predictor. For BC, there is no stratified sampling or shortest solution selection, only the minimization of cross-entropy between actions taken from recent successful solutions and the predictor’s output.

Extensions. For the AIM tasks, we introduce two other techniques, biased sampling and episode pruning. In biased sampling, problems without a solution in the history are sampled 5 times more during episode collection than solved problems to accelerate progress. This was determined by testing 1, 2, 5 and 10 as sampling proportions. For episode pruning, when the agent encountered the same state twice, we prune the episode to exclude the loo** before storing the episode. This helps the predictor learn to avoid these loops.