- This paper contains a proof of the correctness of a simple compiling algorithm
- for compiling arithmetic expressions into machine language.
- The definition of correctness, the formalism used to express the description
- of source language, object language and compiler, and the methods of proof are
- all intended to serve as prototypes for the more complicated task of proving the
- correctness of usable compilers. The ultimate goal, as outlined in references
- [1], [2], [3] and [4] is to make it possible to use a computer to check proofs that
- compilers are correct.
- The concepts of abstract syntax, state vector, the use of an interpreter
- for defining the semantics of a programming language, and the definition of
- correctness of a compiler are all the same as in [3]. The present paper, however,
- is the first in which the correctness of a compiler is proved.
- The expressions dealt with in this paper are formed from constants and
- variables. The only operation allowed is a binary + although no change in
- method would be required to include any other binary operations. An example
- of an expression that can be compiled is
(x + 3) + (x + (y + 2))
- ∗This is a reprint with minor changes of ”Correctness of a Compiler for Arithmetic Ex-
- pressions” by John McCarthy and James Painter which was published in MATHEMATICAL
- ASPECTS OF COMPUTER SCIENCE 1, which was Volume 19 of Proceedings of Symposia
- in Applied Mathematics and published by the American Mathematical Society in 1967
- although, because we use abstract syntax, no commitment to a particular
- notation is made.
- The computer language into which these expressions are compiled is a
- single address computer with an accumulator, called ac, and four instructions:
- li (load immediate), load, sto (store) and add. Note that there are no jump
- instructions. Needless to say, this is a severe restriction on the generality of
- our results which we shall overcome in future work.
- The compiler produces code that computes the value of the expression
- being compiled and leaves this value in the accumulator. The above expression
- is compiled into code which in assembly language might look as follows:
load
sto
li
add
sto
load
sto
load
sto
li
add
add
add
x
t
t
t
x
t + 1
y
t + 2
t + 2
t + 1
t
- Again because we are using abstract syntax there is no commitment to a
- precise form for the object code.
- 2 The source language
- The abstract analytic syntax of the source expressions is given by the table:
associated functions
predicate
isconst(e)
isvar(e)
issum(e)
s1(e) s2(e)
- which asserts that the expressions comprise constants, variables and binary
- sums, that the predicates isconst, isvar, and issum enable one to classify
- each expression and that each sum e has summands
s1(e) and s2(e).
- The semantics is given by the formula
- (2.1)
value(e, ξ) = if isconst(e) then val(e) else if isvar(e) then c(e, ξ)
else if issum(e) then value(s1(e), ξ) + value(s2(e), ξ)
- where
val(e) gives the numerical value of an expression e representing a con-
- stant,
c(e, ξ) gives the value of the variable e in the state vector ξ and + is some
- binary operation. (It is natural to regard + as an operation that resembles
- addition of real numbers, but our results do not depend on this).
- For our present purposes we do not have to give a synthetic syntax for
- the source language expressions since both the interpreter and the compiler
- use only the analytic syntax. However, we shall need the following induction
- principle for expressions:
- Suppose Φ is a predicate applicable to expressions, and suppose that for
- all expressions e we have
isconst(e) ⊃ Φ(e)
isvar(e) ⊃ Φ(e)
issum(e) ⊃ Φ(s1(e)) ∧ Φ(s2(e)) ⊃ Φ(e).
and
and
- Then we may conclude that Φ(e) is true for all expressions e.
- 3 The object language.
- We must give both the analytic and synthetic syntaxes for the object language
- because the interpreter defining its semantics uses the analytic syntax and the
- compiler uses the synthetic syntax. We may write the analytic and synthetic
- syntaxes for instructions in the following table.
- operation
- li α
- load x
- sto x
- add x
predicate
isli(s)
isload(s)
issto(s)
isadd(s)
analytic operation
synthetic operation
mkli(α)
mkload(x)
mksto(x)
mkadd(x)
- A program is a list of instructions and
null(p) asserts that p is the null list.
- If the program p is not null then
first(p) gives the first instruction and rest(p)
arg(s)
adr(s)
adr(s)
adr(s)
- gives the list of remaining instructions. We shall use the operation p1 ∗ p2 to
- denote the program obtained by appending p2 onto the end of p1. Since we
- have only one level of list we can identify a single instruction with a program
- that has just one instruction.
- The synthetic and analytic syntaxes of instructions are related by the fol-
- lowing.
isli(mkli(α))
α = arg(mkli(α))
isli(s) ⊃ s = mkli(arg(s))
null(rest(mkli(α)))
isli(s) ⊃ f irst(s) = s
isload(mkload(x))
x = adr(mkload(x))
isload(x) ⊃ x = mkload(adr(x))
null(rest(mkload(x)))
isload(s) ⊃ f irst(s) = s
issto(mksto(x))
x = adr(mksto(x))
issto(x) ⊃ x = mksto(adr(x))
null(rest(mksto(x)))
issto(s) ⊃ f irst(s) = s
isadd(mkadd(x))
x = adr(mkadd(x))
isadd(x) ⊃ x = mkadd(adr(x))
null(rest(mkadd(x)))
isadd(x) ⊃ f irst(s) = s
(3.1)
(3.2)
(3.3)
(3.4)
(3.5)
(3.6)
(3.7)
¬ null(p) ⊃ p = f irst(p) ∗ rest(p),
¬null(p1) ∧ null(rest(p1)) ⊃ p1 = f irst(p1 ∗ p2)
null(p1 ∗ p2) ≡ null(p1) ∧ null(p2).
- The ∗ operation is associative.
(The somewhat awkward form of these
- relations comes from having a general concatenation operation rather than
- just an operation that prefixes a single instruction onto a program.)
- A state vector for a machine gives, for each register in the machine, its
- contents. We include the accumulator denoted by ac as a register. There are
- two functions of state vectors as introduced in [3], namely
- 1.
c(x, η) denotes the value of the contents of register x in machine state
- η.
- 2.
a(x, α, η) denotes the state vector that is obtained from the state vec-
- tor η by changing the contents of register x to α leaving the other registers
- unaffected.
- These functions satisfy the following relations:
- (3.8)
c(x, a(y, α, η)) = if x = y then α else c(x, η),
- (3.9)
a(x, α, a(y, β, η)) = if x = y then a(x, α, η) else a(y, β, a(x, α, η)),
- (3.10)
a(x, c(x, η), η) = η.
- Now we can define the semantics of the object language by
step(s, η) = if isli(s)then a(ac, arg(s), η)
(3.11)
else if isload(s) then a(ac, c(adr(s), η), η)
else if issto(s) thena(adr(s), c(ac, η), η)
elseif isadd(s) then a(ac, c(adr(s), η) + c(ac, η), η)
- which gives the state vector that results from executing an instruction and
- (3.12)
outcome(p, η)
= if null(p) then η else outcome(rest(p), step(first(p), η))
- which gives the state vector that results from executing the program p with
- state vector η.
- The following lemma is easily proved.
- (3.13)
outcome(p1 ∗ p2, η) =
outcome(p2, outcome(p1, η))
- 4 The compiler
- We shall assume that there is a map giving for each variable in the expression
- a location in the main memory of the machine
loc(ν, map) gives this location
- and we shall assume
- (4.1)
c(loc(ν, map), η) = c(ν, ξ)
- as a relation between the state vector η before the compiled program starts to
- act and the state vector ξ of the source program.
- Now we can write the compiler. It is
compile(e, t) = if isconst(e) then mkli(val(e))
- (4.2)
else if
isvar(e) then mkload(loc(e,map))
else if issum(e) then compile(s1(e), t) ∗ mksto(t) ∗ compile(s2, t + 1) ∗ mkadd(t)- Here t is the number of a register such that all variables are stored in
- registers numbered less than t, so that registers t and above are available for
- temporary storage.
- Before we can state our definition of correctness of the compiler, we need
- a notion of partial equality for state vectors
ζ1 =A ζ2,
- where ζ1 and ζ2 are state vectors and A is a set of variables means that cor-
- responding components of ζ1 and ζ2 are equal except possibly for values, of
- variables in A. Symbolically, x /∈ A ⊃
c(x, ζ1) = c(x, ζ2). Partial equality
- satisfies the following relations:
- (4.3)
ζ1 = ζ2 is equivalent to ζ1 ={} ζ2, where {} denotes the empty set ,
- (4.4)
if A ⊂ B and ζ1 =A ζ2 then ζ1 =B ζ2.
- (4.5)
if ζ1 =A ζ2 then
a(x, α, ζ1) =A−{x} a(x, α, ζ2).
- (4.6)
if x ∈ A then
a(x, α, ζ) =A ζ,
- (4.7)
if ζ1 =A ζ2 and ζ2 =B ζ3 then ζ1 =A∪B ζ3.
- In our case we need a specialization of this notation and will use
ζ1 =t ζ2 to denote ζ1 ={x|x≥t} ζ2
ζ1 =ac ζ2 to denote ζ1 ={ac} ζ2
- and
- and
ζ1 =t,ac ζ2 to denote ζ1 ={x|x=ac∨x≥t} ζ2.
- The correctness of the compiler is stated in
- THEOREM 1. If η and ξ are machine and source language state vectors
- respectively such that
- (4.8)
c(
loc(v, η) = c(v, ξ),
then
outcome(compile(e, t), η) =t a(ac,value(e, ξ), η).
- It states that the result of running the compiled program is to put the
- value of the expression compiled into the accumulator. No registers except the
- accumulator and those with addresses ≥ t are affected.
- 5 Proof of Theorem 1.
- The proof is accomplished by an induction on the expression e being compiled.
- We prove it first for constants, then for variables, and then for sums on the
- induction hypothesis that it is true for the summands. Thus there are three
- cases.
- I.
isconst(e). We have
outcome(compile(e, t), η) = outcome(mkli(val(e)), η)
= step(mkli(val(e)), η)
= a(ac, arg(mkli(val(e))), η)
= a(ac, val(e), η)
= a(ac, value(e, ξ), η)
=t a(ac, value(e, ξ), η).
Justification
4.2
3.12, 3.1
3.1, 3.11
3.1
2.1
4.3, 4.4
- II.
isvar(e). We have
outcome(compile(e, t), η)
= outcome(mkload(loc(e, map)), η)
= a(ac, c(adr(mkload(loc(e))), η), η)
= a(ac, c(loc(e, map), η, η)
= a(ac, c(e, ξ), η)
= a(ac, value(e, ξ), η)
=t a(ac, value(e, ξ), η).
4.2
3.12, 3.2, 3.113.2
4.1
2.1
4.3, 4.4
- III.
issum(e). In this case, we first write
outcome(compile(e, t), η)
= outcome(compile(s1(e), t) ∗ mksto(t)
∗compile(s2(e), t + 1) ∗ mkadd(t), η)
by 4.2
= outcome(mkadd(t), outcome(compile(s2(e), t + 1),
outcome(mksto(t), outcome(compile(s1(e), t), η))))
by 3.13
- using the relation between concatenating programs and composing the func-
- tions they represent. Now we introduce some notation. Let
ν =
value(e, ξ),
ν1 = value(s1(e), ξ),
ν2 = value(s2(e), ξ),
- so that ν = ν1 + ν2. Further let
ζ1 =
outcome(compile(s1(e), t), η),
ζ2 = outcome(mksto(t), ζ1),
ζ3 = outcome(compile(s2(e), t + 1), ζ2),
ζ4 = outcome(mkadd(t), ζ3)
- so that ζ4 = outcome(
compile(e, t), η, and we want to prove that
ζ4 =t a(ac, ν, η).
- We have
ζ1 =
outcome(compile(s1(e), t), η)
=t a(ac, ν1, η)
Induction Hypothesis
- and
- Now
- and
- Next
c(ac, ζ1) = ν1.
3.8
ζ2 = outcome(mksto(t), ζ1)
= a(t, c(ac, ζ1), ζ1)
= a(t, ν1), ζ1)
=t+1 a(t, ν1, a(ac, ν1, η))
=t+1,ac a(t, ν1, η)
3.12, 3.3, 3.11
Substitution
4.5
4.5, 3.9
c(t, ζ2) = ν1
3.8
ζ3 = outcome(compile(s2(e), t + 1), ζ2)
= t+1a(ac, ν2, ζ2).
- Here we again use the induction hypothesis that
s2(e) is compiled correctly.
- In order to apply it, we need
c(loc(ν,map),ζ2) = c(ν, ξ) for each variable ν
- which is proved as follows:
c(loc(ν, map), ζ2) = c(loc(ν), map)a(t, ν1, η)) since loc(ν, map) < t
= c(loc(ν, map), η) for the same reason
= c(ν, ξ) by the hypothesis of the theorem.
ζ3 =t+1 a(ac, ν2, a(t, ν1, η))
by 3.9
- Now we can continue with
- Finally,
ζ4 = outcome(mkadd (t), ζ3)
= a(ac, c(t, ζ3) + c(ac, ζ3), ζ3)
= a(ac, ν, ζ3)
=t+1 a(ac, ν, a(ac, ν2, a(t, ν1, η)))
=t+1 a(ac, ν, a(t, ν1, η))
=t a(ac, ν, η).
- This concludes the proof.
Definition of ν, substitution
3.12, 3.4, 3.114.53.93.9, 4.6, 4.7- 6 Remarks
- The problem of the relations between source language and object language
- arithmetic is dealt with here by assuming that the + signs in formulas (2.1) and
- (3.11) which define the semantics of the source and object languages represent
- the same operation. Theorem 1 does not depend on any properties of this
- operation, not even commutativity or associativity.
- The proof is entirely straightforward once the necessary machinery has
- been created. Additional operations such as subtraction, multiplication and
- division could be added without essential change in the proof.
- For example, to put multiplication into the system the following changes
- would be required. 1. Add
isprod(e), and p1(e), and p2(e) to the abstract
- syntax of the source language.
- 2. Add a term
- to Equation (2.1).
- 3. Add
if
isprod(e) then value(p1(e), ζ)× value(p2(e), ζ)
isprod(e) ∧ Φ(p1(e)) ∧ Φ(p2(e)) ⊃ Φ(e)
- to the hypotheses of the source language induction principle.
- 4. Add an instruction mul x and the three syntactical functions
ismul(s)
adr(r), mkmul(x) to the abstract syntax of the object language together with
- the necessary relations among them.
- 5. Add to the definition (3.11) of step a term
else if
ismul(s) then a(ac, c(adr(s), η) × x(ac, η), η).
- 6. Add to the compiler a term
- if
isprod(e)thencompile(p1(e), t) ∗ mksto(t) ∗ compile(p2(e), t + 1) ∗ mkmul(t).
- 7. Add to the proof a case
isprod(e) which parallels the case issum(e)
- exactly.
- The following other extensions are contemplated. 1. Variable length sums.
- 2. Sequences of assignment statements.
- 3. Conditional expressions.
- 4. go to statements in the source language.
- In order to make these extensions, a complete revision of the formalism will
- be required.
- 7 References
- 1. J. McCarthy, Computer programs for checking mathematical proofs, Proc.
- Sympos. Pure Math. Vol. 5, Amer. Math. Soc., Providence, R. I., 1962, pp.
- 219-227.
- 2. ———–, ”A basis for a mathematical theory of computation” in Com-
- puter programming and formal systems, edited by P. Braffort and D. Hersh-
- berg, North-Holland, Amsterdam, 1963.
- 3. ———–, Towards a mathematical theory of computation, Proc. Internat.
- Congr. on Information Processing, 1962.
- 4. ———–, A formal description of a subset of Algol, Proc. Conf. on
- Formal Language Description Languages, Vienna, 1964.