An explanation of higher order unification

module blog.higher-order-unification where

I’m interested in programming languages with dependent types (such as Agda) and in how they work. In the process of implementing type checking (and inference) for a toy dependent type theory, I have had to become familiar with higher order unification. In this post, I will explain the concepts relevant to understand the underlying problem, then I will explain the algorithm commonly used for this purpose: Huet’s algorithm.

1. The lambda cube

The λ-cube is a notion invented by Henk Barendregt to illustrate the three ways in which a simple type system1¹ By “simple” here I mean the simply typed lambda calculus. can be extended with new axes of dependence. Let’s first consider the base case of simple types through this Haskell type definition:

type ConcreteFunc = Int -> Int

If you inspect the type Int -> Int you will see that it doesn’t depend on anything. All the types referenced in there are concrete and known. There are no type variables. Type systems that only support this are at the “base” point of the λ cube. Basically your garden variety type systems. Let’s look at another type definition:

type PointedFunc a = Maybe a -> a

This time the body of the definition Maybe a -> a depends on a type variable bound in the type PointedFunc a part. This definition exhibits “type-on-type” dependence,2² It is helpful to look at the picture of the λ-cube. which is arguably the simplest way in which we can extend the simple type system. Next let’s look at a term:

wrap :: a -> Maybe a
wrap x = Just x

This time, we are looking at the definition body of wrap a. If you had a way to inspect the type of the variable x from inside the definition,3³ In Haskell, you can actually do this by replacing the whole Just x part with a single underscore _ – a hole. When you compile, this will cause an error and all the variables in the hole’s context will be listed together with their types. you would only see the type variable a. This is because, in a way, the definition depends on a type. If a turns out to be Int, then the function will return a Maybe Int. If it is String, it returns a Maybe String. We have introduced “value-on-type” dependence.

These last two extensions (type-on-type and value-on-type) are usually what functional programming languages of the Hindley-Milner family have. This class of type systems has what we call “full type inference”. This means that in principle, you don’t need to write down any type annotations anywhere because everything can be inferred by the type system.4⁴ In practice, writing down types is a good practice, because it serves as documentation and because writing down the types is a good way of ensuring your model of what is happening matches the compiler’s.

The last axis of extension, the one which motivates this post, is type-on-value dependence. This one is probably less familiar (to most functional programmers) than the first two, so let me give more details.

1.1. Dependent types

The first confusing thing that happens when we add type-on-value dependence to a type system is that the type-value separation collapses. What do I mean by type-value separation? In Haskell, a definition will look something like this:

fact :: Int -> Int
fact 0 = 1
fact n = n * fact (n - 1)

In this definition, there are two parts: the type Int -> Int and the body. These parts are written using two distinct grammars and use two different namespaces that can’t collide,5⁵ This is what makes possible the common practice of naming the constructor of a type the same as the type itself. Think data Person = Person { name :: String, age :: Int}. again, because these two syntactic categories are distinct.

What do things look like when we lose this separation?

open import Terra.Data.Nat
open import Terra.Data.Nat.Functions
open import Terra.Data.Unit renaming (⊤ to Unit)
open import Terra.Data.Counit renaming (⊥ to Empty)

postulate
  ⋯ : {A : Set} -> A

IsEven : Nat -> Set
IsEven zero = Unit
IsEven (suc zero) = Empty
IsEven (suc (suc n)) = IsEven n

func : (x : Nat) -> IsEven (x * 2)
func = ⋯

Definitions get pretty confusing. We have term variables in our type definitions, we have term variables in our type declarations, we use the same syntax to declare parametric types as we do to declare parametric terms.6⁶ Also known as functions hihi.

This is where we get to the motivation of this post. Apart from breaking the nice type and value distinction, having dependent types also makes type inference a lot harder. To see why, let’s look at how type inference works in other systems.

2. Type inference and type checking

The main property of what we call a logical system is that proofs in that logical system can be checked in a decidable way. Either the proof is correct and the theorem is true, or the proof is incorrect and we can’t say anything.7⁷ This idea gives rise to a very intuitive (but weaker) version of Gödel’s first incompleteness theorem. Not all truths in a system that can talk about Turing machines are provable, otherwise you could take an arbitrary Turing machine, then enumerate all possible proofs of it either halting or not halting (checking them along the way) and you would be sure to, at some point, find one proof that checks and tells you whether or not the machine halts. You would have a decision procedure for the halting problem. Because we know the halting problem is undecidable, you can’t have a logical system expressive enough to talk about Turing machines while also being complete.

In an analogous way,8⁸ Actually, through the Curry-Howard correspondence and an exotic-enough meaning for “literally the same”, those are literally the same. the main property of a type system is to have a nice type checking decision procedure. Having a program type check (in a well-made type system without escape hatches) usually means that the program will execute without runtime crashes. Type checking (a piece of code against a given type) is usually pretty easy and mechanical to implement, so we will focus on the other side of the process: type inference.

In a simple type system, type inference is also, well, simple. It usually consists of a big pattern match on each constructor of your syntax. Here is an example:

import Control.Monad

data Syntax = Var String
            | ConstInt Int
            | ConstBool Bool
            | Lambda String Type Syntax
            | App Syntax Syntax
            | Plus Syntax Syntax
            | If Syntax Syntax Syntax
            deriving (Show, Eq)

data Type = Func Type Type
          | BoolType
          | IntType
          deriving (Show, Eq)


-- Let's just see what type it infers
main :: IO ()
main = print $ infer [] (If (ConstBool True)
                         (ConstInt 1)
                         (ConstInt 0))


infer :: [(String, Type)] -> Syntax -> Maybe Type
infer ctx (Var name) = lookup name ctx
infer ctx (ConstInt _) = Just IntType
infer ctx (ConstBool _) = Just BoolType
infer ctx (Lambda var t body) = do
  bodyType <- infer ((var, t):ctx) body
  return (Func t bodyType)
infer ctx (Plus a b) = do
  aType <- infer ctx a
  guard $ aType == IntType

  bType <- infer ctx b
  guard $ bType == IntType

  return IntType
infer ctx (If c y n) = do
  cType <- infer ctx c
  guard $ cType == BoolType

  yType <- infer ctx y
  nType <- infer ctx n
  guard $ yType == nType

  return yType
infer ctx (App f v) = do
  (Func a b) <- infer ctx f
  a' <- infer ctx v
  guard $ a == a'

  return b

Just IntType

I direct your attention to the last couple of clauses. While type inference, well, infers the type of the terms, it also enforces type checking conditions through equality assertions like guard $ a == a'. If we pass infer an If expression where the condition is not a boolean, it shouldn’t return a value, because the term doesn’t type check. Additionally, to check whether or not an application makes sense, we only need to verify that the type in the domain of the function is equal to the type of the argument.

Let’s move on to the Hindley-Milner family of type systems. This time, let us use haskell not as a meta language (like for our previous example), but as the object language. Let’s first consider that we are at a point in the type inference process where we know the type of the unwrapDefault definition.

unwrapDefault :: Maybe a -> a -> a
unwrapDefault (Just x) _ = x
unwrapDefault Nothing def = def

This time, let’s see what happens when we try to infer the type of value.

value = unwrapDefault (Just 1234) 0

This is equivalent (through currying), to:

value = (unwrapDefault (Just 1234)) 0

Now let’s focus on the unwrapDefault (Just 1234) subterm. We know that unwrapDefault has type Maybe a -> (a -> a) and that Just 1234 has type Maybe Int. We want to determine if that application is valid. In the simple type system case, we only had to assert that the domain of the function type was equal to the type of the argument, but here, we have a type variable in the domain of our function type. The correct notion of compatibility here is unification. We ask: “is there some possible value of a that makes Maybe a = Maybe Int?”

The answer, obviously, is “Yes: a = Int.” Once we have this answer, we then instantiate the variable a to be Int. The instantiated type of unwrapDefault is then Maybe Int -> (Int -> Int) and we can carry on with inference.

The procedure that takes in an equality with variables and gives you back a value for that variable is called a unification procedure and it is implemented, apart from some subtleties,9⁹ Specifically, we must handle unification of variables with types containing other variables with care. You must also mind the proper threading of substitutions throughout type checking/inference. pretty much how you would expect. This type of unification is referred to as first-order, because the kind of structure-with-variables we want to unify over has a first-order notion of equality: two types are equal if and only if they are equal as syntax when you write them down.

Now let’s do as in the previous section and see what happens when we introduce types-on-terms dependence. The first thing to notice is that we now encounter questions such as “is the type IsEven 4 the same as IsEven (2 + 2)?” These types are not equal syntactically, but we know they mean the same thing. We therefore have to introduce into our type checking the notion of “normalization”. Two types are equal if their normalized forms are equal too. This creates weird scenarios where unification problems can only be solved by substituting type-level functions into the metavariables.

For instance, let’s say we are trying to infer the type of a function g after having encountered two call sites telling us that

g 2 : IsEven 2
g 4 : IsEven 4

Let’s call the metavariable we have to solve for $T$ . We know that the type of g is equal to $(x : \mathsf{Nat}) \to T(x)$ . Now we just need to find a value of $T$ that satisfies the two equations:

$\begin{align*} T(2) &= \mathsf{IsEven}(2) \\ T(4) &= \mathsf{IsEven}(4) \end{align*}$

A higher-order unification algorithm will be able to solve for $T$ and give us $T = \lambda(n : \mathsf{Nat}). \mathsf{IsEven}(n)$ . This is quite a simple case, but the algorithm will be able to solve much harder problems.

Armed with such an algorithm, one can derive a type checker (and inference algorithm) by recursion on the tree of terms and by switching between normalization, unification, checking.10¹⁰ In [1] they present an algorithm that refactors the checking and inference as a non mutually recursive process. I find that approach much more manageable, because there are fewer moving pieces to keep in one’s head, but I heard it is harder to provide nice error messages when implementing checking through that approach.

3. Huet’s algorithm

The algorithm that is usually used for higher order unification is Huet’s algorithm and it operates on sets of constraints with metavariables. A constraint is a pair of two terms that we want to render equal through the right choice of metavariable value. A single constraint could be something like $T(2) = \mathsf{IsEven}(2)$ from earlier. We have a set of them and the algorithm progresses by transforming (nondeterministically) these constraints through three procedures that are applied in a loop.

Simplification is used on a rigid-rigid constraint. It eliminates trivial constraints (for instance, between a term and itself) and decomposes complex constraints into smaller ones.
For each flex-rigid pair, two kinds of substitutions are attempted: imitation and projection. Those are the ones that actually attempt to give values to the metavariables.

3.1. Normalization

The first thing to note about the algorithm is that it requires working with constraints where the two terms are normalized. That means that we must not have a trivial toplevel introduction + elimination. Maintaining this invariant can be done through smart constructors, or some other technique as long as every operation operates on constraints with normalized terms.

Once the two sides of a constraint have been normalized, we can classify the terms in one of two classes:

Flexible
Rigid

Flexible terms are those of the form $M(a_0, \dots)$ . They are applications where the function part is a metavariable. The rigid terms are the rest. Constraints can be split into three classes: flex-flex, flex-rigid and rigid-rigid. Classifying these constraints is important, because the operations I have not yet explained only work on their specific class.

3.2. Simplification

Simplification works on rigid-rigid constraints and its purpose is to:

Eliminate trivially true constraints.
Detect the presence of unsatisfiable constraints.
Split complex constraints into smaller constraints.

Here’s how it works. Let’s say our language has pairs and we have the following constraint:

$(A, B) = (C, D)$

We know this constraint is solvable if and only if those two constraints are both solvable:

$\begin{align*} A &= C \\ B &= D \end{align*}$

So we split the initial constraint into these two new constraints.

If, on the other hand, we see:

$\mathsf{Left}(A) = \mathsf{Right}(B)$

Then we know no value of $A$ or $B$ will make this constraint satisfiable. We signal an error. Other rigid-rigid pairs that can be reduced include:

function applications where the function is not a metavariable:

$f(A) = g(B) \leadsto \left\{ f = g, A = B \right\}$
lambda abstractions:

$\lambda_{x : t_1}.b(x) = \lambda_{y : t_2}.c(y) \leadsto \left\{ t_1 = t_2, b(z) = c(z) \right\}$

Where $z$ is a fresh variable. (if one used DeBruijn indices, this kind of α-conversion is unnecessary).
Pi types (same as lambda abstractions)
References: if two references don’t refer to the same variable, they are not the same.
Other primitives such as numbers, strings (if your language has those), and more.

3.3. Substitutions

When the simplification phase has halted, we are left with only flex-rigid and flex-flex pairs. The two remaining operations don’t act on flex-flex pairs, as those are underdetermined and have infinite solutions.

Let’s recall the general shape of a flex-rigid constraint:

$M(a_1,\dots,a_k) = f(b_1, \dots, b_l)$

On such a constraint, we can (nondeterministically) attempt two operation types.

3.3.1. Imitation

$M(a_1,\dots,a_k) = f(b_1, \dots, b_l)$

Here, we “mimic” the right hand side of the equation by copying the $f$ . We replace the metavariable $M$ by a function that takes the appropriate number of arguments and calls $f$ on (at this point undetermined) functions of those arguments.

$M \Rightarrow \lambda_{a_1 : A_1} \cdots \lambda_{a_k : A_k}. f(X_1(a_1, \dots, a_k), \dots, X_l(a_1, \dots, a_k))$

We reuse every argument that was given to the $M$ function, but leave it to a future iteration to find how. This is why we generate fresh metavariables $X_i$ for each argument we see $f$ is supposed to take. Once we apply this substitution, we get the following constraint:

$(\lambda_{a_1 : A_1} \cdots \lambda_{a_k : A_k}. f(X_1(a_1, \dots, a_k), \dots, X_l(a_1, \dots, a_k)))(a_1, \dots, a_k) = f(b_1, \dots, b_l)$

Which reduces to:

$f(X_1(a_1, \dots, a_k), \dots, X_l(a_1, \dots, a_k)) = f(b_1, \dots, b_l)$

Which the next iteration’s simplification phase will further simplify (this is a rigid-rigid pair) to the following set of constraints:

$\left\{ X_1(a_1, \dots, a_k) = b_1, \dots, X_l(a_1, \dots, a_k) = b_l \right\}$

For any flex-rigid pair, only one such imitation substitution can be generated, so when we decide (nondeterministically) to do an imitation operation, we only have one choice per constraint.

3.3.2. Projection

Let’s look again at our flex-rigid pair:

$M(a_1,\dots,a_k) = f(b_1, \dots, b_l)$

Another substitution we could try is:

$M \Rightarrow \lambda_{a_1 : A_1} \cdots \lambda_{a_k : A_k}. a_1$

or even

$M \Rightarrow \lambda_{a_1 : A_1} \cdots \lambda_{a_k : A_k}. a_2$

In fact, we have one per argument in the flex application. What each of the substitutions in this family does is project out one argument, so we call this family “projection”.

If the “flex” part of the flex rigid constraint has $k$ arguments, then we can generate $k$ different projection substitutions.

3.4. Complete algorithm

What does an actual implementation of this algorithm look like? Here is some pseudocode:

hou :: [Substitutions] -> [(Term, Term)] -> [Substitution]
hou subs cs = do
  -- This part could fail
  simplified <- simplifyConstraints (subInConstraints subs cs)
  let flexRigids, flexFlexes = separate flexrigidp simplified
  if length flexRigids == 0
    then return subs
    else
    let fr = flexRigids !! 0
    oneSub <- generateImitation fr : generateProjections fr
    hou (oneSub:subs) cs

We start with a set of constraints we inherit from the calling context, then apply those to the constraints and simplify. We either fail (in the pseudocode this is handled by returning an empty list and thus making the list monad backtrack), or end up with a big set of flex-rigid constraints.
Then, we separate the flex-rigid from the flex-flex constraints. There is nothing we can do about the flex-flex constraints, so we will just ignore them and hope that whatever substitution we end up choosing turns them into flex-rigid pairs.
If we have no flex rigid pairs, then there is nothing we can do and we just return our current progress.
Otherwise, we take any flex rigid pair (here we take the first), and choose (nondeterministically) either the imitation substitution or one of the projections.
We add it to our list of substitutions and recurse.

Then, it would suffice to call subInTerm subs term to get back a term with all the holes filled-in!

4. Tying it all together

In this post, we saw how each extension to the simply typed lambda calculus (as seen in the λ cube) makes the process of type inference more difficult. Then, we saw why inference for Hindley-Milner type systems (which have both type-on-type and value-on-type dependency) requires doing first-order unification in the type fragment. We have also seen how extending type inference from HM to dependent types forces us to extend what is initially first-order unification in the type fragment to higher order unification in the whole language. Finally, we looked at the structure of Huet’s algorithm for higher order unification and looked at some pseudocode implementation.

5. References

[1]

F. Mazzoli and A. Abel, “Type checking through unification,” Preprint, Sep. 2016, doi: 10.48550/arXiv.1609.09709.

Footnotes:

By “simple” here I mean the simply typed lambda calculus.

It is helpful to look at the picture of the λ-cube.

In Haskell, you can actually do this by replacing the whole Just x part with a single underscore _ – a hole. When you compile, this will cause an error and all the variables in the hole’s context will be listed together with their types.

⁴

In practice, writing down types is a good practice, because it serves as documentation and because writing down the types is a good way of ensuring your model of what is happening matches the compiler’s.

⁵

This is what makes possible the common practice of naming the constructor of a type the same as the type itself. Think data Person = Person { name :: String, age :: Int}.

⁶

Also known as functions hihi.

⁷

This idea gives rise to a very intuitive (but weaker) version of Gödel’s first incompleteness theorem. Not all truths in a system that can talk about Turing machines are provable, otherwise you could take an arbitrary Turing machine, then enumerate all possible proofs of it either halting or not halting (checking them along the way) and you would be sure to, at some point, find one proof that checks and tells you whether or not the machine halts. You would have a decision procedure for the halting problem. Because we know the halting problem is undecidable, you can’t have a logical system expressive enough to talk about Turing machines while also being complete.

⁸

Actually, through the Curry-Howard correspondence and an exotic-enough meaning for “literally the same”, those are literally the same.

⁹

Specifically, we must handle unification of variables with types containing other variables with care. You must also mind the proper threading of substitutions throughout type checking/inference.

¹⁰

In [1] they present an algorithm that refactors the checking and inference as a non mutually recursive process. I find that approach much more manageable, because there are fewer moving pieces to keep in one’s head, but I heard it is harder to provide nice error messages when implementing checking through that approach.