Algebraic dynamic programming

Note: This was partially copied and Julia-translated from Mattéo’s Midiro presentation.

1. Motivation

The class of algorithms known as “dynamic programming” covers a large class of use cases and the different problems classified as “dynamic programming” sometimes seem too different to be related. Algebraic dynamic programming is a unifying framework through which view dynamic programming problems that allows, through its compositional nature,11 Compositional and algebraic in the same vein as algebraic data types. to understand complex algorithms as the sum of its parts.

The main concept we are going to use throughout this post is the idea of representing dynamic programming problems as shortest path problems in some weird abstract graph. By changing the specific semi ring we use to express the weight of edges in some problem’s graph, we can decompose many analyses and algorithms on a problem into small understandable parts. For instance, we are going to see how “compute the length of the shortest path in this graph” and “give me all paths with this lengths” and “how many different paths are there?” can be derived from a single generic function on a graph.

2. Background: semi rings

The generic interface we are going to use to derive algorithms is going to be based on semi rings. It turns out that semi rings possess exactly the kind of structure we need from a notion of “edge graph”. However, before we see how they are used in dynamic programming, we have to build an intuition of what they are.

In the more usual type of mathematics one might do, a semi ring is an algebraic structure that incarnates the essence of “sum of products”. A couple of examples of semi rings would be the natural numbers, square matrices (in two different ways), real numbers, polynomials, and more. More formally, a semi ring \( S \) is a set with the following properties/data:22 There are other properties, but they aren’t as useful for intuition.

  1. Has an addition operation \( + : S^2 \to S \)
  2. Has a multiplication operation \( \times : S^2 \to S \)
  3. Has additive identity \( 0 \in S \) which means that \( 0 + x = x = x + 0 \) for any \( x \).
  4. Has a multiplicative identity \( 1 \) which means that \( 1 \times x = x = x \times 1
   \) for any \( x \).
  5. Multiplication distributes over addition: \( a \times (b + c) = a \times b + a \times c \).

Through this post, we are going to work with Julia, which allows use to overload addition (Base.:+), multiplication (Base.:*), zero (Base.zero) and one (Base.one) for any data type. We are going to use this feature to make the code look similar to the math, but this is not strictly necessary for implementation, just a nice bonus. Once we define our first semi ring, the overloading behavior will be explained in more details.

abstract type Semiring end

3. Dynamic programming problems as graph traversal

As I said, dynamic programming problems can be framed as finding the shortest bath in some appropriate abstract graph. Let’s look, as an example, at the editing distance between two string. As you would expect, the editing distance between \( a \) and \( b \) is basically the smallest number of operations we need to transform string \( a \) into string \( b \). By “operation”, we mean:

  1. Inserting a character.
  2. Removing a character.
  3. Keeping a character.

The “keep a character” operation, as we will see, doesn’t contribute to the edit distance, but is necessary to make our graph have the necessary edges.

As an example, let’s take the strings “carp” and “carve”. We know intuitively that the fastest way to turn one into the other is to keep “car” and replace the “p” by “ve”. To express this transformation as a sequence of operations, we would say “keep(c), keep(a), keep(r), delete(p), insert(v), insert(e)”. Now I must explain (like I promised I would) how finding the shortest (with respect to some notion of distance) sequence of operation can be framed as a graph shortest path. When expressed like that, it doesn’t sound as mystical.

Imagine a graph where:

  1. Each vertex is a pair of strings \( (a, b) \) representing an intermediate result: “I know how to transform \( a \) into \( b \)”.
  2. An edge between \( (a, b) \) and \( (b, c) \) is an operation that, when added to the end of an “\( a \) to \( b \)” transformation, gives us a “\( c
   \) to \( d \)” transformation.

More visually (blue path is the optimal transformation):33 Pardon me the ugly overflow.

carp-carve.svg

It is probably easier to see now how this becomes a shortest path problem. However, while it might be tempting to just stick normal weights on this graph, then apply Dijkstra’s algorithm, we would lose out on some level of genericity. Having our interface be “function that extracts a (weighted) graph” would fulfil some part of the promise: one solver works on all dynamic programming problems. However, it would still be hard to trick the solver into doing weird ancillary computations for us.

For that reason, we are going to weaken our assumptions about the graph weighs in exchange for a more general algorithm.

Let’s say we have some weighted graph where the weight of each edge is part of some semi ring \( S \). We will denote the weight of an edge using the \(
\text{weight}_S(\cdot) \) function. Let’s also define the weight of a path (continuous sequence of edges) as the product of the weights of all edges in the path. Finally, let’s define the weight of a subgraph bounded by two nodes \( a
\) and \( b \) as the sum of the weights of all paths from \( a \) to \( b
\).44 This whole thing looks a lot like what the wikipedia page on weighted automata calls the weight of a word.

\[ \text{subgraphweight}_S(a,b) = \sum_{p \in \text{Paths}(a, b)} \prod_{e \in p}
\text{weight}_S(e)\]

Through this very general framework that looks a lot like simply computing the “weighted sum” of a subgraph, we can implement diverse algorithms.

4. Implementation of edit distance.

Let’s write a concrete implementation of this “algebraic dynamic programming” idea in julia. Again, we will be using the string editing distance as a toy problem. Here are the definitions of the string operations:

abstract type Op end
struct Ins <: Op
    c::Char
end
struct Del <: Op
    c::Char
end
struct Keep <: Op
    c::Char
end

In many cases, it is complicated (and expensive!) to enumerate all possible paths from source to destination, but for edit distance, the graph is easy to write down.

Here, we compute \( \text{subgraphweight}_S(a,b) \) recursively by using the weight of sub strings.

# Note: for this function to be efficient, you would ideally use
# memoisation. Here I don't to keep the presentation clean.
function subgraphweight(T::Type{<:Semiring}, s1, s2)
    # We will be computing a sum, so it makes sense to initialize this
    # with zero. Notice we pass the semi ring type to the zero
    # function. This is not your garden variety zero, but the one from
    # your chosen semi ring!
    o = zero(T)

    # From a modification that transforms s1[1:end-1] into s2, we can
    # create a transformation that transforms s1 into s2 by appending
    # Del(s1[end]) at the start of the transformations sequence.
    if length(s1) != 0
        o += subgraphweight(T, s1[1:end-1], s2) * weight(T, Del(s1[end]))
    end

    # Same thing, but for insertion.
    if length(s2) != 0
        o += subgraphweight(T, s1, s2[1:end-1]) * weight(T, Ins(s2[end]))
    end

    # When the last character of both strings are the same, we can
    # compute the subgraph weight of the prefix, then multiply it by
    # the weight of Keep(c) at the end.
    if length(s1) != 0 && length(s2) != 0 && s1[end] == s2[end]
        o += subgraphweight(T, s1[1:end-1], s2[1:end-1]) * weight(T, Keep(s1[end]))
    end

    # Finally, if both strings are empty, then the weight of the
    # subgraph is the product of the weights of an empty path. We
    # define an empty product to be equal to one, so that gives us
    # one.

    if length(s1) == 0 && length(s2) == 0
        o += one(T)
    end
    o
end

Now that the function that computes the weight of our subgraph \(
\text{subgraphweight}_S(a,b) \) is done, we can poke it by evaluating it on different choices of \( S \) and \( \text{weight}_S \).

4.1. Tropical semi ring (minimum cost)

The first semi ring we are going to look at is the tropical semi ring. While it might seem very exotic at first, you will see that it lets us recover the “normal” behavior – find the lowest possible cost – that we are used to.

  1. The carrier set of the semi ring is the set of real numbers plus infinity, which isn’t usually part of the real numbers.
  2. The semiring addition \( a \oplus b \) on two elements is \( \min(a,b) \). It interacts as you would expect with infinity.
  3. Multiplication of semi ring elements is addition of reals.
  4. The neutral element under addition is plus infinity: \( \min(\infty, b) = b = \min(b, \infty) \).
  5. The neutral element under multiplication is zero: \( 0 + a = a = a + 0 \).
  6. Multiplication distributes over addition: \( a + \min(b, c) = \min(a + b, a + c)  \).

Let’s define this semi ring structure in code:

struct MinCost <: Semiring
    cost
end

Base.zero(::Type{MinCost}) = MinCost(Inf)
Base.one(::Type{MinCost}) = MinCost(0)

Base.:*(a::MinCost, b::MinCost) = MinCost(a.cost + b.cost)
Base.:+(a::MinCost, b::MinCost) = MinCost(min(a.cost, b.cost))

Now that we have the semi ring set up, we need to define our weighting function in terms of it. As you would expect, adding or deleting a character costs 1 unit, whereas keeping a character costs nothing.

weight(::Type{MinCost}, ::Ins) = MinCost(1)
weight(::Type{MinCost}, ::Del) = MinCost(1)
weight(::Type{MinCost}, ::Keep) = MinCost(0)

It is easy to see that if we combine sequentially two operations (through multiplication), then their cost will be summed. On the other hand, if we take the sum (combine horizontally two operations), then we will keep the minimum of the two costs. This reflects our intuition that the weight of a subgraph between \( a \) and \( b \) is the smallest weight of all possible paths.

Now if we evaluate the graph weight function on “carp” and “carve”, we should get something that is at least as cheap as the three operations we used earlier.

subgraphweight(MinCost, "carp", "carve")
MinCost(3.0)

It works! However, it is not very useful to only see the cost of the best solution, we might also want the solution itself!

4.2. The free semi ring

We want the optimal weight to be paired with its witness: the actual path in the graph that has this optimal weight. However, while the subgraphweight algorithm implicitely computes this shortest path, we want it to give us an explicit representation. We could manually construct a semi ring that combines the “minimum path length” with “remember concrete path” behavior, but you will see that this semiring is in fact a composition of the tropical semi ring and something called the free semi ring.

In algebra, the “free \( X \) generated by the alphabet \( A \)” is a very standard construction where we take the \( A \) set, then add just enough elements, then glue together just enough point such that \( A' \) becomes an X. This construction usually look a lot like syntax for an abstract \( X \) with variables in \( A \).

In the case of semi rings, this means we must make the elements multiply, add, contain a neutral element for those two operations et cetera. It turns out that once we have done that, what we end up with is a semi ring where:

  1. Each element of \( \text{Free}(A) \) is a set of sequences of \( A \)’s.
  2. Addition is union of sets of sequences.
  3. Multiplication is concatenation of every left sequence with every right sequence.
  4. Additive identity is the empty set.
  5. Multiplicative identity is the set that only contains the empty sequence.

Implemented in Julia, those operations look like this:

struct Free{A} <: Semiring
    choices::Set{Vector{A}}
end

Free(s::Set{Vector{A}}) where A = Free{A}(s)

Base.zero(::Type{Free{A}}) where A = Free{A}(Set([]))

Base.one(::Type{Free{A}}) where A = Free{A}(Set([[]]))

Base.:+(a::Free{A}, b::Free{A}) where A = Free{A}(a.choices ∪ b.choices)
Base.:*(a::Free{A}, b::Free{A}) where A = Free{A}(
    Set([vcat(parta, partb)
         for parta in a.choices
         for partb in b.choices]))

Something to note is that for each element of our alphabet set, there is a corresponding element in the free semi ring. This suggests that the appropriate weight function for some operation should just be the associated element in the free semi ring.

weight(::Type{Free{Op}}, o::Op) = Free{Op}(Set([[o]]))

Each time the subgraphweight function combines the weights of two subgraphs into a single big subgraph weight, it will actually be combining all left paths with all right paths. When two possible paths from \( a \) to \( b \) are added, the two sets of possible paths are unioned. At the end of the function’s execution, it will return a set of all possible operation sequences that start at \( a \) and end at \( b \).

Let’s see what those look like in the case of “carp” and “carve”:

# (Il y en a vraiment beaucoup)
for seq in Iterators.take(subgraphweight(Free{Op}, "carp", "carve").choices, 10)
    println(seq)
end
Op[Keep('c'), Keep('a'), Keep('r'), Del('p'), Ins('v'), Ins('e')]
Op[Keep('c'), Ins('a'), Ins('r'), Ins('v'), Del('a'), Del('r'), Del('p'), Ins('e')]
Op[Keep('c'), Ins('a'), Del('a'), Ins('r'), Ins('v'), Del('r'), Ins('e'), Del('p')]
Op[Keep('c'), Ins('a'), Del('a'), Del('r'), Ins('r'), Ins('v'), Del('p'), Ins('e')]
Op[Keep('c'), Ins('a'), Del('a'), Ins('r'), Del('r'), Ins('v'), Ins('e'), Del('p')]
Op[Keep('c'), Ins('a'), Ins('r'), Del('a'), Ins('v'), Del('r'), Ins('e'), Del('p')]
Op[Del('c'), Ins('c'), Ins('a'), Ins('r'), Ins('v'), Ins('e'), Del('a'), Del('r'), Del('p')]
Op[Ins('c'), Ins('a'), Del('c'), Ins('r'), Del('a'), Del('r'), Del('p'), Ins('v'), Ins('e')]
Op[Ins('c'), Del('c'), Ins('a'), Ins('r'), Del('a'), Ins('v'), Del('r'), Ins('e'), Del('p')]
Op[Del('c'), Del('a'), Del('r'), Ins('c'), Ins('a'), Ins('r'), Ins('v'), Ins('e'), Del('p')]

Even though the two words are quite similar, there are a lot of ways to turn one into the other and enumerating all of them (especially the longer ones) is not really the point. Where the free semi ring shines is when we use it in combination with other semi rings as we will be doing soon.

Before we go there, however, let me go through one last building block. How would we go about counting the number of distinct paths, if we did not want to pay the overhead of even constructing them?

4.3. Count semi ring

struct Count <: Semiring
    n::Int
end

Base.zero(::Type{Count}) = Count(0)

Base.one(::Type{Count}) = Count(1)

Base.:+(a::Count, b::Count) = Count(a.n + b.n)
Base.:*(a::Count, b::Count) = Count(a.n * b.n)

weight(::Type{Count}, ::Op) = Count(1)

The “counting semi ring” is the good old natural numbers. Let’s use it on the Weight function, for a laugh

println(subgraphweight(Count,"carp", "carve"))
Count(224)

let’s compare to what the free semiring gives us

println(length(subgraphweight(Free{Op}, "carp", "carve").choices))
224

Let’s see how big of a difference this special semi ring makes on the runtime of the two implementations of “count numbers”:

@time subgraphweight(Count,"carp", "carve")
@time length(subgraphweight(Free{Op}, "carp", "carve").choices)
0.000021 seconds (258 allocations: 8.062 KiB)
0.000898 seconds (30.41 k allocations: 1.887 MiB)

This is not really a good benchmark, but one is two order of magnitues faster than the other.

4.4. Selector semi ring

As promised, we are going to define the combinator that lets us construct the “give me the best transformation” semi ring out of the MinCost and the Free semi rings.

We will construct a semi ring of pairs where the first component is the cost and the second, the payload. However, we are going to ask from our first semi ring that it also has the structure of an order. This is the case of the MinCost semi ring.

Base.isless(a::MinCost, b::MinCost) = a.cost < b.cost

Once we have an ordered semi ring and a payload semi ring, here is how it goes:

  1. Addition of \( (c_1, p_1) \) and \( (c_2, p_2) \) compares \( c_1 \) and \(
   c_2 \). If they are equal, return \( (c_1, p_1 + p_2) \). If they are not, return the pair with the lowest cost.
  2. Multiplication is component-wise multiplication.
  3. Additive identity is the pair of additive identities.
  4. Multiplicative identities is the pair of multiplicative identities.

Here is how the implementation looks:

struct Sel{A,B} <: Semiring
    a::A
    b::B
end

Base.zero(::Type{Sel{A,B}}) where {A,B} = Sel{A,B}(zero(A), zero(B))
Base.one(::Type{Sel{A,B}}) where {A,B} = Sel{A,B}(one(A), one(B))

Base.:*(l::Sel{A,B},r::Sel{A,B}) where {A,B} = Sel{A,B}(l.a * r.a, l.b * r.b)
Base.:+(l::Sel{A,B},r::Sel{A,B}) where {A,B} =
    if l.a == r.a
        Sel{A,B}(l.a, l.b + r.b)
    elseif l.a < r.a
        l
    elseif l.a > r.a
        r
    end

Something interesting to notice is that because this semi ring essentially inherits all of its structure from two other semi rings, it can also inherit its weight function from the two semi rings.

weight(::Type{Sel{A,B}}, x) where {A,B} = Sel(weight(A, x), weight(B, x))

It is this construction that lets us compute the best candidate while remembering which is the best candidate.

r = subgraphweight(Sel{MinCost, Free{Op}}, "carp", "carve")
println("cost: $(r.a)")
for elem in r.b.choices
    println("sequence: $(elem)")
end
cost: MinCost(3)
sequence: Op[Keep('c'), Keep('a'), Keep('r'), Ins('v'), Del('p'), Ins('e')]
sequence: Op[Keep('c'), Keep('a'), Keep('r'), Del('p'), Ins('v'), Ins('e')]
sequence: Op[Keep('c'), Keep('a'), Keep('r'), Ins('v'), Ins('e'), Del('p')]

It looks like there are three optimal ways to edit “carp” into “carve”. However, if all we care about is the number of optimal solutions, there is a simpler way to do that.

r = subgraphweight(Sel{MinCost, Count}, "carp", "carve")
println("cost: $(r.a)")
println("count: $(r.b)")
cost: MinCost(3)
count: Count(3)

Notice that by expressing everything in terms of an algebraic structure, we can easily and quickly combine various components to derive some new correct-by-construction55 Showing the “correct-by-construction” part would require some more work, but it can be done. Moreover, it is likely that the proof structure will also follow the shape of the combinations we make. This is one other reason why algebraic dynamic programming is so nice: it is a framework in which the correctness of algorithms can be studied in a much simpler way. algorithms. This is the essence of compositionality: we define a rich language of building blocks through which we can express any (in-class) algorithm we want. We pay a higher upfront cost for implementing the primitives and the combinators, but constructing some new thing becomes much easier.66 This abstract principle can be seen everywhere: in Parser combinators and Game theory, for instance.

5. Generalized Floyd-Warshall algorithm

Since the beginning, we have framed as graph problems a problem that isn’t intrinsically a graph problem. What about graph-native stuff, like computing the shortest path? In this case, we are going to do a different translation. Instead of taking some dynamic programming problem and shoehorning it into graphs, we are going to take a graph problem and shoehorn it into generalized (over semi rings) matrix multiplications.

5.1. The matrix product

Let there be two matrices that we can multiply

\[ M : \mathbb{R}^{a \times b} \]

\[ N : \mathbb{R}^{b \times c} \]

The formula to evaluate their product is the following:

\[ (MN)_{ik} = \sum_{j=1}^{b} M_{ij}N_{jk} \]

The main thing to notice here is that if we forget for a moment the meaning of these symbols, we recognize that the only operations we really need(defined on the components) to compute matrix multiplication are those carried by a semi ring. We have sums, we have products and that’s pretty much it.

5.2. Adjacency matrices

Let there be a directed graph with weights on the edges:

some-graph.svg

A representation of graphs that is often useful77 The first use I can think of (apart from the one we are going to see), is when we consider the graph as a Markov chain. However, there are a lot more. is its adjacency matrix. This matrix is constructed by encoding the relationship between the \( i \)’th node and the \( j \)’th node at a matrix’s \( ij \)’th component. Usually, that means putting a one where there is an edge and a zero otherwise.

A = [0  1  0  0  1
     0  0  1  1  0
     0  0  0  1  0
     0  0  0  0  0
     0  0  1  0  0]

One neat computation we can do using this matrix is count the number of length \( n \) paths from \( a \) to \( b \). We do that by taking the \( n \)’th power of the adjacency matrix.

println(repr("text/plain", A^2))
5×5 Matrix{Int64}:
 0  0  2  1  0
 0  0  0  1  0
 0  0  0  0  0
 0  0  0  0  0
 0  0  0  1  0

We can see that there are two paths of length 2 from node 1 to node 3. These are \( (1, 2, 3) \) and \( (1, 5, 3) \). Is the shape of this computation familiar? The way we use this adjacency matrix is very similar to how we used the Count semi ring earlier. This is because just like we generalized edge weights (and subgraph weight) to any semi ring, we can do the same for a graph’s adjacency matrix.

5.3. Generalized adjacency matrix

If we consider a graph where the weights live in some semi ring, the appropriate adjacency matrix is one where the components are those weighs; this is simply a change of representation. One obvious question to ask ourselves is: what does matrix multiplication mean?

Remember the definition of matrix multiplication:

\[ (MN)_{ik} = \sum_{j=1}^{b} M_{ij}N_{jk} \] If we consider the multiplication of a matrix with itself, this equation becomes

\[ (MM)_{ik} = \sum_{j=1}^{b} M_{ij}M_{jk}\] the \( ij \)’th component of the resulting matrix gives us the weight of the two-edge subgraph between \( i \) and \( k \).

In fact, taking this matrix’s powers will compute the same kind of things the Floyd-Warshall algorithm would let us compute. Before we understand why, let’s do some more experiments on this construction:

Let’s take a graph where the weight of each edge is the name of the edge. To encode arbitrary data in the matrix, we will be using the free semi ring.

struct Edge
    name::String
end

The adjacency matrix of this graph will look like this:

S = Free{Edge}
O = zero(S)
i(n) = Free{Edge}(Set([[Edge(n)]]))

A = [O  i("1>2")  O         O         i("1>5")
     O  O         i("2>3")  i("2>4")  O
     O  O         O         i("3>4")  O
     O  O         O         O         O
     O  O         i("5>3")  O         O]

Notice here that O means “the zero in the appropriate semiring” and that i injects the name into it. The powers of this matrix will generate sets of paths of fixed length.

using LinearAlgebra

# Necessary for Julia's matrix product to properly use the operations.
Base.zero(::T) where T <: Semiring = zero(T)

A2 = A^2

for path in A2[2, 4].choices
    println(path)
end
Edge[Edge("2>3"), Edge("3>4")]

If we also want to get paths that are length \( \le n \), we can add ones into the diagonal of the matrix like this:

I = one(S)
A′ = [I  i("1>2")  O         O         i("1>5")
      O  I         i("2>3")  i("2>4")  O
      O  O         I         i("3>4")  O
      O  O         O         I         O
      O  O         i("5>3")  O         I]

for path in (A′^3)[2,4].choices
    println(path)
end
Edge[Edge("2>3"), Edge("3>4")]
Edge[Edge("2>4")]

A new path with length 2 that wasn’t visible before has appeared. Again, enumerating the set of paths is not very useful. What if we reapply our earlier trick (using Sel) to do more useful stuff?

S = Sel{MinCost, Free{Edge}}
O = zero(S)
I = one(S)
i(c, n) = S(MinCost(c), Free{Edge}(Set([[Edge(n)]])))

A = [I  i(1, "1>2")  O           O             i(1, "1>5")
     O  I            i(1, "2>3")  i(1, "2>4")  O
     O  O            I            i(1, "3>4")  O
     O  O            O            I            O
     O  O            i(1, "5>3")  O            I]

let
    @time result = (A^5)[1,4]
    println("cost: $(result.a)")
    for path in result.b.choices
        println("path: $(path)")
    end
end
  0.378522 seconds (1.49 M allocations: 71.838 MiB, 3.26% gc time, 99.85% compilation time)
cost: MinCost(2)
path: Edge[Edge("1>2"), Edge("2>4")]

We found the shortest path!

An astute reader might say: “Hey, sure, you have found the shortest path, but it cost you \( O(n^4) \) complexity!” Yes. Here, each matrix-matrix product costs \( n^3 \) multiplications and we do \( n \) of those, so we get very very bad complexity. We could organize the way we do exponentiation more efficiently (by leveraging \( M^{2n + k} = (M^{n})^2M^k \)) making the computation \( O(n^3
\log n) \), but that is still very bad. Something else we can do, is notice that in many cases, we don’t actually care about most of the optimal paths. Usually, we want a single optimal path from a source to a destination and that’s it. This need is easier to satisfy and we can leverage that by computing our shortest path as an iterated matrix vector product (costs \( O(n^2) \) instead of \(O(n^3\)).88 This optimization looks a lot like the idea of computing gradients instead of jacobians in automatic differentiation. We leverage a specific more efficient order of computation for complexity gains.

\begin{equation*}
\begin{aligned}
(M^n)_{ij}
&amp;= e_i^\top M^n e_j \\
&amp;= ((\dots((e_i^\top M) \dots ) M) e_j
\end{aligned}
\end{equation*}
V = [I O O O O]

let
    @time result = (V * A * A * A * A * A)[4]
    println("cost: $(result.a)")
    for path in result.b.choices
        println("path: $(path)")
    end
end
  0.085063 seconds (236.82 k allocations: 11.351 MiB, 99.85% compilation time)
cost: MinCost(2)
path: Edge[Edge("1>2"), Edge("2>4")]

We finally achieved99 Be careful! in Julia, the multiplication operator is left associative (\(
a*b*c = (a*b)*c \)). If that weren’t the case, the same text would parse as repeated matrix multiplication, followed by vector matrix multiplication. \( O(n^3) \), but if we compare it to the classical algorithm for that class of problem, the Floyd-Warshall algorithm, we barely achieve its complexity while doing \( n \) times less work (Floyd-Warshall compute all-to-all shortest paths). This is still bad. How could we translate the FW algorithm to our weird semiring matrix use-case?

5.4. Generalized Floyd-Warshall

The Floyd-Warshall algorithm works recursively. If we know all the shortest paths among all \( 1, \dots, k-1 \) nodes, then we can quickly extend them to get shortest paths among all \( 1, \dots, k \) nodes.

Each iteration up, we:

  1. Consider the matrix \( M_{k-1} \) of optimal paths for all nodes \( &lt; k \).
  2. Consider as \( A \) the adjacency matrix of the first \( k \) nodes (it is the \( k \) wide upper left square of the complete adjacency matrix).
  3. We try to improve the \( &lt; k \) paths by asking, for each pair \( i, j
   &lt; k \) of nodes: “is my current opimal \( (i,j) \)-path as good as going \(
   i,k,j \)?”
  4. We compute the optimal paths from \( k \) to everywhere (that is \( &lt;k \)) by gluing together the paths from \( k \) to everywhere (\( &lt;k \)) with the paths from everywhere (\( &lt;k \)) to everywhere (\( &lt;k \)).
  5. We compute the optimal paths from everywhere (again \( &lt;k \)) to \( k \) by gluing together the paths from everywhere to everywhere with the paths from everywhere to \(
   k \).
  6. The best path from \( k \) to \( k \) is the empty path, the multiplicative identity.

When we try to formulate these steps as matrix operations, we get that the new matrix of optimal paths is the sum of two matrices:

  1. The matrix computed with the exterior product of the \( k \)’t column with the \( k \)’th row of the adjacency matrix. This is the “\( i \) to \(k\) back to \( j \)” part.
  2. The block matrix which contains \( M_{k-1} \), \( M_{k-1} A_{1:k-1,k} \), \(
   A_{k, 1:k-1} M_{k-1} \) and \( 1 \).
\begin{equation*}
\begin{aligned}
M_{k} = A_{1:k,k} A_{k,1:k} +
\left(\begin{matrix}
M_{k-1} &amp; M_{k-1}A_{1:k-1,k} \\
A_{k, 1:k-1} M_{k-1} &amp; 1 \\
\end{matrix}\right)
\end{aligned}
\end{equation*}

Our implementation is this:

function generalized_floyd_warshall(A::Matrix{S}) where S
    k,_ = size(A)
    if k == 0
        return A
    end

    Mₖ₋₁ = generalized_floyd_warshall(A[1:end-1,1:end-1])


    A[:,end:end] .* A[end:end,:] +
        # I love julia's block matrix notation.
    [Mₖ₋₁                         Mₖ₋₁ * A[1:end-1,end:end]
     A[end:end,1:end-1] * Mₖ₋₁    one(S)]
end


let
    result = generalized_floyd_warshall(A)[1,4]
    println("cost: $(result.a)")
    for path in result.b.choices
        println("path: $(path)")
    end
end
cost: MinCost(2)
path: Edge[Edge("1>2"), Edge("2>4")]

This algorithm runs \( n \) iterations where each iteration contains 3 operations of complexity \( O(n^2) \). We have achieved Floyd-Warshall’s complexity bound! (because that is what we, albeit in an obfuscated way, have implemented).

6. Generalized dijkstra algorithm

We have implemented the Floyd-Warshall algorithm, but it is probably not the shortest path algorithm you are familiar with, which I suspect is Dijkstra’s algorithm. This is because in the most general case of “shortest path through graph where edge have weights”, Dijkstra’s algorithm doesn’t actually work. In fact, it relies on an additional assumption we haven’t made until now: positive weights.

6.1. Monotone vs non-monotone semi rings

Dijkstra’s algorithm is able to beat Floyd-Warshall by leveraging the assumption of positive weights. What this lets Dijkstra do is early stopping: once the algorithm visits the target node, it can be sure no new paths are going to be better.

If we translate this “positive weights” in semi ring-speak, we get:

  1. The existence of an order operation on the elements. This is easy, since Sel already uses this for its addition.
  2. For all weights \( w \in S \) of some edge, multiplying by \( w \) is a monotone operation. This means that \( \forall s,t \in S. \quad s \le t \Rightarrow sw \le tw \)

Let’s implement the order structure for Sel.

Base.isless(x::Sel{A,B}, y::Sel{A,B}) where {A,B} = x.a < y.a

With this property, Dijkstra’s algorithm can be used almost verbatim. However, because julia doesn’t have a builtin priority heap, let me quickly write a slow implementation of the necessary interface. In principle, it should be trivial to swap that out for a real implementation.

mutable struct MinHeap
    keys
    values
end

function MinHeap()
    MinHeap(
        [],
        [],
    )
end

function Base.pop!(h::MinHeap)
    k = pop!(h.keys)
    v = pop!(h.values)
    k, v
end

function insert!(h::MinHeap, k, v)
    push!(h.keys, k)
    push!(h.values, v)

    perms = sortperm(h.keys, rev=true)
    h.keys = h.keys[perms]
    h.values = h.values[perms]
end

Dijkstra’s algorithm is usually on top of a “give me outgoing edges and their weights” graph representation. Let us first adapt the adjacency matrix to the appropriate interface.

# Here a graph is represented by a function that gives you the
# outgoing edges to a given node.
function the_graph(from_node)
    if from_node == 1
        [(1, 2), (1,5)]
    elseif from_node == 2
        return [(2, 3), (2, 4)]
    elseif from_node == 3
        return [(3, 4)]
    elseif from_node == 4
        return []
    elseif from_node == 5
        return [(5,3)]
    else
        error("no such node:", from_node)
    end
end

# The weight of an edge is a tuple of a constant 1 and the edge name.
function cost(n)
    edges = [(1,2), (1,5), (2,3), (2,4), (3,4), (5,3)]
    if n ∈ edges
        a,b = n
        Sel(MinCost(1), Free(Set([[Edge("$(a)>$(b)")]])))
    else
        error("no such edge", n)
    end
end

target((a,b)) = b

Now we are ready for the final algorithm. Behold, Dijkstra!!

function generalized_dijkstra(S::Type{<:Semiring}, graph, cost, s₀, s₁)
    queue = MinHeap()
    state_costs = Dict()
    visited = Set()

    state_costs[s₀] = one(S)
    insert!(queue, state_costs[s₀], s₀)

    while true
        source_cost, source_state = pop!(queue)
        if source_state == s₁
            return state_costs[source_state]
        end

        if source_state ∈ visited
            continue
        else
            push!(visited, source_state)
        end

        edges = graph(source_state)
        for edge in edges

            target_state = target(edge)
            if target_state ∉ keys(state_costs)
                state_costs[target_state] = zero(S)
            end

            state_costs[target_state] += source_cost * cost(edge)
            insert!(queue, state_costs[target_state], target_state)
        end
    end
end

The interesting thing to note here is that this algorithm looks a lot like the textbook “compute the length of the shortest path”, and doesn’t seem to think about paths at all, only costs. However, as has been the case for this whole post, the generic interface lets us trick this single basic cost-only algorithm into also computing the path(s)

println(generalized_dijkstra(Sel{MinCost, Free{Edge}}, the_graph, cost, 1, 4))
Sel{MinCost, Free{Edge}}(MinCost(2), Free{Edge}(Set(Vector{Edge}[[Edge("1>2"), Edge("2>4")]])))

Footnotes:

1

Compositional and algebraic in the same vein as algebraic data types.

2

There are other properties, but they aren’t as useful for intuition.

3

Pardon me the ugly overflow.

4

This whole thing looks a lot like what the wikipedia page on weighted automata calls the weight of a word.

5

Showing the “correct-by-construction” part would require some more work, but it can be done. Moreover, it is likely that the proof structure will also follow the shape of the combinations we make. This is one other reason why algebraic dynamic programming is so nice: it is a framework in which the correctness of algorithms can be studied in a much simpler way.

6

This abstract principle can be seen everywhere: in Parser combinators and Game theory, for instance.

7

The first use I can think of (apart from the one we are going to see), is when we consider the graph as a Markov chain. However, there are a lot more.

8

This optimization looks a lot like the idea of computing gradients instead of jacobians in automatic differentiation. We leverage a specific more efficient order of computation for complexity gains.

9

Be careful! in Julia, the multiplication operator is left associative (\(
a*b*c = (a*b)*c \)). If that weren’t the case, the same text would parse as repeated matrix multiplication, followed by vector matrix multiplication.

Date: 2025-09-25 Thu 00:00

Author: Justin Veilleux

Validate