Algebraic dynamic programming
Note: This was partially copied and Julia-translated from Mattéo’s Midiro presentation.
1. Motivation
The class of algorithms known as “dynamic programming” covers a large class of use cases and the different problems classified as “dynamic programming” sometimes seem too different to be related. Algebraic dynamic programming is a unifying framework through which view dynamic programming problems that allows, through its compositional nature,11 Compositional and algebraic in the same vein as algebraic data types. to understand complex algorithms as the sum of its parts.
The main concept we are going to use throughout this post is the idea of representing dynamic programming problems as shortest path problems in some weird abstract graph. By changing the specific semi ring we use to express the weight of edges in some problem’s graph, we can decompose many analyses and algorithms on a problem into small understandable parts. For instance, we are going to see how “compute the length of the shortest path in this graph” and “give me all paths with this lengths” and “how many different paths are there?” can be derived from a single generic function on a graph.
2. Background: semi rings
The generic interface we are going to use to derive algorithms is going to be based on semi rings. It turns out that semi rings possess exactly the kind of structure we need from a notion of “edge graph”. However, before we see how they are used in dynamic programming, we have to build an intuition of what they are.
In the more usual type of mathematics one might do, a semi ring is an algebraic
structure that incarnates the essence of “sum of products”. A couple of examples
of semi rings would be the natural numbers, square matrices (in two different
ways), real numbers, polynomials, and more. More formally, a semi ring
is a set with the following properties/data:22
There are other properties, but they aren’t as useful for intuition.
- Has an addition operation
- Has a multiplication operation
- Has additive identity
which means that
for any
.
- Has a multiplicative identity
which means that
for any
.
- Multiplication distributes over addition:
.
Through this post, we are going to work with Julia, which allows use to overload
addition (Base.:+), multiplication (Base.:*), zero (Base.zero) and one
(Base.one) for any data type. We are going to use this feature to make the code
look similar to the math, but this is not strictly necessary for implementation,
just a nice bonus. Once we define our first semi ring, the overloading behavior
will be explained in more details.
abstract type Semiring end
3. Dynamic programming problems as graph traversal
As I said, dynamic programming problems can be framed as finding the shortest
bath in some appropriate abstract graph. Let’s look, as an example, at the
editing distance between two string. As you would expect, the editing distance
between and
is basically the smallest number of operations we
need to transform string
into string
. By “operation”, we mean:
- Inserting a character.
- Removing a character.
- Keeping a character.
The “keep a character” operation, as we will see, doesn’t contribute to the edit distance, but is necessary to make our graph have the necessary edges.
As an example, let’s take the strings “carp” and “carve”. We know intuitively that the fastest way to turn one into the other is to keep “car” and replace the “p” by “ve”. To express this transformation as a sequence of operations, we would say “keep(c), keep(a), keep(r), delete(p), insert(v), insert(e)”. Now I must explain (like I promised I would) how finding the shortest (with respect to some notion of distance) sequence of operation can be framed as a graph shortest path. When expressed like that, it doesn’t sound as mystical.
Imagine a graph where:
- Each vertex is a pair of strings
representing an intermediate result: “I know how to transform
into
”.
- An edge between
and
is an operation that, when added to the end of an “
to
” transformation, gives us a “
to
” transformation.
More visually (blue path is the optimal transformation):33 Pardon me the ugly overflow.
It is probably easier to see now how this becomes a shortest path problem. However, while it might be tempting to just stick normal weights on this graph, then apply Dijkstra’s algorithm, we would lose out on some level of genericity. Having our interface be “function that extracts a (weighted) graph” would fulfil some part of the promise: one solver works on all dynamic programming problems. However, it would still be hard to trick the solver into doing weird ancillary computations for us.
For that reason, we are going to weaken our assumptions about the graph weighs in exchange for a more general algorithm.
Let’s say we have some weighted graph where the weight of each edge is part of
some semi ring . We will denote the weight of an edge using the
function. Let’s also define the weight of a path
(continuous sequence of edges) as the product of the weights of all edges in the
path. Finally, let’s define the weight of a subgraph bounded by two nodes
and
as the sum of the weights of all paths from
to
.44
This whole thing looks a lot like what the wikipedia page on weighted
automata calls the weight of a word.
Through this very general framework that looks a lot like simply computing the “weighted sum” of a subgraph, we can implement diverse algorithms.
4. Implementation of edit distance.
Let’s write a concrete implementation of this “algebraic dynamic programming” idea in julia. Again, we will be using the string editing distance as a toy problem. Here are the definitions of the string operations:
abstract type Op end
struct Ins <: Op
c::Char
end
struct Del <: Op
c::Char
end
struct Keep <: Op
c::Char
end
In many cases, it is complicated (and expensive!) to enumerate all possible paths from source to destination, but for edit distance, the graph is easy to write down.
Here, we compute recursively by using the weight of
sub strings.
# Note: for this function to be efficient, you would ideally use
# memoisation. Here I don't to keep the presentation clean.
function subgraphweight(T::Type{<:Semiring}, s1, s2)
# We will be computing a sum, so it makes sense to initialize this
# with zero. Notice we pass the semi ring type to the zero
# function. This is not your garden variety zero, but the one from
# your chosen semi ring!
o = zero(T)
# From a modification that transforms s1[1:end-1] into s2, we can
# create a transformation that transforms s1 into s2 by appending
# Del(s1[end]) at the start of the transformations sequence.
if length(s1) != 0
o += subgraphweight(T, s1[1:end-1], s2) * weight(T, Del(s1[end]))
end
# Same thing, but for insertion.
if length(s2) != 0
o += subgraphweight(T, s1, s2[1:end-1]) * weight(T, Ins(s2[end]))
end
# When the last character of both strings are the same, we can
# compute the subgraph weight of the prefix, then multiply it by
# the weight of Keep(c) at the end.
if length(s1) != 0 && length(s2) != 0 && s1[end] == s2[end]
o += subgraphweight(T, s1[1:end-1], s2[1:end-1]) * weight(T, Keep(s1[end]))
end
# Finally, if both strings are empty, then the weight of the
# subgraph is the product of the weights of an empty path. We
# define an empty product to be equal to one, so that gives us
# one.
if length(s1) == 0 && length(s2) == 0
o += one(T)
end
o
end
Now that the function that computes the weight of our subgraph is done, we can poke it by evaluating it on different choices of
and
.
4.1. Tropical semi ring (minimum cost)
The first semi ring we are going to look at is the tropical semi ring. While it might seem very exotic at first, you will see that it lets us recover the “normal” behavior – find the lowest possible cost – that we are used to.
- The carrier set of the semi ring is the set of real numbers plus infinity, which isn’t usually part of the real numbers.
- The semiring addition
on two elements is
. It interacts as you would expect with infinity.
- Multiplication of semi ring elements is addition of reals.
- The neutral element under addition is plus infinity:
.
- The neutral element under multiplication is zero:
.
- Multiplication distributes over addition:
.
Let’s define this semi ring structure in code:
struct MinCost <: Semiring
cost
end
Base.zero(::Type{MinCost}) = MinCost(Inf)
Base.one(::Type{MinCost}) = MinCost(0)
Base.:*(a::MinCost, b::MinCost) = MinCost(a.cost + b.cost)
Base.:+(a::MinCost, b::MinCost) = MinCost(min(a.cost, b.cost))
Now that we have the semi ring set up, we need to define our weighting function in terms of it. As you would expect, adding or deleting a character costs 1 unit, whereas keeping a character costs nothing.
weight(::Type{MinCost}, ::Ins) = MinCost(1)
weight(::Type{MinCost}, ::Del) = MinCost(1)
weight(::Type{MinCost}, ::Keep) = MinCost(0)
It is easy to see that if we combine sequentially two operations (through
multiplication), then their cost will be summed. On the other hand, if we take
the sum (combine horizontally two operations), then we will keep the minimum of
the two costs. This reflects our intuition that the weight of a subgraph between
and
is the smallest weight of all possible paths.
Now if we evaluate the graph weight function on “carp” and “carve”, we should get something that is at least as cheap as the three operations we used earlier.
subgraphweight(MinCost, "carp", "carve")
MinCost(3.0)
It works! However, it is not very useful to only see the cost of the best solution, we might also want the solution itself!
4.2. The free semi ring
We want the optimal weight to be paired with its witness: the actual path in the
graph that has this optimal weight. However, while the subgraphweight algorithm
implicitely computes this shortest path, we want it to give us an explicit
representation. We could manually construct a semi ring that combines the
“minimum path length” with “remember concrete path” behavior, but you will see
that this semiring is in fact a composition of the tropical semi ring and
something called the free semi ring.
In algebra, the “free generated by the alphabet
” is a very
standard construction where we take the
set, then add just enough
elements, then glue together just enough point such that
becomes an X.
This construction usually look a lot like syntax for an abstract
with
variables in
.
In the case of semi rings, this means we must make the elements multiply, add, contain a neutral element for those two operations et cetera. It turns out that once we have done that, what we end up with is a semi ring where:
- Each element of
is a set of sequences of
’s.
- Addition is union of sets of sequences.
- Multiplication is concatenation of every left sequence with every right sequence.
- Additive identity is the empty set.
- Multiplicative identity is the set that only contains the empty sequence.
Implemented in Julia, those operations look like this:
struct Free{A} <: Semiring
choices::Set{Vector{A}}
end
Free(s::Set{Vector{A}}) where A = Free{A}(s)
Base.zero(::Type{Free{A}}) where A = Free{A}(Set([]))
Base.one(::Type{Free{A}}) where A = Free{A}(Set([[]]))
Base.:+(a::Free{A}, b::Free{A}) where A = Free{A}(a.choices ∪ b.choices)
Base.:*(a::Free{A}, b::Free{A}) where A = Free{A}(
Set([vcat(parta, partb)
for parta in a.choices
for partb in b.choices]))
Something to note is that for each element of our alphabet set, there is a corresponding element in the free semi ring. This suggests that the appropriate weight function for some operation should just be the associated element in the free semi ring.
weight(::Type{Free{Op}}, o::Op) = Free{Op}(Set([[o]]))
Each time the subgraphweight function combines the weights of two subgraphs into
a single big subgraph weight, it will actually be combining all left paths with
all right paths. When two possible paths from to
are added, the
two sets of possible paths are unioned. At the end of the function’s execution,
it will return a set of all possible operation sequences that start at
and end at
.
Let’s see what those look like in the case of “carp” and “carve”:
# (Il y en a vraiment beaucoup)
for seq in Iterators.take(subgraphweight(Free{Op}, "carp", "carve").choices, 10)
println(seq)
end
Op[Keep('c'), Keep('a'), Keep('r'), Del('p'), Ins('v'), Ins('e')]
Op[Keep('c'), Ins('a'), Ins('r'), Ins('v'), Del('a'), Del('r'), Del('p'), Ins('e')]
Op[Keep('c'), Ins('a'), Del('a'), Ins('r'), Ins('v'), Del('r'), Ins('e'), Del('p')]
Op[Keep('c'), Ins('a'), Del('a'), Del('r'), Ins('r'), Ins('v'), Del('p'), Ins('e')]
Op[Keep('c'), Ins('a'), Del('a'), Ins('r'), Del('r'), Ins('v'), Ins('e'), Del('p')]
Op[Keep('c'), Ins('a'), Ins('r'), Del('a'), Ins('v'), Del('r'), Ins('e'), Del('p')]
Op[Del('c'), Ins('c'), Ins('a'), Ins('r'), Ins('v'), Ins('e'), Del('a'), Del('r'), Del('p')]
Op[Ins('c'), Ins('a'), Del('c'), Ins('r'), Del('a'), Del('r'), Del('p'), Ins('v'), Ins('e')]
Op[Ins('c'), Del('c'), Ins('a'), Ins('r'), Del('a'), Ins('v'), Del('r'), Ins('e'), Del('p')]
Op[Del('c'), Del('a'), Del('r'), Ins('c'), Ins('a'), Ins('r'), Ins('v'), Ins('e'), Del('p')]
Even though the two words are quite similar, there are a lot of ways to turn one into the other and enumerating all of them (especially the longer ones) is not really the point. Where the free semi ring shines is when we use it in combination with other semi rings as we will be doing soon.
Before we go there, however, let me go through one last building block. How would we go about counting the number of distinct paths, if we did not want to pay the overhead of even constructing them?
4.3. Count semi ring
struct Count <: Semiring
n::Int
end
Base.zero(::Type{Count}) = Count(0)
Base.one(::Type{Count}) = Count(1)
Base.:+(a::Count, b::Count) = Count(a.n + b.n)
Base.:*(a::Count, b::Count) = Count(a.n * b.n)
weight(::Type{Count}, ::Op) = Count(1)
The “counting semi ring” is the good old natural numbers. Let’s use it on the Weight function, for a laugh
println(subgraphweight(Count,"carp", "carve"))
Count(224)
let’s compare to what the free semiring gives us
println(length(subgraphweight(Free{Op}, "carp", "carve").choices))
224
Let’s see how big of a difference this special semi ring makes on the runtime of the two implementations of “count numbers”:
@time subgraphweight(Count,"carp", "carve")
@time length(subgraphweight(Free{Op}, "carp", "carve").choices)
0.000021 seconds (258 allocations: 8.062 KiB) 0.000898 seconds (30.41 k allocations: 1.887 MiB)
This is not really a good benchmark, but one is two order of magnitues faster than the other.
4.4. Selector semi ring
As promised, we are going to define the combinator that lets us construct the
“give me the best transformation” semi ring out of the MinCost and the Free semi
rings.
We will construct a semi ring of pairs where the first component is the cost and
the second, the payload. However, we are going to ask from our first semi ring
that it also has the structure of an order. This is the case of the MinCost semi
ring.
Base.isless(a::MinCost, b::MinCost) = a.cost < b.cost
Once we have an ordered semi ring and a payload semi ring, here is how it goes:
- Addition of
and
compares
and
. If they are equal, return
. If they are not, return the pair with the lowest cost.
- Multiplication is component-wise multiplication.
- Additive identity is the pair of additive identities.
- Multiplicative identities is the pair of multiplicative identities.
Here is how the implementation looks:
struct Sel{A,B} <: Semiring
a::A
b::B
end
Base.zero(::Type{Sel{A,B}}) where {A,B} = Sel{A,B}(zero(A), zero(B))
Base.one(::Type{Sel{A,B}}) where {A,B} = Sel{A,B}(one(A), one(B))
Base.:*(l::Sel{A,B},r::Sel{A,B}) where {A,B} = Sel{A,B}(l.a * r.a, l.b * r.b)
Base.:+(l::Sel{A,B},r::Sel{A,B}) where {A,B} =
if l.a == r.a
Sel{A,B}(l.a, l.b + r.b)
elseif l.a < r.a
l
elseif l.a > r.a
r
end
Something interesting to notice is that because this semi ring essentially inherits all of its structure from two other semi rings, it can also inherit its weight function from the two semi rings.
weight(::Type{Sel{A,B}}, x) where {A,B} = Sel(weight(A, x), weight(B, x))
It is this construction that lets us compute the best candidate while remembering which is the best candidate.
r = subgraphweight(Sel{MinCost, Free{Op}}, "carp", "carve")
println("cost: $(r.a)")
for elem in r.b.choices
println("sequence: $(elem)")
end
cost: MinCost(3)
sequence: Op[Keep('c'), Keep('a'), Keep('r'), Ins('v'), Del('p'), Ins('e')]
sequence: Op[Keep('c'), Keep('a'), Keep('r'), Del('p'), Ins('v'), Ins('e')]
sequence: Op[Keep('c'), Keep('a'), Keep('r'), Ins('v'), Ins('e'), Del('p')]
It looks like there are three optimal ways to edit “carp” into “carve”. However, if all we care about is the number of optimal solutions, there is a simpler way to do that.
r = subgraphweight(Sel{MinCost, Count}, "carp", "carve")
println("cost: $(r.a)")
println("count: $(r.b)")
cost: MinCost(3) count: Count(3)
Notice that by expressing everything in terms of an algebraic structure, we can easily and quickly combine various components to derive some new correct-by-construction55 Showing the “correct-by-construction” part would require some more work, but it can be done. Moreover, it is likely that the proof structure will also follow the shape of the combinations we make. This is one other reason why algebraic dynamic programming is so nice: it is a framework in which the correctness of algorithms can be studied in a much simpler way. algorithms. This is the essence of compositionality: we define a rich language of building blocks through which we can express any (in-class) algorithm we want. We pay a higher upfront cost for implementing the primitives and the combinators, but constructing some new thing becomes much easier.66 This abstract principle can be seen everywhere: in Parser combinators and Game theory, for instance.
5. Generalized Floyd-Warshall algorithm
Since the beginning, we have framed as graph problems a problem that isn’t intrinsically a graph problem. What about graph-native stuff, like computing the shortest path? In this case, we are going to do a different translation. Instead of taking some dynamic programming problem and shoehorning it into graphs, we are going to take a graph problem and shoehorn it into generalized (over semi rings) matrix multiplications.
5.1. The matrix product
Let there be two matrices that we can multiply
The formula to evaluate their product is the following:
The main thing to notice here is that if we forget for a moment the meaning of these symbols, we recognize that the only operations we really need(defined on the components) to compute matrix multiplication are those carried by a semi ring. We have sums, we have products and that’s pretty much it.
5.2. Adjacency matrices
Let there be a directed graph with weights on the edges:
A representation of graphs that is often useful77
The first use I can think of (apart from the one we are going to see), is
when we consider the graph as a Markov chain. However, there are a lot more.
is its adjacency matrix.
This matrix is constructed by encoding the relationship between the ’th
node and the
’th node at a matrix’s
’th component. Usually, that
means putting a one where there is an edge and a zero otherwise.
A = [0 1 0 0 1
0 0 1 1 0
0 0 0 1 0
0 0 0 0 0
0 0 1 0 0]
One neat computation we can do using this matrix is count the number of length
paths from
to
. We do that by taking the
’th power
of the adjacency matrix.
println(repr("text/plain", A^2))
5×5 Matrix{Int64}:
0 0 2 1 0
0 0 0 1 0
0 0 0 0 0
0 0 0 0 0
0 0 0 1 0
We can see that there are two paths of length 2 from node 1 to node 3. These are
and
. Is the shape of this computation familiar?
The way we use this adjacency matrix is very similar to how we used the
Count
semi ring earlier. This is because just like we generalized edge weights (and
subgraph weight) to any semi ring, we can do the same for a graph’s adjacency
matrix.
5.3. Generalized adjacency matrix
If we consider a graph where the weights live in some semi ring, the appropriate adjacency matrix is one where the components are those weighs; this is simply a change of representation. One obvious question to ask ourselves is: what does matrix multiplication mean?
Remember the definition of matrix multiplication:
If we consider the multiplication of a matrix with itself, this equation becomes
the
’th component of the resulting matrix gives us the weight of the
two-edge subgraph between
and
.
In fact, taking this matrix’s powers will compute the same kind of things the Floyd-Warshall algorithm would let us compute. Before we understand why, let’s do some more experiments on this construction:
Let’s take a graph where the weight of each edge is the name of the edge. To encode arbitrary data in the matrix, we will be using the free semi ring.
struct Edge
name::String
end
The adjacency matrix of this graph will look like this:
S = Free{Edge}
O = zero(S)
i(n) = Free{Edge}(Set([[Edge(n)]]))
A = [O i("1>2") O O i("1>5")
O O i("2>3") i("2>4") O
O O O i("3>4") O
O O O O O
O O i("5>3") O O]
Notice here that O means “the zero in the appropriate semiring” and that i
injects the name into it. The powers of this matrix will generate sets of paths of fixed length.
using LinearAlgebra
# Necessary for Julia's matrix product to properly use the operations.
Base.zero(::T) where T <: Semiring = zero(T)
A2 = A^2
for path in A2[2, 4].choices
println(path)
end
Edge[Edge("2>3"), Edge("3>4")]
If we also want to get paths that are length , we can add ones into the
diagonal of the matrix like this:
I = one(S)
A′ = [I i("1>2") O O i("1>5")
O I i("2>3") i("2>4") O
O O I i("3>4") O
O O O I O
O O i("5>3") O I]
for path in (A′^3)[2,4].choices
println(path)
end
Edge[Edge("2>3"), Edge("3>4")]
Edge[Edge("2>4")]
A new path with length 2 that wasn’t visible before has appeared. Again,
enumerating the set of paths is not very useful. What if we reapply our earlier
trick (using Sel) to do more useful stuff?
S = Sel{MinCost, Free{Edge}}
O = zero(S)
I = one(S)
i(c, n) = S(MinCost(c), Free{Edge}(Set([[Edge(n)]])))
A = [I i(1, "1>2") O O i(1, "1>5")
O I i(1, "2>3") i(1, "2>4") O
O O I i(1, "3>4") O
O O O I O
O O i(1, "5>3") O I]
let
@time result = (A^5)[1,4]
println("cost: $(result.a)")
for path in result.b.choices
println("path: $(path)")
end
end
0.378522 seconds (1.49 M allocations: 71.838 MiB, 3.26% gc time, 99.85% compilation time)
cost: MinCost(2)
path: Edge[Edge("1>2"), Edge("2>4")]
We found the shortest path!
An astute reader might say: “Hey, sure, you have found the shortest path, but it
cost you complexity!” Yes. Here, each matrix-matrix product costs
multiplications and we do
of those, so we get very very bad
complexity. We could organize the way we do exponentiation more efficiently (by
leveraging
) making the computation
, but that is still very bad. Something else we can do, is notice that
in many cases, we don’t actually care about most of the optimal paths. Usually,
we want a single optimal path from a source to a destination and that’s it. This
need is easier to satisfy and we can leverage that by computing our shortest
path as an iterated matrix vector product (costs
instead of
).88
This optimization looks a lot like the idea of computing gradients instead
of jacobians in automatic differentiation. We leverage a specific more efficient
order of computation for complexity gains.
V = [I O O O O]
let
@time result = (V * A * A * A * A * A)[4]
println("cost: $(result.a)")
for path in result.b.choices
println("path: $(path)")
end
end
0.085063 seconds (236.82 k allocations: 11.351 MiB, 99.85% compilation time)
cost: MinCost(2)
path: Edge[Edge("1>2"), Edge("2>4")]
We finally achieved99
Be careful! in Julia, the multiplication operator is left associative (). If that weren’t the case, the same text would parse as
repeated matrix multiplication, followed by vector matrix multiplication.
, but if we compare it to the classical
algorithm for that class of problem, the Floyd-Warshall algorithm, we barely
achieve its complexity while doing
times less work (Floyd-Warshall
compute all-to-all shortest paths). This is still bad. How could we translate
the FW algorithm to our weird semiring matrix use-case?
5.4. Generalized Floyd-Warshall
The Floyd-Warshall algorithm works recursively. If we know all the shortest
paths among all nodes, then we can quickly extend them to get
shortest paths among all
nodes.
Each iteration up, we:
- Consider the matrix
of optimal paths for all nodes
.
- Consider as
the adjacency matrix of the first
nodes (it is the
wide upper left square of the complete adjacency matrix).
- We try to improve the
paths by asking, for each pair
of nodes: “is my current opimal
-path as good as going
?”
- We compute the optimal paths from
to everywhere (that is
) by gluing together the paths from
to everywhere (
) with the paths from everywhere (
) to everywhere (
).
- We compute the optimal paths from everywhere (again
) to
by gluing together the paths from everywhere to everywhere with the paths from everywhere to
.
- The best path from
to
is the empty path, the multiplicative identity.
When we try to formulate these steps as matrix operations, we get that the new matrix of optimal paths is the sum of two matrices:
- The matrix computed with the exterior product of the
’t column with the
’th row of the adjacency matrix. This is the “
to
back to
” part.
- The block matrix which contains
,
,
and
.
Our implementation is this:
function generalized_floyd_warshall(A::Matrix{S}) where S
k,_ = size(A)
if k == 0
return A
end
Mₖ₋₁ = generalized_floyd_warshall(A[1:end-1,1:end-1])
A[:,end:end] .* A[end:end,:] +
# I love julia's block matrix notation.
[Mₖ₋₁ Mₖ₋₁ * A[1:end-1,end:end]
A[end:end,1:end-1] * Mₖ₋₁ one(S)]
end
let
result = generalized_floyd_warshall(A)[1,4]
println("cost: $(result.a)")
for path in result.b.choices
println("path: $(path)")
end
end
cost: MinCost(2)
path: Edge[Edge("1>2"), Edge("2>4")]
This algorithm runs iterations where each iteration contains 3
operations of complexity
. We have achieved Floyd-Warshall’s
complexity bound! (because that is what we, albeit in an obfuscated way, have
implemented).
6. Generalized dijkstra algorithm
We have implemented the Floyd-Warshall algorithm, but it is probably not the shortest path algorithm you are familiar with, which I suspect is Dijkstra’s algorithm. This is because in the most general case of “shortest path through graph where edge have weights”, Dijkstra’s algorithm doesn’t actually work. In fact, it relies on an additional assumption we haven’t made until now: positive weights.
6.1. Monotone vs non-monotone semi rings
Dijkstra’s algorithm is able to beat Floyd-Warshall by leveraging the assumption of positive weights. What this lets Dijkstra do is early stopping: once the algorithm visits the target node, it can be sure no new paths are going to be better.
If we translate this “positive weights” in semi ring-speak, we get:
- The existence of an order operation on the elements. This is easy, since
Selalready uses this for its addition. - For all weights
of some edge, multiplying by
is a monotone operation. This means that
Let’s implement the order structure for Sel.
Base.isless(x::Sel{A,B}, y::Sel{A,B}) where {A,B} = x.a < y.a
With this property, Dijkstra’s algorithm can be used almost verbatim. However, because julia doesn’t have a builtin priority heap, let me quickly write a slow implementation of the necessary interface. In principle, it should be trivial to swap that out for a real implementation.
mutable struct MinHeap
keys
values
end
function MinHeap()
MinHeap(
[],
[],
)
end
function Base.pop!(h::MinHeap)
k = pop!(h.keys)
v = pop!(h.values)
k, v
end
function insert!(h::MinHeap, k, v)
push!(h.keys, k)
push!(h.values, v)
perms = sortperm(h.keys, rev=true)
h.keys = h.keys[perms]
h.values = h.values[perms]
end
Dijkstra’s algorithm is usually on top of a “give me outgoing edges and their weights” graph representation. Let us first adapt the adjacency matrix to the appropriate interface.
# Here a graph is represented by a function that gives you the
# outgoing edges to a given node.
function the_graph(from_node)
if from_node == 1
[(1, 2), (1,5)]
elseif from_node == 2
return [(2, 3), (2, 4)]
elseif from_node == 3
return [(3, 4)]
elseif from_node == 4
return []
elseif from_node == 5
return [(5,3)]
else
error("no such node:", from_node)
end
end
# The weight of an edge is a tuple of a constant 1 and the edge name.
function cost(n)
edges = [(1,2), (1,5), (2,3), (2,4), (3,4), (5,3)]
if n ∈ edges
a,b = n
Sel(MinCost(1), Free(Set([[Edge("$(a)>$(b)")]])))
else
error("no such edge", n)
end
end
target((a,b)) = b
Now we are ready for the final algorithm. Behold, Dijkstra!!
function generalized_dijkstra(S::Type{<:Semiring}, graph, cost, s₀, s₁)
queue = MinHeap()
state_costs = Dict()
visited = Set()
state_costs[s₀] = one(S)
insert!(queue, state_costs[s₀], s₀)
while true
source_cost, source_state = pop!(queue)
if source_state == s₁
return state_costs[source_state]
end
if source_state ∈ visited
continue
else
push!(visited, source_state)
end
edges = graph(source_state)
for edge in edges
target_state = target(edge)
if target_state ∉ keys(state_costs)
state_costs[target_state] = zero(S)
end
state_costs[target_state] += source_cost * cost(edge)
insert!(queue, state_costs[target_state], target_state)
end
end
end
The interesting thing to note here is that this algorithm looks a lot like the textbook “compute the length of the shortest path”, and doesn’t seem to think about paths at all, only costs. However, as has been the case for this whole post, the generic interface lets us trick this single basic cost-only algorithm into also computing the path(s)
println(generalized_dijkstra(Sel{MinCost, Free{Edge}}, the_graph, cost, 1, 4))
Sel{MinCost, Free{Edge}}(MinCost(2), Free{Edge}(Set(Vector{Edge}[[Edge("1>2"), Edge("2>4")]])))
Footnotes:
Compositional and algebraic in the same vein as algebraic data types.
There are other properties, but they aren’t as useful for intuition.
Pardon me the ugly overflow.
This whole thing looks a lot like what the wikipedia page on weighted automata calls the weight of a word.
Showing the “correct-by-construction” part would require some more work, but it can be done. Moreover, it is likely that the proof structure will also follow the shape of the combinations we make. This is one other reason why algebraic dynamic programming is so nice: it is a framework in which the correctness of algorithms can be studied in a much simpler way.
This abstract principle can be seen everywhere: in Parser combinators and Game theory, for instance.
The first use I can think of (apart from the one we are going to see), is when we consider the graph as a Markov chain. However, there are a lot more.
This optimization looks a lot like the idea of computing gradients instead of jacobians in automatic differentiation. We leverage a specific more efficient order of computation for complexity gains.
Be careful! in Julia, the multiplication operator is left associative (). If that weren’t the case, the same text would parse as
repeated matrix multiplication, followed by vector matrix multiplication.