Algebraic dynamic programming
Note: This was partially copied and Julia-translated from Mattéo’s Midiro presentation.
1. Background: semi rings
A semiring is an algebraic structure that incarnates the essence of “sum of products”. A couple of examples of semi rings would be the natural numbers, square matrices (in two different ways), real numbers, polynomials, and more.
abstract type Semiring end
A semi ring is a set with the following properties1:
- Has an addition operation. (
)
- Has a multiplication operation. (
)
- Has additive identity. (
)
- Has a multiplicative identity. (
)
- Multiplication distributes over addition. (
)
In Julia, the functions that give use the zero and the one associated to a type
are respectively zero
and one
. The addition and multiplication operations are +
and *
like you would expect.
2. Dynamic programming problems as graph traversal
Very often, dynamic programming problems can be framed as nearest neighbor problems in some appropriate abstract graph. Let’s look, as an example, at the editing distance.
What we mean by edit distance is the following. Given two strings and
, their edit distance is the smallest weight of editing operations such that
applying them on
gives you
. Available operations are the
following:
- Inserting a character.
- Removing a character.
- Keeping a character.
For instance, the sequence of operations “keep(c), keep(a), keep(r), delete(p), insert(v), insert(e)” edits the string “carp” into the string “carve”.
For this specific problem, the abstract graph would have, as nodes, pairs of strings. The edges of this graph would be those editing operations.
the state (“car”, “car) would be connected to the state (”carp“, ”car“) through the ”delete(p)" edge, for instance.
If we use as a starting node the state (“”, “”) (two empty strings), every path the finished at the state (“carp”, “carve”) corresponds to a specific way to modify the first string to turn it into the second.
Now that the problem is properly set up, let us move on to the way we will compare different paths through the graph: weights.
Then, we will choose some semi ring to represent the weights, then a
weight function
that will give such a weight to each
transition in the graph.
Finally, we will enumerate all the paths between the states of interest and compute the weight of these paths through summing (across all paths) the products (across transitions in a single path) of the weights of the transitions.2
Through this very general framework that looks a lot like simply computing the “weighted sum” of a subgraph, we can implement diverse algorithms.
3. Implementation of edit distance.
Let’s write a concrete implementation of this “algebraic dynamic programming” idea in julia. Again, we will be using the string editing distance as a toy problem. Here are the definitions of the string operations:
abstract type Op end
struct Ins <: Op
c::Char
end
struct Del <: Op
c::Char
end
struct Keep <: Op
c::Char
end
In many cases, it is complicated (and expensive!) to enumerate all possible paths from source to destination, but in this specific case, the graph generated by the edit distance problem is easy to write down.
Here, we compute recursively by using the weight of
sub strings.
# Note: for this function to be efficient, you would ideally use
# memoisation. I don't here, because there is no builting way to do
# this.
function Weight(T::Type{<:Semiring}, s1, s2)
# We will be computing a sum, so it makes sense to initialize this
# with zero. Notice we pass the semiring type to the zero
# function. This is not your garden variety zero, but the one from
# your chosen semiring!
o = zero(T)
# From a modification that transforms s1[2:end] into s2, we can
# create a transformation that transforms s1 into s2 by appending
# Del(s1[1]) at the start of the transformations sequence.
if length(s1) != 0
o += weight(T, Del(s1[1])) * Weight(T, s1[2:end], s2)
end
# Same thing, but for insertion.
if length(s2) != 0
o += weight(T, Ins(s2[1])) * Weight(T, s1, s2[2:end])
end
# When the first character of both strings are the same, we can
# keep it by adding Keep(c) at the front of the operations
# sequnce.
if length(s1) != 0 && length(s2) != 0 && s1[1] == s2[1]
o += weight(T, Keep(s1[1])) * Weight(T, s1[2:end], s2[2:end])
end
# Finally, if both strings are empty, we denote triviality by
# adding the multiplicative identity.
if length(s1) == 0 && length(s2) == 0
o += one(T)
end
o
end
Now that the function that computes the weight of our subgraph is done, we can test it on different choices of
.
3.1. Tropical semiring (minimum cost)
One very exotic semiring is the tropical semiring. At first glance, it looks a lot like usual real number, but it really is not.
- The carrier set of the semiring is the set of real numbers plus infinity (which isn’t part of the reals).
- The semiring addition
on two elements is the
operation on two reals.
- Multiplication of semiring elements is addition of reals.
- The neutral element under addition is plus infinity.
- The neutral element under multiplication is zero.
- Verify that the other properties are respected.
struct MinCost <: Semiring
cost
end
Base.zero(::Type{MinCost}) = MinCost(Inf)
Base.one(::Type{MinCost}) = MinCost(0)
Base.:*(a::MinCost, b::MinCost) = MinCost(a.cost + b.cost)
Base.:+(a::MinCost, b::MinCost) = MinCost(min(a.cost, b.cost))
Here, our weights are defined in the following manner: inserting or deleting a character costs 1 unit, whereas keeping an already existing character is free.
weight(::Type{MinCost}, ::Ins) = MinCost(1)
weight(::Type{MinCost}, ::Del) = MinCost(1)
weight(::Type{MinCost}, ::Keep) = MinCost(0)
It is easy to see that if we combine sequentially two operations (through multiplication), then their cost will be summed whereas if we take the sum (combine horizontally two operations), then we will keep the minimum of the two costs.
If we evaluate the graph weight function on “carp” and “carve”, we should get something that is at least as cheap as the three operations we used earlier.
Weight(MinCost, "carp", "carve")
MinCost(3.0)
It works! However, it is not very useful to only see the cost of the best solution, we also want the solution itself!
3.2. The free semi ring
We would really like to access to the optimal sequence implicitly computed by
the algorithm of which the MinCost
is the cost. Before we do that, however, we
will need to be able to extract from the Weight
procedure the entire set of
possible paths. This is what the free semiring is for.
In algebra, the “free X generated by the alphabe ” is a very standard
construction where we take the
set, then add just enough elements and
equalities such that
becomes an X. In the case of semi rings, this
means we must make the elements multiply, add, contain a neutral element for
those two operations et cetera. It turns out that once we have done that, what
we end up with is a semi ring where:
- Each element of
is a set of sequences of
’s.
- Addition is union of sets of sequences.
- Multiplication is concatenation of every left sequence with every right sequence.
- Additive identity is the empty set.
- Multiplicative identity is the set that only contains the empty sequence.
Implemented in Julia, those operations look like this:
struct Free{A} <: Semiring
choices::Set{Vector{A}}
end
Free(s::Set{Vector{A}}) where A = Free{A}(s)
Base.zero(::Type{Free{A}}) where A = Free{A}(Set([]))
Base.one(::Type{Free{A}}) where A = Free{A}(Set([[]]))
Base.:+(a::Free{A}, b::Free{A}) where A = Free{A}(a.choices ∪ b.choices)
Base.:*(a::Free{A}, b::Free{A}) where A = Free{A}(
Set([vcat(parta, partb)
for parta in a.choices
for partb in b.choices]))
What is interesting here, is that we can generate a free semi ring from every single possible alphabet set. What this suggests with regards to the operation weight function is that we should just embed each operation in the free semi ring generated by the operations.
We do that by mapping every transition to the singleton set that contains a sequence containing only that thing.
weight(::Type{Free{Op}}, o::Op) = Free{Op}(Set([[o]]))
What happens next will might surprise you. Each time the Weight
function
combines the weights of two half paths into a single big path weight, it will
actually be combining all left possible paths with all right possible paths.
When two possible paths from to
are added, the two sets of
possible paths are unioned. At the end of the function’s execution, it returns
us a set of all possible operation sequences that start at
and end at
.
Let’s see what those look like in the case of “carp” and “carve”:
# (Il y en a vraiment beaucoup)
for seq in Iterators.take(Weight(Free{Op}, "carp", "carve").choices, 10)
println(seq)
end
Op[Ins('c'), Del('c'), Ins('a'), Del('a'), Keep('r'), Ins('v'), Del('p'), Ins('e')] Op[Del('c'), Del('a'), Ins('c'), Ins('a'), Ins('r'), Del('r'), Ins('v'), Del('p'), Ins('e')] Op[Ins('c'), Ins('a'), Del('c'), Ins('r'), Del('a'), Ins('v'), Ins('e'), Del('r'), Del('p')] Op[Ins('c'), Ins('a'), Del('c'), Del('a'), Del('r'), Ins('r'), Ins('v'), Del('p'), Ins('e')] Op[Ins('c'), Ins('a'), Ins('r'), Del('c'), Del('a'), Ins('v'), Ins('e'), Del('r'), Del('p')] Op[Keep('c'), Ins('a'), Del('a'), Keep('r'), Del('p'), Ins('v'), Ins('e')] Op[Del('c'), Ins('c'), Ins('a'), Ins('r'), Del('a'), Del('r'), Del('p'), Ins('v'), Ins('e')] Op[Keep('c'), Ins('a'), Del('a'), Del('r'), Del('p'), Ins('r'), Ins('v'), Ins('e')] Op[Keep('c'), Ins('a'), Ins('r'), Del('a'), Ins('v'), Del('r'), Del('p'), Ins('e')] Op[Ins('c'), Del('c'), Del('a'), Ins('a'), Del('r'), Ins('r'), Ins('v'), Del('p'), Ins('e')]
Even though the two words are quite similar, there are a lot of ways to turn one into the other and enumerating all of them (especially the longer ones) is really not the point of this exercise. Where the free semi ring shines is when we use it in combination with other semi rings as we will be doing soon.
Before we go there, however, how would we go about counting the number of distinct paths, if we did not want to construct them in the process?
3.3. Count semi ring
struct Count <: Semiring
n::Int
end
Base.zero(::Type{Count}) = Count(0)
Base.one(::Type{Count}) = Count(1)
Base.:+(a::Count, b::Count) = Count(a.n + b.n)
Base.:*(a::Count, b::Count) = Count(a.n * b.n)
weight(::Type{Count}, ::Op) = Count(1)
What the “counting semi ring”, really is, is the good old natural numbers. Let’s use it on the Weight function, for a laugh
println(Weight(Count,"carp", "carve"))
let’s compare to what the free semiring gives us
println(length(Weight(Free{Op}, "carp", "carve").choices))
3.4. Selector semi ring
If we have two semi rings, a very standard algebraic construction is to consider the product of those two semi rings. However, to make something useful, we will also be asking for the first semi ring to have the structure of an order.
We can construct a semi ring of pairs where the first component is kind of like a “cost” and the second, a “payload”. Here is how it goes:
- Addition of
and
compares
and
. If they are equal, return
. If they are not, return the pair with the lowest cost.
- Multiplication is component-wise multiplication.
- Additive identity is the pair of additive identities.
- Multiplicative identities is the pair of multiplicative identities.
Here is how the implementation looks:
struct Sel{A,B} <: Semiring
a::A
b::B
end
Base.zero(::Type{Sel{A,B}}) where {A,B} = Sel{A,B}(zero(A), zero(B))
Base.one(::Type{Sel{A,B}}) where {A,B} = Sel{A,B}(one(A), one(B))
Base.:*(l::Sel{A,B},r::Sel{A,B}) where {A,B} = Sel{A,B}(l.a * r.a, l.b * r.b)
Base.:+(l::Sel{A,B},r::Sel{A,B}) where {A,B} =
if l.a == r.a
Sel{A,B}(l.a, l.b + r.b)
elseif l.a < r.a
l
elseif l.a > r.a
r
end
What is interesting is that because this semi ring essentially inherits its structure from two other semi rings, we can also inherit our weight functions from the two semi rings.
weight(::Type{Sel{A,B}}, x) where {A,B} = Sel(weight(A, x), weight(B, x))
It is this construction that lets us compute the best candidate while remembering which is the best candidate.
# Il faut au préalable définir une relation d'ordre pour la composante
# gauche.
Base.isless(a::MinCost, b::MinCost) = a.cost < b.cost
r = Weight(Sel{MinCost, Free{Op}}, "carp", "carve")
println("coût: $(r.a)")
for elem in r.b.choices
println("séquence: $(elem)")
end
coût: MinCost(3) séquence: Op[Keep('c'), Keep('a'), Keep('r'), Ins('v'), Del('p'), Ins('e')] séquence: Op[Keep('c'), Keep('a'), Keep('r'), Del('p'), Ins('v'), Ins('e')] séquence: Op[Keep('c'), Keep('a'), Keep('r'), Ins('v'), Ins('e'), Del('p')]
It looks like there are three optimal ways to edit “carp” into “carve”.
4. Generalized Floyd-Warshall algorithm
Since the beginning, we have framed in graph terms problems that aren’t intrinsically graph problems. What about graph-native stuff, like shortest path problems?
4.1. Matrix product, deconstructed
Let there be two matrices that we can multiply
The formula to evaluate their product is the following:
Do you notice anything interesting? What properties should the component of matrices of something be so that they can be multiplied?
4.2. Adjacency matrices, revisited
Let there be a directed graph with weights on the edges (you can ignore the weights, for now):
You might already know this, but it is very useful to study such a graph through
its adjacency matrix. This matrix is constructed by writing the relationship
between the node and the
nod at a matrix’s
’th
component. Usually, that means putting a one where there is an edge and a zero
otherwise.
A = [0 1 0 0 1
0 0 1 1 0
0 0 0 1 0
0 0 0 0 0
0 0 1 0 0]
On neat computation we can do using this matrix is counting the number -length paths from
to
by taking the
’th power of the
adjacency matrix.
println(repr("text/plain", A^2))
5×5 Matrix{Int64}: 0 0 2 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
We can see that there are two paths of length 2 from node 1 to node 3. But
doesn’t this operation seem familiar? It should. The way we use this adjacency
matrix is very similar to how we used the Count
semi ring earlier.
4.3. Generalized adjacency matrix
The two previous remarks’ purpose were to bring us to the following question:
If we consider a graph where the weights live in some semi ring, and we consider its adjacency matrix where the components are those weighs, and we observe the successive powers of this matrix, what do they mean?
In fact, this gives a way to compute the same kind of things the Floyd-Warshall algorithm would let us compute. Before we understand why, let’s do some more experiments on this construction:
Let’s take a graph where the weight of each edge is the name of the edge. We will be using the free semi ring.
struct Edge
name::String
end
The adjacency matrix of this graph will look like this:
S = Free{Edge}
O = zero(S)
i(n) = Free{Edge}(Set([[Edge(n)]]))
A = [O i("1>2") O O i("1>5")
O O i("2>3") i("2>4") O
O O O i("3>4") O
O O O O O
O O i("5>3") O O]
Notice here that O
means “the zero in the appropriate semiring” and that i
injects the name into it. The powers of this matrix will generate sets of paths of fixed length.
using LinearAlgebra
# Necessary for Julia's matrix product to properly use the operations.
Base.zero(::T) where T <: Semiring = zero(T)
for path in (A^2)[2, 4].choices
println(path)
end
Edge[Edge("2>3"), Edge("3>4")]
If we also want to get paths that are length , we can add ones into the
diagonal of the matrix like this:
let
A′ = A + diagm(ones(Free{Edge}, 5))
for path in (A′^3)[2,4].choices
println(path)
end
end
Edge[Edge("2>4")] Edge[Edge("2>3"), Edge("3>4")]
A new path with length 2 that wasn’t visible before has appeared.
Again, enumerating the set of paths is not very useful. What if we reapply our
earlier trick (using Sel
) to do more useful stuff?
Again, we fill up the diagonal with ones to consider paths that aren’t exactly
edges long.
S = Sel{MinCost, Free{Edge}}
O = zero(S)
I = one(S)
i(c, n) = S(MinCost(c), Free{Edge}(Set([[Edge(n)]])))
A = [I i(1, "1>2") O O i(1, "1>5")
O I i(1, "2>3") i(1, "2>4") O
O O I i(1, "3>4") O
O O O I O
O O i(1, "5>3") O I]
let
result = (A^5)[1,4]
println("cost: $(result.a)")
for path in result.b.choices
println("path: $(path)")
end
end
cost: MinCost(2) path: Edge[Edge("1>2"), Edge("2>4")]
We found the shortest path!
An astute reader might remark, however: “Hey, sure, you have found the shortest
path, but it cost you complexity!” Yes. Here, each matrix-matrix
product costs
multiplications and we do
of those, so we get
very very bad complexity.
We could organize the way we do exponentiation more efficiently (by leveraging
) making the computation
, but
that is still very bad.
Another thing we can do, is notice that in many cases, we don’t actually care about most of the optimal paths. Usually, we want a single optimal path from a source to a destination and that’s it.
We can leverage this weaker need by computing our shortest path as an iterated
matrix vector product (costs instead of
).3
V = [I O O O O]
let
result = (V * A * A * A * A * A)[4]
println("cost: $(result.a)")
for path in result.b.choices
println("path: $(path)")
end
end
cost: MinCost(2) path: Edge[Edge("1>2"), Edge("2>4")]
We finally achieved4 , but if we compare it to the classical
algorithm for that class of problem, the Floyd-Warshall algorithm, we barely
achieve its complexity while doing
times less work (FW compute all
source and all destinations shortest paths). This is still bad. How could we
translate the FW algorithm to our weird semiring matrix usecase?
4.4. Generalized Floyd-Warshall
The Floyd-Warshall algorithm works recursively. If we know all the shortest
paths among all nodes, then we can quickly extend them to get
shortest paths among all
nodes.
Each iteration up, we:
- Consider the matrix
of optimal paths for all nodes
.
- Consider as
the adjacency matrix of the first
nodes (it is the
wide upper left square of the complete adjacency matrix).
- We try to improve the
paths by asking, for each pair
of nodes: “is my current opimal
-path as good as going
?”
- We compute the optimal paths from
to everywhere (that is
) by gluing together the paths from
to everywhere (
) with the paths from everywhere (
) to everywhere (
).
- We compute the optimal paths from everywhere (again
) to
by gluing together the paths from everywhere to everywhere with the paths from everywhere to
.
- The best path from
to
is the empty path, the multiplicative identity.
When we try to formulate these steps as matrix operations, we get that the new matrix of optimal paths is the sum of two matrices:
- The matrix computed with the exterior product of the
’t column with the
’th row of the adjacency matrix. This is the
part.
- The block matrix which contains
,
,
and
.
Our implementation is this:
function generalized_floyd_warshall(A::Matrix{S}) where S
k,_ = size(A)
if k == 0
return A
end
Mₖ₋₁ = generalized_floyd_warshall(A[1:end-1,1:end-1])
A[:,end:end] .* A[end:end,:] +
# I love julia's block matrix notation.
[Mₖ₋₁ Mₖ₋₁ * A[1:end-1,end:end]
A[end:end,1:end-1] * Mₖ₋₁ one(S)]
end
let
result = generalized_floyd_warshall(A)[1,4]
println("cost: $(result.a)")
for path in result.b.choices
println("path: $(path)")
end
end
cost: MinCost(2) path: Edge[Edge("1>2"), Edge("2>4")]
This algorithm runs iterations where each iteration contains 3
operations of complexity
. We have achieved Floyd-Warshall’s
complexity bound! (because that is what we, albeit in an obfuscated way, have
implemented).
5. Generalized dijkstra algorithm
The “shortest path calculating algorithm” we have implemented, however, is not the one familiar to most of you dear readers, however. This is because in the most general case of “shortest path through graph where edge have weights”, Dijkstra’s algorithm doesn’t actually work. In fact, it relies on an additional assumption we haven’t made until now: positive weights.
5.1. Monotone vs non-monotone semi rings
Dijkstra’s algorithm is able to beat Floyd-Warshall by leveraging the assumption of positive weights. What this lets Dijkstra do is early stopping: once the algorithm visits the target node, it can be sure no new paths are going to be better.
If we translate this “positive weights” in semi ring-speak, we get:
- The existence of an order operation on the elements. This is easy, since
Sel
already uses this for its addition. - For all weights
of some edge, multiplying by
is a monotone operation. This means that
Let’s implement the order structure for Sel
.
Base.isless(x::Sel{A,B}, y::Sel{A,B}) where {A,B} = x.a < y.a
With this property, Dijkstra’s algorithm can be used almost verbatim. We only need to tweak a few types to make it work.
Because julia doesn’t have a builtin priority heap, let me write a slow implementation of one first.
mutable struct MinHeap
keys
values
end
function MinHeap()
MinHeap(
[],
[],
)
end
function Base.pop!(h::MinHeap)
k = pop!(h.keys)
v = pop!(h.values)
k, v
end
function insert!(h::MinHeap, k, v)
push!(h.keys, k)
push!(h.values, v)
perms = sortperm(h.keys, rev=true)
h.keys = h.keys[perms]
h.values = h.values[perms]
end
Since dijkstra isn’t using the same graph representation as FW, we need to massage the graph into a different representation.
# Here a graph is represented by a function that gives you the
# outgoing edges to a given node.
function the_graph(from_node)
if from_node == 1
[(1, 2), (1,5)]
elseif from_node == 2
return [(2, 3), (2, 4)]
elseif from_node == 3
return [(3, 4)]
elseif from_node == 4
return []
elseif from_node == 5
return [(5,3)]
else
error("no such node:", from_node)
end
end
# The weight of an edge is a tuple of a constant 1 and the edge name.
function cost(n)
edges = [(1,2), (1,5), (2,3), (2,4), (3,4), (5,3)]
if n ∈ edges
a,b = n
Sel(MinCost(1), Free(Set([[Edge("$(a)>$(b)")]])))
else
error("no such edge", n)
end
end
target((a,b)) = b
Now we are ready for the final algorithm. Behold, Dijkstra!!
function generalized_dijkstra(S::Type{<:Semiring}, graph, cost, s₀, s₁)
queue = MinHeap()
state_costs = Dict()
visited = Set()
state_costs[s₀] = one(S)
insert!(queue, state_costs[s₀], s₀)
while true
source_cost, source_state = pop!(queue)
if source_state == s₁
return state_costs[source_state]
end
if source_state ∈ visited
continue
else
push!(visited, source_state)
end
edges = graph(source_state)
for edge in edges
target_state = target(edge)
if target_state ∉ keys(state_costs)
state_costs[target_state] = zero(S)
end
state_costs[target_state] += source_cost * cost(edge)
insert!(queue, state_costs[target_state], target_state)
end
end
end
println(generalized_dijkstra(Sel{MinCost, Free{Edge}}, the_graph, cost, 1, 4))
Sel{MinCost, Free{Edge}}(MinCost(2), Free{Edge}(Set(Vector{Edge}[[Edge("1>2"), Edge("2>4")]])))
Footnotes:
They are other properties, but they aren’t as useful for intuition.
Note: This looks a lot like what the wikipedia page on weighted automata calls the weight of a word.
This optimization looks a lot like the idea of computing gradients instead of jacobians in automatic differentiation. We leverage a specific more efficient order of computation for complexity gains.
Be careful! in Julia, the multiplication operator is left associative (). If that weren’t the case, the same text would parse as
repeated matrix multiplication, followed by vector matrix multiplication.