label sequence is y, the target is maximize p(y|x)
x → emmition_graph → y
for a single y, there are many corresponding paths in emmition_graph.
all paths cross over some nodes.
some nodes may belong to a few paths.
other may belong to more paths.
more paths, stronger relationship between a node and y.
so occupation probability is a representation of relation tense between a node and y.
Optimize y, means optimize all valid paths which can generate it.
in a graph, the total score is sum of probability on each valid paths.
y = a + b
grad of a is 1, grad_b is 1
the more paths crossed, the more 1 each node will get(occupation probabity).
Let’s recall something of pytorch firstly. With pytorch, programmer only concentrate forward process, during which all tensors are created, manipulated, and interacted with each other. Usually, a scalar type loss is generated as the end of forward process. Then just do
loss.backward() // pytorch will do automatic differentiation
// over all tensors in its dynamic computation graph
Replace “tensor” with “fsa/fst” of above process, you got the idea of what GTN do.