The solution has two parts: the first part is to generate the transitions at a
finer grain and using hash consing for new states. The second part is to
optimize the resulting diagram by collapsing identical states, which are an
unfortunate side effect of the first part.
The end result is a much smaller diagram, in which there is never more than one
speculative branch introduced for any input code. I haven't measured the
performance, but this can't help but be faster just on the basis of the amount
of data being manipulated. Now that we have a limit on the speculative
branches, it should be possible to optimize the NFA further by supporting at
most two branches rather than a list of them.