Some documents contain no space between a name and one of the following characters: []{},.<>. It makes sense to add an option that would recognize such characters as a token separator.
Additional thing that happens quite often are names like <i>Aus bus</i> Linn. It would be good to ignore <i> and </i>, or even use them as indicators of a canonical form of scientific names.
See also
#150
#53