LaTeX Hyphenation Mist-akes*

One of the great things about TeX is that it will automatically hyphenate words when doing so leads to better overall line breaks in a paragraph. This is somewhat difficult task because, when hyphenating words, it is not acceptable to insert the hyphen between just any pair of letters. Some hyphenations, such as “new-spaper” can lead the reader “down a garden path.” That is, when reading the end of the line (“new-“), the reader guesses incorrectly that “new” is a complete stem within a compound word and is then completely confused when confronted with the unlikely terminating word “spaper”. A similar problem occurs when the hyphenation causes the reader to pronounce the head portion incorrectly, causing them to read nonsensical words (I’ll give an example in a second.)

To address this problem some automatic text layout systems rely on dictionaries in which acceptable hyphenation points have been marked by a human. While generally correct, such dictionary-based approaches require a very large data file to store the dictionary and fail when given new words. An alternate approach, taken in TeX, is to summarize hyphenation points into a small set of patterns which can then be applied to any word. The TeX method was described in Franklin Liang’s thesis Word Hy-phen-a-tion by Com-put-er. Liang’s thesis claims that his pattern-based finds about 90% of the human marked hyphenation points and finds essentially no incorrect ones.

Unfortunately, essentially no incorrect ones is not the same as no incorrect ones. In the last two papers I’ve written, I’ve come across words that TeX’s method failed rather dramatically on. Fortunately, LaTeX provides an easy way to override the automatic hyphen selection through the \hyphenation{} command.

\hyphenation{white-space} fixes TeX’s “whites-pace”, a somewhat racist rendering. Note that this incorrect hyphenation is the opposite of the “new-spaper” example, which came from Liang’s thesis, highlighting the problems facing a purely pattern-based approach.

\hyphenation{analy-sis} fixes TeX’s “anal-ysis” which leads to a rather infelicitous mispronunciation of the first part of the word.

Despite doing a good job most of the time, TeX’s automatic hyphenation can and does go awry. Keep your eyes open when proof reading!

*I’m pretty sure TeX gets the hyphenation of “mistakes” correct.

3 thoughts on “LaTeX Hyphenation Mist-akes*

  1. Justin Talbot Post author

    That’s probably how Latex does it by default, but I find the hyphenation anal-ysis to read poorly. First, there’s the obvious problem of having “anal” hanging at the end of a line. But this is exacerbated by the fact that the pronunciation of “anal” by itself is not the same as the first part of “analysis” (at least in my Western US English dialect), leading to a garden path problem.

  2. Matthias B├╝chse

    Looks like your pronunciation algorithm is very “greedy”. I would only pronounce something once I know what it means. This is beneficial in other cases as well, cf. “to tear” and “a tear”. Can one hyphenate bear-d?

Leave a Reply

Your email address will not be published.