My colleagues Aly Nielson and Ante Grgić
recently worked on a part of a web application that required trimming
strings. Their interesting and anomalous findings prompted me to write
Quick defintions: a “string” is an array of zero or more characters. A
“character” represents one symbol that has some semantic value,
represented in some encoding scheme. “A”, “9” “ج”, “א”, “∜”, and “漢” are
all characters. Common encoding schemes are UTF-8, UTF-16 and UTF-32.
In many languages, spaces between characters also carry semantic meaning. Most writers of Indo-European languages – English, French, Spanish, Hindi, Arabic, Farsi, for example – are familiar with the spaces that separate words, sentences, and paragraphs. Because spaces also have semantic meaning, different characters are used to denote spaces of different lengths. Unicode, for example, defines no fewer than 25 spaces characters with a “whitespace” property, and another 6 “separators” – characters that serve to separate adjoining characters without necessarily showing up as whitespace.
All this can be rather bewildering – especially if you’ve been accustomed to thinking of “whitespace” comprising only the space, tab, and line-feed/carriage-return characters. Space is more vast than you might have imagined!
Trimming a string means stripping all the whitespace characters from
either end of that string. With so many different characters, character
encodings, whitespaces, and even zero-width spaces; one may expect
“trimming a string” to be anything but straightforward. And that is
I didn’t realize that this whole issue of whitespace was a subject of much activity in the Unicode organization. Particularly, the “narrow no-break space” has received much attention in 2019 in various Unicode committees. .
Given all this variation across languages (both computer and human); I recommend the following:
1. Unit test any code that trims strings thoroughly. I mean tests with literally this level of detail (using Jest-style syntax):
it “does not trim non-breaking spaces from the either end of a string”
it “trims space, tab, and line-feed/carriage-return characters from either end of a string”
it “does not trim embedded space, tab
d. … and others
3. Be particularly wary of strings that will be used by the system (as opposed to being read by humans). If you trim the content of a news article badly, you may get laughed at by a few readers. However, if you trim a RESTful URI, a configuration value, or an encoded password incorrectly; you may be in for a world of grief while debugging.
Being aware of the details of how strings and whitespaces are treated in your language can be the difference between a smoothly working system and subtle bugs that prove notoriously difficult to find.
- C# Char.IsWhiteSpace() method
- Java Character.isWhitespace() method
- Ruby String:strip method
- Unicode line breaking algorithm
- “Property change for narrow no-break space”
- “Proposal to clarify the purpose of narrow no-break space”
- “Summary of Mongolian updates in the Unicode Standard”