Give me some space

My colleagues Aly Nielson and Ante Grgić recently worked on a part of a web application that required trimming strings. Their interesting and anomalous findings prompted me to write this.

Quick defintions: a “string” is an array of zero or more characters. A “character” represents one symbol that has some semantic value, represented in some encoding scheme. “A”, “9” “ج”, “א”, “∜”, and “漢” are all characters. Common encoding schemes are UTF-8, UTF-16 and UTF-32.

In many languages, spaces between characters also carry semantic meaning. Most writers of Indo-European languages – English, French, Spanish, Hindi, Arabic, Farsi, for example – are familiar with the spaces that separate words, sentences, and paragraphs. Because spaces also have semantic meaning, different characters are used to denote spaces of different lengths. Unicode, for example, defines no fewer than 25 spaces characters with a “whitespace” property, and another 6 “separators” – characters that serve to separate adjoining characters without necessarily showing up as whitespace.

All this can be rather bewildering – especially if you’ve been accustomed to thinking of “whitespace” comprising only the space, tab, and line-feed/carriage-return characters. Space is more vast than you might have imagined!

Trimming a string means stripping all the whitespace characters from either end of that string. With so many different characters, character encodings, whitespaces, and even zero-width spaces; one may expect “trimming a string” to be anything but straightforward. And that is indeed true.

What Aly and Ante found was that different languages trim strings differently. JavaScript and C#, for instance, treat “non-breaking space” (Unicode 00A0) as whitespace and trim it. Java and Ruby, on the other hand, do not treat this same character as whitespace, and do not trim it.

I didn’t realize that this whole issue of whitespace was a subject of much activity in the Unicode organization. Particularly, the “narrow no-break space” has received much attention in 2019 in various Unicode committees. [6][7][8].

Given all this variation across languages (both computer and human); I recommend the following:

1. Unit test any code that trims strings thoroughly. I mean tests with literally this level of detail (using Jest-style syntax):

a. it “does not trim non-breaking spaces from the either end of a string”
b. it “trims space, tab, and line-feed/carriage-return characters from either end of a string”
c. it “does not trim embedded space, tab and line-feed/carriage-return characters”
d. … and others

2. When you have one system module sharing data with another system module, and they are written in different computer languages, consider writing a contract test (or two) for the boundary between the two modules. An example would be a Java-based backend that sends data to a JavaScript-based single-page web-app.

3. Be particularly wary of strings that will be used by the system (as opposed to being read by humans). If you trim the content of a news article badly, you may get laughed at by a few readers. However, if you trim a RESTful URI, a configuration value, or an encoded password incorrectly; you may be in for a world of grief while debugging.

Being aware of the details of how strings and whitespaces are treated in your language can be the difference between a smoothly working system and subtle bugs that prove notoriously difficult to find.

References

  1. JavaScript String.trim() method
  2. C# Char.IsWhiteSpace() method
  3. Java Character.isWhitespace() method
  4. Ruby String:strip method
  5. Unicode line breaking algorithm
  6. “Property change for narrow no-break space”
  7. “Proposal to clarify the purpose of narrow no-break space”
  8. “Summary of Mongolian updates in the Unicode Standard”