Normalizing Text to NFC in Modern CMS

Normalizing Text to NFC in Modern CMS

Because all CMS should always use pre-composed characters

Project Description

There are 23 official languages within the European Union, and many—if not all—of them use special characters. In German, for example, there are umlauts (“üöä”) and the “ß”; and in other languages, there are more. Many characters exist in a pre-composed version and as a combination of two characters. Apple is using this so-called NFD normalization form, for example in its HFS+ file system and all internal processes. Using the two-character version can lead to a broken search (in the CMS and in the browser itself), broken sorting, broken spell check, broken transliteration for the slug, and broken images.

This is why all non-NFC text should be normalized to NFC in all modern CMS, because it breaks the web.

Hackathon Goals

We aim to create tickets with pull requests or patches for the most-used CMS platforms. The topic might be a bit niche and may require further explanation for some readers, so maybe we will write a blog post to summarize the problem in greater detail, as well as explain why we chose to tackle it in this particular manner.

With PHP 5.3 there is an internationalization module (“intl”) available, but it is optional. This module contains a normalize function to change decomposed characters to their pre-composed states. If a hoster would install it by default, we could use it in the CMS and wouldn’t need to rely on modern browsers and their JS equivalent to normalize text to NFC (which is just supported by modern browsers as this function is part of ES6).

If this module or modern JS is not available, we need to have fallbacks. These complex RegEx fixes would be our last line of defense—hopefully we wouldn’t need them very often due to the help of the hosters.

Target audience

Every CMS which is using text and file names should be interested in normalizing text. If you have experiences with a modern CMS (WordPress, Drupal, Joomla, Typo3, etc.) and you are interested in or experienced with Unicode or RegEx, then please join this project—apply to the Hackathon.

If you are a browser developer (Firefox/Gecko, Chrome/Blink, or other) and you think this should be solved in the browser, please apply and help us to get this solved on your end.

Project Lead

Torsten Landsiedel is a WordPress developer and freelancer based in Hamburg, Germany, with a strong history of support and translation work in the WordPress community. He’s a member of the “Pluginkollektiv”. At the moment, his priorities are taking care of his daughter and working on medium-sized customer projects.

Back to Hackathon Projects