DEV Community

John Au-Yeung
John Au-Yeung

Posted on

Comparing Non-English Strings with JavaScript Collators

Subscribe to my email list now at http://jauyeung.net/subscribe/

Follow me on Twitter at https://twitter.com/AuMayeung

With the combination of the double equal or triple equal operator with string methods, we can compare strings easily in a case-sensitive or case insensitive manner. However, this doesn’t take into account the characters that are in non-English strings like French or Italian. These languages have alphabets that may contain accents, something that isn’t recognized in normal string comparisons.

To handle this scenario, we can use the Intl.Collator object to compare strings with accents or for different locales. The Intl.Collator object is a constructor for collators, which are objects that let us compare characters in a language-sensitive way. With Collators, we can compare the order of single characters according to the language that is specified.

Basic Collator Usage for String Equality Comparison

To use a collator, we can construct a Collator object and then use its compare method. The compare method does a comparison of the alphabetical order of the entire string based on the locale. For example, if we want to compare two strings in the German using its alphabet’s order, we can write the following code:

const collator = new Intl.Collator('de');  
const order = collator.compare('Ü', 'ß');  
console.log(order);

We created the Collator object by writing new Intl.Collator(‘de’) to specify that we are comparing strings in the German alphabet. Then we use the created compare method, which takes two parameters as the two strings that you want to compare in string form.

Then a number is returned from the compare function. 1 is returned if the string in the first parameter comes after the second one alphabetically, 0 if both strings are the same, and -1 is returned if the string in the first parameter comes before the second string alphabetically.

So if we flip the order of the strings like in the code below:

const collator = new Intl.Collator('de');  
const order = collator.compare('ß', 'Ü');  
console.log(order);

Then the console.log outputs -1.

If they’re the same, like in the following code:

const collator = new Intl.Collator('de');  
const order = collator.compare('ß', 'ß');  
console.log(order);

Then we get 0 returned for order.

To summarize: If the strings are equal, the function returns 0. If they are not equal the function returns either 1 or -1 which also indicates the alphabetical order of the strings.

Advanced Usage

The Collator is useful because we can put it in the Array.sort method as a callback function to sort multiple strings in the array. For example, if we have multiple German strings in an array, like in the code below:

const collator = new Intl.Collator('de');  
const sortedLetters = ['Z', 'Ä', 'Ö', 'Ü', 'ß'].sort(collator.compare);  
console.log(sortedLetters);

Then we get [“Ä”, “Ö”, “ß”, “Ü”, “Z”].

The constructor takes a number of options that take into account the features of the alphabets of different languages. As we can see above, the first parameter in the constructor is the locale, which is BCP-47 language tag, or an array of such tags. This is an optional parameter. An abridged list of BCP-47 language tags include:

  • ar — Arabic
  • bg — Bulgarian
  • ca — Catalan
  • zh-Hans — Chinese, Han (Simplified variant)
  • cs — Czech
  • da — Danish
  • de — German
  • el — Modern Greek (1453 and later)
  • en — English
  • es — Spanish
  • fi — Finnish
  • fr — French
  • he — Hebrew
  • hu — Hungarian
  • is — Icelandic
  • it — Italian
  • ja — Japanese
  • ko — Korean
  • nl — Dutch
  • no — Norwegian
  • pl — Polish
  • pt — Portuguese
  • rm — Romansh
  • ro — Romanian
  • ru — Russian
  • hr — Croatian
  • sk — Slovak
  • sq — Albanian
  • sv — Swedish
  • th — Thai
  • tr — Turkish
  • ur — Urdu
  • id — Indonesian
  • uk — Ukrainian
  • be — Belarusian
  • sl — Slovenian
  • et — Estonian
  • lv — Latvian
  • lt — Lithuanian
  • tg — Tajik
  • fa — Persian
  • vi — Vietnamese
  • hy — Armenian
  • az — Azerbaijani
  • eu — Basque
  • hsb — Upper Sorbian
  • mk — Macedonian
  • tn — Tswana
  • xh — Xhosa
  • zu — Zulu
  • af — Afrikaans
  • ka — Georgian
  • fo — Faroese
  • hi — Hindi
  • mt — Maltese
  • se — Northern Sami
  • ga — Irish
  • ms — Malay (macrolanguage)
  • kk — Kazakh
  • ky — Kirghiz
  • sw — Swahili (macrolanguage)
  • tk — Turkmen
  • uz — Uzbek
  • tt — Tatar
  • bn — Bengali
  • pa — Panjabi
  • gu — Gujarati
  • or — Oriya
  • ta — Tamil
  • te — Telugu
  • kn — Kannada
  • ml — Malayalam
  • as — Assamese
  • mr — Marathi
  • sa — Sanskrit
  • mn — Mongolian
  • bo — Tibetan
  • cy — Welsh
  • km — Central Khmer
  • lo — Lao
  • gl — Galician
  • kok — Konkani (macrolanguage)
  • syr — Syriac
  • si — Sinhala
  • iu — Inuktitut
  • am — Amharic
  • tzm — Central Atlas Tamazight
  • ne — Nepali
  • fy — Western Frisian
  • ps — Pushto
  • fil — Filipino
  • dv — Dhivehi
  • ha — Hausa
  • yo — Yoruba
  • quz — Cusco Quechua
  • nso — Pedi
  • ba — Bashkir
  • lb — Luxembourgish
  • kl — Kalaallisut
  • ig — Igbo
  • ii — Sichuan Yi
  • arn — Mapudungun
  • moh — Mohawk
  • br — Breton
  • ug — Uighur
  • mi — Maori
  • oc — Occitan (post 1500)
  • co — Corsican
  • gsw — Swiss German
  • sah — Yakut
  • qut — Guatemala
  • rw — Kinyarwanda
  • wo — Wolof
  • prs — Dari
  • gd — Scottish Gaelic

For example, de is for German or fr-ca for Canadian French. So, we can sort Canadian French strings by running the following code:

const collator = new Intl.Collator('fr-ca');  
const sortedLetters = ['ç', 'à', 'c'].sort(collator.compare);  
console.log(sortedLetters);

The constructor to Collator can also take an array of strings for multiple locale comparison — new Intl.Collator([/* local strings */]). The array argument allows us to sort strings from multiple locales. For example, we can sort both Canadian French alphabet and the German alphabet at the same time:

const collator = new Intl.Collator(['fr-ca', 'de']);  
const sortedLetters = [  
  'Ü', 'ß', 'ç', 'à', 'c'  
].sort(collator.compare);
console.log(sortedLetters);

Then we get [“à”, “c”, “ç”, “ß”, “Ü”] from the console.log statement.

Additional Options

Unicode extension keys which include "big5han", "dict", "direct", "ducet", "gb2312", "phonebk", "phonetic", "pinyin", "reformed", "searchjl", "stroke", "trad", "unihan" are also allowed in our locale strings. They specify the collations that we want to compare strings with. However, when there are fields in the options in the second argument that overlaps with this, then the options in the argument overrides the Unicode extension keys specified in the first argument.

Numerical collations can be specified by adding kn to your locale string in your first argument. For example, if we want to compare numerical strings, then we can write:

const collator = new Intl.Collator(['en-u-kn-true']);  
const sortedNums = ['10', '2'].sort(collator.compare);  
console.log(sortedNums);

Then we get [“2”, “10”] since we specified kn in the locale string in the constructor which makes the collator compare numbers.

Also, we can specify whether upper or lower case letters should be sorted first with the kf extension key. The possible options are upper, lower, or false. false means that the locale's default will be the option. This option can be set in the locale string by adding as a Unicode extension key, and if both are provided, then the option property will take precedence. For example, to make uppercase letters have precedence over lowercase letters, we can write:

const collator = new Intl.Collator('en-ca-u-kf-upper');  
const sorted = ['Able', 'able'].sort(collator.compare);  
console.log(sorted);

This sorts the same word with upper case letters first. When we run console.log, we get [“Able”, “able”] since we have an uppercase ‘A’ in ‘Able’, and a lowercase ‘a’ for ‘able’. On the other hand, if we instead pass in en-ca-u-kf-lower in the constructor like in the code below:

const collator = new Intl.Collator('en-ca-u-kf-lower');  
const sorted = ['Able', 'able'].sort(collator.compare);  
console.log(sorted);

Then after console.log we get [“able”, “Able”] because kf-lower means that we sort the same word with lowercase letters before the ones with uppercase letters.

The second argument of the constructor takes an object that can have multiple properties. The properties that the object accepts are localeMatcher, usage, sensitivity, ignorePunctuation, numeric, and caseFirst. numeric is the same as the kn option in the Unicode extension key in the locale string, and caseFirst is the same as the kf option in the Unicode extension key in the locale string. The localeMatcher option specifies the locale matching algorithm to use. The possible values are lookup and best fit. The lookup algorithm searches for the locale until it finds the one that fits the character set of the strings that are being compared. best fit finds the locale that is at least but possibly more suited that the lookup algorithm.

The usage option specifies whether the Collator is used for sorting or searching for strings. The default option is sort.

The sensitivity option specifies the way that the strings are compared. The possible options are base, accent, case, and variant.

base compares the base of the letter, ignoring the accent. For example a is not the same as b, but a is the same as á, a is the same as Ä.

accent specifies that a string is only different if there is a base letter or their accents are unequal then they’re unequal, ignoring case. So a isn’t the same as b, but a is the same as A. a is not the same as á.

The case option specifies that strings that are different in their base letters or case are considered unequal, so a wouldn’t be the same as A and a wouldn’t be the same as c, but a is the same as á.

variant means that strings that are different in the base letter, accent, other marks, or case are considered unequal. For example a wouldn’t be the same as A and a wouldn’t be the same as c. But also a wouldn't be the same as á.

The ignorePunctuation specifies whether punctuation should be ignored when sorting strings. It’s a boolean property and the default value is false.

We can use the Collator constructor with the second argument in the following way:

const collator = new Intl.Collator('en-ca', {  
  ignorePunctuation: false,  
  sensitivity: "base",  
  usage: 'sort'  
});  
console.log(collator.compare('Able', 'able'));

In the code above, we sort by checking for punctuation and only consider letters different if the base letter is different, and we keep the default that upper case letters are sorted first, so we get [‘Able’, ‘able’] in the console.log.

We can search for strings as follows:

const arr = ["ä", "ad", "af", "a"];  
const stringToSearchFor = "af";
const collator = new Intl.Collator("fr", {  
  usage: "search"  
});  
const matches = arr.filter((str) => collator.compare(str, stringToSearchFor) === 0);  
console.log(matches);

We set the usage option to search to use the Collator to search for strings and when the compare method returns 0, then we know that we have the same string. So we get [“af”] logged when we run console.log(matches).

We can adjust the options for comparing letter, so if we have:

const arr = ["ä", "ad", "ef", "éf", "a"];  
const stringToSearchFor = "ef";
const collator = new Intl.Collator("fr", {  
  sensitivity: 'base',  
  usage: "search"  
});
const matches = arr.filter((str) => collator.compare(str, stringToSearchFor) === 0);
console.log(matches);

Then we get [“ef”, “éf”] in our console.log because we specified sensitivity as base which means that we consider the letters with the same base accent as the same.

Also, we can specify the numeric option to sort numbers. For example, if we have:

const collator = new Intl.Collator(['en-u-kn-false'], {  
  numeric: true  
});  
const sortedNums = ['10', '2'].sort(collator.compare);  
console.log(sortedNums);

Then we get [“2”, “10”] because the numeric property in the object in the second argument overrides the kn-false in the first argument.

Conclusion

JavaScript offers a lot of string comparison options for comparing strings that aren’t in English. The Collator constructor in Intl provides many options to let us search for or sort strings in ways that can’t be done with normal comparison operators like double or triple equals. It lets us order numbers, and consider cases, accents, punctuation, or the combination of those features in each character to compare strings. Also, it accepts locale strings with key extensions for comparison.

All of these options together make JavaScript’s Collator constructor a great choice for comparing international strings.

Top comments (0)