Character Encoding Menu in 2014

This post is about a UI feature that I wish no one would have to use. Happily, it is indeed almost unused. Still, I made it more usable in the case when it is used. (The change was more driven by code removal than usability, though.) Anne asked me to document the situation, so here goes.

The Feature and Its Use Cases

For historical reasons, HTML can be delivered over the network using various character encodings. The browser decodes the incoming HTML data to Unicode and needs to know what encoding to use for decoding. The encoding can be declared using a byte order mark (BOM) inside the HTML file, declared using a <meta> tag inside the HTML file or declared using the Content-Type HTTP header outside the HTML file. Or the browser can encounter content whose encoding is undeclared, in which case the browser needs to guess. Traditionally, the guessing is based on the browser localization, but Firefox now tries to first guess based on the top-level domain of the URL. For some locales (Japanese, Russian and Ukrainian in Firefox), the guessing is based on the content of the file rather than the locale itself alone.

The Character Encoding menu allows the user to reload the document with a different encoding to be used for decoding. Specifically, the use cases are:

Sadly, telemetry shows that the second use case is now more common than the first one. On the bright side, telemetry also shows that the menu is almost entirely unused. It is unused in more than 99.9% of Firefox sessions in the locale where it is used the most (Traditional Chinese) and it is unused in more than 99.99% of Firefox sessions in most locales. In a way, it is sad to even have to improve a feature like this instead of just removing it. I hope we can at least avoid adding it to Firefox OS.

The Old Menu

The old menu implementation was very old. It was created on September 21, 1999. Back then, RDF was still a thing at Netscape, so the menu’s data came from an RDF data source. It seems that not everyone liked RDF even back then. By November 23, 1999, the documentation for the class implementing the data source said “God, our GUI programming disgusts me.”

The general attitude back then was to support a lot of encodings even without a strongly demonstrated Web compatibility need. Also, in addition to being used in the browser, both the menu and the general encoding back end were also used for the Composer HTML editor and the Mail/News client included in the Mozilla application suite. The code base gained a lot of encodings. Some were actually needed for the Web. Some were needed for email. Some were needed for converting Unicode to non-Unicode font encodings for rendering with pre-Unicode text rendering APIs. Some (in particular EUC-TW, it seems) were added to deal with file paths and the clipboard on some Unix flavors. Some encodings seem to have gotten added without a strong use case just because a standard existed. It also happened that the same encoding was added multiple times (e.g. TIS-620 and ISO-8859-11) or with slight variations under multiple names.

The large number of encodings led to attempts to manage the number first by organizing the encodings into submenus by region and then alleviating the problems created by the submenus by showing the most recent choices on the top-level and even providing editability (full with a decidated dialog!) for pinning some items to the top level.

Over time, some encodings were removed as completely useless and some encodings were removed or hidden as security problems, but overall, in the beginning of 2014, the menu was pretty much the way it was in 1999. By early 2014, Georgian GEOSTD8 had already been removed as not relevant to the Web and UTF-7, UTF-16, MacHebrew and MacArabic had been removed as cross-site scripting (XSS) hazards. Here’s the structure of the old menu from the beginning of 2014:

The old menu has so many problems I’m not even sure where to begin. Here are some problems. The list is not necessarily exhaustive.

The New Menu

At the end of 2014, the menu looks like this:

Clearly, the menu looks much better now. Particular things that are nice about the new menu include:

I took the following steps to come up with the new menu:

I’m not quite happy with the menu. In particular, I suspect that some “(ISO)” entries might be pretty useless, specifically the ones for Arabic, Baltic, Cyrillic and Greek. The Greek one is actually the fallback encoding used by the Greek Firefox localization and also the Greek Chrome localization, but it’s possible that this is a legacy arising from anti-Microsoft sentiment that doesn’t actually have much to do with the legacy content out there. The differences between Windows and ISO Greek are so small the chances are that guessing the ISO encoding works well enough with Windows-encoded legacy content, but guessing the Windows encoding and hiding the ISO encoding would be even more successful. In the case of the Arabic and Baltic ISO encodings, I doubt that they are used often enough that it’s worthwhile to have them in the menu considering that readers of Arabic, Cyrillic or Baltic text might waste time choosing the wrong option. Research into these matters would be appreciated.

Also, I am uncomfortable with having ISO-2022-JP in the menu. It has a structure that looks like an XSS hazard on its face. However, it has leaked from email to the Web, so it has some usage, and I have neither been able to develop nor seen anyone alse develop a proof-of-concept attack using it. If you want to get it out of the menu, the best bet is to show a proof-of-concept attack.

Update 2021-01-18

See also a sequel.