Dart UTF-8, UTF-16, UTF-32 Encoding & Decoding

This tutorial shows you how to encode and decode characters with UTF-8, UTF-16, and UTF32 in Dart.

Character encoding is used to represent a character as bytes. There are some encoding formats, UTF-8 is the most commonly used. There are also other UTF encodings such as UTF-16 and UTF-32. The main diference is how many bytes needed to represent a character. UTF-8 uses at least one byte, UTF-16 uses at least 2 bytes, while UTF-32 always uses 4 bytes. This tutorial gives you examples of how to perform encoding and decoding with those formats in Dart, including how to set endianness and BOM (Byte Order Mark).

Dependencies

Dart's built-in convert package only supports UTF-8. For UTF-16 and UTF-32, you can use utf package. Add the following in the dependencies section of your pubspec.yaml file, then run `Get dependencies'.

  dependencies:
    utf: 0.9.0+5

Using convert Package

To use Dart's convert package, import the library first by adding the following:

  import 'dart:convert';

To perform encoding, use:

  List<int> utf8.encode(String input)

You only need to pass the string to be encoded.

To decode the bytes into a String, use:

  utf8.decode(List<int> bytes, { bool allowMalformed = false })

If allowMalformed is set to true, it will replace invalid or unterminated octet sequences with the Unicode Replacement character `U+FFFD` (�). If it's set to false and invalid sequence exist, it will throw FormatException.

Here is the usage example:

  List<int> bytes = utf8.encode('www.woolha.com');
  String value = utf8.decode(encoded);
  print('bytes: $bytes');
  print('value: $value');

Output:

  bytes: [119, 119, 119, 46, 119, 111, 111, 108, 104, 97, 46, 99, 111, 109]
  value: www.woolha.com

Using utf Package

Besides UTF-8, the utf also supports UTF-16 and UTF-32. Below are the list of functions for encoding and decoding provided by utf package.

  List encodeUtf8(String str)
  
  String decodeUtf8(List<int> bytes,
      [int offset = 0,
      int length,
      int replacementCodepoint = UNICODE_REPLACEMENT_CHARACTER_CODEPOINT])
  
  List encodeUtf16(String str)
  List encodeUtf16be(String str, [bool writeBOM = false])
  List encodeUtf16le(String str, [bool writeBOM = false])
  
  String decodeUtf16(
      List<int> bytes,
      [int offset = 0,
      int length,
      int replacementCodepoint = UNICODE_REPLACEMENT_CHARACTER_CODEPOINT]
  )
  
  String decodeUtf16be(
      List<int> bytes,
      [int offset = 0,
      int length,
      bool stripBom = true,
      int replacementCodepoint = UNICODE_REPLACEMENT_CHARACTER_CODEPOINT]
  )
  
  String decodeUtf16le(
      List<int> bytes,
      [int offset = 0,
      int length,
      bool stripBom = true,
      int replacementCodepoint = UNICODE_REPLACEMENT_CHARACTER_CODEPOINT]
  )
  
  List encodeUtf32(String str)
  List encodeUtf32be(String str, [bool writeBOM = false])
  List encodeUtf32le(String str, [bool writeBOM = false])
  
  String decodeUtf32(
      List<int> bytes,
      [int offset = 0,
      int length,
      int replacementCodepoint = UNICODE_REPLACEMENT_CHARACTER_CODEPOINT]
  )
  String decodeUtf32be(
      List<int> bytes,
      [int offset = 0,
      int length,
      bool stripBom = true,
      int replacementCodepoint = UNICODE_REPLACEMENT_CHARACTER_CODEPOINT]
  )
  String decodeUtf32le(
      List<int> bytes,
      [int offset = 0,
      int length,
      bool stripBom = true,
      int replacementCodepoint = UNICODE_REPLACEMENT_CHARACTER_CODEPOINT]
  )

The functions for different encodings are similar. For encoding, the only required parameters is the value (String). For UTF-16 and UTF-32, you can choose between BE (Big Endian) and LE (Little Endian) by using function with be and le suffix respectively. By default, the functions without suffix uses Big Endian. The functions with suffix are not available for UTF-8 as it's read byte by byte regarless of the CPU architecture. For functions with BE and LE suffix, there is also an optional parameter:

  • writeBOM: determines whether the BOM (Byte Order Mark) should be written.

For decoding, UTF-16 and UTF-32 also have the BE and LE variants. You need to pass the bytes (List<int>) as the first argument. The optional parameters are:

  • offset: an offset into a list of bytes.
  • length: limit the length of the values to be decoded.
  • stripBom: whether to strip the leading BOM.
  • replacementCodepoint: the replacement character. Default to 0xffd.

Below are the examples of using the encoding functions mentioned above on the same string as well as the functions for decoding, followed by the output.

  var text = 'woolha.com';
  var _8 = encodeUtf8(text);
  print('8: $_8');

  var _16 = encodeUtf16(text);
  var _16Le = encodeUtf16le(text);
  var _16LeBom = encodeUtf16le(text);
  print('16: $_16'); // Big Endian
  print('16LE: $_16Le'); // Little Endian
  print('16LE - BOM: $_16LeBom'); // Little Endian, writeBOM = true

  var _32 = encodeUtf32(text);
  var _32Le = encodeUtf32le(text);
  var _32LeBom = encodeUtf32le(text);
  print('32: $_32'); // Big Endian
  print('32LE: $_32Le'); // Little Endian
  print('32LE - BOM: $_32LeBom'); // Little Endian, writeBOM = true

  print('32: ${encodeUtf32(text)}'); // Big Endian
  print('32LE: ${encodeUtf32le(text)}'); // Little Endian
  print('32LE - BOM: ${encodeUtf32le(text, true)}'); // Little Endian, writeBOM = true

  print('8 - value: ${decodeUtf8(_8)}');

  print('16 - value: ${decodeUtf16(_16)}');
  print('16LE - value: ${decodeUtf16le(_16Le)}');
  print('16LE - BOM - value: ${decodeUtf16le(_16LeBom)}');

  print('32 - value: ${decodeUtf32(_32)}');
  print('32LE - value: ${decodeUtf32le(_32Le)}');
  print('32LE - BOM - value: ${decodeUtf32le(_32LeBom)}');

Output::

  8: [119, 111, 111, 108, 104, 97, 46, 99, 111, 109]
  16: [254, 255, 0, 119, 0, 111, 0, 111, 0, 108, 0, 104, 0, 97, 0, 46, 0, 99, 0, 111, 0, 109]
  16LE: [119, 0, 111, 0, 111, 0, 108, 0, 104, 0, 97, 0, 46, 0, 99, 0, 111, 0, 109, 0]
  16LE - BOM: [119, 0, 111, 0, 111, 0, 108, 0, 104, 0, 97, 0, 46, 0, 99, 0, 111, 0, 109, 0]
  32: [0, 0, 254, 255, 0, 0, 0, 119, 0, 0, 0, 111, 0, 0, 0, 111, 0, 0, 0, 108, 0, 0, 0, 104, 0, 0, 0, 97, 0, 0, 0, 46, 0, 0, 0, 99, 0, 0, 0, 111, 0, 0, 0, 109]
  32LE: [119, 0, 0, 0, 111, 0, 0, 0, 111, 0, 0, 0, 108, 0, 0, 0, 104, 0, 0, 0, 97, 0, 0, 0, 46, 0, 0, 0, 99, 0, 0, 0, 111, 0, 0, 0, 109, 0, 0, 0]
  32LE - BOM: [119, 0, 0, 0, 111, 0, 0, 0, 111, 0, 0, 0, 108, 0, 0, 0, 104, 0, 0, 0, 97, 0, 0, 0, 46, 0, 0, 0, 99, 0, 0, 0, 111, 0, 0, 0, 109, 0, 0, 0]
  32: [0, 0, 254, 255, 0, 0, 0, 119, 0, 0, 0, 111, 0, 0, 0, 111, 0, 0, 0, 108, 0, 0, 0, 104, 0, 0, 0, 97, 0, 0, 0, 46, 0, 0, 0, 99, 0, 0, 0, 111, 0, 0, 0, 109]
  32LE: [119, 0, 0, 0, 111, 0, 0, 0, 111, 0, 0, 0, 108, 0, 0, 0, 104, 0, 0, 0, 97, 0, 0, 0, 46, 0, 0, 0, 99, 0, 0, 0, 111, 0, 0, 0, 109, 0, 0, 0]
  32LE - BOM: [255, 254, 0, 0, 119, 0, 0, 0, 111, 0, 0, 0, 111, 0, 0, 0, 108, 0, 0, 0, 104, 0, 0, 0, 97, 0, 0, 0, 46, 0, 0, 0, 99, 0, 0, 0, 111, 0, 0, 0, 109, 0, 0, 0]
  8 - value: woolha.com
  16 - value: woolha.com
  16LE - value: woolha.com
  16LE - BOM - value: woolha.com
  32 - value: woolha.com
  32LE - value: woolha.com
  32LE - BOM - value: woolha.com

You can see the difference ouput bytes as the result of using different encodings and endiannesses. For decoding, using the right function based on the encoding and endianness is also important to get the correct value.