Mystery encoding

· jollygood's blog

The other day I bought some food and got a receipt with some interesting characters:

Receipt with funky encoding

The receipt is written in Finnish, but the Finnish alphabet does not use the 'Σ' letter. The receipt should use 'ä' instead of 'Σ': Yhteensä means Total and Pääte means Terminal in Finnish. I also found that the receipt uses a dash-like symbol '─' in place of the capital letter 'Ä'.

It looks like the receipt was written using one character encoding and read using another character encoding. Since the characters map one-to-one, we can assume that both encodings use the same number of bytes per for these characters.1

Let's find out which combination of character encodings would produce this receipt. We will use the ä-to-Σ conversion to find encodings, and then verify it using Ä-to-'─'2.

# First try: UTF-8

UTF-8 is one of the most common character encodings today, so let's start with that. It encodes both 'ä' and 'Σ' with two bytes. We'll compare UTF-8 with other two-byte encodings: UTF-16 and the obsolete UCS-2.

Encoding 'ä' 'Σ'
UTF-8 C3A4 CEA3
UTF-16 00E4 03A3
UCS-2 00E4 03A3

No combination of these encodings will produce a match between 'ä' and 'Σ'.

# Second try: single-byte encodings

There are very many single-byte encodings, I have limited the following table to the interesting ones with regards to the receipt above:

Encoding 'ä' 'Σ'
ISO 8859-1 E4 -
ISO 8859-2 E4 -
ISO 8859-3 E4 -
ISO 8859-4 E4 -
ISO 8859-9 E4 -
ISO 8859-10 E4 -
ISO 8859-13 E4 -
ISO 8859-14 E4 -
ISO 8859-15 E4 -
ISO 8859-16 E4 -
Windows-1250 E4 -
Windows-1252 E4 -
Windows-1254 E4 -
Windows-1257 E4 -
Windows-1258 E4 -
IBM Code Page 437 84 E4
IBM Code Page 860 - E4
IBM Code Page 862 - E4
IBM Code Page 863 - E4
IBM Code Page 865 84 E4

The Windows-125x and ISO 8859-x encodings listed above use E4 for 'ä' and the IBM Code Pages listed above use E4 for 'Σ'. It also turns out that these same ISO-8859-x and Windows-125x encodings use C4 for 'Ä' while the IBM Code Pages listed above use C4 for '─'3 - this match confirms that these are probably the encodings that we are looking for.

# Conclusion

The receipt was likely written using one of the ISO 8859-x or Windows-125x encodings listed above, and read using one of the IBM Code Pages listed above.

Based on the fact that this happened in Finland, my best guess is that the receipt was written using Windows-1252 "Western Europe" and read using IBM Code Page 865 "Nordic" or 437 "OEM United States" (the original IBM PC character encoding).

Mystery solved.


  1. If the number of bytes differ between encodings, we typically get a one-to-many mapping. For example, UTF-8 encodes 'ä' as two bytes (C3 A4). If we try to decode this with the single-byte encoding Windows-1252, we get two characters: 'ä' ↩︎

  2. There are many dash-like characters, so we can only use this as an indicator, not as evidence. ↩︎

  3. Unicode code point 0x2500 "Box Drawings Light Horizontal" ↩︎