The other day I bought some food and got a receipt with some interesting characters:
The receipt is written in Finnish, but the Finnish alphabet does not use the 'Σ' letter. The receipt should use 'ä' instead of 'Σ': Yhteensä means Total and Pääte means Terminal in Finnish. I also found that the receipt uses a dash-like symbol '─' in place of the capital letter 'Ä'.
It looks like the receipt was written using one character encoding and read using another character encoding. Since the characters map one-to-one, we can assume that both encodings use the same number of bytes per for these characters.1
Let's find out which combination of character encodings would produce this receipt. We will use the ä-to-Σ conversion to find encodings, and then verify it using Ä-to-'─'2.
First try: UTF-8 #
UTF-8 is one of the most common character encodings today, so let's start with that. It encodes both 'ä' and 'Σ' with two bytes. We'll compare UTF-8 with other two-byte encodings: UTF-16 and the obsolete UCS-2.
Encoding | 'ä' | 'Σ' |
---|---|---|
UTF-8 | C3A4 |
CEA3 |
UTF-16 | 00E4 |
03A3 |
UCS-2 | 00E4 |
03A3 |
No combination of these encodings will produce a match between 'ä' and 'Σ'.
Second try: single-byte encodings #
There are very many single-byte encodings, I have limited the following table to the interesting ones with regards to the receipt above:
Encoding | 'ä' | 'Σ' |
---|---|---|
ISO 8859-1 | E4 |
- |
ISO 8859-2 | E4 |
- |
ISO 8859-3 | E4 |
- |
ISO 8859-4 | E4 |
- |
ISO 8859-9 | E4 |
- |
ISO 8859-10 | E4 |
- |
ISO 8859-13 | E4 |
- |
ISO 8859-14 | E4 |
- |
ISO 8859-15 | E4 |
- |
ISO 8859-16 | E4 |
- |
Windows-1250 | E4 |
- |
Windows-1252 | E4 |
- |
Windows-1254 | E4 |
- |
Windows-1257 | E4 |
- |
Windows-1258 | E4 |
- |
IBM Code Page 437 | 84 |
E4 |
IBM Code Page 860 | - | E4 |
IBM Code Page 862 | - | E4 |
IBM Code Page 863 | - | E4 |
IBM Code Page 865 | 84 |
E4 |
The Windows-125x and ISO 8859-x encodings listed above use E4
for 'ä' and the IBM Code Pages listed above use E4
for 'Σ'.
It also turns out that these same ISO-8859-x and Windows-125x encodings use C4
for 'Ä' while the IBM Code Pages listed above use C4
for '─'3 - this match confirms that these are probably the encodings that we are looking for.
Conclusion #
The receipt was likely written using one of the ISO 8859-x or Windows-125x encodings listed above, and read using one of the IBM Code Pages listed above.
Based on the fact that this happened in Finland, my best guess is that the receipt was written using Windows-1252 "Western Europe" and read using IBM Code Page 865 "Nordic" or 437 "OEM United States" (the original IBM PC character encoding).
Mystery solved.
-
If the number of bytes differ between encodings, we typically get a one-to-many mapping. For example, UTF-8 encodes 'ä' as two bytes (
C3 A4
). If we try to decode this with the single-byte encoding Windows-1252, we get two characters: 'ä' ↩︎ -
There are many dash-like characters, so we can only use this as an indicator, not as evidence. ↩︎
-
Unicode code point
0x2500
"Box Drawings Light Horizontal" ↩︎