Character Encodings Demystified
If you’ve ever tried moving data between database systems then you’ve probably run into a character encoding issue. Encoding errors usually pop up during data loading and are usually very vague, such as ‘invalid byte sequence for encoding’. These errors are hard to understand because the data looks correct in your old system and looks correct on disk, but it just wont load. Frustration often leads to sacrificing data integrity in order to get most of the data into the system. Fortunately, with planning and a basic understanding of character encodings it is possible to achieve a perfect data load every time.
This article includes the following sections:
- What is an Encoding?
- ASCII
- ASCII Extensions
- Unicode and UCS
- UTF-8
- Loading Your Data
- Finding Your Data Encoding
- Converting Your Data
- Dealing With Corrupt Data
- Converting With Web External Tables
- External Table Encoding
- Handling NULL Characters
- Data Delimiters
- Database Encoding
What is an encoding?
An encoding is a specification for transforming the textual characters you see on your screen into bytes of data that can be stored electronically, such as on a computer disk. Character encodings have been important since the time of the telegraph, when it was first necessary to encode text so that it could be transmitted over a wire and accurately decoded on the other end. Early encoding systems were very simple since only the English alphabet and basic punctuation were required.
As computing advanced and spread around the world, many different encodings were invented to store characters from different languages and regions. Many of these encodings were created independently without taking existing encodings into account. This lead to a lot of overlap between encodings; two encoding systems might store the same character differently. It also lead to confusion when users tried to read data generated elsewhere. If they didn’t know the encoding the data was originally stored in it might be hard or even impossible for them to read it.
ASCII
ASCII, The American Standard Code for Information Interchange, is a character encoding standard first published in 1963. It was developed from telegraphic codes and specified the encodings for 128 English alphabet characters. This encoding was widely adopted, but ultimately suffered from its ability to only encode 128 characters. It could never be used for non-English data, nor even for English data that required special symbols. By contrast, UTF-8 has assigned over 100,000 unique characters to date and has the ability to eventually support more than a million.
Conveniently but confusingly, the ASCII encoding is overlapped by other encodings such as UTF-8. This means that newer encodings will use the same byte sequences for the 128 ASCII characters and provide additional sequences for the other characters supported by the encoding. For example, the byte 0x42 represents the character ‘B’ in both ASCII and UTF-8. This often causes confusion because people will look at their data and assume it is ASCII when it is really encoded in another encoding that happens to overlap ASCII.
Since ASCII was a standard in the English speaking world for so long, it is also common to hear people refer to any binary encoded data as ‘ACII data’. For this reason it is always important to clarify and double check before loading what may appear to simply be ‘ASCII data’.
ASCII Extensions
At the same time ASCII was being standardized for encoding English data, developers in other countries were creating their own standards for handling their native characters. Encodings were independently developed for languages such as Chinese, Japanese, Korean, Arabic, and many others. Most of these encodings were simply language-specific extensions of ASCII.
One thing most of these ASCII Extension encodings have in common is they are single byte encodings - this means they store every character as a single byte. This is a severe limitation because it limits each encoding to just 255 distinct characters. This limitation is the reason that each language needed a separate encoding; the characters from all languages simply couldn’t fit in one 255 character encoding. Examples of these single byte ASCII extension encodings include: Latin1 through Latin10, KOI8, Windows-874, Windows-1250, and many more.
EBCDIC is another single byte encoding, but it is not an extension of ASCII. It was invented by IBM in the 1960s and contains most of the characters included in ASCII, but the byte encodings for the characters are different.
Legacy databases and data storage systems almost always use one of these single byte encoding schemes. Encoding errors when loading data into PostgreSQL or Greenplum are almost always caused by a mismatch in database encoding and the encoding of data on the disk.
Unicode and UCS
In the 1980s two different international organizations recognized the only way to solve the problems associated with having many overlapping encodings was to create a new encoding standard that would be a superset of all the encodings created before it. These two organizations were the International Organization for Standardization and the Unicode Consortium. In the early 1990s these two groups decided to merge their work on the Universal Character Set (UCS) and Unicode projects respectively to create a single standard that could be used by the international community. To this day both the Unicode and UCS standards exist and are published independently, but they are functionally equivalent.
At the present time, Unicode and UCS have assigned unique names and numbers, called code points, to over 100,000 characters. Unicode has consequently become the de facto standard for encoding data around the world. All new data systems being built today support Unicode as the default for encoding data. Since Unicode is a superset of all other encodings it should always be your first choice to accurately store your data.
UTF-8
Unicode assigns a unique ‘code point’ to each character it supports. For example, the code point for the character ‘B’ is ‘0042’ and the code point for the Euro symbol ‘€’ is ‘20ac’. See the following link for a complete listing of Unicode characters and their code points.
Using the Unicode standard we can easily transform any character into a numeric code point, but before we can use it to actually work with data we must also have a method for encoding strings of these characters (code points) into sequences of bytes that we can store on a disk. This standard system for transforming code points into bytes is called UTF-8. Unlike the ASCII Extension encodings which represent each character as a single byte, UTF-8 is capable of representing many more characters because it uses between 1 and 4 bytes to encode each character. The distinction between Unicode and UTF-8 is very important: Unicode is a standard for transforming characters into numeric code points and UTF-8 is a standard for for writing out strings of Unicode code points as bytes.
Let us examine this distinction in more detail by looking at the characters ‘B’, ‘€’, and ‘Δ’. The following table shows these characters, their Unicode code points, and their UTF-8 byte encodings taken from this reference table.
Character | Unicode Code Point | UTF-8 Byte Sequence | Name |
---|---|---|---|
B | 0042 | 42 | English Capital B |
Δ | 0394 | ce 94 | Greek Delta |
€ | 20ac | e2 82 ac | Euro Symbol |
First, we can use Perl to print these characters to the screen using their Unicode code points:
$ perl -C -e 'print "\x{0042}\x{0394}\x{20ac}"'
B Δ €
By printing them to the screen we can only see a translation from code point to character, in order to see the byte encodings we must use the Unix ‘od’ tool. od takes a string of bytes as an input and prints out what it reads in a variety of different representations. We will use it to see the byte sequences for each character:
$ perl -C -e 'print "\x{0042}\x{0394}\x{20ac}"' | od -t x1 -c
0000000 42 ce 94 e2 82 ac
B Δ ** € ** **
0000006
Notice that this output corresponds correctly to our table above and that ‘B’ is represented by one byte, ‘Δ’ by two bytes, and ‘€’ by three bytes. Instead of using od to examine the output of our Perl script we could just as easily use it to examine the contents of a data file before loading it into a database.
As a final example we will also use Python to print the same character string using Unicode code points or their corresponding byte sequences:
$ python
>>> print u'\u0042 \u0394 \u20ac'
B Δ €
>>> print '\x42 \xce\x94 \xe2\x82\xac'
B Δ €
Loading Your Data
By now it should be clear that UTF-8 should be the preferred encoding for your data. Using any other encoding for your database will prevent you from ever storing characters outside that encoding. UTF-8 will store all the characters your data set contains today and give you the maximum future flexibility as you add new data sources to your environment.
With a firm understanding of what character encodings are and how they evolved you should be ready to start tackling real data problems. Character encoding issues almost always show up when trying to move data from one system to another. When migrating data it may have to be decoded and encoded three times or more: when unloading from your source system, when transferring to a landing zone or staging server, and when being loaded into your target system. At each point in the process it is possible to introduce encoding problems.
Finding your data encoding
As stated in the introduction, the number one encoding error encountered when loading data is ‘invalid byte sequence for encoding’. As we know from the above section, this error must mean that the data on disk contains a byte sequence that does not correspond to a Unicode code point. In turn this must mean the data is either not stored as UTF-8 or it is corrupt.
When people run into this error I always ask “How is your data encoded?” and the most typical response is “How do I find out?”. It turns out that finding the data encoding for arbitrary data is a very hard problem. Joel Spolsky said it best in his blog post on Unicode: > It does not make sense to have a string without knowing what encoding it uses. … There Ain’t No Such Thing As Plain Text. If you have a string, in memory, in a file, or in an email message, you have to know what encoding it is in or you cannot interpret it or display it to users correctly.
This difficult problem is made harder by the encoding overlap issue described earlier. You may print out a small sample of your data, or even open it in a text editor and see that most of it displays correctly. This is because the first 128 characters of Unicode use the same byte sequences as the corresponding characters in ASCII, which are also used by most single-byte ASCII extensions. If your data is mostly ASCII than most of it will display correctly in a UTF-8 environment, regardless of the actual encoding. However, if you are seeing errors when loading the data you can rest assured that somewhere your data has characters which are not encoded with UTF-8.
If you aren’t sure what encoding your data uses, it is often helpful to check the system encoding of the machine the data came from:
$ cat /etc/sysconfig/i18n
LANG="en_US.UTF-8"
You can also use the od tool to examine the byte sequences in your data file so you can manually check for invalid characters. These characters will usually be displayed in text editors as a blank space, a question mark, or simply not displayed at all. When we use our earlier ‘od’ example with an invalid byte string we see ‘327 212’ displayed instead of a valid character.
$ perl -C -e 'print "\x{05ca}"' | od -c -t x1
0000000 327 212
d7 8a
0000002
If you use od to examine your data file and see output like the above it is a good sign that you have non-UTF8 data. You can also use the ‘file’ utility to check for the encoding, but it is less reliable that manually inspecting the byte sequences:
$ file yourfilename
yourfilename: UTF-8 Unicode English text
To summarize, the most reliable way to find your data encoding is to check the encoding of the source system it came from. If you aren’t able to do that then you can also manually examine the byte sequences or use the ‘file’ utility to check.
Encoding conversion
The Unicode Consortium provides mapping tables for translating characters from any other encoding into Unicode. As long as you know the current encoding of your data you can use these tables to convert it without any ambiguity or data loss.
The standard tool for converting data between different character encodings is iconv, which is powered by the GNU libiconv library. You specify the input and output encodings you want and it will use the Unicode mapping tables to convert the data for you.
For example, Windows-1253 is a Windows code page for Greek characters. We can see from the code page layout that in this encoding a Greek Capital Delta is represented by the byte 0xC4. From our earlier examples we know the Unicode code point for this character is U+0394 and it is represented by the UTF-8 byte sequence ‘ce 94’. A look at the Unicode mapping table for this encoding confirms that is the correct translation. The following snippet shows the conversion working as expected; we print out the Windows-1253 byte sequence for a Greek Delta (0xC4) and see that it is converted to the UTF-8 byte sequence ‘ce 94’.
$ python -c "print '\xc4'" | iconv --from-code WINDOWS-1253 --to-code UTF8 | od -c -t x1
0000000 Δ ** \n
ce 94 0a
0000003
Instead of passing the input over STDIN we could just as easily give iconv the name of a data file to convert.
Dealing with corruption
It may be possible that the data from our source system is corrupt; or in other words it contains byte sequences that are invalid in the source encoding and can not be converted to Unicode. For example, if we look again at the Windows-1253 Mapping Table we can see that the byte ‘0xAA’ is undefined. If this byte appears in a source file that we believe to be encoded by Windows-1253 than we would say the source file is corrupt. See what happens if we try and use our above conversion example to convert data with this byte:
$ python -c "print '\xc4\xaa'" | iconv --from-code WINDOWS-1253 --to-code UTF8 | od -c -t x1
iconv: (stdin):1:1: cannot convert
0000000 Δ **
ce 94
0000002
Iconv provides us with a couple of options for dealing with this problem. First, we can use the ‘-c’ option to silently discard the invalid bytes:
$ python -c "print '\xc4\xaa'" | iconv -c --from-code WINDOWS-1253 --to-code UTF8 | od -c -t x1
0000000 Δ ** \n
ce 94 0a
0000003
Since that byte was undefined in the source encoding this may be the preferred solution, but it does introduce some data loss into our system. If keeping track of these bytes is important to us we can also tell iconv how to represent bytes that can’t be converted with the ‘–byte-subst’ option. It is possible to format the invalid byte in any way, but in the following example we replace the invalid byte with the string “<0xaa>” to denote the existence of that byte in the source data:
$ python -c "print '\xc4\xaa'" | iconv --from-code WINDOWS-1253 --to-code UTF8 --byte-subst="<0x%x>" | od -c -t x1
0000000 Δ ** < 0 x a a > \n
ce 94 3c 30 78 61 61 3e 0a
0000011
$ python -c "print '\xc4\xaa'" | iconv --from-code WINDOWS-1253 --to-code UTF8 --byte-subst="<0x%x>"
Δ<0xaa>
It is also possible that you might have corrupt UTF-8 data. If you get invalid byte sequence errors when loading data you believe is encoded in UTF-8 you can also use the iconv utility to fix that data. For example, the byte sequence ‘80’ is invalid in UTF-8. We can specify both our source and target encodings as UTF8, and as above iconv will either strip the invalid bytes with -c, or replace them with –byte-subst:
$ python -c "print '\xce\x94\x80'" | iconv --from-code UTF8 --to-code UTF8 --byte-subst="U+%04X"
ΔU+0080
The above example we denote the invalid code point in the output data, but we could have also simply stripped it with the ‘-c’ option if that was desired instead.
Converting data with web external tables
It is a common practice in Greenplum to load data through an External Table. This is the fastest way to get data into Greenplum since it takes advantage of Greenplum’s scatter/gather streaming technology to parallelize the loading process. This works great if your data is already in the correct format, but if your data needs to be converted you can use a Web External Table to convert it on the fly, during the load.
For this example we will be using a simple table with just two columns, an integer and a text field:
CREATE TABLE my_simple_table(
id int,
str text
);
First we will create a Unicode data file that contains only one row with the columns separated by a ‘|’ delimiter. Our row will contain a Greek Delta as an example of a character outside of ASCII range.
$ perl -C -e 'print "1|Greek Delta: \x{0394}\n"' > unicode.txt
$ cat unicode.txt
1|Greek Delta: Δ
Next we will convert our Unicode data file to Windows-1253 so we can practice loading it. Note our Greek Delta now displays as a question mark since it is a non-ASCII character that has been converted to a non-UTF8 encoding:
$ iconv --from-code UTF8 --to-code WINDOWS-1253 unicode.txt > windows1253.txt
$ cat windows1253.txt
1|Greek Delta: ?
We can create a regular external table to try loading this data, but we will see an encoding error:
CREATE EXTERNAL TABLE my_ext_table(
id int,
str text
) LOCATION(
'gpfdist://localhost:8080/windows1253.txt'
) FORMAT 'text' (
DELIMITER '|'
);
demo=# select * from my_ext_table;
ERROR: invalid byte sequence for encoding "UTF8": 0xc40a (seg0 slice1 gptext:40000 pid=4668)
DETAIL: External table my_ext_table, line 1 of file gpfdist://localhost:8080/windows1253.txt
To fix this we simply change to a Web External Table that uses iconv to convert the data as described earlier:
CREATE EXTERNAL WEB TABLE my_web_table(
id int,
str text
) EXECUTE
'iconv --from-code WINDOWS-1253 --to-code UTF8 windows1253.txt'
ON MASTER
FORMAT 'TEXT' (
DELIMITER '|'
);
demo=# select * from my_web_table;
id | str
----+----------------
1 | Greek Delta: Δ
(1 row)
In this way our data is converted on the fly and loaded into the database in UTF8 format without an intermediary file ever having been written to disk.
External table encoding parameter
You could also accomplish the above using the encoding parameter when you create your external table:
CREATE EXTERNAL TABLE my_w1253_table(
id int,
str text
) LOCATION(
'gpfdist://localhost:8080/windows1253.txt'
) FORMAT 'text' (
DELIMITER '|'
) ENCODING 'WINDOWS-1253'
demo=# select * from my_w1253_table;
id | str
----+----------------
1 | Greek Delta: Δ
(1 row)
This works for our example, but it would not work if your data file contained any corruption, or byte sequences that could not be converted directly. To deal with corruption you need to use the Web External Table example with the correct iconv parameters to either drop or reformat the bad byte sequences.
Handling null characters in Greenplum
The NULL character 0x00 can not be stored in text fields in either PostgreSQL or Greenplum. If it appears in your data you will see an error like the following:
demo=# select E'\0';
ERROR: invalid byte sequence for encoding "UTF8": 0x00
This is a major problem because U+0000 is a valid Unicode control character. Programs like iconv will not detect a problem when they encounter it so the above methods shown for dealing with corrupt data will not correct this error. The NULL character is also valid in many other database systems, so it is quite common for it to appear in data that you are trying to migrate to Greenplum from another system.
To deal with this issue you must either strip or translate these characters before loading using the ‘tr’ utility or something similar. As an example we can use the technique we learned earlier to examine a string that contains the null character:
$ perl -C -e 'print "\x{0042}\x{0000}\x{0394}"' | od -c -t x1
0000000 B \0 Δ **
42 00 ce 94
0000004
Now we can use tr with the ‘-d \\0’ option to strip the null character from the string.
$ perl -C -e 'print "\x{0042}\x{0000}\x{0394}"' | tr -d \\0 | od -c -t x1
0000000 B Δ **
42 ce 94
0000003
We could also combine this with our web external table example above to convert the encoding of our data, fix any corruption, and strip all null characters at the same time!
CREATE EXTERNAL WEB TABLE my_web_table(
id int,
str text
) EXECUTE
E'iconv --from-code WINDOWS-1253 --to-code UTF8 --byte-subst="<0x%x>" windows1253.txt | tr -d \\\\0'
ON MASTER
FORMAT 'TEXT' (
DELIMITER '|'
);
demo=# select * from my_web_table;
id | str
----+----------------
1 | Greek Delta: Δ
(1 row)
Data Delimiters
If you are importing data into Greenplum than it is almost certainly delimited, meaning the column values are separated by a character known as the ‘delimiter’. Common examples are comma delimited or pipe delimited data files:
1,col a,col b
2|col a|col b
A Greenplum limitation that sometimes catches people off guard is that your delimiter must be represented by a single byte, regardless of the encoding. Even if you have Unicode data, your delimiter must be a Unicode character that has a single byte UTF-8 encoding. This means the Greek Delta from our above examples, which is encoded by the byte sequence ‘ce 94’ is not a valid delimiter in a Greenplum data file. If you try to use a multi-byte delimiter you will see the following error:
CREATE EXTERNAL TABLE my_delta_ext_table(
id int,
str text
) LOCATION(
'gpfdist://localhost:8080/windows1253.txt'
) FORMAT 'text' (
DELIMITER 'Δ'
);
ERROR: delimiter must be a single ASCII character, or 'off'
This can be a severe limitation because your delimiter must not appear in your data. Especially if you are loading textual data it can be tricky to find a single-byte character that doesn’t appear in your data. One option is to always use the ‘CSV’ format type and quote your column values:
"1"|"col a | contains the delimiter"|"col b"
However, depending on the system you are exporting from, quoting column values may not be an option. It should also be noted that loading data in ‘CSV’ format is a bit slower than loading data in the ‘TEXT’ format (which doesn’t allow quoting column values). If it is either not possible to quote your column values or if load speed is important, than you must carefully pick your delimiter based on your data. Control characters in the byte range ‘01’ - ‘1f’ are usually good choices since they rarely appear in valid textual data.
Changing your database encoding
When people have trouble loading data their first reaction is often to change their database encoding to match the encoding of the data they are trying to load. This may allow you to quickly load your data, but it will forever prevent you from loading characters outside of that encoding. In the modern big data era you should be encouraging rapid incorporation of new data sources whenever possible, regardless of their encoding of that data. For this reason I always recommend setting your database encoding to UTF8 and using the techniques described above to convert your data to match it.
Conclusion
Unicode and UTF8 are modern standards for data encoding that are supersets of all encodings that have preceded them. When bringing new data into your environment always take the time to understand the encoding of the system the data is coming from. Once you know the encoding of your source data, converting it to UTF8 is a relatively simple and straightforward process. Using the same techniques you can also identify and fix corruption of your source data before or even during the load process.
comments powered by Disqus