A Weird Python 3 Unicode Failure

Mar 30, 2015

The following code can fail on my system:

from os import listdir for f in listdir('.'):     print(f)

Why?

UnicodeEncodeError: 'utf-8' codec can't encode character '\udce9' in position 13: surrogates not allowed

What?

I have a file with the name b'Latin1 file: \xe9'. This is a filename with a "é" encoded using Latin-1 (which is byte value \xe9). Python attempts to decode it using the current locale, which is utf-8. Unfortunately, \xe9 is not valid UTF-8, so Python solves this by inserting a surrogate character. So, I get a variable f which can be used to open the file.

However, I cannot print the value of this f because when it attempts to convert back to UTF-8 to print, an error is triggered.

I can understand what is happening, but it's just a mess. [1]

Here is a complete example:

f = open('Latin1 file: é'.encode('latin1'), 'w') f.write("My file") f.close() from os import listdir for f in listdir('.'):     print(f) On a modern Unix system (i.e., one that uses UTF-8 as its system encoding), this will fail. § A good essay on the failure of the Python 3 transition is out there to be written. Update 10 Aug 2019: Still failing on Python 3.6.5 [1] ls on the same directory generates a placeholder character, which is a decent fallback.

Rabbit Thoughts

A Weird Python 3 Unicode Failure