auto-detect file Unicode encoding using BOM
April 2nd, 2006As anyone who worked with Unicode probably knows, Unicode files can be written using an array of different encodings. Special byte sequence known as BOM (Byte-Order-Marker) is (usually) put at the beginning of the file to declare its encoding explicitly.
The other day I needed a Python routine that should support reading files in all these encodings properly. Simply using codecs.open doesn’t work as expected: it does not strip BOM always as someone already found out. After reading Mark Pilgrim’s chardet module documentation and related ASPN recipe I come up with the following code:
def detect_unicode_encoding(fd):
'''Peeks inside the file stream to guess correct variant of Unicode encoding
via Byte-Order-Marker (BOM) tag.
Requires a file stream which supports backward and forward
positioning via .seek(). Resets read position to 0 (start of the file).'''
encodings_map = [
(3, codecs.BOM_UTF8, ‘UTF-8′),
(4, codecs.BOM_UTF32_LE, ‘UTF-32LE’),
(4, codecs.BOM_UTF32_BE, ‘UTF-32BE’),
(2, codecs.BOM_UTF16_LE, ‘UTF-16LE’),
(2, codecs.BOM_UTF16_BE, ‘UTF-16BE’),
]
buf = fd.read(4)
for (offset, bom, name) in encodings_map:
if buf[:offset] == bom:
fd.seek(offset) # skip byte-order marker
return name
fd.seek(0) # return to the beginning - no BOM found