Character encoding substitution

I have a file loader that I use for reading proprietary format files generated on Windows and on OS/2. It works, but occasionally I run into a character encoding problem.

Text fields in the file header can contain user-entered text. A problem arises if a user enters the greek letter mu in one of these fields. The character encoding used for mu in OS/2 makes an illegal byte in UTF-8, and, unless I strip it out, splitstring chokes on the illegal byte as I further parse the text.

As a quick fix for this particular issue, I can simply strip out any suspicious looking byte (checking for char2num(byte)<0 seems to do the trick), which will at least prevent user-added text from affecting the reading of the rest of the file. But I'm thinking that there must be a better way to deal with this.

So, does anyone know the encoding most likely to be used in an OS/2 application? It looks like a single byte character set. Is there an existing tool for converting between character sets? Splitstring can handle valid unicode, right?

The loader is for Opus files created by Bruker infrared spectrometers.

This page says

The OS/2 operating system supports an encoding by the name of Code page 1004 (CCSID 1004) or "Windows Extended".[18][19] This mostly matches code page 1252, with the exception of certain C0 control characters being replaced by diacritic characters.

So it is probably Windows-1252.

You can use The ConvertTextEncoding function with sourceTextEncoding=3 (Windows 1252) and destTextEncoding=1 (UTF-8).



Thanks, that is helpful.

On further inspection, it seems that the text uses an extended ASCII character set, (something like "Code page 437"), where mu or micro symbol is character 0xE6.

I don't know how I missed ConvertTextEncoding!

Unfortunately it doesn't look like I can use ConvertTextEncoding to map extended ASCII variants to UTF-8. However, I can use ConvertTextEncoding to clean up any illegal bytes, using 1 for both source and destination encodings. That's probably the best option, because I don't know how to differentiate between files that could have contemporary character encodings and these files created by a legacy system.

In reply to by thomas_braun

Thanks, Thomas, that's good to know.

For my application, this was sufficient to deal with the unlikely occurrence of such characters in legacy files:

// preserves UTF-8 and cleans up single-byte extended charactersets
str = ConvertTextEncoding(str , 1 , 1 , 2, 1)