Home > C#, C++, Software Development > Testing for possible Unicode – ANSI code page compatibility

Testing for possible Unicode – ANSI code page compatibility

When dealing with a recent ExifTool remoting task, there was a question whether or not a given Unicode file name could be safely represented in the system ANSI code page. Only if the file name was fully convertible it could be passed to the application directly.

In case the file name cannot be converted to the current code page, an application which does not utilize the CreateFileW() API will not be able to open the file with this name. In case the file system supports old style DOS 8.3 filenames, the application should resort to using those instead.

BOOL IsConvertibleText( PCWSTR sFile )
{
    BOOL bRet = FALSE;
    if ( sFile )
    {
        int iBuffer = WideCharToMultiByte( CP_ACP, 0, sFile, -1, NULL, 0, NULL, NULL );
        if ( iBuffer != 0 )
        {
            iBuffer += 1;
            PSTR a = (PSTR)HeapAlloc( GetProcessHeap(), 0, iBuffer );
            if ( a )
            {
                if ( WideCharToMultiByte( CP_ACP, 0, sFile, -1, a, iBuffer, NULL, NULL ) )
                {
                    iBuffer = MultiByteToWideChar( CP_ACP, 0, a, -1, NULL, 0 );
                    if ( iBuffer != 0 )
                    {
                        iBuffer = ( iBuffer + 1 ) * sizeof(WCHAR);
                        PWSTR w = (PWSTR)HeapAlloc( GetProcessHeap(), 0, iBuffer );
                        if ( w )
                            if ( MultiByteToWideChar( CP_ACP, 0, a, -1, w, iBuffer ) )
                                if ( CompareStringW( LOCALE_SYSTEM_DEFAULT, 0, sFile, -1, w, -1 ) == CSTR_EQUAL )
                                    bRet = TRUE;
                        HeapFree( GetProcessHeap(), 0, w );
                    }
                }
                HeapFree( GetProcessHeap(), 0, a );
            }
        }
    }
    return bRet;
}

For those using C#:

bool IsConvertibleText( string sFile )
{
    byte[] b = Encoding.Default.GetBytes( sFile );
    string s = Encoding.Default.GetString( b );
    return sFile.Equals( s, StringComparison.InvariantCulture );
}

See also: Post in CPAN::Forum
Win32API::File Unicode support bug

  1. Jean
    May 20th, 2010 at 08:33 | #1

    Hello
    I use ExifTool under Window (XP, 7).
    I create a .bat to be able to get the metadata in a .txt with the following syntax:
    (the .bat is created with WCHAR characters)

    exiftool.exe -EXIF:All myfile.jpg >myfile.txt

    It works perfectly with ‘western’ file names but as soon as i use, for example, cyrillic file names, nothing’s happening.

  2. May 20th, 2010 at 10:12 | #2

    Jean,
    the bahaviour you are describing is due to a limitation of the Perl interpreter, which can only handle file names which can be represented in the current ANSI code page of your Windows system. Most likely you are using code page 1252, which does not contain cyrillic letters. Basically you have the following options:

    1. Change your system code page to one that supports cyrillic. This will have many side effects and might impact your ability to open files with western accented characters.
    2. Use the windows command line to rename the file including the folder name to one that can be safely represented in your code page. After you are done with ExifTool, rename it and move it to the original location. This is a complicated but safe approach if you are planning to modify the file.
    3. Do not provide a file name to ExifTool and tell it to read data from stdin instead. Then use the type command to read the file. E.g. type Кирилица.jpg | exiftool -. This is guaranteed to work in all cases, but it is difficult to use the same technique for writing.
  3. Jean
    May 20th, 2010 at 17:15 | #3

    Christian
    Thank you for your answer, i did not know that ‘type’ command, it works fine.
    Now i rename/copy the files to temp and Exiftool can read them.
    But some infos are written in cyrillic (see for example Keywords and By-line) and are not read by Exiftool; it seems to be the normal behaviour due to Perl ?

    Current IPTC Digest : 9a42ed6cec8fd5344ec946c1cb15501e
    Application Record Version : 2
    Keywords : Àíäðåé, Èâàí, Òàðàñ
    By-line : Ïåòðîâ Ïåòð
    IPTC Digest : 9a42ed6cec8fd5344ec946c1cb15501e
    Image Width : 1704
    Image Height : 2272
    Encoding Process : Baseline DCT, Huffman coding

  4. May 20th, 2010 at 18:21 | #4

    Again, did you output this data to a text file and open it as UTF-8 in a suitable text editor?
    Is there a chance the data has been written incorrectly?
    Many third party programs do not handle UTF-8 IPTC correctly. They might just use the system default code page. Please verify by writing cyrillic text using the -@ argfile of ExifTool.

    - Christian

  5. Jean
    May 20th, 2010 at 19:08 | #5

    yes, the output is made to a text file, i read it with Notepad (Notepad reads perfectly those characters).
    the metadata should be corrected written, i can read them with Imagemagick

    (i have joined the file in the Exiftool forum)

  6. May 20th, 2010 at 21:47 | #6

    Jean,
    Text is in UTF-8 format, it reads out fine. Try:

    exiftool -location IMG_0650_web.jpg>location.txt

    then open the text file with a UTF-8 aware editor. “St. Isaac’s Cathedral (Исаакиевский Собор)” This should solve the problem.

    Christian

  7. Jean
    May 21st, 2010 at 07:45 | #7

    Yes, the text file is correct :-)
    But i need to read it, i tried fread, fgets, fgetws.
    And i still get uncorrect characters

  8. May 21st, 2010 at 12:13 | #8

    I don’t know if your text reading api supports UTF8 directly, in case it doesn’t you should be able to read the file as binary data and then convert it into UTF-16 on the Windows platform. This can be done by using the MultiByteToWideChar() API.

    - Christian

  9. Anonymous
    May 21st, 2010 at 13:15 | #9

    Thank you for the tip, Christian.
    I convert the buffer i read to UTF-16 and now i can see correct chars :-)

  1. No trackbacks yet.