Testing for possible Unicode – ANSI code page compatibility
When dealing with a recent ExifTool remoting task, there was a question whether or not a given Unicode file name could be safely represented in the system ANSI code page. Only if the file name was fully convertible it could be passed to the application directly.
In case the file name cannot be converted to the current code page, an application which does not utilize the CreateFileW() API will not be able to open the file with this name. In case the file system supports old style DOS 8.3 filenames, the application should resort to using those instead.
BOOL IsConvertibleText( PCWSTR sFile ) { BOOL bRet = FALSE; if ( sFile ) { int iBuffer = WideCharToMultiByte( CP_ACP, 0, sFile, -1, NULL, 0, NULL, NULL ); if ( iBuffer != 0 ) { iBuffer += 1; PSTR a = (PSTR)HeapAlloc( GetProcessHeap(), 0, iBuffer ); if ( a ) { if ( WideCharToMultiByte( CP_ACP, 0, sFile, -1, a, iBuffer, NULL, NULL ) ) { iBuffer = MultiByteToWideChar( CP_ACP, 0, a, -1, NULL, 0 ); if ( iBuffer != 0 ) { iBuffer = ( iBuffer + 1 ) * sizeof(WCHAR); PWSTR w = (PWSTR)HeapAlloc( GetProcessHeap(), 0, iBuffer ); if ( w ) if ( MultiByteToWideChar( CP_ACP, 0, a, -1, w, iBuffer ) ) if ( CompareStringW( LOCALE_SYSTEM_DEFAULT, 0, sFile, -1, w, -1 ) == CSTR_EQUAL ) bRet = TRUE; HeapFree( GetProcessHeap(), 0, w ); } } HeapFree( GetProcessHeap(), 0, a ); } } } return bRet; }
For those using C#:
bool IsConvertibleText( string sFile ) { byte[] b = Encoding.Default.GetBytes( sFile ); string s = Encoding.Default.GetString( b ); return sFile.Equals( s, StringComparison.InvariantCulture ); }
See also: Post in CPAN::Forum
Win32API::File Unicode support bug
Hello
I use ExifTool under Window (XP, 7).
I create a .bat to be able to get the metadata in a .txt with the following syntax:
(the .bat is created with WCHAR characters)
exiftool.exe -EXIF:All myfile.jpg >myfile.txt
It works perfectly with ‘western’ file names but as soon as i use, for example, cyrillic file names, nothing’s happening.
Jean,
the bahaviour you are describing is due to a limitation of the Perl interpreter, which can only handle file names which can be represented in the current ANSI code page of your Windows system. Most likely you are using code page 1252, which does not contain cyrillic letters. Basically you have the following options:
Christian
Thank you for your answer, i did not know that ‘type’ command, it works fine.
Now i rename/copy the files to temp and Exiftool can read them.
But some infos are written in cyrillic (see for example Keywords and By-line) and are not read by Exiftool; it seems to be the normal behaviour due to Perl ?
Current IPTC Digest : 9a42ed6cec8fd5344ec946c1cb15501e
Application Record Version : 2
Keywords : Àíäðåé, Èâàí, Òàðàñ
By-line : Ïåòðîâ Ïåòð
IPTC Digest : 9a42ed6cec8fd5344ec946c1cb15501e
Image Width : 1704
Image Height : 2272
Encoding Process : Baseline DCT, Huffman coding
Again, did you output this data to a text file and open it as UTF-8 in a suitable text editor?
Is there a chance the data has been written incorrectly?
Many third party programs do not handle UTF-8 IPTC correctly. They might just use the system default code page. Please verify by writing cyrillic text using the -@ argfile of ExifTool.
- Christian
yes, the output is made to a text file, i read it with Notepad (Notepad reads perfectly those characters).
the metadata should be corrected written, i can read them with Imagemagick
(i have joined the file in the Exiftool forum)
Jean,
Text is in UTF-8 format, it reads out fine. Try:
exiftool -location IMG_0650_web.jpg>location.txt
then open the text file with a UTF-8 aware editor. “St. Isaac’s Cathedral (Исаакиевский Собор)” This should solve the problem.
Christian
Yes, the text file is correct
But i need to read it, i tried fread, fgets, fgetws.
And i still get uncorrect characters
I don’t know if your text reading api supports UTF8 directly, in case it doesn’t you should be able to read the file as binary data and then convert it into UTF-16 on the Windows platform. This can be done by using the MultiByteToWideChar() API.
- Christian
Thank you for the tip, Christian.
I convert the buffer i read to UTF-16 and now i can see correct chars