Archive

Posts Tagged ‘UTF-8’

Optimized Reading of Meta-Data using ExifTool (Unicode-Proof!)

April 22nd, 2010 No comments

Today we are going to look at how to work around the lack of Unicode support in ExifTool.

In my last post, I have already been talking about a safe way of handling Unicode file/path names, which was rather slow unfortunately. In this post I would like to elaborate on how to combine this with a fast reading approach using .NET.

I have chosen to give examples using C# code in these series, since it allows me to demonstrate my ideas in a very compact way. However the general approach is compatible with many programming languages and therefore not a .NET only solution.

Basically we are combining a batch read using ExifTool with a single file read operation for incompatible file names. In optimal circumstances, i.e. when all file names are convertible, this method performs as fast as ExifTool can be. Worst case would be reading all files one by one, which has a bigger performance penalty.

Prior to processing any files, we have to divide all file names into compatible and incompatible ones. After splitting them up, we start the actual reading.

public ExifFileJson[] GetOriginalDateExifToolUnicode( string[] files )
{
    // first, single out all files with incompatible file names, since they cannot be handled in a batch
    var tmp = ( from x in files select new { OriginalName = x, ConvertedName = Encoding.ASCII.GetString( Encoding.UTF8.GetBytes( x ) ) } ).ToArray();
    string[] batch = tmp.Where( x => x.OriginalName.Equals( x.ConvertedName ) ).Select( x => x.OriginalName ).ToArray();
    string[] nobatch = tmp.Where( x => !x.OriginalName.Equals( x.ConvertedName ) ).Select( x => x.OriginalName ).ToArray();
 
    List<ExifFileJson> exiffiles = new List<ExifFileJson>();
    exiffiles.AddRange( GetOriginalDateExifToolBatch( batch ) );
    foreach ( string s in nobatch )
        exiffiles.Add( GetExifImageExifTool( s ) );
    if ( files.Length != exiffiles.Count() )
        throw new Exception( "Could not open all files. Missing: " + String.Join( ", ", files.Except( exiffiles.Select( x => x.SourceFile ) ).ToArray() ) );
    return exiffiles.ToArray();
}

The next method basically runs ExifTool and parses the output in Json format.

private static ExifFileJson[] GetOriginalDateExifToolBatch( string[] files )
{
    Process oP = new Process();
    oP.EnableRaisingEvents = false;
    oP.StartInfo.CreateNoWindow = true;
    oP.StartInfo.LoadUserProfile = false;
    oP.StartInfo.RedirectStandardError = false;
    oP.StartInfo.RedirectStandardOutput = true;
    oP.StartInfo.RedirectStandardInput = true;
    oP.StartInfo.StandardErrorEncoding = null;
    oP.StartInfo.StandardOutputEncoding = Encoding.UTF8;
    oP.StartInfo.UseShellExecute = false;
    oP.StartInfo.WindowStyle = ProcessWindowStyle.Hidden;
    oP.StartInfo.FileName = @"exiftool.exe";
    oP.StartInfo.Arguments = "-EXIF:ModifyDate -EXIF:DateTimeOriginal -EXIF:CreateDate -j -d \"%Y-%m-%d %H:%M:%S\" -@ -";
    oP.Start();
 
    /// Pass all file names in an arg file which is piped to the process (no temporary file)
    byte[] data = Encoding.UTF8.GetBytes( String.Join( "\r\n", files ) );
    oP.StandardInput.BaseStream.Write( data, 0, data.Length );
    oP.StandardInput.BaseStream.Close();
 
    DataContractJsonSerializer deserializer = new DataContractJsonSerializer( typeof( ExifFileJson[] ) );
    ExifFileJson[] exif = deserializer.ReadObject( oP.StandardOutput.BaseStream ) as ExifFileJson[];
 
    oP.WaitForExit();
    return exif;
}

The following Unicode-safe way does not rely on the Perl file API, but instead pipes the image to stdin. To avoid out of memory conditions, it might be advisable to read the image file in small chunks using a stream. Do not forget to set the file name in the ExifFileJson object before returning it (ExifTool does not know about the file name).

private static ExifFileJson GetExifImageExifTool( string sFile )
{
    Process oP = new Process();
    oP.EnableRaisingEvents = false;
    oP.StartInfo.CreateNoWindow = true;
    oP.StartInfo.LoadUserProfile = false;
    oP.StartInfo.RedirectStandardError = false;
    oP.StartInfo.RedirectStandardOutput = true;
    oP.StartInfo.RedirectStandardInput = true;
    oP.StartInfo.StandardErrorEncoding = null;
    oP.StartInfo.StandardOutputEncoding = Encoding.UTF8;
    oP.StartInfo.UseShellExecute = false;
    oP.StartInfo.WindowStyle = ProcessWindowStyle.Hidden;
    oP.StartInfo.FileName = @"exiftool.exe";
    oP.StartInfo.Arguments = "-j -EXIF:ModifyDate -EXIF:DateTimeOriginal -EXIF:CreateDate -d \"%Y-%m-%d %H:%M:%S\" -";;
    oP.Start();
 
    byte[] image = File.ReadAllBytes( sFile );
    oP.StandardInput.BaseStream.Write( image, 0, image.Length );
    oP.StandardInput.BaseStream.Close();
 
    DataContractJsonSerializer deserializer = new DataContractJsonSerializer( typeof( ExifFileJson[] ) );
    ExifFileJson[] exif = deserializer.ReadObject( oP.StandardOutput.BaseStream ) as ExifFileJson[];
 
    oP.WaitForExit();
    if ( exif.Length > 0 )
        exif[ 0 ].SourceFile = sFile;
    return exif.FirstOrDefault();
}

In case you wonder about the Json class we use for deserializing output:

[DataContract]
public class ExifFileJson
{
    [DataMember( IsRequired = true, Name = "SourceFile" )]
    public string SourceFile;
    [OnDeserializedAttribute()]
    internal void ReplaceBackSlashes( StreamingContext context ) { this.SourceFile = this.SourceFile.Replace( '/', '\\' ); }
 
    [DataMember( IsRequired = false )]
    public string DateTimeOriginal;
    [DataMember( IsRequired = false )]
    public string CreateDate;
    [DataMember( IsRequired = false )]
    public string ModifyDate;
}

Basically we declare required and optional attributes and a name mapping if necessary. Remember to replace forward slashes to backslashes for the file names, since these are returned in Unix style. It is probably not a good idea to parse dates as DateTime? nullables, since there could be some images with unparsable dates, which will result in a parsing exception. If you would still want to do it, remember to decorate the dates in Json format in ExifTool: -d “/Date(%Y-%m-%d %H:%M:%S)/”.

Other Options

Certainly we could reach a better performance and less coding overhead if we had a way of batch processing files independent of their name and path. If you have a perl environment on your machine with the Win32::API module, you could rewrite the above code within a Perl script and therefore get much better performance even when reading Unicode files.

There is also another option: It is possible to add Unicode file name support into the Perl interpreter for Windows. I recently did a a proof-of-concept which shows that ExifTool (or any UTF-8 supporting Perl app) could be using Unicode file names in Windows without changing a line of code, as long as it is executed with a Unicode supporting interpreter. The source code of Perl is pretty big however, and I am afraid I won’t be able to invest enough time to do a bullet-proof implementation.

Editing Metadata using Exiftool and Unicode

June 2nd, 2009 2 comments

When it comes to editing image metadata, no program or API gets close to ExifTool in terms of robustness, feature count and support. ExifTool has been written in Perl and is provided within a .exe wrapper for the Windows platform. This executable can be controlled through command line parameters and a parameter (arg)  file.

Editing metadata by passing command line parameters and attribute values generally works well, as long as certain restrictions are not violated. All parameters must be correctly escaped and may only contain ANSI characters that are supported by the current user’s codepage. The maximum length of the command line is limited, and therefore entering long text might be difficult in some cases.

Unfortunately it is impossible to enter any characters that are stored in Unicode encodings such as UTF-8, UTF-16 or UCS-2. This is due to a restriction in the Perl interpreter binary, which only accepts and forwards 8 bit character sets. Technically, UTF-8 could be passed as an 8 bit character set, however due to the command line string handling of Windows, only text encoded with the current System ANSI codepage can be passed using the CreateProcess() API.

Fortunately there are a few workarounds for passing Unicode data to ExifTool:

  • Use the -E option to write special characters as HTML character entities. Examples: &auml; – ä Umlaut, &#10; – line break, &#10;&#13; – line break (Windows), &#31435; – Chinese character 立, etc. In case the text already contains some HTML- entities, you would have to escape them first. With this approach you will be restricted to the maximum length of the command line, which is between 2047 and 8191 characters (MSDN). When using the Win32 API CreateProcess(), the maximum argument length is 32000 characters (MSDN).
  • Write the data directly to stdin and specifyon the command line which attribute is supposed to store the data. This is a very fast approach if you need to write only a single attribute in any encoding, including binary data. No escaping is necessary in this case. Multi-value attributes and line breaks are supported.
  • Write the text or binary data for each attribute into a separate temporary file before calling ExifTool. Make sure you remove every file after ExifTool finishes processing. Similar to the above, you can write any kind of data without escaping, and as an additional benefit you can write parameters in different encodings at the same time. To pass a file as a parameter, use the “attribute<=filename” syntax (needs to be escaped with double quotes on the command line).
  • Write all attributes of the same encoding into an arg file. Then call ExifTool using the -@ parameter. You could also add processing instructions to the file, as well as the names of the files to be processed.  Another benefit is that size restrictions of the command line do not apply, and therefore an arbitrary number of files and attributes can be processed. All you have to make sure is that you only use a single encoding (probably UTF-8) for all attributes in the file. Since every assignment has to fit into a single line, multi value attributes and text with line breaks has to be escaped. In case you have to write any text that contains line breaks, you have to escape them with a $/ and change the = operator to <=. Also note that you have to escape all $ characters by $$.
  • Use the argfile approach from above, but do not use a temporary file. Instead, write the contents of that file directly to stdin. You can use the -@ - syntax for this. The main advantage compared to using an arg file is that you do not need to worry about cleaning up.
  • Combine any of the above ways of passing data.

Background

Image metadata is frequentlystored within an EXIF header. Unfortunately, EXIF has a limited concept of character sets used to encode information. According to the specification, only ASCII encoded text can be stored in text attributes. Actually many applications will also read and and write text in other 8-Bit encodings, with Latin1 being a very common denominator. For any text that is written in the current ANSI encoding of the user’s system, the user will most likely be able to retrieve the saved text without any loss in information. This however is only the case as long as the meta data is viewed on a system with the same code page. Depending on the characters used to encode the attribute, a system with a different default ANSI code page is likely to show wrong characters or even complete nonsense. Except for a few attributes such as UserComment, EXIF does not forsee the usage of Unicode.

IPTC fortunately does offer storage of Unicode text. It is possible to flag all IPTC text as UTF-8. Therefore all Unicode characters can be safely stored within IPTC, as long as the reading and writing applications adhere to the specification. Storing UTF-8 encoded text is supported by ExifTool.

XMP supports UTF-8 by default, so whenever writing XMP data, one of the above techniques should be used.

Link to ExifTool: http://www.sno.phy.queensu.ca/~phil/exiftool/

ExifTool forum at CPAN: http://www.cpanforum.com/dist/Image-ExifTool

ATL CString Extension for UTF-8, UTF-7, ASCII, OEM, Latin1 Character Sets

April 29th, 2009 No comments

Sometimes we come across Text that has been encoded in a particular locale or Unicode encoding. ATL CString classes do not provide conversion for this in most cases, that’s where these two extension classes come in handy:

CStringWExt – Convert 8-bit Character Sets to UTF-16

class CStringWExt : public CStringW
    {
    public:
        BOOL Latin12Wide  ( PSTR s ) { return CP2Wide( 28591    , s ); } // Latin1 encoding or ISO/IEC 8859-1, similar to Windows-1252 
        BOOL OEM2Wide     ( PSTR s ) { return CP2Wide( CP_OEMCP , s ); } // Use for console related text
        BOOL ASCII2Wide   ( PSTR s ) { return CP2Wide( 20127    , s ); }
        BOOL UTF72Wide    ( PSTR s ) { return CP2Wide( CP_UTF7  , s ); }
        BOOL UTF82Wide    ( PSTR s ) { return CP2Wide( CP_UTF8  , s ); }
        BOOL ANSI2Wide    ( PSTR s ) { return CP2Wide( CP_ACP   , s ); }
        BOOL UserCP2Wide  ( PSTR s ) { return CP2Wide( GetUserCodePage()  , s ); }
        BOOL SystemCP2Wide( PSTR s ) { return CP2Wide( GetSystemCodePage(), s ); } // System code page is the locale set for non Unicode programs
        UINT GetUserCodePage() { return GetCodePage( LOCALE_USER_DEFAULT ); }
        UINT GetSystemCodePage() { return GetCodePage( LOCALE_SYSTEM_DEFAULT ); }
        UINT GetCodePage( LCID locale )
        {
            UINT langCP;
            if ( GetLocaleInfo( locale, LOCALE_IDEFAULTANSICODEPAGE | LOCALE_RETURN_NUMBER, (LPTSTR)&amp;langCP, sizeof(langCP) ) )
                return langCP;
            return 0;
        }
        BOOL CP2Wide( UINT cp, PCSTR s )
        {
            if ( s == NULL )
                return FALSE;
            int iBuffer = MultiByteToWideChar( cp, 0, s, -1, NULL, 0 );
            if ( iBuffer == 0 )
                return FALSE;
            Preallocate( iBuffer );
            if ( !MultiByteToWideChar( cp, 0, s, -1, GetBuffer() , GetAllocLength() ) )
                return FALSE;
            ReleaseBuffer();
            return TRUE;
        }
};

CStringAExt – Convert UTF-16 to 8-bit Character Set

This conversion with a target of OEM, ASCII and ANSI CP is potentially lossy, depeding on the text that has to be converted. To check if any loss has occurred, use an instance of CStringWExt above.

class CStringAExt : public CStringA
    {
    public:
        BOOL Wide2Latin1  ( PWSTR s ) { return Wide2CP( 28591    , s ); } // Latin1 encoding or ISO/IEC 8859-1, similar to Windows-1252 
        BOOL Wide2OEM     ( PWSTR s ) { return Wide2CP( CP_OEMCP , s ); } // Use for console related text
        BOOL Wide2ASCII   ( PWSTR s ) { return Wide2CP( 20127    , s ); }
        BOOL Wide2UTF7    ( PWSTR s ) { return Wide2CP( CP_UTF7  , s ); }
        BOOL Wide2UTF8    ( PWSTR s ) { return Wide2CP( CP_UTF8  , s ); }
        BOOL Wide2ANSI    ( PWSTR s ) { return Wide2CP( CP_ACP   , s ); }
        BOOL Wide2UserCP  ( PWSTR s ) { return Wide2CP( GetUserCodePage()  , s ); }
        BOOL Wide2SystemCP( PWSTR s ) { return Wide2CP( GetSystemCodePage(), s ); } // System code page is the locale set for non Unicode programs
        UINT GetUserCodePage() { return GetCodePage( LOCALE_USER_DEFAULT ); }
        UINT GetSystemCodePage() { return GetCodePage( LOCALE_SYSTEM_DEFAULT ); }
        UINT GetCodePage( LCID locale )
        {
            UINT langCP;
            if ( GetLocaleInfo( locale, LOCALE_IDEFAULTANSICODEPAGE | LOCALE_RETURN_NUMBER, (LPTSTR)&amp;langCP, sizeof(langCP) ) )
                return langCP;
            return 0;
        }
        BOOL Wide2CP( UINT cp, PCWSTR s )
        {
            if ( s == NULL )
                return FALSE;
            int iBuffer = WideCharToMultiByte( cp, 0, s, -1, NULL, 0, NULL, NULL );
            if ( iBuffer == 0 )
                return FALSE;
            Preallocate( iBuffer );
            if ( !WideCharToMultiByte( cp, 0, s, -1, GetBuffer() , GetAllocLength(), NULL, NULL ) )
                return FALSE;
            ReleaseBuffer();
            return TRUE;
        }
};