Archive

Posts Tagged ‘Linq’

Optimized Reading of Meta-Data using ExifTool (Unicode-Proof!)

April 22nd, 2010 Christian Etter No comments

Today we are going to look at how to work around the lack of Unicode support in ExifTool.

In my last post, I have already been talking about a safe way of handling Unicode file/path names, which was rather slow unfortunately. In this post I would like to elaborate on how to combine this with a fast reading approach using .NET.

I have chosen to give examples using C# code in these series, since it allows me to demonstrate my ideas in a very compact way. However the general approach is compatible with many programming languages and therefore not a .NET only solution.

Basically we are combining a batch read using ExifTool with a single file read operation for incompatible file names. In optimal circumstances, i.e. when all file names are convertible, this method performs as fast as ExifTool can be. Worst case would be reading all files one by one, which has a bigger performance penalty.

Prior to processing any files, we have to divide all file names into compatible and incompatible ones. After splitting them up, we start the actual reading.

public ExifFileJson[] GetOriginalDateExifToolUnicode( string[] files )
{
    // first, single out all files with incompatible file names, since they cannot be handled in a batch
    var tmp = ( from x in files select new { OriginalName = x, ConvertedName = Encoding.ASCII.GetString( Encoding.UTF8.GetBytes( x ) ) } ).ToArray();
    string[] batch = tmp.Where( x => x.OriginalName.Equals( x.ConvertedName ) ).Select( x => x.OriginalName ).ToArray();
    string[] nobatch = tmp.Where( x => !x.OriginalName.Equals( x.ConvertedName ) ).Select( x => x.OriginalName ).ToArray();
 
    List<ExifFileJson> exiffiles = new List<ExifFileJson>();
    exiffiles.AddRange( GetOriginalDateExifToolBatch( batch ) );
    foreach ( string s in nobatch )
        exiffiles.Add( GetExifImageExifTool( s ) );
    if ( files.Length != exiffiles.Count() )
        throw new Exception( "Could not open all files. Missing: " + String.Join( ", ", files.Except( exiffiles.Select( x => x.SourceFile ) ).ToArray() ) );
    return exiffiles.ToArray();
}

The next method basically runs ExifTool and parses the output in Json format.

private static ExifFileJson[] GetOriginalDateExifToolBatch( string[] files )
{
    Process oP = new Process();
    oP.EnableRaisingEvents = false;
    oP.StartInfo.CreateNoWindow = true;
    oP.StartInfo.LoadUserProfile = false;
    oP.StartInfo.RedirectStandardError = false;
    oP.StartInfo.RedirectStandardOutput = true;
    oP.StartInfo.RedirectStandardInput = true;
    oP.StartInfo.StandardErrorEncoding = null;
    oP.StartInfo.StandardOutputEncoding = Encoding.UTF8;
    oP.StartInfo.UseShellExecute = false;
    oP.StartInfo.WindowStyle = ProcessWindowStyle.Hidden;
    oP.StartInfo.FileName = @"exiftool.exe";
    oP.StartInfo.Arguments = "-EXIF:ModifyDate -EXIF:DateTimeOriginal -EXIF:CreateDate -j -d \"%Y-%m-%d %H:%M:%S\" -@ -";
    oP.Start();
 
    /// Pass all file names in an arg file which is piped to the process (no temporary file)
    byte[] data = Encoding.UTF8.GetBytes( String.Join( "\r\n", files ) );
    oP.StandardInput.BaseStream.Write( data, 0, data.Length );
    oP.StandardInput.BaseStream.Close();
 
    DataContractJsonSerializer deserializer = new DataContractJsonSerializer( typeof( ExifFileJson[] ) );
    ExifFileJson[] exif = deserializer.ReadObject( oP.StandardOutput.BaseStream ) as ExifFileJson[];
 
    oP.WaitForExit();
    return exif;
}

The following Unicode-safe way does not rely on the Perl file API, but instead pipes the image to stdin. To avoid out of memory conditions, it might be advisable to read the image file in small chunks using a stream. Do not forget to set the file name in the ExifFileJson object before returning it (ExifTool does not know about the file name).

private static ExifFileJson GetExifImageExifTool( string sFile )
{
    Process oP = new Process();
    oP.EnableRaisingEvents = false;
    oP.StartInfo.CreateNoWindow = true;
    oP.StartInfo.LoadUserProfile = false;
    oP.StartInfo.RedirectStandardError = false;
    oP.StartInfo.RedirectStandardOutput = true;
    oP.StartInfo.RedirectStandardInput = true;
    oP.StartInfo.StandardErrorEncoding = null;
    oP.StartInfo.StandardOutputEncoding = Encoding.UTF8;
    oP.StartInfo.UseShellExecute = false;
    oP.StartInfo.WindowStyle = ProcessWindowStyle.Hidden;
    oP.StartInfo.FileName = @"exiftool.exe";
    oP.StartInfo.Arguments = "-j -EXIF:ModifyDate -EXIF:DateTimeOriginal -EXIF:CreateDate -d \"%Y-%m-%d %H:%M:%S\" -";;
    oP.Start();
 
    byte[] image = File.ReadAllBytes( sFile );
    oP.StandardInput.BaseStream.Write( image, 0, image.Length );
    oP.StandardInput.BaseStream.Close();
 
    DataContractJsonSerializer deserializer = new DataContractJsonSerializer( typeof( ExifFileJson[] ) );
    ExifFileJson[] exif = deserializer.ReadObject( oP.StandardOutput.BaseStream ) as ExifFileJson[];
 
    oP.WaitForExit();
    if ( exif.Length > 0 )
        exif[ 0 ].SourceFile = sFile;
    return exif.FirstOrDefault();
}

In case you wonder about the Json class we use for deserializing output:

[DataContract]
public class ExifFileJson
{
    [DataMember( IsRequired = true, Name = "SourceFile" )]
    public string SourceFile;
    [OnDeserializedAttribute()]
    internal void ReplaceBackSlashes( StreamingContext context ) { this.SourceFile = this.SourceFile.Replace( '/', '\\' ); }
 
    [DataMember( IsRequired = false )]
    public string DateTimeOriginal;
    [DataMember( IsRequired = false )]
    public string CreateDate;
    [DataMember( IsRequired = false )]
    public string ModifyDate;
}

Basically we declare required and optional attributes and a name mapping if necessary. Remember to replace forward slashes to backslashes for the file names, since these are returned in Unix style. It is probably not a good idea to parse dates as DateTime? nullables, since there could be some images with unparsable dates, which will result in a parsing exception. If you would still want to do it, remember to decorate the dates in Json format in ExifTool: -d “/Date(%Y-%m-%d %H:%M:%S)/”.

Other Options

Certainly we could reach a better performance and less coding overhead if we had a way of batch processing files independent of their name and path. If you have a perl environment on your machine with the Win32::API module, you could rewrite the above code within a Perl script and therefore get much better performance even when reading Unicode files.

There is also another option: It is possible to add Unicode file name support into the Perl interpreter for Windows. I recently did a a proof-of-concept which shows that ExifTool (or any UTF-8 supporting Perl app) could be using Unicode file names in Windows without changing a line of code, as long as it is executed with a Unicode supporting interpreter. The source code of Perl is pretty big however, and I am afraid I won’t be able to invest enough time to do a bullet-proof implementation.

Retrieving Image Meta-Data using GDI+ and ExifTool

April 14th, 2010 Christian Etter 3 comments

How to read image meta data in .NET? Here we illustrate two techniques:

First, for the sake of speed and simplicity, we chose the GDI+ builtin capabilities of the Image class:

/// <summary>Much faster than using Exiftool. In case GDI+ cannot decode the date string we use Exiftool.</summary>
private DateTime? GetOriginalDate( string sFileName )
{
    using ( FileStream stream = new FileStream( sFileName, FileMode.Open, FileAccess.Read ) )
    {
        using ( Image img = Image.FromStream( stream, false, false ) )
        {
            int[] date_tags = new int[] { 36867, 36868, 306 }; // tag numbers with dates
            string[] s1 = ( from x in date_tags where img.PropertyIdList.Contains( x ) select Encoding.ASCII.GetString( img.GetPropertyItem( x ).Value ).Replace( "\0", "" )).ToArray(); // get date as string without training \0
            DateTime d;
            DateTime?[] dd = ( from x in s1 where x.Trim().Length > 0
                select DateTime.TryParseExact( x, new string[] { "yyyy:MM:dd HH:mm:ss", "yyyy-MM-dd HH:mm:ss", "MM/dd/yyyy HH:mm:ss", "yyyy-MM-dd'T'HH:mm:sszzz" }, CultureInfo.InvariantCulture, DateTimeStyles.AllowWhiteSpaces | DateTimeStyles.AssumeLocal, out d ) ? d as DateTime? : 
                null ).ToArray(); // we see if we can parse all the date attributes found
            if ( dd.Where( x => !x.HasValue ).Count() > 0 )
                return GetOldestExifDateExifTool( sFileName ); // if there is something in the date attribute we cannot parse we ask exiftool.
            else
                return ( from x in dd where x.Value > new DateTime( 1990, 01, 01 ) && x.Value < DateTime.UtcNow select x ).Min(); // make sure we use a valid date range
        }
    }
}

Since the EXIF standard defines date values to be stored as text data, sometimes we find non-standard date formats. This includes dates being stored with milliseconds added, using different separator characters or including an additional UTC offset. Exiftool does a pretty decent job interpreting all those values as a date, plus it might be capable of reading certain off-standard or broken meta-data which GDI+ doesn’t.

Here is the second approach:

/// <summary>Extracts the oldest possible EXIF date. Can process 3 files per second, very slow, will need 8 hours for 90.000 files.</summary>
private static DateTime? GetOldestExifDateExifTool( string sFile )
{
    Process oP = new Process();
    oP.EnableRaisingEvents = false;
    oP.StartInfo.CreateNoWindow = true;
    oP.StartInfo.LoadUserProfile = false;
    oP.StartInfo.RedirectStandardError = false;
    oP.StartInfo.RedirectStandardOutput = true;
    oP.StartInfo.RedirectStandardInput = true;
    oP.StartInfo.StandardErrorEncoding = null;
    oP.StartInfo.StandardOutputEncoding = Encoding.UTF8;
    oP.StartInfo.UseShellExecute = false;
    oP.StartInfo.WindowStyle = ProcessWindowStyle.Hidden;
    oP.StartInfo.FileName = @"exiftool.exe";
    oP.StartInfo.Arguments = "-s -s -EXIF:ModifyDate -EXIF:DateTimeOriginal -EXIF:CreateDate -d \"%Y-%m-%d %H:%M:%S\" -";
    oP.Start();
 
    byte[] image = File.ReadAllBytes( sFile );
    oP.StandardInput.BaseStream.Write( image, 0, image.Length );
    oP.StandardInput.BaseStream.Flush();
    oP.StandardInput.BaseStream.Close();
    string sStdOut = oP.StandardOutput.ReadToEnd();
    oP.WaitForExit();
 
    string[] datetags = new string[] { "DateTimeOriginal", "CreateDate", "ModifyDate" };
    string[] res1 = sStdOut.Split( new string[] { "\r\n" }, StringSplitOptions.RemoveEmptyEntries ); // split lines
    string[][] res2 = ( from x in res1 select x.Split( new char[] { ':' }, 2 ) ).ToArray(); // split after colon to separate attributes and values
    string[] res3 = ( from x in res2 where x.Length == 2 && datetags.Contains( x[ 0 ], StringComparer.InvariantCultureIgnoreCase ) select x[ 1 ] ).ToArray(); // only chose lines of date attributes
    DateTime d;
    DateTime?[] dd = ( from x in res3 select DateTime.TryParseExact( x, "yyyy-MM-dd HH:mm:ss", CultureInfo.InvariantCulture, DateTimeStyles.AllowWhiteSpaces | DateTimeStyles.AssumeLocal, out d ) ? d as DateTime? : null ).ToArray();
    DateTime? oDate = ( from x in dd where x.HasValue && x > new DateTime( 1990, 01, 01 ) && x < DateTime.UtcNow select x ).Min();
    return oDate;
}

Basically Exiftool has three shortcomings when used within another program:

  1. It is not available as a library. Therefore we need to make use of the Process API.
  2. It has a long startup time. This is because it has been written in Perl which is packed in a single self-expanding exe wrapper. As a result, we can only process about 3 files per second on a fast computer. GDI+ might be a hundred times faster. We might be able to work around this somehow by processing several files in a batch, which would require a bigger change in our program logic.
  3. It does not support the unicode filesystem API, so filenames which are not compatible with the current ANSI encoding cannot be opened. To work around this limitation, we read the file into memory first and then pipe it into ExifTool.

Note: when using the Process API, you are given the option to redirect both stdout and stderr at the same time, which could allow for more detailed error handling/messages. However you *must* always read stdout and stderr in different threads to avoid a deadlock situation. For the sake of simplicity, I have ommitted error handling in this case.

Using IEqualityComparer on Custom Types with Except()

April 12th, 2010 Christian Etter No comments

Recently I was writing about an alternative way of using Linq Distinct() on custom types which does not involve writing a custom IEqualityComparer derivate.

Today there was a similar requirement, using Linq Except() for determining all elements of an IEnumerable which do not intersect with the elements of another IEnumerable. Again, there is a one line solution to it, which is slow on large input data:

byte[][] hashes_old = /* an array of byte arrays containing a hash value */;
byte[][] hashes_new = /* another array of byte arrays containing a hash value */;
byte[][] hashes_obsolete = ( from x in hashes_old where hashes_new.Any( y => y.SequenceEqual( x ) ) == false select x ).ToArray();

We are using two arrays of 16 byte long hashes and determine which elements do not intersect. It is a more or less elegant one liner that does not require any other comparison code. Yet when we test this with larger amounts of data, it runs slow since SequenceEqual() has to be called for every single comparison:

hashes_old: 18830 hashes_new: 8210 hashes_obsolete: 12228 time: 19564 ms

Since the Except() method is extensively using the GetHashCode() override, a lot of time can be saved by properly implementing a hash function within an IEqualityComparer derivate.

byte[][] hashes_old = /* an array of byte arrays containing a hash value */;
byte[][] hashes_new = /* another array of byte arrays containing a hash value */;
byte[][] hashes_obsolete = hashes_old.Except( hashes_new, new ByteArrayComparer() ).ToArray();
 
/* .... */
public class ByteArrayComparer : IEqualityComparer<byte[]>
{
    public bool Equals( byte[] a, byte[] b )
    {
        if ( a == null || b == null )
            return a == b;
        return a.SequenceEqual( b );
    }
    public int GetHashCode( byte[] x )
    {
        if ( x == null )
            throw new ArgumentNullException();
        int iHash = 0;
        for ( int i = 0; i < x.Length; ++i )
            iHash ^= ( x[ i ] << ( ( 0x03 & i ) << 3 ) );
        return iHash;
    }
}

hashes_old: 18830 hashes_new: 8210 hashes_obsolete: 12228 time: 14 ms

Same result, but instead of 19 seconds we only need 16 milliseconds, that is 1400 times faster!

What happens is that the result of GetHashCode() is used for each comparison. When two arrays have the same hash code, Linq calls the Equals function to ensure both are really equal (there has been no hash collision). So the main speedup is realized by writing a low-collision hashing function.

Using Linq Distinct() without IEqualityComparer

January 18th, 2010 Christian Etter No comments

Given an IEnumerable class, such as a generic list or array, it is only possible to use the Distinct() method when working with simple data types. As soon as we are operating on a list of objects though, we are forced to write your own class implementing IEqualityComparer, which is a bit bothersome in many cases. At tehe first glance, it seems that Microsoft has simply forgotten to implement Lambda Expressions for Distinct() and similar functions. Another reason might be that these functions immensely benefit from the use of a hash based comparison algorithm, and basically that is what the IEqualityComparer is all about. See my other blog post about this subject.

For those who are just looking for a simple solution, the following one-liner might be useful:

SomeObject[] array_1 = new SomeObject[] { ... }
SomeObject[] array_2 = array_1.GroupBy( x => x.SomePropertyOrMethod ).Select( x => x.First() ).ToArray();

A standard implementation using IEqualityComparer could look like this:

byte[][] hash_distinct = hash_duplicate.Distinct( new ByteArrayComparer() ).ToArray();
/* .... */
public class ByteArrayComparer : IEqualityComparer<byte[]>
{
    public bool Equals( byte[] a, byte[] b )
    {
        if ( a == null || b == null )
            return a == b;
        return a.SequenceEqual( b );
    }
    public int GetHashCode( byte[] x )
    {
        if ( x == null )
            throw new ArgumentNullException();
        int iHash = 0;
        for ( int i = 0; i < x.Length; ++i )
            iHash ^= ( x[ i ] << ( ( 0x03 & i ) << 3 ) );
        return iHash;
    }
}

In my tests the performance gain by using an IEqualityComparer implementation instead of the above solution is about 100% when working on an array of 18000 elements.

Byte Array to Hex String in C# and Linq

February 18th, 2009 Christian Etter No comments

Converting a byte array into a hexadecimal string is easy to do using a loop and a few lines of code. With Linq, it can be done more elegantly in a single line:

string s = String.Join( " ", byte_array.Select( x => x.ToString( "X2" ) ).ToArray() )
Categories: C#, Software Development Tags: , ,