Archive

Posts Tagged ‘.NET’

ExifTool Performance Benchmark

April 15th, 2010 Christian Etter 3 comments

I had some time to look into the performance of reading image meta data using ExifTool and the GDI+ API. If you are planning to use Exiftool in your own software project, you might find the following information useful:

Test Environment
Tests have been conducted on a Core i7 720QM system with 4 GB RAM and a 7200 rpm HD, running Windows 7 64 bit. The testing application was written in C# using .NET Framework 3.5 and Linq.

Preparation

We are benchmarking different approaches reading a total of 1000 jpg files and measuring the total time taken to read 3 EXIF dates from each file (if present). In order to minimize misleading results when reading files which have previously been cached, we are reading all files once before performing the test.

Reading files without processing

We read every file into memory after it has been cached in the file system cache, so this is more or less an in-memory operation. This gives us a clue regarding the file access overhead.

Duration for 1000 files: 226 ms.

Reading meta data with GDI+

Using the .NET builtin Image class, reading and parsing of 3 attributes is rather fast, taking less than 2 seconds.

Duration for 1000 files: 1632 ms.

Reading meta data using ExifTool

We call ExifTool for every file and parse the return values as DateTime. In order to support all filenames (also names which can only be represented in Unicode), we read the file into memory first and then pipe it to stdin of ExifTool. This is the slowest of all reading modes, taking about 200 times as long as GDI+. As we will see soon, the actual preformance hit ist the startup time of the Perl interpreter, and not ExifTool itself.

Duration for 1000 files: 337000 ms.

Reading meta data using ExifTool (-fast)

Same as above, but using the -fast option in ExifTool, which will prematurely cancel reading from stdin, once a sufficient amount of data has been found. This should increase performance especially when reading files directly over a slow network, but as we can see, it does not make a big difference in our case. Throughput increases by less than 10 percent.

Duration for 1000 files: 317000 ms.
Reading meta data using ExifTool (-fast2)
In addition to the -fast option, using -fast2 should allow for faster processing by ommitting maker notes. The effect is small in our case, which might also be because not all images in the test set contain maker notes. As we will see later, the actual work performed by ExifTool is only very small compared to the Perl startup time. Using the evidence gathered in this test, Perl consumes as much as 98.4% of the total execution time, whereas reading and parsing of the jpeg file only takes 1.6%.

Duration for 1000 files: 311000 ms.

Reading meta data using ExifTool (without extraction)

Since exiftool.exe unpacks a payload of 967 files in 60 folders (8.67 MB), one approach was to skip the extraction process and call ExifTool directly in the temporary folder with all the PAR environment variables set. Unfortunately this could not deliver any improvement worth the effort. Therefore we can say that the amount of time used for extraction is hardly relevant.

Duration for 1000 files: 314000 ms.

Reading meta data using ExifTool (multiple threads)

Since the entire operation is clearly CPU bound, the obvious optimization on a multi core / hyper threading system would be to execute ExifTool in parallel. For our test case, we are executing 8 instances of ExifTool at the same time, using 8 threads (the test system has 8 virtual cores). This results in a signifficant speed up, more than 3 times as fast as a serialized execution:

Duration for 1000 files: 99000 ms.

Reading meta data using ExifTool (batch mode)

The end of the road? With the multi threaded approach we have reached the maximum optimization possible when calling ExifTool seperately for each file. This technique gave us the advantage of being able to handle file names with arbitrary character sets and detailed progress and error reporting.

The only way of getting more (much more) throughput is to use ExifTool in batch mode, thus minimizing the actual impact of the Perl startup time.

As before, ExifTool has been set to read three Date tags from each file and format the text output. This time we are using Json as the output format, since it allows us to easily parse the result of each file. In order to avoid getting a command line which is too long, we are writing all files into an ‘argfile’, which is then written to stdin of Exiftool using -@ -.

This approach is F A S T! We are suddenly doing the same job within 1.6% of the previous time!

Duration for 1000 files: 5669 ms.

Reading meta data using ExifTool (batch mode, -fast2)

No big improvements here, we are ending up with almost the same amount of time. This might be due to the fact that only part of the tested files contain maker notes, so results may vary.

Duration for 1000 files: 5422 ms.

Conclusion

In the above test we have learned that iterating file by file is extremely slow when processing large amounts of files using ExifTool. The performance impact is caused by the overhead of calling the Perl interpreter, and is therefore difficult to minimize when calling ExifTool once for each file. Depending on the kind of data needed, it is therefore worthwhile to look into other solutions such as batch processing or using a different API.

When using ExifTool in batch mode, it is strongly recommended to check whether or not a file name and path can be represented using the current system’s code page. Otherwise such files will not be read. It is not safe to call ExifTool with MS-DOS ’8dot3′ filenames, since these are not converted to ANSI in case of eastern asian characters. Also, these names are not guaranteed to be present for each file on the NTFS file system.

For an ExifTool-only solution reading meta data, the following approach is recommended when reading larger amounts of files:

  1. Create a list of image files to be processed.
  2. Convert each file name into a UTF-8 byte sequence and decode using the current system ANSI code page. Then compare the resulting string to the original path of the file. If both are not equal, you must not use this file in batch processing (ExifTool cannot open it) and instead read it in a separate call. Why should you convert to UTF-8 and then to ANSI? This might be confusing at first, but it mimics exactly what happens behind the scenes. You need to pass the file names as UTF-8 to ExifTool, since they appear again in the output (which should be UTF-8).
  3. Write a list of UTF-8 encoded file names to stdin of ExifTool, do not pass them on the command line, since you might hit the maximum length of the command line and truncate it unknowingly.
  4. Call ExifTool, preferably using -J option for generating output in Json format. Json allows for more efficient parsing (compared to XML) of the result.
  5. Parse the result and ensure you are getting a result for each file. There are many libraries to handle Json parsing, in .NET it is done in a few lines of code. If you had entered file names in ANSI encoding in the previous step, you would run into errors here because your output would contain a mix of encodings.
  6. All files which did not pass the UTF-8 to ANSI encoding roundtrip need to be processed one by one.

Source Code

The source code of the test app is available for download here.

Extending SQLite functionality using .NET (UDF)

April 13th, 2010 Christian Etter No comments

Although SQLite does not contain builtin support for directly writing user defined functions (UDF) in PL/SQL, it features a very fast and convenient way for extended functionality using your programming language of choice.

Let me illustrate this by example: In this case we are looking at a string transformation algorithm that allows for sorting a Url string according to the top level domain, domain name and host name. E.g. http://www.wikipedia.de/favicon.ico should be sorted according to http://de.wikipedia.www/favicon.ico.

Using PL/SQL with its limited string processing functions, this can result in a lot of coding. Using .NET, it’s easy to write and surprisingly fast.

[SQLiteFunction( Name = "ReverseDomain", Arguments = 1, FuncType = FunctionType.Scalar )]
public class ReverseDomain: SQLiteFunction
{
    public override object Invoke( object[] args )
    {
        if ( args == null || args.Length < 1 || args[ 0 ] == null )
            return null;
        string[] ss = args[ 0 ].ToString().Split( new string[] { "://", "/" }, 3 );
        if ( ss.Length > 1 )
        {
            string[] sss = ss[ 1 ].Split( new char[] { '.' } );
            Array.Reverse( sss );
            return ss[ 0 ] + "://" + String.Join( ".", sss ) + "/" + ( ss.Length == 3 ? ss[ 2 ] : String.Empty );
        }
        return args[ 0 ];
    }
}

Alternatively, we could use the UriBuilder class:

[SQLiteFunction( Name = "ReverseDomain", Arguments = 1, FuncType = FunctionType.Scalar )]
public class ReverseDomain: SQLiteFunction
{
    public override object Invoke( object[] args )
    {
        if ( args == null || args.Length < 1 || args[ 0 ] == null )
            return null;
        UriBuilder u = new UriBuilder( args[ 0 ] as string );
        string[] ss = u.Host.Split( new char[] { '.' } );
        Array.Reverse( ss );
        u.Host = String.Join( ".", ss );
        return u.ToString();
    }
}

Where is the catch with UriBuilder? Compared to the custom string splitting solution it is about 5 times slower. It can transform only about 65000 urls per second on my machine, while the prior method can convert 350000 urls in the same time.

How is this new UDF called from SQL?

SELECT * FROM UrlTable ORDER BY ReverseDomain( Url )

Using IEqualityComparer on Custom Types with Except()

April 12th, 2010 Christian Etter No comments

Recently I was writing about an alternative way of using Linq Distinct() on custom types which does not involve writing a custom IEqualityComparer derivate.

Today there was a similar requirement, using Linq Except() for determining all elements of an IEnumerable which do not intersect with the elements of another IEnumerable. Again, there is a one line solution to it, which is slow on large input data:

byte[][] hashes_old = /* an array of byte arrays containing a hash value */;
byte[][] hashes_new = /* another array of byte arrays containing a hash value */;
byte[][] hashes_obsolete = ( from x in hashes_old where hashes_new.Any( y => y.SequenceEqual( x ) ) == false select x ).ToArray();

We are using two arrays of 16 byte long hashes and determine which elements do not intersect. It is a more or less elegant one liner that does not require any other comparison code. Yet when we test this with larger amounts of data, it runs slow since SequenceEqual() has to be called for every single comparison:

hashes_old: 18830 hashes_new: 8210 hashes_obsolete: 12228 time: 19564 ms

Since the Except() method is extensively using the GetHashCode() override, a lot of time can be saved by properly implementing a hash function within an IEqualityComparer derivate.

byte[][] hashes_old = /* an array of byte arrays containing a hash value */;
byte[][] hashes_new = /* another array of byte arrays containing a hash value */;
byte[][] hashes_obsolete = hashes_old.Except( hashes_new, new ByteArrayComparer() ).ToArray();
 
/* .... */
public class ByteArrayComparer : IEqualityComparer<byte[]>
{
    public bool Equals( byte[] a, byte[] b )
    {
        if ( a == null || b == null )
            return a == b;
        return a.SequenceEqual( b );
    }
    public int GetHashCode( byte[] x )
    {
        if ( x == null )
            throw new ArgumentNullException();
        int iHash = 0;
        for ( int i = 0; i < x.Length; ++i )
            iHash ^= ( x[ i ] << ( ( 0x03 & i ) << 3 ) );
        return iHash;
    }
}

hashes_old: 18830 hashes_new: 8210 hashes_obsolete: 12228 time: 14 ms

Same result, but instead of 19 seconds we only need 16 milliseconds, that is 1400 times faster!

What happens is that the result of GetHashCode() is used for each comparison. When two arrays have the same hash code, Linq calls the Equals function to ensure both are really equal (there has been no hash collision). So the main speedup is realized by writing a low-collision hashing function.

Using Linq Distinct() without IEqualityComparer

January 18th, 2010 Christian Etter No comments

Given an IEnumerable class, such as a generic list or array, it is only possible to use the Distinct() method when working with simple data types. As soon as we are operating on a list of objects though, we are forced to write your own class implementing IEqualityComparer, which is a bit bothersome in many cases. At tehe first glance, it seems that Microsoft has simply forgotten to implement Lambda Expressions for Distinct() and similar functions. Another reason might be that these functions immensely benefit from the use of a hash based comparison algorithm, and basically that is what the IEqualityComparer is all about. See my other blog post about this subject.

For those who are just looking for a simple solution, the following one-liner might be useful:

SomeObject[] array_1 = new SomeObject[] { ... }
SomeObject[] array_2 = array_1.GroupBy( x => x.SomePropertyOrMethod ).Select( x => x.First() ).ToArray();

A standard implementation using IEqualityComparer could look like this:

byte[][] hash_distinct = hash_duplicate.Distinct( new ByteArrayComparer() ).ToArray();
/* .... */
public class ByteArrayComparer : IEqualityComparer<byte[]>
{
    public bool Equals( byte[] a, byte[] b )
    {
        if ( a == null || b == null )
            return a == b;
        return a.SequenceEqual( b );
    }
    public int GetHashCode( byte[] x )
    {
        if ( x == null )
            throw new ArgumentNullException();
        int iHash = 0;
        for ( int i = 0; i < x.Length; ++i )
            iHash ^= ( x[ i ] << ( ( 0x03 & i ) << 3 ) );
        return iHash;
    }
}

In my tests the performance gain by using an IEqualityComparer implementation instead of the above solution is about 100% when working on an array of 18000 elements.

.NET: Limiting a Program to a Single Instance

March 12th, 2009 Christian Etter No comments

Windows Mobile comes with an internal instance counter that will switch to an already existing window if a program is started more than just once.

Windows CE is lacking such a feature, but it is relatively painless to implement using P/Invoke and events.

public class SingleInstance : IDisposable
{
    [DllImport( "Coredll.dll", SetLastError = true )]
    static extern IntPtr CreateEvent( IntPtr ptrAlwaysZero, bool bManualReset, bool bInitialState, string lpName );
 
    [DllImport( "Coredll.dll", SetLastError = true )]
    static extern int CloseHandle( IntPtr handle );
 
    private IntPtr Handle = IntPtr.Zero;
 
    public bool IsSingleInstance { get { return this.Handle != IntPtr.Zero; } }
 
    /// <summary>A running program is identified by an already existing event that is named the same as the executable path.</summary>
    public SingleInstance() : this( Assembly.GetExecutingAssembly().GetModules()[ 0 ].FullyQualifiedName ) { }
 
    /// <summary>Accepts a single string as a parameter.</summary>
    public SingleInstance( string sEventName )
    {
        this.Handle = CreateEvent( IntPtr.Zero, true, false, sEventName );
        if ( Marshal.GetLastWin32Error() != 0 )
            Dispose();
    }
 
    public void Dispose()
    {
        if ( this.Handle != IntPtr.Zero )
            CloseHandle( this.Handle );
        this.Handle = IntPtr.Zero;
    }
}

You should make use of this class as early in the program as possible, e.g.

static class Program
{
    [MTAThread]
    static void Main()
    {
        using ( SingleInstance si = new SingleInstance() )
            if ( si.IsSingleInstance )
                Application.Run();
    }
}

Use of the above code is not restricted to Windows Mobile though. You could also use this snippet for regular Windows programs, just remember to replace the dllimport for Coredll.dll to kernel32.dll.