I had some time to look into the performance of reading image meta data using ExifTool and the GDI+ API. If you are planning to use Exiftool in your own software project, you might find the following information useful:
Tests have been conducted on a Core i7 720QM system with 4 GB RAM and a 7200 rpm HD, running Windows 7 64 bit. The testing application was written in C# using .NET Framework 3.5 and Linq.
We are benchmarking different approaches reading a total of 1000 jpg files and measuring the total time taken to read 3 EXIF dates from each file (if present). In order to minimize misleading results when reading files which have previously been cached, we are reading all files once before performing the test.
Reading files without processing
We read every file into memory after it has been cached in the file system cache, so this is more or less an in-memory operation. This gives us a clue regarding the file access overhead.
Duration for 1000 files: 226 ms.
Reading meta data with GDI+
Using the .NET builtin Image class, reading and parsing of 3 attributes is rather fast, taking less than 2 seconds.
Duration for 1000 files: 1632 ms.
Reading meta data using ExifTool
We call ExifTool for every file and parse the return values as DateTime. In order to support all filenames (also names which can only be represented in Unicode), we read the file into memory first and then pipe it to stdin of ExifTool. This is the slowest of all reading modes, taking about 200 times as long as GDI+. As we will see soon, the actual preformance hit ist the startup time of the Perl interpreter, and not ExifTool itself.
Duration for 1000 files: 337000 ms.
Reading meta data using ExifTool (-fast)
Same as above, but using the -fast option in ExifTool, which will prematurely cancel reading from stdin, once a sufficient amount of data has been found. This should increase performance especially when reading files directly over a slow network, but as we can see, it does not make a big difference in our case. Throughput increases by less than 10 percent.
Duration for 1000 files: 317000 ms.
Reading meta data using ExifTool (-fast2)
In addition to the -fast option, using -fast2 should allow for faster processing by ommitting maker notes. The effect is small in our case, which might also be because not all images in the test set contain maker notes. As we will see later, the actual work performed by ExifTool is only very small compared to the Perl startup time. Using the evidence gathered in this test, Perl consumes as much as 98.4% of the total execution time, whereas reading and parsing of the jpeg file only takes 1.6%.
Duration for 1000 files: 311000 ms.
Reading meta data using ExifTool (without extraction)
Since exiftool.exe unpacks a payload of 967 files in 60 folders (8.67 MB), one approach was to skip the extraction process and call ExifTool directly in the temporary folder with all the PAR environment variables set. Unfortunately this could not deliver any improvement worth the effort. Therefore we can say that the amount of time used for extraction is hardly relevant.
Duration for 1000 files: 314000 ms.
Reading meta data using ExifTool (multiple threads)
Since the entire operation is clearly CPU bound, the obvious optimization on a multi core / hyper threading system would be to execute ExifTool in parallel. For our test case, we are executing 8 instances of ExifTool at the same time, using 8 threads (the test system has 8 virtual cores). This results in a signifficant speed up, more than 3 times as fast as a serialized execution:
Duration for 1000 files: 99000 ms.
Reading meta data using ExifTool (batch mode)
The end of the road? With the multi threaded approach we have reached the maximum optimization possible when calling ExifTool seperately for each file. This technique gave us the advantage of being able to handle file names with arbitrary character sets and detailed progress and error reporting.
The only way of getting more (much more) throughput is to use ExifTool in batch mode, thus minimizing the actual impact of the Perl startup time.
As before, ExifTool has been set to read three Date tags from each file and format the text output. This time we are using Json as the output format, since it allows us to easily parse the result of each file. In order to avoid getting a command line which is too long, we are writing all files into an ‘argfile’, which is then written to stdin of Exiftool using -@ -.
This approach is F A S T! We are suddenly doing the same job within 1.6% of the previous time!
Duration for 1000 files: 5669 ms.
Reading meta data using ExifTool (batch mode, -fast2)
No big improvements here, we are ending up with almost the same amount of time. This might be due to the fact that only part of the tested files contain maker notes, so results may vary.
Duration for 1000 files: 5422 ms.
In the above test we have learned that iterating file by file is extremely slow when processing large amounts of files using ExifTool. The performance impact is caused by the overhead of calling the Perl interpreter, and is therefore difficult to minimize when calling ExifTool once for each file. Depending on the kind of data needed, it is therefore worthwhile to look into other solutions such as batch processing or using a different API.
When using ExifTool in batch mode, it is strongly recommended to check whether or not a file name and path can be represented using the current system’s code page. Otherwise such files will not be read. It is not safe to call ExifTool with MS-DOS ’8dot3′ filenames, since these are not converted to ANSI in case of eastern asian characters. Also, these names are not guaranteed to be present for each file on the NTFS file system.
For an ExifTool-only solution reading meta data, the following approach is recommended when reading larger amounts of files:
- Create a list of image files to be processed.
- Convert each file name into a UTF-8 byte sequence and decode using the current system ANSI code page. Then compare the resulting string to the original path of the file. If both are not equal, you must not use this file in batch processing (ExifTool cannot open it) and instead read it in a separate call. Why should you convert to UTF-8 and then to ANSI? This might be confusing at first, but it mimics exactly what happens behind the scenes. You need to pass the file names as UTF-8 to ExifTool, since they appear again in the output (which should be UTF-8).
- Write a list of UTF-8 encoded file names to stdin of ExifTool, do not pass them on the command line, since you might hit the maximum length of the command line and truncate it unknowingly.
- Call ExifTool, preferably using -J option for generating output in Json format. Json allows for more efficient parsing (compared to XML) of the result.
- Parse the result and ensure you are getting a result for each file. There are many libraries to handle Json parsing, in .NET it is done in a few lines of code. If you had entered file names in ANSI encoding in the previous step, you would run into errors here because your output would contain a mix of encodings.
- All files which did not pass the UTF-8 to ANSI encoding roundtrip need to be processed one by one.
The source code of the test app is available for download here.