Archive

Posts Tagged ‘Performance’

ExifTool Performance Benchmark

April 15th, 2010 7 comments

I had some time to look into the performance of reading image meta data using ExifTool and the GDI+ API. If you are planning to use Exiftool in your own software project, you might find the following information useful:

Test Environment
Tests have been conducted on a Core i7 720QM system with 4 GB RAM and a 7200 rpm HD, running Windows 7 64 bit. The testing application was written in C# using .NET Framework 3.5 and Linq.

Preparation

We are benchmarking different approaches reading a total of 1000 jpg files and measuring the total time taken to read 3 EXIF dates from each file (if present). In order to minimize misleading results when reading files which have previously been cached, we are reading all files once before performing the test.

Reading files without processing

We read every file into memory after it has been cached in the file system cache, so this is more or less an in-memory operation. This gives us a clue regarding the file access overhead.

Duration for 1000 files: 226 ms.

Reading meta data with GDI+

Using the .NET builtin Image class, reading and parsing of 3 attributes is rather fast, taking less than 2 seconds.

Duration for 1000 files: 1632 ms.

Reading meta data using ExifTool

We call ExifTool for every file and parse the return values as DateTime. In order to support all filenames (also names which can only be represented in Unicode), we read the file into memory first and then pipe it to stdin of ExifTool. This is the slowest of all reading modes, taking about 200 times as long as GDI+. As we will see soon, the actual preformance hit ist the startup time of the Perl interpreter, and not ExifTool itself.

Duration for 1000 files: 337000 ms.

Reading meta data using ExifTool (-fast)

Same as above, but using the -fast option in ExifTool, which will prematurely cancel reading from stdin, once a sufficient amount of data has been found. This should increase performance especially when reading files directly over a slow network, but as we can see, it does not make a big difference in our case. Throughput increases by less than 10 percent.

Duration for 1000 files: 317000 ms.
Reading meta data using ExifTool (-fast2)
In addition to the -fast option, using -fast2 should allow for faster processing by ommitting maker notes. The effect is small in our case, which might also be because not all images in the test set contain maker notes. As we will see later, the actual work performed by ExifTool is only very small compared to the Perl startup time. Using the evidence gathered in this test, Perl consumes as much as 98.4% of the total execution time, whereas reading and parsing of the jpeg file only takes 1.6%.

Duration for 1000 files: 311000 ms.

Reading meta data using ExifTool (without extraction)

Since exiftool.exe unpacks a payload of 967 files in 60 folders (8.67 MB), one approach was to skip the extraction process and call ExifTool directly in the temporary folder with all the PAR environment variables set. Unfortunately this could not deliver any improvement worth the effort. Therefore we can say that the amount of time used for extraction is hardly relevant.

Duration for 1000 files: 314000 ms.

Reading meta data using ExifTool (multiple threads)

Since the entire operation is clearly CPU bound, the obvious optimization on a multi core / hyper threading system would be to execute ExifTool in parallel. For our test case, we are executing 8 instances of ExifTool at the same time, using 8 threads (the test system has 8 virtual cores). This results in a signifficant speed up, more than 3 times as fast as a serialized execution:

Duration for 1000 files: 99000 ms.

Reading meta data using ExifTool (batch mode)

The end of the road? With the multi threaded approach we have reached the maximum optimization possible when calling ExifTool seperately for each file. This technique gave us the advantage of being able to handle file names with arbitrary character sets and detailed progress and error reporting.

The only way of getting more (much more) throughput is to use ExifTool in batch mode, thus minimizing the actual impact of the Perl startup time.

As before, ExifTool has been set to read three Date tags from each file and format the text output. This time we are using Json as the output format, since it allows us to easily parse the result of each file. In order to avoid getting a command line which is too long, we are writing all files into an ‘argfile’, which is then written to stdin of Exiftool using -@ -.

This approach is F A S T! We are suddenly doing the same job within 1.6% of the previous time!

Duration for 1000 files: 5669 ms.

Reading meta data using ExifTool (batch mode, -fast2)

No big improvements here, we are ending up with almost the same amount of time. This might be due to the fact that only part of the tested files contain maker notes, so results may vary.

Duration for 1000 files: 5422 ms.

Conclusion

In the above test we have learned that iterating file by file is extremely slow when processing large amounts of files using ExifTool. The performance impact is caused by the overhead of calling the Perl interpreter, and is therefore difficult to minimize when calling ExifTool once for each file. Depending on the kind of data needed, it is therefore worthwhile to look into other solutions such as batch processing or using a different API.

When using ExifTool in batch mode, it is strongly recommended to check whether or not a file name and path can be represented using the current system’s code page. Otherwise such files will not be read. It is not safe to call ExifTool with MS-DOS ’8dot3′ filenames, since these are not converted to ANSI in case of eastern asian characters. Also, these names are not guaranteed to be present for each file on the NTFS file system.

For an ExifTool-only solution reading meta data, the following approach is recommended when reading larger amounts of files:

  1. Create a list of image files to be processed.
  2. Convert each file name into a UTF-8 byte sequence and decode using the current system ANSI code page. Then compare the resulting string to the original path of the file. If both are not equal, you must not use this file in batch processing (ExifTool cannot open it) and instead read it in a separate call. Why should you convert to UTF-8 and then to ANSI? This might be confusing at first, but it mimics exactly what happens behind the scenes. You need to pass the file names as UTF-8 to ExifTool, since they appear again in the output (which should be UTF-8).
  3. Write a list of UTF-8 encoded file names to stdin of ExifTool, do not pass them on the command line, since you might hit the maximum length of the command line and truncate it unknowingly.
  4. Call ExifTool, preferably using -J option for generating output in Json format. Json allows for more efficient parsing (compared to XML) of the result.
  5. Parse the result and ensure you are getting a result for each file. There are many libraries to handle Json parsing, in .NET it is done in a few lines of code. If you had entered file names in ANSI encoding in the previous step, you would run into errors here because your output would contain a mix of encodings.
  6. All files which did not pass the UTF-8 to ANSI encoding roundtrip need to be processed one by one.

Source Code

The source code of the test app is available for download here.

How to optimize reading of large amounts of files

February 21st, 2010 3 comments

Conventional hard drives keep getting faster. While 10 years ago an average transfer speed of 40 MiB/sec was considered ‘fast’, today there are mainstream hard drives which deliver average transfer speeds beyond 100 MiB/sec.
So far, so good. What hasn’t increased much is the average access time. Especially when reading large numbers of relatively small files, a lot of time is spent just waiting for the hard disk to position its arm over the relevant data. No big change compared to a decade ago.

Note: In case you are developing software that exclusively operates on Solid State Disks (SSD) you may stop reading now…

The following idea has been developed for an image indexing engine, which is supposed to scan a large amount of images and extract information within the least possible amount of time. The entire indexing operation was disk-bound. For the main test case, a hard disk with roughly 100,000 image files was chosen.

In order to minimize the time spent waiting for each file being read, it would be worthwhile to read all files in the same physical order in which they are residing on the hard disk. So the key information we need is the physical location of each file on the hard disk.

For this purpose, the FSCTL_GET_RETRIEVAL_POINTERS operation can be used:

struct MyFile { PTSTR sFilename; DWORD64 Lcn; DWORD dwFragments; };
 
void GetStartCluster( PCTSTR sFilename, MyFile& f )
{
    f.sFilename = sFilename;
    f.dwFragments = f.Lcn = 0;
 
    HANDLE hFile = CreateFile( sFilename, GENERIC_READ, FILE_SHARE_READ, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL );
    if ( INVALID_HANDLE_VALUE != hFile )
    {
        const DWORD MAX_CLUSTERS = 100; // actually we only care about the first cluster number
        const DWORD RETRIEVAL_POINTERS_BUFFER_SIZE = sizeof( RETRIEVAL_POINTERS_BUFFER ) + ( 2 * ( MAX_CLUSTERS - 1 ) * sizeof( LARGE_INTEGER ) );
        BYTE output[ RETRIEVAL_POINTERS_BUFFER_SIZE ];
 
        STARTING_VCN_INPUT_BUFFER input;
        input.StartingVcn.QuadPart = 0;
 
        DWORD dwBytesReturned;
        if ( DeviceIoControl( hFile, FSCTL_GET_RETRIEVAL_POINTERS, &input, sizeof( input ), &output, sizeof( output ), &dwBytesReturned, NULL ) )
        {
            RETRIEVAL_POINTERS_BUFFER* p = (RETRIEVAL_POINTERS_BUFFER*)&output;
            if ( p->ExtentCount > 0 && p->StartingVcn.QuadPart == 0 )
            {
                f.dwFragments = p->ExtentCount;
                f.Lcn = p->Extents[ 0 ].Lcn.QuadPart;
            }
            else
                _tprintf( _T("Error: ExtentCount = %d StartingVcn = %I64d\r\n"), p->ExtentCount, p->StartingVcn.QuadPart );
        }
        else
            _tprintf( _T("Error: DeviceIoControl = 0x%08X\r\n"), GetLastError() );
        CloseHandle( hFile ); 
    }
}

Initially, the cluster number and the number of extents (fragments) is retrieved for every file. In a second step, all files are sorted according to the number of fragments and their cluster number. Then they are processed in exactly this new order which leads to a minimized amount of disk ‘thrashing’.
The actual retrieval of file locations using DeviceIoControl is relatively fast, since it only queries information from the file system and not from the file’s data location. Typically this operation is CPU bound and consumes less than 5% of all indexing time.

Taken into account the time needed to retrieve position data and sort files, this approach can save between 20 and 50% of read time, depending on average file size, fragmentation and spacial data distribution. There is no benefit from this method if the majority of files is fragmented. However, fragmentation is more likey to happen with larger files, and the average file size in this case was only 149 KiB. Generally speaking, the performance gain this method can provide increases with decreasing file sizes.

Benchmarking

  • Hard disk: 1 TB 7200 rpm SATA II, avg. access time: 17 ms
  • Windows XP Professional 32 bit booted in protected mode
  • Number of files scanned: 87,418
  • Total file size: 12.46 GiB
  • Average file size: 149 KiB
  • Time for reading files in filesystem order: 506 seconds
  • Average throughput reading files in filesystem order: 25.24 MiB/s
  • Time for reading file position data: 12 seconds
  • Time for sorting files according to their position: 54 ms
  • Time for reading files in sorted order: 347 seconds
  • Total time using new approach: 359 seconds
  • Total throughput using new approach: 35.58 MiB/s
  • Performance increase: 147 seconds or 29 %
  • Reading files in filesystem order takes 1.42 times as long as a pre-sorted read

Measuring Code Performance in C

August 11th, 2009 No comments

When optimizing code for speed, it is useful to have a very precise measurement of the time a certain routine or loop needs to complete. When measuring time only, the QueryPerformanceFrequency() and QueryPerformanceCounter() APIs can be employed. If we need to be even more accurate and measure the number of CPU clock cycles, the x86 asm instruction rdtsc, or the Visual C++ compiler intrinsic __rdtsc() function can be used. The return value of this function is a unsigned 64 bit value that wraps around every 194 years or so on a 3 Ghz CPU. Originally, the RDTSC instruction loads the current count of the time-stamp counter into the EDX:EAX registers for further processing. The instrinsic function is merely a wrapper that returns a 64 bit value of the combined registers.

The following two macros come in handy for wrapping a section to be examined and outputting the result to stdout or a function of your choice.

#define TIMER_START( description ) { \
	PCTSTR szDescription = description; \
	_tprintf( _T("\n-->Starting %s...\n"), description ); \
	LARGE_INTEGER start; \
	QueryPerformanceCounter( &start ); \
	DWORD64 ticks = __rdtsc();
 
#define TIMER_END( iterations ) \
	ticks = __rdtsc() - ticks; \
	LARGE_INTEGER frequency, end; \
	QueryPerformanceCounter( &end ); \
	QueryPerformanceFrequency( &frequency ); \
	DWORD64 it = iterations; \
	DWORD64 diff = end.QuadPart - start.QuadPart; \
	_tprintf( _T("-->Finished %s in %I64d ms %I64d ticks\n"), szDescription, diff / ( frequency.QuadPart / 1000 ), ticks ); \
	_tprintf( _T("-->%I64d iterations - %I64d iterations/sec - %I64d ticks/iteration\n\n"), it, ( it * frequency.QuadPart ) / diff, ticks / it ); }

Technically you would have to subtract the duration of two __rdtsc() calls from the final result, but most likely the time taken is not measurable by QueryPerformanceCounter().
You could also use the above as a separate function:

typedef struct TIMER_DATA{ PCTSTR szDescription; DWORD64 iterations; LARGE_INTEGER start_time; DWORD64 start_ticks; } *PTIMER_DATA;
 
inline void timer_start( PTIMER_DATA data )
{
	_tprintf( _T("\n-->Starting %s...\n"), ( data->szDescription == NULL ) ? _T("") : data->szDescription );
	QueryPerformanceCounter( &(data->start_time) ); \
	data->start_ticks = __rdtsc();
}
 
inline void timer_end( PTIMER_DATA data )
{
	DWORD64 ticks = __rdtsc() - data->start_ticks;
	LARGE_INTEGER frequency, end;
	QueryPerformanceCounter( &end );
	QueryPerformanceFrequency( &frequency );
	DWORD64 diff = end.QuadPart - data->start_time.QuadPart;
	_tprintf( _T("-->Finished %s in %I64d ms %I64d ticks\n"), ( data->szDescription == NULL ) ? _T("") : data->szDescription, diff / ( frequency.QuadPart / 1000 ), ticks );
	_tprintf( _T("-->%I64d iterations - %I64d iterations/sec - %I64d ticks/iteration\n\n"), data->iterations, ( data->iterations * frequency.QuadPart ) / diff, ticks / data->iterations );
}

Usage:

TIMER_START( _T("bla") )
 
/* Your code here */
 
TIMER_END( 200000 )
 
TIMER_DATA data;
data.szDescription = _T("bla");
data.iterations = 200000;
timer_start( &data );
 
/* Your code here */
 
timer_end( &data );

Especially when optimizing tight loops it is recommended to measure the execution time of the complete loop and then divide the result by the number of iterations instead of measuring each iteration individually.