Archive

Posts Tagged ‘Assembly’

Measuring Code Performance in C

August 11th, 2009 No comments

When optimizing code for speed, it is useful to have a very precise measurement of the time a certain routine or loop needs to complete. When measuring time only, the QueryPerformanceFrequency() and QueryPerformanceCounter() APIs can be employed. If we need to be even more accurate and measure the number of CPU clock cycles, the x86 asm instruction rdtsc, or the Visual C++ compiler intrinsic __rdtsc() function can be used. The return value of this function is a unsigned 64 bit value that wraps around every 194 years or so on a 3 Ghz CPU. Originally, the RDTSC instruction loads the current count of the time-stamp counter into the EDX:EAX registers for further processing. The instrinsic function is merely a wrapper that returns a 64 bit value of the combined registers.

The following two macros come in handy for wrapping a section to be examined and outputting the result to stdout or a function of your choice.

#define TIMER_START( description ) { \
	PCTSTR szDescription = description; \
	_tprintf( _T("\n-->Starting %s...\n"), description ); \
	LARGE_INTEGER start; \
	QueryPerformanceCounter( &start ); \
	DWORD64 ticks = __rdtsc();
 
#define TIMER_END( iterations ) \
	ticks = __rdtsc() - ticks; \
	LARGE_INTEGER frequency, end; \
	QueryPerformanceCounter( &end ); \
	QueryPerformanceFrequency( &frequency ); \
	DWORD64 it = iterations; \
	DWORD64 diff = end.QuadPart - start.QuadPart; \
	_tprintf( _T("-->Finished %s in %I64d ms %I64d ticks\n"), szDescription, diff / ( frequency.QuadPart / 1000 ), ticks ); \
	_tprintf( _T("-->%I64d iterations - %I64d iterations/sec - %I64d ticks/iteration\n\n"), it, ( it * frequency.QuadPart ) / diff, ticks / it ); }

Technically you would have to subtract the duration of two __rdtsc() calls from the final result, but most likely the time taken is not measurable by QueryPerformanceCounter().
You could also use the above as a separate function:

typedef struct TIMER_DATA{ PCTSTR szDescription; DWORD64 iterations; LARGE_INTEGER start_time; DWORD64 start_ticks; } *PTIMER_DATA;
 
inline void timer_start( PTIMER_DATA data )
{
	_tprintf( _T("\n-->Starting %s...\n"), ( data->szDescription == NULL ) ? _T("") : data->szDescription );
	QueryPerformanceCounter( &(data->start_time) ); \
	data->start_ticks = __rdtsc();
}
 
inline void timer_end( PTIMER_DATA data )
{
	DWORD64 ticks = __rdtsc() - data->start_ticks;
	LARGE_INTEGER frequency, end;
	QueryPerformanceCounter( &end );
	QueryPerformanceFrequency( &frequency );
	DWORD64 diff = end.QuadPart - data->start_time.QuadPart;
	_tprintf( _T("-->Finished %s in %I64d ms %I64d ticks\n"), ( data->szDescription == NULL ) ? _T("") : data->szDescription, diff / ( frequency.QuadPart / 1000 ), ticks );
	_tprintf( _T("-->%I64d iterations - %I64d iterations/sec - %I64d ticks/iteration\n\n"), data->iterations, ( data->iterations * frequency.QuadPart ) / diff, ticks / data->iterations );
}

Usage:

TIMER_START( _T("bla") )
 
/* Your code here */
 
TIMER_END( 200000 )
 
TIMER_DATA data;
data.szDescription = _T("bla");
data.iterations = 200000;
timer_start( &data );
 
/* Your code here */
 
timer_end( &data );

Especially when optimizing tight loops it is recommended to measure the execution time of the complete loop and then divide the result by the number of iterations instead of measuring each iteration individually.

Optimized Image Matching with SSE2

August 8th, 2009 No comments

When it comes to matching a large numer of images according to certain visual criteria, the number of iterations that need to be performed are likely to increase exponentially. So developing a fast matching algorithm is essential.

Depending on your choice of algorithm, making use of your processor’s SIMD extensions can be worthwhile.

In this scenario I was benchmarking a modified Euclidean matching algorithm against a set of 85.730 images of various sizes. Due to the exponential number of iterations, there are a total of 3,674,773,585 comparisons to be performed, as every image is matched against all remaining images. No need to explain why a fast matching algorithm is essential.

A standard C implementation would take about 20,000 clock cycles per iteration on my machine.

Using SSE intrinsics, this could be optimized to take less than 800 clock cycles per iteration on a smaller number of images.  Once I tried to process more than a few thousand pictures at once, the same code would suddenly perform several times slower than before, taking around 1,000 cycles per iteration. Where was the catch? After changing allocation APIs to ensure that all data was aligned on 32 byte boundaries there was still no signifficant speed increase. Using SSE prefetching would not work either.  The reason for the slowdown was in fact that the RGB data was scattered in memory, combined with other image specific data. This would result in a very inefficient use of the CPU’s internal caching, making it necessary to pull information from slow main memory more often than necessary.

Solution: Copy all the data used in the inner loop in a contiguous memory location. Then use the same algorithm to run against the contiguous data. Performance increased by 580%. After unrolling the loop for a 4x4x3 matrix, we were down to an average of 58 ticks per iteration, comparing 85K images within just 66 seconds, that is roughly 5.4 GiB/second!

inline static DOUBLE CalcDifferenceSSE2Unrolled4x4x3( DATA* p1, DATA* p2 )
{
    __m128i* pp1 = (__m128i*)p1->rgb_small;
    __m128i* pp2 = (__m128i*)p2->rgb_small;
    const DWORD RGB_BYTES = sizeof( p1->rgb_small ) / sizeof( *p1->rgb_small );
    ATLASSERT( ( RGB_BYTES & 15 ) == 0 );   // number of elements must be divisible by 16
    ATLASSERT( ( (DWORD)pp1 & 15 ) == 0 ); // pointer must point to address multiple of 16 (cache line)
    ATLASSERT( ( (DWORD)pp2 & 15 ) == 0 ); // pointer must point to address multiple of 16
 
    __m128i a, b, c, d;
    b = _mm_sad_epu8( pp1[ 0 ], pp2[ 0 ] );
    c = _mm_sad_epu8( pp1[ 1 ], pp2[ 1 ] );
    d = _mm_sad_epu8( pp1[ 2 ], pp2[ 2 ] );
 
    a = _mm_add_epi64( b, c );
    a = _mm_add_epi64( a, d ); 
 
    b = _mm_srli_si128 ( a, 8 );
    DWORD dw = _mm_cvtsi128_si32( a ) + _mm_cvtsi128_si32( b );
    return (DOUBLE)dw / ( RGB_BYTES * 0xFF );
}

When searching for duplicate images, alternate approaches should be used to eliminate potential duplicates before entering a one by one comparison like the above. Potential preselection criteria could be aspect ratio, dimensions, or colorspace.

Update March 18, 2012: Thanks to a thoughtful reader the following may be replaced

__m128i a = _mm_setzero_si128();
a = _mm_add_epi64( a, b );
a = _mm_add_epi64( a, c );
a = _mm_add_epi64( a, d );

by a shorter and slightly faster solution:

__m128i a = _mm_add_epi64( b, c );
a = _mm_add_epi64( a, d );