When it comes to matching a large numer of images according to certain visual criteria, the number of iterations that need to be performed are likely to increase exponentially. So developing a fast matching algorithm is essential.
Depending on your choice of algorithm, making use of your processor’s SIMD extensions can be worthwhile.
In this scenario I was benchmarking a modified Euclidean matching algorithm against a set of 85.730 images of various sizes. Due to the exponential number of iterations, there are a total of 3,674,773,585 comparisons to be performed, as every image is matched against all remaining images. No need to explain why a fast matching algorithm is essential.
A standard C implementation would take about 20,000 clock cycles per iteration on my machine.
Using SSE intrinsics, this could be optimized to take less than 800 clock cycles per iteration on a smaller number of images. Once I tried to process more than a few thousand pictures at once, the same code would suddenly perform several times slower than before, taking around 1,000 cycles per iteration. Where was the catch? After changing allocation APIs to ensure that all data was aligned on 32 byte boundaries there was still no signifficant speed increase. Using SSE prefetching would not work either. The reason for the slowdown was in fact that the RGB data was scattered in memory, combined with other image specific data. This would result in a very inefficient use of the CPU’s internal caching, making it necessary to pull information from slow main memory more often than necessary.
Solution: Copy all the data used in the inner loop in a contiguous memory location. Then use the same algorithm to run against the contiguous data. Performance increased by 580%. After unrolling the loop for a 4x4x3 matrix, we were down to an average of 58 ticks per iteration, comparing 85K images within just 66 seconds, that is roughly 5.4 GiB/second!
inline static DOUBLE CalcDifferenceSSE2Unrolled4x4x3( DATA* p1, DATA* p2 )
{
__m128i* pp1 = (__m128i*)p1->rgb_small;
__m128i* pp2 = (__m128i*)p2->rgb_small;
const DWORD RGB_BYTES = sizeof( p1->rgb_small ) / sizeof( *p1->rgb_small );
ATLASSERT( ( RGB_BYTES &; 15 ) == 0 ); // number of elements must be divisible by 16
ATLASSERT( ( (DWORD)pp1 & 15 ) == 0 ); // pointer must point to address multiple of 16 (cache line)
ATLASSERT( ( (DWORD)pp2 & 15 ) == 0 ); // pointer must point to address multiple of 16
__m128i b, c, d;
b = _mm_sad_epu8( pp1[ 0 ], pp2[ 0 ] );
c = _mm_sad_epu8( pp1[ 1 ], pp2[ 1 ] );
d = _mm_sad_epu8( pp1[ 2 ], pp2[ 2 ] );
__m128i a = _mm_setzero_si128();
a = _mm_add_epi64( a, b );
a = _mm_add_epi64( a, c );
a = _mm_add_epi64( a, d );
b = _mm_srli_si128 ( a, 8 );
DWORD dw = _mm_cvtsi128_si32( a ) + _mm_cvtsi128_si32( b );
return (DOUBLE)dw / ( RGB_BYTES * 0xFF );
}
When searching for duplicate images, alternate approaches should be used to eliminate potential duplicates before entering a one by one comparison like the above. Potential preselection criteria could be aspect ratio, dimensions, or colorspace.