Archive

Archive for the ‘C++’ Category

OutputDebugString with Variable Arguments

May 11th, 2010 Christian Etter No comments

Just working with the new Visual Studio 2010, integrating Win32 console based code into a window app. The old code used fprintf(…) for debug output.
Since stdout/stderr is not available in a windows app, the fprintf based debug output had to be changed to print to the visual studio output window.

This can be accomplished via the universal OutputDebugString() API. The only shortcoming is, that it does not support variable arguments in a sprintf(…) form.

Here is a workaround:

#include <strsafe.h>
// ...
void MyOutputDebugString( LPCTSTR sFormat, ... )
{
    va_list argptr;      
    va_start( argptr, sFormat ); 
    TCHAR buffer[ 2000 ];
    HRESULT hr = StringCbVPrintf( buffer, sizeof( buffer ), sFormat, argptr );
    if ( STRSAFE_E_INSUFFICIENT_BUFFER == hr || S_OK == hr )
        OutputDebugString( buffer );
    else
        OutputDebugString( _T("StringCbVPrintf error.") );
}

I am using strsafe.h in this case to offer protection against buffer overruns. In case the internal buffer is not big enough to handle the output string, it will be safely truncated with an ending \0. In case you cannot make use of strsafe.h, here is an old style solution:

void MyOutputDebugString( LPCTSTR sFormat, ... )
{
    va_list argptr;      
    va_start( argptr, sFormat ); 
    TCHAR buffer[ 2000 ];
    wvsprintf( buffer, sizeof( buffer ), sFormat, argptr );
    buffer[ ( sizeof( buffer ) / sizeof( *buffer ) ) - 1 ] = '\0';
    OutputDebugString( buffer );
}

How to optimize reading of large amounts of files

February 21st, 2010 Christian Etter No comments

Conventional hard drives keep getting faster. While 10 years ago an average transfer speed of 40 MiB/sec was considered ‘fast’, today there are mainstream hard drives which deliver average transfer speeds beyond 100 MiB/sec.
So far, so good. What hasn’t increased much is the average access time. Especially when reading large numbers of relatively small files, a lot of time is spent just waiting for the hard disk to position its arm over the relevant data. No big change compared to a decade ago.

Note: In case you are developing software that exclusively operates on Solid State Disks (SSD) you may stop reading now…

The following idea has been developed for an image indexing engine, which is supposed to scan a large amount of images and extract information within the least possible amount of time. The entire indexing operation was disk-bound. For the main test case, a hard disk with roughly 100,000 image files was chosen.

In order to minimize the time spent waiting for each file being read, it would be worthwhile to read all files in the same physical order in which they are residing on the hard disk. So the key information we need is the physical location of each file on the hard disk.

For this purpose, the FSCTL_GET_RETRIEVAL_POINTERS operation can be used:

struct MyFile { PTSTR sFilename; DWORD64 Lcn; DWORD dwFragments; };
 
void GetStartCluster( PCTSTR sFilename, MyFile& f )
{
    f.sFilename = sFilename;
    f.dwFragments = f.Lcn = 0;
 
    HANDLE hFile = CreateFile( sFilename, GENERIC_READ, FILE_SHARE_READ, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL );
    if ( INVALID_HANDLE_VALUE != hFile )
    {
        const DWORD MAX_CLUSTERS = 100; // actually we only care about the first cluster number
        const DWORD RETRIEVAL_POINTERS_BUFFER_SIZE = sizeof( RETRIEVAL_POINTERS_BUFFER ) + ( 2 * ( MAX_CLUSTERS - 1 ) * sizeof( LARGE_INTEGER ) );
        BYTE output[ RETRIEVAL_POINTERS_BUFFER_SIZE ];
 
        STARTING_VCN_INPUT_BUFFER input;
        input.StartingVcn.QuadPart = 0;
 
        DWORD dwBytesReturned;
        if ( DeviceIoControl( hFile, FSCTL_GET_RETRIEVAL_POINTERS, &input, sizeof( input ), &output, sizeof( output ), &dwBytesReturned, NULL ) )
        {
            RETRIEVAL_POINTERS_BUFFER* p = (RETRIEVAL_POINTERS_BUFFER*)&output;
            if ( p->ExtentCount > 0 && p->StartingVcn.QuadPart == 0 )
            {
                f.dwFragments = p->ExtentCount;
                f.Lcn = p->Extents[ 0 ].Lcn.QuadPart;
            }
            else
                _tprintf( _T("Error: ExtentCount = %d StartingVcn = %I64d\r\n"), p->ExtentCount, p->StartingVcn.QuadPart );
        }
        else
            _tprintf( _T("Error: DeviceIoControl = 0x%08X\r\n"), GetLastError() );
        CloseHandle( hFile ); 
    }
}

Initially, the cluster number and the number of extents (fragments) is retrieved for every file. In a second step, all files are sorted according to the number of fragments and their cluster number. Then they are processed in exactly this new order which leads to a minimized amount of disk ‘thrashing’.
The actual retrieval of file locations using DeviceIoControl is relatively fast, since it only queries information from the file system and not from the file’s data location. Typically this operation is CPU bound and consumes less than 5% of all indexing time.

Taken into account the time needed to retrieve position data and sort files, this approach can save between 20 and 50% of read time, depending on average file size, fragmentation and spacial data distribution. There is no benefit from this method if the majority of files is fragmented. However, fragmentation is more likey to happen with larger files, and the average file size in this case was only 149 KiB. Generally speaking, the performance gain this method can provide increases with decreasing file sizes.

Benchmarking

  • Hard disk: 1 TB 7200 rpm SATA II, avg. access time: 17 ms
  • Windows XP Professional 32 bit booted in protected mode
  • Number of files scanned: 87,418
  • Total file size: 12.46 GiB
  • Average file size: 149 KiB
  • Time for reading files in filesystem order: 506 seconds
  • Average throughput reading files in filesystem order: 25.24 MiB/s
  • Time for reading file position data: 12 seconds
  • Time for sorting files according to their position: 54 ms
  • Time for reading files in sorted order: 347 seconds
  • Total time using new approach: 359 seconds
  • Total throughput using new approach: 35.58 MiB/s
  • Performance increase: 147 seconds or 29 %
  • Reading files in filesystem order takes 1.42 times as long as a pre-sorted read

Do You have Code Pages? And if not, how many?

November 13th, 2009 Christian Etter No comments

The code page system is a means of representing extended character sets in binary form without leaving the 8 (or 7) Bit based storage and API conventions. There are more than a hundred different code pages, and depending on your system, more or less are supported by the Win32 API.

If you are planning to handle different encodings within your application, it might be useful to know which code pages are available on your system, so you can encode/decode text based on them. Not every System supports every code page though. Just recently I found that an English Windows Mobile 5 would support CP 1252, but not the (very common) Latin-1 character set.
Especially when converting from or to some less common code pages it can be worthwhile to check whether or not there is built in support available.

To enumerate all the installed Code Pages, you can use the EnumSystemCodePages API. It accepts a callback function as a parameter and a flag that specifies whether to enumerate all installed or all supported code pages. Although the docs are not clear about the latter, it seems that the installed code pages are the pages which are ready for conversion, while the supported code pages seem to contain also pages which fail to convert any characters.

Strangely, the callback function gets called with the code page number in string representation, so you might want to do  a _ttoi() in order to get the code page number in use with other APIs.

CAtlArray<CPINFOEX> asCodePagesInstalled;
 
BOOL CALLBACK EnumCodePagesInstalledProc( LPTSTR szCodePageString )
{
    CPINFOEX cpInfoEx;
     if ( 0 != GetCPInfoEx( _ttoi( szCodePageString ), 0, &cpInfoEx ) )
         asCodePagesInstalled.Add( cpInfoEx );
    return TRUE;
}
//...
EnumSystemCodePages( EnumCodePagesInstalledProc, CP_INSTALLED );
//...

It is difficult to determine the lowest common denominator of all code pages supported on every system, and it gets even more complicated when Win 9x or mobile versions need to be supported. So in case your application will demand more than just the bare minimum conversions, you might want to resort to an external library, such as ICU or libiconv.

C++ Callbacks

October 11th, 2009 Christian Etter No comments

C – style callbacks are straight forward to declare and implement. C++ callbacks are less trivial due to the nature of the thiscall convention. Calls adhering to this convention transparently pass a pointer of the current instance to the function/method.
How the instance pointer is passed is compiler specific. GCC pushes it on the stack as a last parameter, MSVC will pass it in ECX except when calling vararg functions.
It is still possible to execute callbacks if they are stored in the same class, but when stored in an external location it is not possible to invoke a thiscall callback directly. Here is a small workaround:

typedef struct SomeStorage
{
    // ...
    VOID (MyClass::*MyMethodPtr)(); 
};
class Myclass
{
public:
    void SomeMethod()
    {
        SomeStorage ss;
        ss.MyMethodPtr = &MyClass::MyMethod;
        this->MyMethodPtr= ss.MyMethodPtr;
        (this->*MyMethodPtr)( void );
    }
private:
    void (MyClass::*MyMethodPtr)( void ); // dummy
    void MyMethod()
    {
        // ...
    }
};

The trick is to temporarily store the retrieved pointer in a member variable and dereference it from there using the ->* operator.

See:

How a ListView Can Waste Your Time

September 28th, 2009 Christian Etter No comments

When debugging a C++ app that made trivial use of a ListView, the first click inside the ListView produced a noticeable delay before giving a visual feedback such as selecting an item or popping a menu. Debug output produced the following:

Loaded 'C:\WINDOWS\system32\winmm.dll'
Loaded 'C:\WINDOWS\system32\wdmaud.drv'
Loaded 'C:\WINDOWS\system32\setupapi.dll'
Loaded 'C:\WINDOWS\system32\wintrust.dll'
Loaded 'C:\WINDOWS\system32\crypt32.dll'
Loaded 'C:\WINDOWS\system32\msasn1.dll'
Loaded 'C:\WINDOWS\system32\imagehlp.dll'
Unloaded 'C:\WINDOWS\system32\setupapi.dll'
Unloaded 'C:\WINDOWS\system32\wdmaud.drv'
Loaded 'C:\WINDOWS\system32\wdmaud.drv'
Loaded 'C:\WINDOWS\system32\setupapi.dll'
Unloaded 'C:\WINDOWS\system32\setupapi.dll'
Loaded 'C:\WINDOWS\system32\msacm32.drv'
Loaded 'C:\WINDOWS\system32\msacm32.dll'
Loaded 'C:\WINDOWS\system32\midimap.dll'

To clarify, this happened with a standard Win32 ListCtrl, and the app did not make use of any sounds or multimedia APIs at all. Additionally, all system sounds were turned off in the Windows XP control panel. What was going on?
It turned out that a ListView is supposed to emit a sound upon item selection, and due to a bug that still shows up in Windows Vista, it even loads the necessary libraries when sounds are completely turned off at a system level.

The problem is easy to solve on an individual basis, but its not trivial to prevent it from occurring on other systems the app is run. All that needs to be done is delete this (empty) key in the current user’s registry hive:

HKEY_CURRENT_USER\AppEvents\Schemes\Apps\.Default\CCSelect\.Current

It strikes me that such a bug, which is probably affecting a huge number of users, is not fixed. Whenever a ListView is clicked for the first time within an app, the DLL delayloading feature will steal a few milliseconds of the user’s time and induce a noticeable lag in UI reaction.

The following KB post describes this bug as Vista specific, although it can be reproduced on XP as well:
http://support.microsoft.com/kb/944150

Hardware Accelerated Image Resizing with GDI

August 31st, 2009 Christian Etter 4 comments

I recently had a situation where we needed a very fast RGB image resizing algorithm that would also come with a low foot print. After looking into Intel® Integrated Performance Primitives (which is not free any more) there was the idea to use Win32 GDI functions directly.

Basically what we do here is create a memory device contexts based on the current display DC, then associate it with the destination memory bitmap. This enables us to use the fast StretchDIBits() API for doing arbitrary image scaling. Then we copy the RGB array from the destination bitmap structure into the destination buffer and delete the destination bitmap.

void Resize( PBYTE pSrc, DWORD dwSrcWidth, DWORD dwSrcHeight, PBYTE pDst, DWORD dwDstWidth, DWORD dwDstHeight )
{
    HDC hDC = GetDC( NULL );
    HDC hDstDC = CreateCompatibleDC( hDC );
 
    BITMAPINFO bi_dst;
    ZeroMemory( &bi_dst, sizeof( bi_dst ) );
    bi_dst.bmiHeader.biSize = sizeof( bi_dst.bmiHeader );
    bi_dst.bmiHeader.biWidth = dwDstWidth;
    bi_dst.bmiHeader.biHeight = dwDstHeight;
    bi_dst.bmiHeader.biPlanes = 1;
    bi_dst.bmiHeader.biBitCount = 24;
    bi_dst.bmiHeader.biCompression = BI_RGB;
 
    PVOID bits;
    HBITMAP hDstBm = CreateDIBSection( NULL, &bi_dst, DIB_RGB_COLORS, &bits, NULL, 0 );
    SelectObject( hDstDC, hDstBm );
 
    BITMAPINFO bi_src;
    ZeroMemory( &bi_src, sizeof( bi_src ) );
    bi_src.bmiHeader.biSize = sizeof( bi_src.bmiHeader );
    bi_src.bmiHeader.biWidth = dwSrcWidth;
    bi_src.bmiHeader.biHeight = dwSrcHeight;
    bi_src.bmiHeader.biPlanes = 1;
    bi_src.bmiHeader.biBitCount = 24;
    bi_src.bmiHeader.biCompression = BI_RGB;    
 
    SetStretchBltMode( hDstDC, HALFTONE );
    StretchDIBits( hDstDC, 0, 0, dwDstWidth, dwDstHeight, 0, 0, dwSrcWidth, dwSrcHeight, pSrc, &bi_src, DIB_RGB_COLORS, SRCCOPY );
 
    CopyMemory( pDst, bits, dwDstWidth * dwDstHeight * 3 );
 
    DeleteObject( hDstBm );
    DeleteDC( hDstDC );
    ReleaseDC( NULL, hDC );
}

For creating a device independent bitmap, we use the CreateDIBSection() API. This will allocate an RGB (or better BGR) buffer which will be automatically freed when calling DeleteObject().

This code cannot replace any image processing library. Yet it is a working solution that simply resizes data in an RGB buffer. Depending on your graphics driver, there might be some restrictions regarding maximum image size to resize from/to and if your display DC supports less than 24 bit colors. On my integrated graphics chipset (Radeon Express) I am getting good results however. Images of all sizes can be enlarged and reduced, the only apparent limit being the amount of RAM needed to hold the data.

If the graphics driver supports hardware acceleration, there is a remarkable speedup when resizing. Evidence suggests that 2D hardware acceleration is only available up to Windows XP and most likely Windows 7, which would be just another reason for not getting Windows Vista ;-) .

There are several free and open source libraries available for resizing images. Although using them in your own code might be quite simple, many of them will come with a bigger memory and code footprint than necessary, especially when you only want to do some resizing or simple drawing. When developing commercial software, one would also have to think about licensing issues.

See also: UI rendering in Windows 7

Measuring Code Performance in C

August 11th, 2009 Christian Etter No comments

When optimizing code for speed, it is useful to have a very precise measurement of the time a certain routine or loop needs to complete. When measuring time only, the QueryPerformanceFrequency() and QueryPerformanceCounter() APIs can be employed. If we need to be even more accurate and measure the number of CPU clock cycles, the x86 asm instruction rdtsc, or the Visual C++ compiler intrinsic __rdtsc() function can be used. The return value of this function is a unsigned 64 bit value that wraps around every 194 years or so on a 3 Ghz CPU. Originally, the RDTSC instruction loads the current count of the time-stamp counter into the EDX:EAX registers for further processing. The instrinsic function is merely a wrapper that returns a 64 bit value of the combined registers.

The following two macros come in handy for wrapping a section to be examined and outputting the result to stdout or a function of your choice.

#define TIMER_START( description ) { \
	PCTSTR szDescription = description; \
	_tprintf( _T("\n-->Starting %s...\n"), description ); \
	LARGE_INTEGER start; \
	QueryPerformanceCounter( &start ); \
	DWORD64 ticks = __rdtsc();
 
#define TIMER_END( iterations ) \
	ticks = __rdtsc() - ticks; \
	LARGE_INTEGER frequency, end; \
	QueryPerformanceCounter( &end ); \
	QueryPerformanceFrequency( &frequency ); \
	DWORD64 it = iterations; \
	DWORD64 diff = end.QuadPart - start.QuadPart; \
	_tprintf( _T("-->Finished %s in %I64d ms %I64d ticks\n"), szDescription, diff / ( frequency.QuadPart / 1000 ), ticks ); \
	_tprintf( _T("-->%I64d iterations - %I64d iterations/sec - %I64d ticks/iteration\n\n"), it, ( it * frequency.QuadPart ) / diff, ticks / it ); }

Technically you would have to subtract the duration of two __rdtsc() calls from the final result, but most likely the time taken is not measurable by QueryPerformanceCounter().
You could also use the above as a separate function:

typedef struct TIMER_DATA{ PCTSTR szDescription; DWORD64 iterations; LARGE_INTEGER start_time; DWORD64 start_ticks; } *PTIMER_DATA;
 
inline void timer_start( PTIMER_DATA data )
{
	_tprintf( _T("\n-->Starting %s...\n"), ( data->szDescription == NULL ) ? _T("") : data->szDescription );
	QueryPerformanceCounter( &(data->start_time) ); \
	data->start_ticks = __rdtsc();
}
 
inline void timer_end( PTIMER_DATA data )
{
	DWORD64 ticks = __rdtsc() - data->start_ticks;
	LARGE_INTEGER frequency, end;
	QueryPerformanceCounter( &end );
	QueryPerformanceFrequency( &frequency );
	DWORD64 diff = end.QuadPart - data->start_time.QuadPart;
	_tprintf( _T("-->Finished %s in %I64d ms %I64d ticks\n"), ( data->szDescription == NULL ) ? _T("") : data->szDescription, diff / ( frequency.QuadPart / 1000 ), ticks );
	_tprintf( _T("-->%I64d iterations - %I64d iterations/sec - %I64d ticks/iteration\n\n"), data->iterations, ( data->iterations * frequency.QuadPart ) / diff, ticks / data->iterations );
}

Usage:

TIMER_START( _T("bla") )
 
/* Your code here */
 
TIMER_END( 200000 )
 
TIMER_DATA data;
data.szDescription = _T("bla");
data.iterations = 200000;
timer_start( &data );
 
/* Your code here */
 
timer_end( &data );

Especially when optimizing tight loops it is recommended to measure the execution time of the complete loop and then divide the result by the number of iterations instead of measuring each iteration individually.

Optimized Image Matching with SSE2

August 8th, 2009 Christian Etter No comments

When it comes to matching a large numer of images according to certain visual criteria, the number of iterations that need to be performed are likely to increase exponentially. So developing a fast matching algorithm is essential.

Depending on your choice of algorithm, making use of your processor’s SIMD extensions can be worthwhile.

In this scenario I was benchmarking a modified Euclidean matching algorithm against a set of 85.730 images of various sizes. Due to the exponential number of iterations, there are a total of 3,674,773,585 comparisons to be performed, as every image is matched against all remaining images. No need to explain why a fast matching algorithm is essential.

A standard C implementation would take about 20,000 clock cycles per iteration on my machine.

Using SSE intrinsics, this could be optimized to take less than 800 clock cycles per iteration on a smaller number of images.  Once I tried to process more than a few thousand pictures at once, the same code would suddenly perform several times slower than before, taking around 1,000 cycles per iteration. Where was the catch? After changing allocation APIs to ensure that all data was aligned on 32 byte boundaries there was still no signifficant speed increase. Using SSE prefetching would not work either.  The reason for the slowdown was in fact that the RGB data was scattered in memory, combined with other image specific data. This would result in a very inefficient use of the CPU’s internal caching, making it necessary to pull information from slow main memory more often than necessary.

Solution: Copy all the data used in the inner loop in a contiguous memory location. Then use the same algorithm to run against the contiguous data. Performance increased by 580%. After unrolling the loop for a 4x4x3 matrix, we were down to an average of 58 ticks per iteration, comparing 85K images within just 66 seconds, that is roughly 5.4 GiB/second!

inline static DOUBLE CalcDifferenceSSE2Unrolled4x4x3( DATA* p1, DATA* p2 )
{
    __m128i* pp1 = (__m128i*)p1->rgb_small;
    __m128i* pp2 = (__m128i*)p2->rgb_small;
    const DWORD RGB_BYTES = sizeof( p1->rgb_small ) / sizeof( *p1->rgb_small );
    ATLASSERT( ( RGB_BYTES &; 15 ) == 0 );   // number of elements must be divisible by 16
    ATLASSERT( ( (DWORD)pp1 & 15 ) == 0 ); // pointer must point to address multiple of 16 (cache line)
    ATLASSERT( ( (DWORD)pp2 & 15 ) == 0 ); // pointer must point to address multiple of 16
 
    __m128i b, c, d;
    b = _mm_sad_epu8( pp1[ 0 ], pp2[ 0 ] );
    c = _mm_sad_epu8( pp1[ 1 ], pp2[ 1 ] );
    d = _mm_sad_epu8( pp1[ 2 ], pp2[ 2 ] );
 
    __m128i a = _mm_setzero_si128();
    a = _mm_add_epi64( a, b );
    a = _mm_add_epi64( a, c );
    a = _mm_add_epi64( a, d );
 
    b = _mm_srli_si128 ( a, 8 );
    DWORD dw = _mm_cvtsi128_si32( a ) + _mm_cvtsi128_si32( b );
    return (DOUBLE)dw / ( RGB_BYTES * 0xFF );
}

When searching for duplicate images, alternate approaches should be used to eliminate potential duplicates before entering a one by one comparison like the above. Potential preselection criteria could be aspect ratio, dimensions, or colorspace.