How to optimize reading of large amounts of files

February 1st, 2010 Christian Etter No comments

Conventional hard drives keep getting faster. While 10 years ago an average transfer speed of 40 MiB/sec was considered ‘fast’, today there are mainstream hard drives which deliver average transfer speeds beyond 100 MiB/sec.
So far, so good. What hasn’t increased much is the average access time. Especially when reading large numbers of relatively small files, a lot of time is spent just waiting for the hard disk to position its arm over the relevant data. No big change compared to a decade ago.

Note: In case you are developing software that exclusively operates on Solid State Disks (SSD) you may stop reading now…

The following idea has been developed for an image indexing engine, which is supposed to scan a large amount of images and extract information within the least possible amount of time. The entire indexing operation was disk-bound. For the main test case, a hard disk with roughly 100,000 image files was chosen.

In order to minimize the time spent waiting for each file being read, it would be worthwhile to read all files in the same physical order in which they are residing on the hard disk. So the key information we need is the physical location of each file on the hard disk.

For this purpose, the FSCTL_GET_RETRIEVAL_POINTERS operation can be used:

struct MyFile { PTSTR sFilename; DWORD64 Lcn; DWORD dwFragments; };
 
void GetStartCluster( PCTSTR sFilename, MyFile& f )
{
    f.sFilename = sFilename;
    f.dwFragments = f.Lcn = 0;
 
    HANDLE hFile = CreateFile( sFilename, GENERIC_READ, FILE_SHARE_READ, NULL, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, NULL );
    if ( INVALID_HANDLE_VALUE != hFile )
    {
        const DWORD MAX_CLUSTERS = 100; // actually we only care about the first cluster number
        const DWORD RETRIEVAL_POINTERS_BUFFER_SIZE = sizeof( RETRIEVAL_POINTERS_BUFFER ) + ( 2 * ( MAX_CLUSTERS - 1 ) * sizeof( LARGE_INTEGER ) );
        BYTE output[ RETRIEVAL_POINTERS_BUFFER_SIZE ];
 
        STARTING_VCN_INPUT_BUFFER input;
        input.StartingVcn.QuadPart = 0;
 
        DWORD dwBytesReturned;
        if ( DeviceIoControl( hFile, FSCTL_GET_RETRIEVAL_POINTERS, &input, sizeof( input ), &output, sizeof( output ), &dwBytesReturned, NULL ) )
        {
            RETRIEVAL_POINTERS_BUFFER* p = (RETRIEVAL_POINTERS_BUFFER*)&output;
            if ( p->ExtentCount > 0 && p->StartingVcn.QuadPart == 0 )
            {
                f.dwFragments = p->ExtentCount;
                f.Lcn = p->Extents[ 0 ].Lcn.QuadPart;
            }
            else
                _tprintf( _T("Error: ExtentCount = %d StartingVcn = %I64d\r\n"), p->ExtentCount, p->StartingVcn.QuadPart );
        }
        else
            _tprintf( _T("Error: DeviceIoControl = 0x%08X\r\n"), GetLastError() );
        CloseHandle( hFile ); 
    }
}

Initially, the cluster number and the number of extents (fragments) is retrieved for every file. In a second step, all files are sorted according to the number of fragments and their cluster number. Then they are processed in exactly this new order which leads to a minimized amount of disk ‘thrashing’.
The actual retrieval of file locations using DeviceIoControl is relatively fast, since it only queries information from the file system and not from the file’s data location. Typically this operation is CPU bound and consumes less than 5% of all indexing time.

Taken into account the time needed to retrieve position data and sort files, this approach can save between 20 and 50% of read time, depending on average file size, fragmentation and spacial data distribution. There is no benefit from this method if the majority of files is fragmented. However, fragmentation is more likey to happen with larger files, and the average file size in this case was only 149 KiB. Generally speaking, the performance gain this method can provide increases with decreasing file sizes.

Benchmarking

  • Hard disk: 1 TB 7200 rpm SATA II, avg. access time: 17 ms
  • Windows XP Professional 32 bit booted in protected mode
  • Number of files scanned: 87,418
  • Total file size: 12.46 GiB
  • Average file size: 149 KiB
  • Time for reading files in filesystem order: 506 seconds
  • Average throughput reading files in filesystem order: 25.24 MiB/s
  • Time for reading file position data: 12 seconds
  • Time for sorting files according to their position: 54 ms
  • Time for reading files in sorted order: 347 seconds
  • Total time using new approach: 359 seconds
  • Total throughput using new approach: 35.58 MiB/s
  • Performance increase: 147 seconds or 29 %
  • Reading files in filesystem order takes 1.42 times as long as a pre-sorted read

Efficient import of large data sources into SQL Server 2008

January 20th, 2010 Christian Etter No comments

When importing large amounts of data into SQL Server 2008 there are several choices. Most people working with SQL Server are aware of the BULK INSERT statement and the bcp utility. In the most simple case, we can specify a text file that contains information in CSV format and import it with a single SQL statement.
In many cases this might work well, yet there are also a lot of circumstances under which this approach is bound to fail. Especially when you are dealing with data that is delimited according to a different convention, such as escaped strings, single quotes instead of double quotes, binary data and the like.

Already simple tasks like skipping a certain column in output or parsing text in a different encoding on one column can be challenging. There is a simple way of doing the latter by using XML format files that can specify more complex import rules.
Here is an example:

<?xml version="1.0"?>
<BCPFORMAT xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
 <RECORD>
	 <FIELD ID="1" xsi:type="CharTerm" TERMINATOR=";" COLLATION="Latin1_General_CI_AS"/>
	 <FIELD ID="2" xsi:type="CharTerm" TERMINATOR=";" COLLATION="Latin1_General_CI_AS"/>
	 <FIELD ID="3" xsi:type="CharTerm" TERMINATOR=";" COLLATION="Latin1_General_CI_AS"/>
	 <FIELD ID="4" xsi:type="CharTerm" TERMINATOR=";" COLLATION="Latin1_General_CI_AS"/>
	 <FIELD ID="5" xsi:type="CharTerm" TERMINATOR=";" COLLATION="Latin1_General_CI_AS"/>
	 <FIELD ID="6" xsi:type="CharTerm" TERMINATOR=";" COLLATION="Latin1_General_CI_AS"/>
	 <FIELD ID="7" xsi:type="CharTerm" TERMINATOR=";" COLLATION="Latin1_General_CI_AS"/>
	 <FIELD ID="8" xsi:type="CharTerm" TERMINATOR=";" COLLATION="Latin1_General_CI_AS"/>
	 <FIELD ID="9" xsi:type="CharTerm" TERMINATOR=";" COLLATION="Latin1_General_CI_AS"/>
	 <FIELD ID="10" xsi:type="CharTerm" TERMINATOR=";" COLLATION="Latin1_General_CI_AS"/>
	 <FIELD ID="11" xsi:type="CharTerm" TERMINATOR=";" COLLATION="Latin1_General_CI_AS"/>
	 <FIELD ID="12" xsi:type="CharTerm" TERMINATOR=";" COLLATION="Latin1_General_CI_AS"/>
	 <FIELD ID="13" xsi:type="CharTerm" TERMINATOR=";" COLLATION="Latin1_General_CI_AS"/>
	 <FIELD ID="14" xsi:type="CharTerm" TERMINATOR=";" COLLATION="Latin1_General_CI_AS"/>
	 <FIELD ID="15" xsi:type="CharTerm" TERMINATOR=";" COLLATION="Latin1_General_CI_AS"/>
	 <FIELD ID="16" xsi:type="CharTerm" TERMINATOR="\n" MAX_LENGTH="2000" COLLATION="Latin1_General_CI_AS"/>
 </RECORD>
 <ROW>
  <COLUMN SOURCE="1" NAME="WarenwirtschaftsID" xsi:type="SQLINT"/>
  <COLUMN SOURCE="2" NAME="Bezeichnung" xsi:type="SQLNVARCHAR"/>
  <COLUMN SOURCE="3" NAME="Beschreibung" xsi:type="SQLNVARCHAR"/>
  <COLUMN SOURCE="4" NAME="Vorname" xsi:type="SQLNVARCHAR"/>
  <COLUMN SOURCE="5" NAME="Ort" xsi:type="SQLNVARCHAR"/>
  <COLUMN SOURCE="6" NAME="PLZ" xsi:type="SQLNVARCHAR"/>
  <COLUMN SOURCE="7" NAME="Land" xsi:type="SQLNVARCHAR"/>
  <COLUMN SOURCE="8" NAME="Strasse" xsi:type="SQLNVARCHAR"/>
  <COLUMN SOURCE="11" NAME="Telefon" xsi:type="SQLNVARCHAR"/>
  <COLUMN SOURCE="14" NAME="Fax" xsi:type="SQLNVARCHAR"/>
  <COLUMN SOURCE="15" NAME="Email" xsi:type="SQLVARYCHAR"/>
 </ROW>
</BCPFORMAT>

The above serves as a format description for the following ETL script:

INSERT INTO Kunden ( Bezeichnung, Strasse, PLZ, Ort, Land, Telefon, Fax, Email, WarenwirtschaftsID, 
    Beschreibung, HotlineVertragVorhanden, ZeiterfassungsID, GeloeschtAm, Ansprechpartner )
    SELECT Bezeichnung, Strasse, PLZ, Ort, Land, Telefon, Fax, Email, WarenwirtschaftsID, 
        Beschreibung, 0 AS HotlineVertragVorhanden, NULL AS ZeiterfassungsID, NULL AS GeloeschtAm, 
        CASE WHEN Vorname IS NOT NULL THEN Vorname + ' ' + Bezeichnung ELSE NULL END AS Ansprechpartner
        FROM  OPENROWSET(BULK  '.....\KUNDEN.csv' , 
        FORMATFILE = '.....\Kunden.xml',
        FIRSTROW = 2, CODEPAGE = '1252'
    ) AS T ;

The neat thing about the BULK INSERT command is that you can do it in pure SQL (left aside the XML file) and therefore it can be easily automated as a task/job that runs on a regular basis.

If you are looking for more transformations, you have to resort to a programming language that allows for greater flexibility, such as C#.

In traditional transactional processing you would read each row of data from your data source, transform it and then insert it into the SQL Server data base. Due to the nature of transactional processing, this can be a very inefficient and slow process.

For large amounts of data, it makes sense to use a bulk load mechanism encapsulated in the SqlBulkCopy class. The following method uses a compressed file as an input stream, decompresses it and uses SqlBulkCopy to mass insert all data.

private static void ProcessLocations( ZipInputStream oZip )
{
    using ( SqlConnection connection = new SqlConnection( SQL_SERVER_CONNECTION ) )
    {
        connection.Open();
        SqlCommand cmd = connection.CreateCommand();
 
        cmd.CommandText =
            "IF EXISTS ( SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[T_Location]') AND type in (N'U')) " +
            "TRUNCATE TABLE [dbo].[T_Location];" +
            "ELSE " +
            "CREATE TABLE T_Location ( LocationID INT PRIMARY KEY CLUSTERED, Country CHAR(2) NOT NULL, Region CHAR(2), City NVARCHAR(50), PostalCoce VARCHAR(50), Latitude REAL, Longitude REAL )";
        cmd.CommandTimeout = 60;
        cmd.ExecuteNonQuery();
 
        /// open stream with ISO-8859-1 (Latin-1) encoding, do not close streamreader
        StreamReader re = new StreamReader( oZip, Encoding.GetEncoding( 28591 ) );
 
        using ( SqlBulkCopy oBulk = new SqlBulkCopy( connection ) )
        {
            oBulk.BatchSize =
            oBulk.NotifyAfter = 50000;
            oBulk.DestinationTableName = "T_Location";
            oBulk.SqlRowsCopied += new SqlRowsCopiedEventHandler( delegate( object sender, SqlRowsCopiedEventArgs e ) { Output( String.Format( "{0:0,0} Rows imported.", e.RowsCopied ) ); } );
            /// need to add a column mapping because we are only storing 7 out of 9 columns.
            for ( int i = 0; i < 7; ++i )
                oBulk.ColumnMappings.Add( i, i );
 
            LocationCSVReader oBlocks = new LocationCSVReader( re, Output );
            oBulk.WriteToServer( oBlocks );
        }
    }
}

How are files parsed and columns extracted? As you may have guessed, the LocationCSVReader:

 
private class LocationCSVReader : IDataReader
{
    private StreamReader oReader = null;
    private object[] oCurrentValues = new object[ 9 ];
    private int iCurrentRow = 0;
    private OutputEventHandler Output = null;
 
    public delegate void OutputEventHandler( string s );
 
    public LocationCSVReader( StreamReader oReader ) : this( oReader, null ) { }
 
    public LocationCSVReader( StreamReader oReader, OutputEventHandler o )
    {
        this.oReader = oReader;
        this.Output = o;
        Output( "Skipping header\r\n" + this.oReader.ReadLine() + "\r\n" + this.oReader.ReadLine() );
        this.iCurrentRow = 2;
    }
 
    #region IDataReader Members
 
    public bool Read()
    {
        string input;
        while ( ( input = this.oReader.ReadLine() ) != null )
        {
            ++this.iCurrentRow;
            input = input.Replace( "\"", "" );
            string[] ss = input.Split( new char[] { ',' } );
            if ( ss.Length != this.oCurrentValues.Length )
            {
                if ( this.Output != null )
                    this.Output( String.Format( "Invalid number of columns on line {0:0,0} - {1}", this.iCurrentRow, String.Join( " - ", ss ) ) );
                continue;
            }
            for ( int i = 0; i < this.oCurrentValues.Length; ++i )
                this.oCurrentValues[ i ] = ss[ i ];
            this.oCurrentValues[ 0 ] = (Int32)( Int64.Parse( ss[ 0 ] ) - Int32.MaxValue );
            this.oCurrentValues[ 5 ] = Double.Parse( ss[ 5 ], System.Globalization.NumberStyles.Float, CultureInfo.InvariantCulture );
            this.oCurrentValues[ 6 ] = Double.Parse( ss[ 6 ], System.Globalization.NumberStyles.Float, CultureInfo.InvariantCulture );
            return true;
        }
        return false;
    }
 
    // ...
    #endregion
 
    #region IDisposable Members
    // ...
    #endregion
 
    #region IDataRecord Members
    // ...
    #endregion
}

One important thing to note: Consider that your code will be executed on a system with a different locale than yours. That’s frequently the point when things start to fall apart: E.g. when testing on a German server, things seem alright. Although when published on an English production server, “1,002″ suddenly becomes equivalent to “1.000″ and vice versa. Same problem exists with dates and times. Therefore whenever possible specify a specific culture or CultureInfo.InvariantCulture when parsing text values!

Using Linq Distinct() without IEqualityComparer

January 18th, 2010 Christian Etter No comments

Given an IEnumerable class, such as a generic list or array it is only possible to use the Distinct() method when working with simple data types. As soon as you are operating on a list of objects though, you are forced to write your own class implementing IEqualityComparer, which is a bit bothersome in many cases. It seems that Microsoft has simply forgotten to implement Lambda Expressions for Distinct()!
For those who are just looking for a simple solution, the following one-liner might be useful:

SomeObject[] array_1 = new SomeObject[] { ... }
SomeObject[] array_2 = array_1.GroupBy( x => x.SomePropertyOrMethod ).Select( x => x.First() ).ToArray();

Convenient MemoryStream Wrapper

December 18th, 2009 Christian Etter No comments

The System.IO.MemoryStream class provides a way of storing binary data of arbitrary length in memory. Frequently, the data source is another System.IO.Stream or a byte array. Unfortunately, the MemoryStream class does not contain a convenient method for reading an entire Stream object in a single call.
The following wrapper does just that. It reads the content of any Stream class passed to it’s contructor and stores it internally:

public class MyMemoryStream : MemoryStream
{
    public MyMemoryStream( Stream s ) : this() { byte[] b = new byte[ 512 ]; int i; while ( ( i = s.Read( b, 0, b.Length ) ) > 0 ) Write( b, 0, i ); }
    public void Write( byte[] b ) { Write( b, 0, b.Length ); }
 }

How to make use of this class? The following example returns a UTF-8 encoded web resource as a String:

public static string GetUri( Uri uri )
{
    HttpWebRequest oRequest = HttpWebRequest.Create( uri ) as HttpWebRequest;
    using ( HttpWebResponse oResponse = oRequest.GetResponse() as HttpWebResponse )
        using ( Stream oStream = oResponse.GetResponseStream() )
            using ( MyMemoryStream oMS = new MyMemoryStream( oStream ) )
                return Encoding.UTF8.GetString( oMS.ToArray() );
}

Do You have Code Pages? And if not, how many?

November 13th, 2009 Christian Etter No comments

The code page system is a means of representing extended character sets in binary form without leaving the 8 (or 7) Bit based storage and API conventions. There are more than a hundred different code pages, and depending on your system, more or less are supported by the Win32 API.

If you are planning to handle different encodings within your application, it might be useful to know which code pages are available on your system, so you can encode/decode text based on them. Not every System supports every code page though. Just recently I found that an English Windows Mobile 5 would support CP 1252, but not the (very common) Latin-1 character set.
Especially when converting from or to some less common code pages it can be worthwhile to check whether or not there is built in support available.

To enumerate all the installed Code Pages, you can use the EnumSystemCodePages API. It accepts a callback function as a parameter and a flag that specifies whether to enumerate all installed or all supported code pages. Although the docs are not clear about the latter, it seems that the installed code pages are the pages which are ready for conversion, while the supported code pages seem to contain also pages which fail to convert any characters.

Strangely, the callback function gets called with the code page number in string representation, so you might want to do  a _ttoi() in order to get the code page number in use with other APIs.

CAtlArray<CPINFOEX> asCodePagesInstalled;
 
BOOL CALLBACK EnumCodePagesInstalledProc( LPTSTR szCodePageString )
{
    CPINFOEX cpInfoEx;
     if ( 0 != GetCPInfoEx( _ttoi( szCodePageString ), 0, &cpInfoEx ) )
         asCodePagesInstalled.Add( cpInfoEx );
    return TRUE;
}
//...
EnumSystemCodePages( EnumCodePagesInstalledProc, CP_INSTALLED );
//...

It is difficult to determine the lowest common denominator of all code pages supported on every system, and it gets even more complicated when Win 9x or mobile versions need to be supported. So in case your application will demand more than just the bare minimum conversions, you might want to resort to an external library, such as ICU or libiconv.

Improving Backup Reliability on Optical Storage

October 24th, 2009 Christian Etter No comments

Many end-users are using CD-R or DVD-R media as a convenient storage for backup data. However the quality of writable media degrades over time; which inevitably leads to a loss of all or part of the stored data. In some cases, data loss can already occur within the first year after a media has been written. As a consequence, the data needs to be copied in regular intervals, before data loss happens.
Now the important point is how to determine the right time when a disc has to be copied. If copied too early, time and media are wasted. If copied too late, data loss has already occurred which might not be recoverable. Unlike a piece of printed paper, where the text on it is fading over time, the degradation of data on optical media is usually not linear and there are no clear signs that would signal if the media is about to fail or not.

There is an open source project called dvddisaster which has tackled this problem using an interesting approach: Before an ISO image is written on a media, it can be stuffed with extra error correction data that will be transparrently embedded along with the actual data. This gives gives extra redundancy to an extent that can be controlled by the user and depending on the free space remaining on the media.

While the idea of adding redundant error correction codes to important data is not new, the novelty lies in the concept of adding this information invisibly on a level below the visible file system. If any disk with error correction information is starting to deteriorate and some sectors become unreadable, we can still recover the real data that was stored in these sectors – provided that the redundant data is enough for recovery.
This brings the advantage that slight data loss is not critical, but can now be used as an indicator of aging media. With this approach one can actually wait with copying until some of the media’s sectors become illegible – wthout having to fear any loss of data.

dvddisaster is open source software, which makes it less likely that you will find a discontinued software if you need to recover a CD 10 years from now. As an added benefit, it is cross platform, so you should be able to use it from Windows and Linux.

C++ Callbacks

October 11th, 2009 Christian Etter No comments

C – style callbacks are straight forward to declare and implement. C++ callbacks are less trivial due to the nature of the thiscall convention. Calls adhering to this convention transparently pass a pointer of the current instance to the function/method.
How the instance pointer is passed is compiler specific. GCC pushes it on the stack as a last parameter, MSVC will pass it in ECX except when calling vararg functions.
It is still possible to execute callbacks if they are stored in the same class, but when stored in an external location it is not possible to invoke a thiscall callback directly. Here is a small workaround:

typedef struct SomeStorage
{
    // ...
    VOID (MyClass::*MyMethodPtr)(); 
};
class Myclass
{
public:
    void SomeMethod()
    {
        SomeStorage ss;
        ss.MyMethodPtr = &MyClass::MyMethod;
        this->MyMethodPtr= ss.MyMethodPtr;
        (this->*MyMethodPtr)( void );
    }
private:
    void (MyClass::*MyMethodPtr)( void ); // dummy
    void MyMethod()
    {
        // ...
    }
};

The trick is to temporarily store the retrieved pointer in a member variable and dereference it from there using the ->* operator.

See:

How a ListView Can Waste Your Time

September 28th, 2009 Christian Etter No comments

When debugging a C++ app that made trivial use of a ListView, the first click inside the ListView produced a noticeable delay before giving a visual feedback such as selecting an item or popping a menu. Debug output produced the following:

Loaded 'C:\WINDOWS\system32\winmm.dll'
Loaded 'C:\WINDOWS\system32\wdmaud.drv'
Loaded 'C:\WINDOWS\system32\setupapi.dll'
Loaded 'C:\WINDOWS\system32\wintrust.dll'
Loaded 'C:\WINDOWS\system32\crypt32.dll'
Loaded 'C:\WINDOWS\system32\msasn1.dll'
Loaded 'C:\WINDOWS\system32\imagehlp.dll'
Unloaded 'C:\WINDOWS\system32\setupapi.dll'
Unloaded 'C:\WINDOWS\system32\wdmaud.drv'
Loaded 'C:\WINDOWS\system32\wdmaud.drv'
Loaded 'C:\WINDOWS\system32\setupapi.dll'
Unloaded 'C:\WINDOWS\system32\setupapi.dll'
Loaded 'C:\WINDOWS\system32\msacm32.drv'
Loaded 'C:\WINDOWS\system32\msacm32.dll'
Loaded 'C:\WINDOWS\system32\midimap.dll'

To clarify, this happened with a standard Win32 ListCtrl, and the app did not make use of any sounds or multimedia APIs at all. Additionally, all system sounds were turned off in the Windows XP control panel. What was going on?
It turned out that a ListView is supposed to emit a sound upon item selection, and due to a bug that still shows up in Windows Vista, it even loads the necessary libraries when sounds are completely turned off at a system level.

The problem is easy to solve on an individual basis, but its not trivial to prevent it from occurring on other systems the app is run. All that needs to be done is delete this (empty) key in the current user’s registry hive:

HKEY_CURRENT_USER\AppEvents\Schemes\Apps\.Default\CCSelect\.Current

It strikes me that such a bug, which is probably affecting a huge number of users, is not fixed. Whenever a ListView is clicked for the first time within an app, the DLL delayloading feature will steal a few milliseconds of the user’s time and induce a noticeable lag in UI reaction.

The following KB post describes this bug as Vista specific, although it can be reproduced on XP as well:
http://support.microsoft.com/kb/944150