Archive

Archive for the ‘SQL’ Category

SQL Server 2005/2008 CLR support by example: crypt() UDF in C#

.NET programmability support has been one of the big novelties of SQL Server 2005. Sometimes I had the impression that CLR support was also one of the most misunderstood features of the new platform. Some people considered moving their existing pre-2005 T-SQL code to .NET in an attempt to gain more efficiency by using a ‘compiled’ programming language.

This was a bit of a common misbelief, since stored T-SQL procedures and functions are also optimized/compiled prior to execution. Performance gains by using .NET code instead of T-SQL are usually limited to highly iterative/procedural code, and not for optimizing anything which can be easily done within native SQL.

Let us have a look at a simple user defined function written in C#. This function takes a clear text string and a 2 character salt parameter and calculates a Unix crypt hash based on both.

public partial class UserDefinedFunctions
{
    [Microsoft.SqlServer.Server.SqlFunction( IsDeterministic=true )]
    public static SqlString Crypt( SqlString sPlainText, SqlChars cSalt )
    {
        if ( sPlainText.IsNull || cSalt.IsNull || cSalt.Length < 2 )
            return SqlString.Null;
        string sSalt = new string( cSalt.Value );
        sSalt = sSalt.Substring( 0, 2 );
        return new SqlString( UnixCrypt.Crypt( sSalt, sPlainText.Value ) );
    }
}

We just declare a class with a static function qualified as a SqlFunction that accepts and returns classes/structs of the System.Data.SqlTypes namespace. The UnixCrypt.Crypt method contains the actual code which I am excluding here. If you are interested in building this example, you can find a working implementation of the crypt class in C# at The Code Project.

After compiling the above code into a dll, you may load it into SQL Server and run it with the following commands:

-- CLR support is disabled by default
EXEC sp_configure 'clr enabled' , '1'
GO
reconfigure;
GO
-- load dll and create assembly
CREATE ASSEMBLY asmYourAssembly FROM 'D:\Your\Path\to\CryptUDF.dll' WITH PERMISSION_SET = SAFE
GO
-- add a function that references the CLR UDF
CREATE FUNCTION dbo.Crypt( @plain AS NVARCHAR(200), @salt AS NCHAR(2) )
    RETURNS NVARCHAR(200) AS EXTERNAL NAME asmYourAssembly.UserDefinedFunctions.Crypt
GO
SELECT dbo.Crypt( 'password', 'xy' );

Note PERMISSION_SET = SAFE, which reflects the external dependencies on framework libraries. Since we are doing all our calculations within the library, there is nothing additional we need to reference, which permits us to be running within the safest permission set. You can learn more about the CLR integration permission sets on MSDN.

When doing authentication, you could use the UDF as follows:

SELECT @UserID=[UserID] FROM dbo.tblAccounts 
    WHERE [UserName]=@Username COLLATE Latin1_General_CI_AS
    AND dbo.Crypt( @Password, [Password] )=[Password] COLLATE Latin1_General_CS_AS

Note that we are doing a case insensitive compare on the UserName column and a case sensitive compare on the crypt hash of the crypt hash column.

Why does it make sense to do authentication like this inside the database instead of calculating the password hash outside of the database and then just pass it to a WHERE Username=@Username AND Password=@Password query?

The reason lies in the nature of the crypt function: the hash it creates is based on a two character prefix or salt, which has to be used for hashing the candidate password prior to comparing it with the original password hash. Since the above approach can be completely encapsulated within a stored procedure, no hashes or salts need to be returned to the calling application for verifying the correctness of the password.

Where is this used today? While Unix and Linux user authentication is the classic usage of the crypt function, also some internet services are relying on it. E.g. the PayPal IPN feature for selling subscription accounts is transmitting passwords as crypt hashes only.

Efficient import of large data sources into SQL Server 2008

January 20th, 2010 Christian Etter No comments

When importing large amounts of data into SQL Server 2008 there are several choices. Most people working with SQL Server are aware of the BULK INSERT statement and the bcp utility. In the most simple case, we can specify a text file that contains information in CSV format and import it with a single SQL statement.
In many cases this might work well, yet there are also a lot of circumstances under which this approach is bound to fail. Especially when you are dealing with data that is delimited according to a different convention, such as escaped strings, single quotes instead of double quotes, binary data and the like.

Already simple tasks like skipping a certain column in output or parsing text in a different encoding on one column can be challenging. There is a simple way of doing the latter by using XML format files that can specify more complex import rules.
Here is an example:

<?xml version="1.0"?>
<BCPFORMAT xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
 <RECORD>
	 <FIELD ID="1" xsi:type="CharTerm" TERMINATOR=";" COLLATION="Latin1_General_CI_AS"/>
	 <FIELD ID="2" xsi:type="CharTerm" TERMINATOR=";" COLLATION="Latin1_General_CI_AS"/>
	 <FIELD ID="3" xsi:type="CharTerm" TERMINATOR=";" COLLATION="Latin1_General_CI_AS"/>
	 <FIELD ID="4" xsi:type="CharTerm" TERMINATOR=";" COLLATION="Latin1_General_CI_AS"/>
	 <FIELD ID="5" xsi:type="CharTerm" TERMINATOR=";" COLLATION="Latin1_General_CI_AS"/>
	 <FIELD ID="6" xsi:type="CharTerm" TERMINATOR=";" COLLATION="Latin1_General_CI_AS"/>
	 <FIELD ID="7" xsi:type="CharTerm" TERMINATOR=";" COLLATION="Latin1_General_CI_AS"/>
	 <FIELD ID="8" xsi:type="CharTerm" TERMINATOR=";" COLLATION="Latin1_General_CI_AS"/>
	 <FIELD ID="9" xsi:type="CharTerm" TERMINATOR=";" COLLATION="Latin1_General_CI_AS"/>
	 <FIELD ID="10" xsi:type="CharTerm" TERMINATOR=";" COLLATION="Latin1_General_CI_AS"/>
	 <FIELD ID="11" xsi:type="CharTerm" TERMINATOR=";" COLLATION="Latin1_General_CI_AS"/>
	 <FIELD ID="12" xsi:type="CharTerm" TERMINATOR=";" COLLATION="Latin1_General_CI_AS"/>
	 <FIELD ID="13" xsi:type="CharTerm" TERMINATOR=";" COLLATION="Latin1_General_CI_AS"/>
	 <FIELD ID="14" xsi:type="CharTerm" TERMINATOR=";" COLLATION="Latin1_General_CI_AS"/>
	 <FIELD ID="15" xsi:type="CharTerm" TERMINATOR=";" COLLATION="Latin1_General_CI_AS"/>
	 <FIELD ID="16" xsi:type="CharTerm" TERMINATOR="\n" MAX_LENGTH="2000" COLLATION="Latin1_General_CI_AS"/>
 </RECORD>
 <ROW>
  <COLUMN SOURCE="1" NAME="WarenwirtschaftsID" xsi:type="SQLINT"/>
  <COLUMN SOURCE="2" NAME="Bezeichnung" xsi:type="SQLNVARCHAR"/>
  <COLUMN SOURCE="3" NAME="Beschreibung" xsi:type="SQLNVARCHAR"/>
  <COLUMN SOURCE="4" NAME="Vorname" xsi:type="SQLNVARCHAR"/>
  <COLUMN SOURCE="5" NAME="Ort" xsi:type="SQLNVARCHAR"/>
  <COLUMN SOURCE="6" NAME="PLZ" xsi:type="SQLNVARCHAR"/>
  <COLUMN SOURCE="7" NAME="Land" xsi:type="SQLNVARCHAR"/>
  <COLUMN SOURCE="8" NAME="Strasse" xsi:type="SQLNVARCHAR"/>
  <COLUMN SOURCE="11" NAME="Telefon" xsi:type="SQLNVARCHAR"/>
  <COLUMN SOURCE="14" NAME="Fax" xsi:type="SQLNVARCHAR"/>
  <COLUMN SOURCE="15" NAME="Email" xsi:type="SQLVARYCHAR"/>
 </ROW>
</BCPFORMAT>

The above serves as a format description for the following ETL script:

INSERT INTO Kunden ( Bezeichnung, Strasse, PLZ, Ort, Land, Telefon, Fax, Email, WarenwirtschaftsID, 
    Beschreibung, HotlineVertragVorhanden, ZeiterfassungsID, GeloeschtAm, Ansprechpartner )
    SELECT Bezeichnung, Strasse, PLZ, Ort, Land, Telefon, Fax, Email, WarenwirtschaftsID, 
        Beschreibung, 0 AS HotlineVertragVorhanden, NULL AS ZeiterfassungsID, NULL AS GeloeschtAm, 
        CASE WHEN Vorname IS NOT NULL THEN Vorname + ' ' + Bezeichnung ELSE NULL END AS Ansprechpartner
        FROM  OPENROWSET(BULK  '.....\KUNDEN.csv' , 
        FORMATFILE = '.....\Kunden.xml',
        FIRSTROW = 2, CODEPAGE = '1252'
    ) AS T ;

The neat thing about the BULK INSERT command is that you can do it in pure SQL (left aside the XML file) and therefore it can be easily automated as a task/job that runs on a regular basis.

If you are looking for more transformations, you have to resort to a programming language that allows for greater flexibility, such as C#.

In traditional transactional processing you would read each row of data from your data source, transform it and then insert it into the SQL Server data base. Due to the nature of transactional processing, this can be a very inefficient and slow process.

For large amounts of data, it makes sense to use a bulk load mechanism encapsulated in the SqlBulkCopy class. The following method uses a compressed file as an input stream, decompresses it and uses SqlBulkCopy to mass insert all data.

private static void ProcessLocations( ZipInputStream oZip )
{
    using ( SqlConnection connection = new SqlConnection( SQL_SERVER_CONNECTION ) )
    {
        connection.Open();
        SqlCommand cmd = connection.CreateCommand();
 
        cmd.CommandText =
            "IF EXISTS ( SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[T_Location]') AND type in (N'U')) " +
            "TRUNCATE TABLE [dbo].[T_Location];" +
            "ELSE " +
            "CREATE TABLE T_Location ( LocationID INT PRIMARY KEY CLUSTERED, Country CHAR(2) NOT NULL, Region CHAR(2), City NVARCHAR(50), PostalCoce VARCHAR(50), Latitude REAL, Longitude REAL )";
        cmd.CommandTimeout = 60;
        cmd.ExecuteNonQuery();
 
        /// open stream with ISO-8859-1 (Latin-1) encoding, do not close streamreader
        StreamReader re = new StreamReader( oZip, Encoding.GetEncoding( 28591 ) );
 
        using ( SqlBulkCopy oBulk = new SqlBulkCopy( connection ) )
        {
            oBulk.BatchSize =
            oBulk.NotifyAfter = 50000;
            oBulk.DestinationTableName = "T_Location";
            oBulk.SqlRowsCopied += new SqlRowsCopiedEventHandler( delegate( object sender, SqlRowsCopiedEventArgs e ) { Output( String.Format( "{0:0,0} Rows imported.", e.RowsCopied ) ); } );
            /// need to add a column mapping because we are only storing 7 out of 9 columns.
            for ( int i = 0; i < 7; ++i )
                oBulk.ColumnMappings.Add( i, i );
 
            LocationCSVReader oBlocks = new LocationCSVReader( re, Output );
            oBulk.WriteToServer( oBlocks );
        }
    }
}

How are files parsed and columns extracted? As you may have guessed, the LocationCSVReader:

 
private class LocationCSVReader : IDataReader
{
    private StreamReader oReader = null;
    private object[] oCurrentValues = new object[ 9 ];
    private int iCurrentRow = 0;
    private OutputEventHandler Output = null;
 
    public delegate void OutputEventHandler( string s );
 
    public LocationCSVReader( StreamReader oReader ) : this( oReader, null ) { }
 
    public LocationCSVReader( StreamReader oReader, OutputEventHandler o )
    {
        this.oReader = oReader;
        this.Output = o;
        Output( "Skipping header\r\n" + this.oReader.ReadLine() + "\r\n" + this.oReader.ReadLine() );
        this.iCurrentRow = 2;
    }
 
    #region IDataReader Members
 
    public bool Read()
    {
        string input;
        while ( ( input = this.oReader.ReadLine() ) != null )
        {
            ++this.iCurrentRow;
            input = input.Replace( "\"", "" );
            string[] ss = input.Split( new char[] { ',' } );
            if ( ss.Length != this.oCurrentValues.Length )
            {
                if ( this.Output != null )
                    this.Output( String.Format( "Invalid number of columns on line {0:0,0} - {1}", this.iCurrentRow, String.Join( " - ", ss ) ) );
                continue;
            }
            for ( int i = 0; i < this.oCurrentValues.Length; ++i )
                this.oCurrentValues[ i ] = ss[ i ];
            this.oCurrentValues[ 0 ] = (Int32)( Int64.Parse( ss[ 0 ] ) - Int32.MaxValue );
            this.oCurrentValues[ 5 ] = Double.Parse( ss[ 5 ], System.Globalization.NumberStyles.Float, CultureInfo.InvariantCulture );
            this.oCurrentValues[ 6 ] = Double.Parse( ss[ 6 ], System.Globalization.NumberStyles.Float, CultureInfo.InvariantCulture );
            return true;
        }
        return false;
    }
 
    // ...
    #endregion
 
    #region IDisposable Members
    // ...
    #endregion
 
    #region IDataRecord Members
    // ...
    #endregion
}

One important thing to note: Consider that your code will be executed on a system with a different locale than yours. That’s frequently the point when things start to fall apart: E.g. when testing on a German server, things seem alright. Although when published on an English production server, “1,002″ suddenly becomes equivalent to “1.000″ and vice versa. Same problem exists with dates and times. Therefore whenever possible specify a specific culture or CultureInfo.InvariantCulture when parsing text values!

Storing IP Addresses in SQL Server

August 30th, 2009 Christian Etter No comments

Efficient storage of IP addresses can save a lot of disk and memory space while at the same time increasing query performance. Also it enables you to further process data based on subnets and network ranges.

Any IPv4 address can be represented by 32 bits. The “dotted-decimal notation” – which is the most widely known format – is just one form of writing the four bytes separated by dots. In this string representation, an IP address will consist of 7 (x.x.x.x) to 15 (xxx.xxx.xxx.xxx) characters. When encoded with an 8-bit character set, the storage space required is therefore at least 15 bytes, in contrast to 4 bytes for the binary representation. In SQL Server 2008, storing a table with a single INT column requires about 13 bytes of storage per row, whereas storing an IP address in VARCHAR(15) format requires about 26 bytes per row. Storing the address in four TINYINT columns or in a single 4 byte BINARY column also results in the same storage size as INT.

Besides saving space, storing in binary format will also allow for more efficient and smaller indexing, and hence result in faster lookup times.

As an added benefit, the binary representation enables us to perform searches based on subnets and arbitrary network ranges which is very difficult and slow when using dotted-decimal notation. You can find a real-life example in this post: Geolocation of IP Addresses.

Storing IP addresses in binary (integer) format usually requires conversion from/to the dotted-decimal notation. So the key question is how to transparently handle conversion for external applications. For us, using user defined scalar functions (UDF) has proven to be the most efficient way. Since there is no unsigned 32-bit integer format in SQL Server, we are converting all addresses into a signed integer format which only requires 4 bytes of storage and allows for efficient indexing.

CREATE FUNCTION [dbo].[GetLongIP] ( @StringIP AS VARCHAR(15) ) RETURNS  INT
BEGIN 
 
DECLARE @LongIP AS INT
DECLARE @StartPos AS SMALLINT
 
SET @LongIP = 0
SET @StartPos = 1
 
IF LEN( @StringIP ) > 15 OR LEN( @StringIP ) < 7     RETURN 0 ELSE BEGIN     IF CHARINDEX( '.', @StringIP, @StartPos ) > 1
    IF ISNUMERIC( SUBSTRING( @StringIP, @StartPos, CHARINDEX( '.', @StringIP, @StartPos ) - @StartPos ) ) = 1
    IF CAST( SUBSTRING( @StringIP, @StartPos, CHARINDEX( '.', @StringIP, @StartPos ) - @StartPos ) AS INT) <= 0xFF      IF CAST( SUBSTRING( @StringIP, @StartPos, CHARINDEX( '.', @StringIP, @StartPos ) - @StartPos ) AS INT ) >= 0x00
    BEGIN
        SET @LongIP = (CAST( SUBSTRING( @StringIP, @StartPos, CHARINDEX( '.', @StringIP, @StartPos ) - @StartPos ) AS INT ) - 0x80 ) * 0x01000000
        SET @StartPos = CHARINDEX( '.', @StringIP, @StartPos ) + 1
    END
 
    IF CHARINDEX( '.', @StringIP, @StartPos ) > 1
    IF ISNUMERIC( SUBSTRING( @StringIP, @StartPos, CHARINDEX( '.', @StringIP, @StartPos ) - @StartPos ) ) = 1
    IF CAST( SUBSTRING( @StringIP, @StartPos, CHARINDEX( '.', @StringIP, @StartPos ) - @StartPos ) AS INT) <= 0xFF      IF CAST( SUBSTRING( @StringIP, @StartPos, CHARINDEX( '.', @StringIP, @StartPos ) - @StartPos ) AS INT ) >= 0x00
    BEGIN
        SET @LongIP = @LongIP + CAST( SUBSTRING( @StringIP, @StartPos, CHARINDEX( '.', @StringIP, @StartPos ) - @StartPos ) AS INT ) * 0x00010000
        SET @StartPos = CHARINDEX( '.', @StringIP, @StartPos ) + 1
    END
 
    IF CHARINDEX( '.', @StringIP, @StartPos ) > 1
    IF ISNUMERIC( SUBSTRING( @StringIP, @StartPos, CHARINDEX( '.', @StringIP, @StartPos ) - @StartPos ) ) = 1
    IF CAST( SUBSTRING( @StringIP, @StartPos, CHARINDEX( '.', @StringIP, @StartPos ) - @StartPos ) AS INT) <= 0xFF      IF CAST( SUBSTRING( @StringIP, @StartPos, CHARINDEX( '.', @StringIP, @StartPos ) - @StartPos ) AS INT ) >= 0x00
    BEGIN
        SET @LongIP = @LongIP + CAST( SUBSTRING( @StringIP, @StartPos, CHARINDEX( '.', @StringIP, @StartPos ) - @StartPos ) AS INT ) * 0x00000100
        SET @StartPos = CHARINDEX( '.', @StringIP, @StartPos ) + 1
    END
    IF ISNUMERIC( SUBSTRING( @StringIP, @StartPos, LEN( @StringIP ) - @StartPos + 1 ) ) = 1
    IF CAST( SUBSTRING( @StringIP, @StartPos, LEN( @StringIP ) - @StartPos + 1 ) AS INT) <= 0xFF      IF CAST( SUBSTRING( @StringIP, @StartPos, LEN( @StringIP ) - @StartPos + 1 ) AS INT ) >= 0x00
        SET @LongIP = @LongIP + CAST( SUBSTRING( @StringIP, @StartPos, LEN( @StringIP ) - @StartPos + 1 ) AS INT )
END
    RETURN @LongIP
END

The above looks a bit funky, but it does the job. There is also a hack using the PARSENAME system function, which should be mentioned as well. As an added benefit, the following UDF is more than 3 times faster than the above.

CREATE FUNCTION [dbo].[GetLongIP] ( @StringIP AS VARCHAR(15) ) RETURNS  INT
BEGIN
    RETURN
    (
    ( CAST( PARSENAME( @StringIP, 4 ) AS INT ) - 0x80 )    * 0x01000000 +
      CAST( PARSENAME( @StringIP, 3 ) AS INT )            * 0x00010000 +
      CAST( PARSENAME( @StringIP, 2 ) AS INT )            * 0x00000100 +
      CAST( PARSENAME( @StringIP, 1 ) AS INT )
    )
END

Converting IP addresses from signed integer format to dotted-decimal notation is straightforward:

CREATE FUNCTION [dbo].[GetStringIP] ( @IP AS INT ) RETURNS VARCHAR(15)
BEGIN
    RETURN
    (
    CAST( (@IP & 0xff000000)/0x01000000 + 0x80    AS VARCHAR ) + '.' +
    CAST( (@IP & 0x00ff0000)/0x00010000        AS VARCHAR ) + '.' +
    CAST( (@IP & 0x0000ff00)/0x00000100        AS VARCHAR ) + '.' +
    CAST( (@IP & 0x000000ff)                AS VARCHAR )
    )
END

Both UDFs are compatible with SQL Server 2000 and newer, and have been used in a production environment for almost a decade.

When using SQL Server 2000, you may also consider to use an extended stored procedure (XP) for this purpose. An XP is a dynamic link library written in native C. This dll could make use of the itoa() and atoi() APIs or use some custom parsing code. But unless you need to pull the last bit of performance out of your system, using an XP for this purpose is discouraged due to potential security implications.

With SQL Server 2005/2008 extending functionality has become much easier and less of a security concern due to the introduction of managed CLR functions that can be invoked in the same way as a UDF.  In case you are interested, here is some C# code to do the job:

private static int GetLongIP( string s )
{
    string[] ss = s.Split( new char[] { '.' }, 4 );
    return ( ( Int32.Parse( ss[ 0 ] ) - 0x80 ) << 24 ) | ( Int32.Parse( ss[ 1 ] ) << 16 ) | ( Int32.Parse( ss[ 2 ] ) << 8 ) | Int32.Parse( ss[ 3 ] );
}
 
private static string GetStringIP( int i )
{
    return String.Format( "{0}.{1}.{2}.{3}", ( i >> 24 ) + 0x80, ( i >> 16 ) & 0xFF, ( i >> 8 ) & 0xFF, i & 0xFF );
}

This will run within a SAFE (default) permission set, since it does not reference any additional name spaces or libraries. You might be tempted to save even more work by calling member methods of the IPAddress class in the System.Net namespace, but that would require your code to run with EXTERNAL_ACCESS or UNSAFE permission set.

Fast Geolocation of IP-Adresses using SQLite and SQL Server

August 24th, 2009 Christian Etter No comments

Quite frequently, network enabled applications will store information related to IP adresses in a SQL data base. In some cases it makes sense to analyze this data based on the geographic location of each IP address, such as finding out country, state and city of origin.

The purpose of this article is to show a very fast and efficient way of querying IP address related geoinformation by using standard T-SQL statements.

IP adresses are usually reserved in Blocks and therefore storing location and owner based information is best done by specifying a range with a starting and ending IP address.

In this case we are going to use the free GeoLite City IP database, which can be found at www.maxmind.com/app/ip-location.  In this database, starting and ending IP adresses are stored in their binary representation (as integers) instead of the common 3 dot  notation. This allows for more compact data storage and very fast indexed searches.

Using SQLite – Simple Queries

SQLite is a very compact and fast open source DBS. Storing all the 3.4 Million rows requires a total of 68,708,352 bytes. Since every row consists of 12 bytes of data, in theory the minimum file size would be 41,839,740 bytes, and the overhead using SQLite amounts to approximately 64.2%, which is a decent value.

To store the data, we create the  following table:

CREATE TABLE T_IP_Blocks ( start INTEGER PRIMARY KEY, end INTEGER NOT NULL, locid INTEGER NOT NULL )

In order to query for a location identifier, we have to find the row in the database that has a starting IP address less than or equal to the IP address we are looking for, and an ending address larger or equal the address in question. If both prerequisites are satisfied, we have found the right IP block and can use its location identifier for a lookup in the city data base.

Finding an IP block is straightforward:

SELECT locid FROM T_IP_Blocks WHERE @ip BETWEEN start AND end

However this is very slow, even when start is a primary key and there is a secondary key on the end column. On my machine, SQLite would perform with less than one lookup per second. Why is it so slow? In the above case, the SQL statement does not contain enough information about the structure of our data. The database assumes that for any row where the start column is larger than or equal ip needs to be checked whether or not the end column is less than or equal ip.

However, since our list of IP blocks does not contain any overlapping blocks we can be sure that only the row with the lowest start value being less than or equal ip is a potential match. All othe rows can be safely skipped. Lets see how we can speed this up:

SELECT locid FROM ( SELECT locid FROM T_IP_Blocks WHERE start <= @ip ORDER BY start DESC LIMIT 1 ) WHERE end > @ip

In this case we are operating on the primary key of the table (which is sorted in descending order) and return the first row that matches the predicate. Keep in mind that this will only work because we sort the result of the subquery before returning the first resulting row. Then we retrieve the row data and check if the address is also within the bounds of the end IP address. This simple measure will result in a lookup performance more than 12000 lookups per second.

SELECT locid, end FROM T_IP_Blocks WHERE start <= @ip ORDER BY start DESC LIMIT 1

You might be able to speed this up by another 10% by dropping the subquery and verifying the end variable in your application code.

Why are we not using a combined primary key, or an index on two columns? Due to some SQLite intrinsics, creating a primary key with more than one column will result in the file growing by an additional 8 byte per row, and in our case we have a row count of 3,486,645. Besides, speed would not increase compared to the approach described before.

Using SQLite – Table Joins

The above shows how to look up records in a very basic way. Probably your actual requirements when working with geolocation data will be more complex. One common scenario is the joining of IP address related tables with geolocation data. In the following section we are going to have a look at how this can be achieved in an efficient way.

First, we are adding another table and fill it with 1 Million random IP adresses, ranging from 0.0.0.0 to 255.255.255.255. For compatibility reasons with other databases, we save them as signed integers (to save space).

CREATE TABLE T_IP_Adresses ( ip INTEGER PRIMARY KEY )

For simplicity’s sake, the table contains only one column. Doing a join between both tables is trivial:

SELECT ip, locid FROM T_IP_Adresses LEFT JOIN T_IP_Blocks ON T_IP_Adresses.ip BETWEEN T_IP_Blocks.start AND T_IP_Blocks.end

The performance of such a query is very poor. We should rather change to to look like this:

SELECT ip, ( SELECT locid FROM ( SELECT locid, end FROM T_IP_Blocks WHERE start <= ip ORDER BY start DESC LIMIT 1 ) WHERE end > ip ) AS locid FROM T_IP_Addresses

The above statement is quite fast, despite the two subqueries. On my machine it joins about 43,000 rows per second, so the 1 Million row table will be processen within 23 seconds. Off course your results may vary, I just mention some numbers here to show that by writing proper SQL we can improve query performance by several orders of magnitude.

Using SQL Server 2008 Express

In the following section we are using a more complex RDBMS and see how we can optimize the lookup performance on this platform. Both, query and ddl syntax are quite similar to SQLite. However we have more possibilities to adjust data storage and indexing to fit our needs. First of all, the imported data occupies 74,194,944 bytes, that is a total overhead of 77% compared to the actual data size. It is also about 8% more storage space than the SQLite version. If we create a second index on the end column, another 39,182,336 bytes of storage space are needed. However this index can be discarded entirely, as we will see…

DDL:

CREATE TABLE T_IP_Blocks ( [start] INT NOT NULL, [end] INT NOT NULL, locid INT NOT NULL, CONSTRAINT [PK_T_IP_Blocks] PRIMARY KEY CLUSTERED ( [start] DESC, [end] ASC ) )
CREATE TABLE T_IP_Adresses ( ip INT PRIMARY KEY )

First off, looking up single IP adresses is a lot slower than with SQLite, due to the network and inter-process communication overhead which is the result of using an out-of-process DBMS. Although the comparison is not entirely fair in this case, the following optimized query takes about 5 times longer than the ones that run against SQLite.

SELECT [end], [locid] FROM ( SELECT TOP 1 [locid], [end] FROM T_IP_Blocks WHERE [start] <= @ip ORDER BY [start] DESC ) AS X WHERE [end] > @ip

I managed to get about 2400 lookups per second. When querying larger sets of data, the real power of  a more sophisticated DBMS will show. Again, we are joining the geolocation table against a table with one million randomly generated IP addresses.

SELECT [ip], ( SELECT [locid] FROM
    ( SELECT TOP 1 [locid], [end] FROM T_IP_Blocks WHERE [start] <= [ip] ORDER BY [start] DESC ) AS X WHERE [end] > [ip] )
    [locid] FROM T_IP_Addresses

This statement will join up to 90,000 rows per second (without result caching and the like), so 1 Million rows process in roughly 11 seconds.

Do More: Looking up Location Details

Now that we know how to efficiently look up IP addresses using the above technique, we can look into the details of making use of such lookup information. The free MaxMind database comes with a second data set that contains detailed information for each location id. The following columns are provided: locId, country, region, city, postalCode, latitude, longitude, metroCode, areaCode. Some of them are optional.

Joining all this data is straightforward in SQL:

SELECT [ip], ( SELECT [locid] FROM
    ( SELECT TOP 1 [locid], [end] FROM T_IP_Blocks WHERE [start] <= [ip] ORDER BY [start] DESC ) AS X WHERE [end] > [ip] )
    [locid], country, region, city, latitude, longitude
    FROM T_IP_Addresses LEFT JOIN T_City ON [locid] = T_City.[locid]

To be continued…

Links:

Use SET NOCOUNT ON!

March 2nd, 2007 Christian Etter No comments

When writing stored procedures in SQL Server, every execution will be automatically followed by a (mostly useless) execution summary. In most cases, this should be switched off in order to decrease server round trips and network traffic. So it is a good idea to add

SET NOCOUNT ON

at the beginning of any sp.

Why it is not a good idea to name your stored procedures sp_…

February 2nd, 2006 Christian Etter No comments

Recently I was reviewing an internal database developent project which was expected to be released soon. While the overall code quality was good, one thing caught my eye: developers had named all stored procedures according to the sp_… scheme. What’s wrong with it? Even Microsoft SQL Server Devs are using this convention. Well, that is actually the catch.
While SPs with this prefix are executing correctly, they are causing an increased lookup time when called. This is due to the fact that all procedures starting with sp_ will be resolved in the master DB first, and only if not found there in the current database.