Archive

Posts Tagged ‘IP address’

Storing IP Addresses in SQL Server

August 30th, 2009 Christian Etter No comments

Efficient storage of IP addresses can save a lot of disk and memory space while at the same time increasing query performance. Also it enables you to further process data based on subnets and network ranges.

Any IPv4 address can be represented by 32 bits. The “dotted-decimal notation” – which is the most widely known format – is just one form of writing the four bytes separated by dots. In this string representation, an IP address will consist of 7 (x.x.x.x) to 15 (xxx.xxx.xxx.xxx) characters. When encoded with an 8-bit character set, the storage space required is therefore at least 15 bytes, in contrast to 4 bytes for the binary representation. In SQL Server 2008, storing a table with a single INT column requires about 13 bytes of storage per row, whereas storing an IP address in VARCHAR(15) format requires about 26 bytes per row. Storing the address in four TINYINT columns or in a single 4 byte BINARY column also results in the same storage size as INT.

Besides saving space, storing in binary format will also allow for more efficient and smaller indexing, and hence result in faster lookup times.

As an added benefit, the binary representation enables us to perform searches based on subnets and arbitrary network ranges which is very difficult and slow when using dotted-decimal notation. You can find a real-life example in this post: Geolocation of IP Addresses.

Storing IP addresses in binary (integer) format usually requires conversion from/to the dotted-decimal notation. So the key question is how to transparently handle conversion for external applications. For us, using user defined scalar functions (UDF) has proven to be the most efficient way. Since there is no unsigned 32-bit integer format in SQL Server, we are converting all addresses into a signed integer format which only requires 4 bytes of storage and allows for efficient indexing.

CREATE FUNCTION [dbo].[GetLongIP] ( @StringIP AS VARCHAR(15) ) RETURNS  INT
BEGIN 
 
DECLARE @LongIP AS INT
DECLARE @StartPos AS SMALLINT
 
SET @LongIP = 0
SET @StartPos = 1
 
IF LEN( @StringIP ) > 15 OR LEN( @StringIP ) < 7     RETURN 0 ELSE BEGIN     IF CHARINDEX( '.', @StringIP, @StartPos ) > 1
    IF ISNUMERIC( SUBSTRING( @StringIP, @StartPos, CHARINDEX( '.', @StringIP, @StartPos ) - @StartPos ) ) = 1
    IF CAST( SUBSTRING( @StringIP, @StartPos, CHARINDEX( '.', @StringIP, @StartPos ) - @StartPos ) AS INT) <= 0xFF      IF CAST( SUBSTRING( @StringIP, @StartPos, CHARINDEX( '.', @StringIP, @StartPos ) - @StartPos ) AS INT ) >= 0x00
    BEGIN
        SET @LongIP = (CAST( SUBSTRING( @StringIP, @StartPos, CHARINDEX( '.', @StringIP, @StartPos ) - @StartPos ) AS INT ) - 0x80 ) * 0x01000000
        SET @StartPos = CHARINDEX( '.', @StringIP, @StartPos ) + 1
    END
 
    IF CHARINDEX( '.', @StringIP, @StartPos ) > 1
    IF ISNUMERIC( SUBSTRING( @StringIP, @StartPos, CHARINDEX( '.', @StringIP, @StartPos ) - @StartPos ) ) = 1
    IF CAST( SUBSTRING( @StringIP, @StartPos, CHARINDEX( '.', @StringIP, @StartPos ) - @StartPos ) AS INT) <= 0xFF      IF CAST( SUBSTRING( @StringIP, @StartPos, CHARINDEX( '.', @StringIP, @StartPos ) - @StartPos ) AS INT ) >= 0x00
    BEGIN
        SET @LongIP = @LongIP + CAST( SUBSTRING( @StringIP, @StartPos, CHARINDEX( '.', @StringIP, @StartPos ) - @StartPos ) AS INT ) * 0x00010000
        SET @StartPos = CHARINDEX( '.', @StringIP, @StartPos ) + 1
    END
 
    IF CHARINDEX( '.', @StringIP, @StartPos ) > 1
    IF ISNUMERIC( SUBSTRING( @StringIP, @StartPos, CHARINDEX( '.', @StringIP, @StartPos ) - @StartPos ) ) = 1
    IF CAST( SUBSTRING( @StringIP, @StartPos, CHARINDEX( '.', @StringIP, @StartPos ) - @StartPos ) AS INT) <= 0xFF      IF CAST( SUBSTRING( @StringIP, @StartPos, CHARINDEX( '.', @StringIP, @StartPos ) - @StartPos ) AS INT ) >= 0x00
    BEGIN
        SET @LongIP = @LongIP + CAST( SUBSTRING( @StringIP, @StartPos, CHARINDEX( '.', @StringIP, @StartPos ) - @StartPos ) AS INT ) * 0x00000100
        SET @StartPos = CHARINDEX( '.', @StringIP, @StartPos ) + 1
    END
    IF ISNUMERIC( SUBSTRING( @StringIP, @StartPos, LEN( @StringIP ) - @StartPos + 1 ) ) = 1
    IF CAST( SUBSTRING( @StringIP, @StartPos, LEN( @StringIP ) - @StartPos + 1 ) AS INT) <= 0xFF      IF CAST( SUBSTRING( @StringIP, @StartPos, LEN( @StringIP ) - @StartPos + 1 ) AS INT ) >= 0x00
        SET @LongIP = @LongIP + CAST( SUBSTRING( @StringIP, @StartPos, LEN( @StringIP ) - @StartPos + 1 ) AS INT )
END
    RETURN @LongIP
END

The above looks a bit funky, but it does the job. There is also a hack using the PARSENAME system function, which should be mentioned as well. As an added benefit, the following UDF is more than 3 times faster than the above.

CREATE FUNCTION [dbo].[GetLongIP] ( @StringIP AS VARCHAR(15) ) RETURNS  INT
BEGIN
    RETURN
    (
    ( CAST( PARSENAME( @StringIP, 4 ) AS INT ) - 0x80 )    * 0x01000000 +
      CAST( PARSENAME( @StringIP, 3 ) AS INT )            * 0x00010000 +
      CAST( PARSENAME( @StringIP, 2 ) AS INT )            * 0x00000100 +
      CAST( PARSENAME( @StringIP, 1 ) AS INT )
    )
END

Converting IP addresses from signed integer format to dotted-decimal notation is straightforward:

CREATE FUNCTION [dbo].[GetStringIP] ( @IP AS INT ) RETURNS VARCHAR(15)
BEGIN
    RETURN
    (
    CAST( (@IP & 0xff000000)/0x01000000 + 0x80    AS VARCHAR ) + '.' +
    CAST( (@IP & 0x00ff0000)/0x00010000        AS VARCHAR ) + '.' +
    CAST( (@IP & 0x0000ff00)/0x00000100        AS VARCHAR ) + '.' +
    CAST( (@IP & 0x000000ff)                AS VARCHAR )
    )
END

Both UDFs are compatible with SQL Server 2000 and newer, and have been used in a production environment for almost a decade.

When using SQL Server 2000, you may also consider to use an extended stored procedure (XP) for this purpose. An XP is a dynamic link library written in native C. This dll could make use of the itoa() and atoi() APIs or use some custom parsing code. But unless you need to pull the last bit of performance out of your system, using an XP for this purpose is discouraged due to potential security implications.

With SQL Server 2005/2008 extending functionality has become much easier and less of a security concern due to the introduction of managed CLR functions that can be invoked in the same way as a UDF.  In case you are interested, here is some C# code to do the job:

private static int GetLongIP( string s )
{
    string[] ss = s.Split( new char[] { '.' }, 4 );
    return ( ( Int32.Parse( ss[ 0 ] ) - 0x80 ) << 24 ) | ( Int32.Parse( ss[ 1 ] ) << 16 ) | ( Int32.Parse( ss[ 2 ] ) << 8 ) | Int32.Parse( ss[ 3 ] );
}
 
private static string GetStringIP( int i )
{
    return String.Format( "{0}.{1}.{2}.{3}", ( i >> 24 ) + 0x80, ( i >> 16 ) & 0xFF, ( i >> 8 ) & 0xFF, i & 0xFF );
}

This will run within a SAFE (default) permission set, since it does not reference any additional name spaces or libraries. You might be tempted to save even more work by calling member methods of the IPAddress class in the System.Net namespace, but that would require your code to run with EXTERNAL_ACCESS or UNSAFE permission set.

Fast Geolocation of IP-Adresses using SQLite and SQL Server

August 24th, 2009 Christian Etter No comments

Quite frequently, network enabled applications will store information related to IP adresses in a SQL data base. In some cases it makes sense to analyze this data based on the geographic location of each IP address, such as finding out country, state and city of origin.

The purpose of this article is to show a very fast and efficient way of querying IP address related geoinformation by using standard T-SQL statements.

IP adresses are usually reserved in Blocks and therefore storing location and owner based information is best done by specifying a range with a starting and ending IP address.

In this case we are going to use the free GeoLite City IP database, which can be found at www.maxmind.com/app/ip-location.  In this database, starting and ending IP adresses are stored in their binary representation (as integers) instead of the common 3 dot  notation. This allows for more compact data storage and very fast indexed searches.

Using SQLite – Simple Queries

SQLite is a very compact and fast open source DBS. Storing all the 3.4 Million rows requires a total of 68,708,352 bytes. Since every row consists of 12 bytes of data, in theory the minimum file size would be 41,839,740 bytes, and the overhead using SQLite amounts to approximately 64.2%, which is a decent value.

To store the data, we create the  following table:

CREATE TABLE T_IP_Blocks ( start INTEGER PRIMARY KEY, end INTEGER NOT NULL, locid INTEGER NOT NULL )

In order to query for a location identifier, we have to find the row in the database that has a starting IP address less than or equal to the IP address we are looking for, and an ending address larger or equal the address in question. If both prerequisites are satisfied, we have found the right IP block and can use its location identifier for a lookup in the city data base.

Finding an IP block is straightforward:

SELECT locid FROM T_IP_Blocks WHERE @ip BETWEEN start AND end

However this is very slow, even when start is a primary key and there is a secondary key on the end column. On my machine, SQLite would perform with less than one lookup per second. Why is it so slow? In the above case, the SQL statement does not contain enough information about the structure of our data. The database assumes that for any row where the start column is larger than or equal ip needs to be checked whether or not the end column is less than or equal ip.

However, since our list of IP blocks does not contain any overlapping blocks we can be sure that only the row with the lowest start value being less than or equal ip is a potential match. All othe rows can be safely skipped. Lets see how we can speed this up:

SELECT locid FROM ( SELECT locid FROM T_IP_Blocks WHERE start <= @ip ORDER BY start DESC LIMIT 1 ) WHERE end > @ip

In this case we are operating on the primary key of the table (which is sorted in descending order) and return the first row that matches the predicate. Keep in mind that this will only work because we sort the result of the subquery before returning the first resulting row. Then we retrieve the row data and check if the address is also within the bounds of the end IP address. This simple measure will result in a lookup performance more than 12000 lookups per second.

SELECT locid, end FROM T_IP_Blocks WHERE start <= @ip ORDER BY start DESC LIMIT 1

You might be able to speed this up by another 10% by dropping the subquery and verifying the end variable in your application code.

Why are we not using a combined primary key, or an index on two columns? Due to some SQLite intrinsics, creating a primary key with more than one column will result in the file growing by an additional 8 byte per row, and in our case we have a row count of 3,486,645. Besides, speed would not increase compared to the approach described before.

Using SQLite – Table Joins

The above shows how to look up records in a very basic way. Probably your actual requirements when working with geolocation data will be more complex. One common scenario is the joining of IP address related tables with geolocation data. In the following section we are going to have a look at how this can be achieved in an efficient way.

First, we are adding another table and fill it with 1 Million random IP adresses, ranging from 0.0.0.0 to 255.255.255.255. For compatibility reasons with other databases, we save them as signed integers (to save space).

CREATE TABLE T_IP_Adresses ( ip INTEGER PRIMARY KEY )

For simplicity’s sake, the table contains only one column. Doing a join between both tables is trivial:

SELECT ip, locid FROM T_IP_Adresses LEFT JOIN T_IP_Blocks ON T_IP_Adresses.ip BETWEEN T_IP_Blocks.start AND T_IP_Blocks.end

The performance of such a query is very poor. We should rather change to to look like this:

SELECT ip, ( SELECT locid FROM ( SELECT locid, end FROM T_IP_Blocks WHERE start <= ip ORDER BY start DESC LIMIT 1 ) WHERE end > ip ) AS locid FROM T_IP_Addresses

The above statement is quite fast, despite the two subqueries. On my machine it joins about 43,000 rows per second, so the 1 Million row table will be processen within 23 seconds. Off course your results may vary, I just mention some numbers here to show that by writing proper SQL we can improve query performance by several orders of magnitude.

Using SQL Server 2008 Express

In the following section we are using a more complex RDBMS and see how we can optimize the lookup performance on this platform. Both, query and ddl syntax are quite similar to SQLite. However we have more possibilities to adjust data storage and indexing to fit our needs. First of all, the imported data occupies 74,194,944 bytes, that is a total overhead of 77% compared to the actual data size. It is also about 8% more storage space than the SQLite version. If we create a second index on the end column, another 39,182,336 bytes of storage space are needed. However this index can be discarded entirely, as we will see…

DDL:

CREATE TABLE T_IP_Blocks ( [start] INT NOT NULL, [end] INT NOT NULL, locid INT NOT NULL, CONSTRAINT [PK_T_IP_Blocks] PRIMARY KEY CLUSTERED ( [start] DESC, [end] ASC ) )
CREATE TABLE T_IP_Adresses ( ip INT PRIMARY KEY )

First off, looking up single IP adresses is a lot slower than with SQLite, due to the network and inter-process communication overhead which is the result of using an out-of-process DBMS. Although the comparison is not entirely fair in this case, the following optimized query takes about 5 times longer than the ones that run against SQLite.

SELECT [end], [locid] FROM ( SELECT TOP 1 [locid], [end] FROM T_IP_Blocks WHERE [start] <= @ip ORDER BY [start] DESC ) AS X WHERE [end] > @ip

I managed to get about 2400 lookups per second. When querying larger sets of data, the real power of  a more sophisticated DBMS will show. Again, we are joining the geolocation table against a table with one million randomly generated IP addresses.

SELECT [ip], ( SELECT [locid] FROM
    ( SELECT TOP 1 [locid], [end] FROM T_IP_Blocks WHERE [start] <= [ip] ORDER BY [start] DESC ) AS X WHERE [end] > [ip] )
    [locid] FROM T_IP_Addresses

This statement will join up to 90,000 rows per second (without result caching and the like), so 1 Million rows process in roughly 11 seconds.

Do More: Looking up Location Details

Now that we know how to efficiently look up IP addresses using the above technique, we can look into the details of making use of such lookup information. The free MaxMind database comes with a second data set that contains detailed information for each location id. The following columns are provided: locId, country, region, city, postalCode, latitude, longitude, metroCode, areaCode. Some of them are optional.

Joining all this data is straightforward in SQL:

SELECT [ip], ( SELECT [locid] FROM
    ( SELECT TOP 1 [locid], [end] FROM T_IP_Blocks WHERE [start] <= [ip] ORDER BY [start] DESC ) AS X WHERE [end] > [ip] )
    [locid], country, region, city, latitude, longitude
    FROM T_IP_Addresses LEFT JOIN T_City ON [locid] = T_City.[locid]

To be continued…

Links: