Fast Geolocation of IP-Addresses using SQLite and SQL Server
Quite frequently, network enabled applications will store information related to IP addresses in a SQL data base. In some cases it makes sense to analyze this data based on the geographic location of each IP address, such as finding out country, state and city of origin.
The purpose of this article is to show a very fast and efficient way of querying IP address related geoinformation by using standard SQL/T-SQL statements.
IP addresses are usually reserved in Blocks and therefore storing location and owner based information is best done by specifying a range with a starting and ending IP address.
In this case we are going to use the free GeoLite City IP database, which can be found at www.maxmind.com/app/ip-location. In this database, starting and ending IP addresses are stored in their binary representation (as integers) instead of the common 3 dot notation. This allows for more compact data storage and very fast indexed searches.
Using SQLite – Simple Queries
SQLite is a very compact and fast open source DBS. Storing all the 3.4 Million rows requires a total of 68,708,352 bytes. Since every row consists of 12 bytes of data, in theory the minimum file size would be 41,839,740 bytes, and the overhead using SQLite amounts to approximately 64.2%, which is a decent value.
To store the data, we create the following table:
CREATE TABLE tblIP_Blocks ( START INTEGER PRIMARY KEY, END INTEGER NOT NULL, locid INTEGER NOT NULL ) |
In order to query for a location identifier, we have to find the row in the database that has a starting IP address less than or equal to the IP address we are looking for, and an ending address larger or equal the address in question. If both prerequisites are satisfied, we have found the right IP block and can use its location identifier for a lookup in the city data base.
Finding an IP block is straightforward:
SELECT locid FROM tblIP_Blocks WHERE @ip BETWEEN START AND END |
However this is very slow, even when start is a primary key and there is a secondary key on the end column. On my machine, SQLite would perform with less than one lookup per second. Why is it so slow? In the above case, the SQL statement does not contain enough information about the structure of our data. The database assumes that for any row where the start column is larger than or equal ip needs to be checked whether or not the end column is less than or equal ip.
However, since our list of IP blocks does not contain any overlapping blocks we can be sure that only the row with the lowest start value being less than or equal ip is a potential match. All othe rows can be safely skipped. Lets see how we can speed this up:
SELECT locid FROM ( SELECT locid FROM tblIP_Blocks WHERE START <= @ip ORDER BY START DESC LIMIT 1 ) WHERE END > @ip |
In this case we are operating on the primary key of the table (which is sorted in descending order) and return the first row that matches the predicate. Keep in mind that this will only work because we sort the result of the subquery before returning the first resulting row. Then we retrieve the row data and check if the address is also within the bounds of the end IP address. This simple measure will result in a lookup performance more than 12000 lookups per second.
SELECT locid, END FROM tblIP_Blocks WHERE START <= @ip ORDER BY START DESC LIMIT 1 |
You might be able to speed this up by another 10% by dropping the subquery and verifying the end variable in your application code.
Why are we not using a combined primary key, or an index on two columns? Due to some SQLite intrinsics, creating a primary key with more than one column will result in the file growing by an additional 8 byte per row, and in our case we have a row count of 3,486,645. Besides, speed would not increase compared to the approach described before.
Using SQLite – Table Joins
The above shows how to look up records in a very basic way. Probably your actual requirements when working with geolocation data will be more complex. One common scenario is the joining of IP address related tables with geolocation data. In the following section we are going to have a look at how this can be achieved in an efficient way.
First, we are adding another table and fill it with 1 Million random IP addresses, ranging from 0.0.0.0 to 255.255.255.255. For compatibility reasons with other databases, we save them as signed integers (to save space).
CREATE TABLE tblIP_Addresses ( ip INTEGER PRIMARY KEY ) |
For simplicity’s sake, the table contains only one column. Doing a join between both tables is trivial:
SELECT ip, locid FROM tblIP_Addresses LEFT JOIN tblIP_Blocks ON tblIP_Addresses.ip BETWEEN tblIP_Blocks.START AND tblIP_Blocks.END |
The performance of such a query is very poor. We should rather change to to look like this:
SELECT ip, ( SELECT locid FROM ( SELECT locid, END FROM tblIP_Blocks WHERE START <= ip ORDER BY START DESC LIMIT 1 ) WHERE END > ip ) AS locid FROM tblIP_Addresses |
The above statement is quite fast, despite the two subqueries. On my machine it joins about 43,000 rows per second, so the 1 Million row table will be processen within 23 seconds. Off course your results may vary, I just mention some numbers here to show that by writing proper SQL we can improve query performance by several orders of magnitude.
Using SQL Server 2008 Express
In the following section we are using a more complex RDBMS and see how we can optimize the lookup performance on this platform. Both, query and ddl syntax are quite similar to SQLite. However we have more possibilities to adjust data storage and indexing to fit our needs. First of all, the imported data occupies 74,194,944 bytes, that is a total overhead of 77% compared to the actual data size. It is also about 8% more storage space than the SQLite version. If we create a second index on the end column, another 39,182,336 bytes of storage space are needed. However this index can be discarded entirely, as we will see…
DDL:
CREATE TABLE tblIP_Blocks ( [START] INT NOT NULL, [END] INT NOT NULL, locid INT NOT NULL, CONSTRAINT [PK_tblIP_Blocks] PRIMARY KEY CLUSTERED ( [START] DESC, [END] ASC ) ) CREATE TABLE tblIP_Addresses ( ip INT PRIMARY KEY ) |
First off, looking up single IP addresses is a lot slower than with SQLite, due to the network and inter-process communication overhead which is the result of using an out-of-process DBMS. Although the comparison is not entirely fair in this case, the following optimized query takes about 5 times longer than the ones that run against SQLite.
SELECT [END], [locid] FROM ( SELECT TOP 1 [locid], [END] FROM tblIP_Blocks WHERE [START] <= @ip ORDER BY [START] DESC ) AS X WHERE [END] > @ip |
The above statement is also compatible with SQL Server 2000. Starting with 2005, we can also use a common table expression:
WITH CTE AS ( SELECT TOP 1 [locid], [END] FROM tblIP_Blocks WHERE [START] <= @ip ORDER BY [START] DESC ) SELECT * FROM CTE WHERE [END] > @ip |
As a matter of fact, the optimizer will prepare the following statement just as efficient as teh above:
SELECT TOP 1 [locid], [END] FROM tblIP_Blocks WHERE [START] <= @ip AND [END] > @ip |
I managed to get about 2400 lookups per second. When querying larger sets of data, the advantages of a more advanced DBMS will show. Again, we are joining the geolocation table against a table with one million randomly generated IP addresses.
SELECT [ip], ( SELECT [locid] FROM ( SELECT TOP 1 [locid], [END] FROM tblIP_Blocks WHERE [START] <= [ip] ORDER BY [START] DESC ) AS X WHERE [END] > [ip] ) [locid] FROM tblIP_Addresses |
This statement will join up to 90,000 rows per second (without result caching and the like), so 1 Million rows process in roughly 11 seconds.
Do More: Looking up Location Details
Now that we know how to efficiently look up IP addresses using the above technique, we can look into the details of making use of such lookup information. The free MaxMind database comes with a second data set that contains detailed information for each location id. The following columns are provided: locId, country, region, city, postalCode, latitude, longitude, metroCode, areaCode. Some of them are optional.
Joining all this data is straightforward in SQL:
SELECT [ip], ( SELECT [locid] FROM ( SELECT TOP 1 [locid], [END] FROM tblIP_Blocks WHERE [START] <= [ip] ORDER BY [START] DESC ) AS X WHERE [END] > [ip] ) [locid], country, region, city, latitude, longitude FROM tblIP_Addresses LEFT JOIN tblCity ON [locid] = tblCity.[locid] |
Links:
- SQLite language reference – www.sqlite.org/lang_expr.html
- T-SQL language reference – msdn.microsoft.com/en-us/library/ms189826%28SQL.90%29.aspx
- MaxMind GeoIP – www.maxmind.com/app/ip-location
- GeoNames Databases – www.geonames.org/export/ws-overview.html