13 May 2011

Bulk upsert to SQL Server from .NET

or, “How inserting multiple records using an ORM should probably work”

Anyone familiar with .NET ORMs should know that one area they’re lacking in is where it comes to updating or inserting multiple objects at the same time. You end up with many individual UPDATE and INSERT statements being executed on the database which can be very inefficient and often results in developers having to extend the ORM or break out of it completely in order to perform particular operations. An added complication is that, where identities are being used in tables, each INSERT command the ORM performs must immediately be followed by a SELECT SCOPE_IDENTITY() call to retrieve the identity value for the newly inserted row so that the CLR object may be amended.

It’s possible to drastically improve on this by making use of a couple of features already supported in the .NET Framework and SQL Server and I’m hoping that a similar solution will feature in future releases of the major ORMs.

  • The .NET Framework’s SqlBulkCopy class allowing you to take advantage of BULK operations supported by SQL Server.
  • SQL Server temporary tables.
  • SQL Server 2008’s MERGE command which allows upsert operations to be performed on a table and in particular its ability, using the OUTPUT command, to return identities for inserted rows.

The process

The main steps of the process are as follows:

  1. Using ADO.NET create a temporary table in SQL Server whose schema mirrors your source data and whose column types match the types in the target table.
  2. Using SqlBulkCopy populate the temporary table with the source data.
  3. Execute a MERGE command via ADO.NET on the SQL Server which upserts data from the temporary table into the target table, outputting identities.
  4. Read the row set of inserted identities.
  5. Drop the temporary table.

So instead of n INSERT statements to insert n records that’s four SQL commands in all to insert or update n records.

There’s already a blog post on this technique that goes into more detail by Kelias which you can read here. The only part missing from Kelias’ post is the piece utilising the OUTPUT modifier to retrieve the inserted identities from the MERGE command. This is simply an additional line in the merge command e.g.

OUTPUT $action, INSERTED.$IDENTITY

and the small matter of reading those returned identities out of a SqlDataReader.

This is the crucial piece, however, as it is this which allows us to tie the inserted row back to the original CLR “entity” item that formed part of our source data. Updating our CLR object with this identity will allow us to save subsequent changes away as an UPDATE to the now existing database row.

Performance

I did some brief testing to get rough timings of this technique versus individual INSERT calls using a parameterised ADO.NET command. With a variety of numbers and sizes of rows from 100 to 10,000 and with row sizes from 1k to 10k roughly the upsert technique nearly always executed in less than half the time of the individual INSERT statements. For example, 1,000 rows of about 1k each took individual INSERTs an average of just over 500ms versus bulk upsert’s 150ms on my quite old desktop with not very much RAM.

That’s pretty cool considering the upsert could be performing either an INSERT or an UPDATE command in the same number of calls whereas if I were to factor that into the individual SQL statements method it would be a lot of extra commands to try an UPDATE and then check whether any rows had been affected etc.

Github project

I decided to have a go at wrapping the upsert technique up in a library which would automatically generate the SQL necessary for creating the temporary table and running the MERGE. I pushed an initial version of this SqlBulkUpsert project to github which can be found here:
https://github.com/dezfowler/SqlBulkUpsert

Usage would be something like this:

using (var connection = DatabaseHelper.CreateAndOpenConnection())
{
 var targetSchema = SqlTableSchema.LoadFromDatabase(connection, "TestUpsert", "ident");

 var columnMappings = new Dictionary<string, Func<TestDto, object>>
       {
        {"ident", d => d.Ident},
        {"key_part_1", d => d.KeyPart1},
        {"key_part_2", d => d.KeyPart2},
        {"nullable_text", d => d.Text},
        {"nullable_number", d => d.Number},
        {"nullable_datetimeoffset", d => d.Date},
       };

 Action<TestDto, int> identUpdater = (d, i) => d.Ident = i;

 var upserter = new TypedUpserter<TestDto>(targetSchema, columnMappings, identUpdater);

 var items = new List<TestDto>();

 // Populate items with TestDto instances
 
 upserter.Upsert(connection, items);

 // Ident property of TestDto instances updated
}

with TestDto just being a simple class like this:

public class TestDto
{
 public int? Ident { get; set; }
 public string KeyPart1 { get; set; }
 public short KeyPart2 { get; set; }
 public string Text { get; set; }
 public int Number { get; set; }
 public DateTimeOffset Date { get; set; }
}

In this TypedUpserter example we:

  1. define the schema of the target table either in code or by loading it from the database (shown in the example)
  2. define mappings from column names of the target to a lambda retrieving the appropriate property value from the TestDto class
  3. define an action to be called to allow setting the the new identity to a property of the DTO
  4. instantiate the Upserter and call Upsert() with a list of items and a database connection
  5. the identity properties of the TestDto instances will have been updated using the defined action so the CLR objects will now be consistent with the database rows.

Next step

The object model could probably do with some refinement and it needs lots more tests adding but it’s in pretty good shape so next I’m going to look at integrating it into Mark Rendle’s Simple.Data project which should mean that, to my knowledge, it’s the only .NET ORM doing proper bulk loading of multiple records.