13 May 2011

Bulk upsert to SQL Server from .NET

or, “How inserting multiple records using an ORM should probably work”

Anyone familiar with .NET ORMs should know that one area they’re lacking in is where it comes to updating or inserting multiple objects at the same time. You end up with many individual UPDATE and INSERT statements being executed on the database which can be very inefficient and often results in developers having to extend the ORM or break out of it completely in order to perform particular operations. An added complication is that, where identities are being used in tables, each INSERT command the ORM performs must immediately be followed by a SELECT SCOPE_IDENTITY() call to retrieve the identity value for the newly inserted row so that the CLR object may be amended.

It’s possible to drastically improve on this by making use of a couple of features already supported in the .NET Framework and SQL Server and I’m hoping that a similar solution will feature in future releases of the major ORMs.

  • The .NET Framework’s SqlBulkCopy class allowing you to take advantage of BULK operations supported by SQL Server.
  • SQL Server temporary tables.
  • SQL Server 2008’s MERGE command which allows upsert operations to be performed on a table and in particular its ability, using the OUTPUT command, to return identities for inserted rows.

The process

The main steps of the process are as follows:

  1. Using ADO.NET create a temporary table in SQL Server whose schema mirrors your source data and whose column types match the types in the target table.
  2. Using SqlBulkCopy populate the temporary table with the source data.
  3. Execute a MERGE command via ADO.NET on the SQL Server which upserts data from the temporary table into the target table, outputting identities.
  4. Read the row set of inserted identities.
  5. Drop the temporary table.

So instead of n INSERT statements to insert n records that’s four SQL commands in all to insert or update n records.

There’s already a blog post on this technique that goes into more detail by Kelias which you can read here. The only part missing from Kelias’ post is the piece utilising the OUTPUT modifier to retrieve the inserted identities from the MERGE command. This is simply an additional line in the merge command e.g.

OUTPUT $action, INSERTED.$IDENTITY

and the small matter of reading those returned identities out of a SqlDataReader.

This is the crucial piece, however, as it is this which allows us to tie the inserted row back to the original CLR “entity” item that formed part of our source data. Updating our CLR object with this identity will allow us to save subsequent changes away as an UPDATE to the now existing database row.

Performance

I did some brief testing to get rough timings of this technique versus individual INSERT calls using a parameterised ADO.NET command. With a variety of numbers and sizes of rows from 100 to 10,000 and with row sizes from 1k to 10k roughly the upsert technique nearly always executed in less than half the time of the individual INSERT statements. For example, 1,000 rows of about 1k each took individual INSERTs an average of just over 500ms versus bulk upsert’s 150ms on my quite old desktop with not very much RAM.

That’s pretty cool considering the upsert could be performing either an INSERT or an UPDATE command in the same number of calls whereas if I were to factor that into the individual SQL statements method it would be a lot of extra commands to try an UPDATE and then check whether any rows had been affected etc.

Github project

I decided to have a go at wrapping the upsert technique up in a library which would automatically generate the SQL necessary for creating the temporary table and running the MERGE. I pushed an initial version of this SqlBulkUpsert project to github which can be found here:
https://github.com/dezfowler/SqlBulkUpsert

Usage would be something like this:

using (var connection = DatabaseHelper.CreateAndOpenConnection())
{
 var targetSchema = SqlTableSchema.LoadFromDatabase(connection, "TestUpsert", "ident");

 var columnMappings = new Dictionary<string, Func<TestDto, object>>
       {
        {"ident", d => d.Ident},
        {"key_part_1", d => d.KeyPart1},
        {"key_part_2", d => d.KeyPart2},
        {"nullable_text", d => d.Text},
        {"nullable_number", d => d.Number},
        {"nullable_datetimeoffset", d => d.Date},
       };

 Action<TestDto, int> identUpdater = (d, i) => d.Ident = i;

 var upserter = new TypedUpserter<TestDto>(targetSchema, columnMappings, identUpdater);

 var items = new List<TestDto>();

 // Populate items with TestDto instances
 
 upserter.Upsert(connection, items);

 // Ident property of TestDto instances updated
}

with TestDto just being a simple class like this:

public class TestDto
{
 public int? Ident { get; set; }
 public string KeyPart1 { get; set; }
 public short KeyPart2 { get; set; }
 public string Text { get; set; }
 public int Number { get; set; }
 public DateTimeOffset Date { get; set; }
}

In this TypedUpserter example we:

  1. define the schema of the target table either in code or by loading it from the database (shown in the example)
  2. define mappings from column names of the target to a lambda retrieving the appropriate property value from the TestDto class
  3. define an action to be called to allow setting the the new identity to a property of the DTO
  4. instantiate the Upserter and call Upsert() with a list of items and a database connection
  5. the identity properties of the TestDto instances will have been updated using the defined action so the CLR objects will now be consistent with the database rows.

Next step

The object model could probably do with some refinement and it needs lots more tests adding but it’s in pretty good shape so next I’m going to look at integrating it into Mark Rendle’s Simple.Data project which should mean that, to my knowledge, it’s the only .NET ORM doing proper bulk loading of multiple records.

6 comments:

Mark said...

Nice one. I'll be adding that into Simple.Data soon then.

Anonymous said...

Have you added bulk upsert functionality to your excellent Simple.Data project?

outofcoolnames said...

Hi Derek,
Did you get around to integrating this into Simple.Data? I've had a quick look at some of the code within Simple.Data and don't see your stuff but I giess it may have changed...
Regards,
Thomas

Mickey Perstein said...

Not to be a spoil sport, but why did you reinvent the dataset object instead of relying on it ?

I think it would make more sense to for me to tell you what table on the dataset you are to upsert, and you can reuse the datatable adapter and commands which have been already programmed for update and insert and delete, no ?

Am I missing something ?

Derek Fowler said...

@outofcoolnames
Not put this into Simple.Data, no. Last I checked Mark had implemented an alternative but it may be worth revisiting.


@Mickey
First off the upserter doesn't do deletes, it's update and insert only.

The upserter uses IDataReader instead of something like DataSet as this makes it usable in a wider set of data access scenarios - not everyone uses DataSets but most probably have, or could easily create, an implementation of IDataReader.

I'm using a bulk copy to transport the data to the database rather than individual insert and update commands you would get from the likes of a DataAdapter as it is far more efficient.

Also, in high volume scenarios, by using an IDataReader it's possible to stream a large amount of data from a source up to your database without having that entire set of data resident in memory.

Anonymous said...

Hello!

I found your library's bulk update function quite useful. However, I detected a problem while using it.

One of our tables didn't have identity column set, so your MERGE statement was acting up. The problem was in the OUTPUT part, the exception thrown was "Invalid column name '$IDENTITY'". As I didn't need the updated indexes, I was able to simply remove the OUTPUT section.

I hope this information helps you improve your library. Thanks for sharing.