28 August 2010

Aggregate full outer join in LINQ

I’ve recently been working on adding a feature to Rob Ashton’s AutoPoco project, a framework which enables dynamic creation of Plain Old CLR Object test data sets using realistic ranges of values. Rather than explicitly defining sets of objects in code, loading them from a database or deserializing them from a file the framework allows you to pre-define the make-up of the data set and then automatically generates the objects to meet your criteria.

I had a requirement that, from some sets of possible values for particular properties of a type, I  needed to create an instance for every variation of those values. Defining all the variations manually would take along time, be difficult to maintain and error prone. Dynamic generation seemed the way to go and after checking with Rob whether this was already a feature of AutoPoco and finding out it wasn’t I proceeded to have a go at implementing a GetAllVariations method.

The principal problem here is that we need to perform an operation analogous to a SQL full outer join on n sets of values. For example, give the following type:

public class Blah
{
	public int Integer { get; set; }
	public string StringA { get; set; }
	public string StringB { get; set; }
}

and the possible values:

Integer: [ 1, 2, 3 ]
StringA: [ "hello", "world" ]
StringB: [ "foo", "bar" ]

the output should be 12 objects with the following property values:

# Integer StringA StringB
1 1 hello foo
2 1 hello bar
3 1 world foo
4 1 world bar
5 2 hello foo
6 2 hello bar
7 2 world foo
8 2 world bar
9 3 hello foo
10 3 hello bar
11 3 world foo
12 3 world bar

Achieving this using LINQ

A full outer join can be performed in LINQ as follows:

var A = new List<object>
	{
		1, 
		2,
		3,
	};

var B = new List<object>
	{
		"hello",
		"world",
	};

A.Join(B, r => 0, r => 0, (a, b) => new List<object>{ a, b }).Dump();

Note: I’m using the LINQPad Dump() extension method here.

Fairly straight forward, we just set the join values to zero which forces a set to be produced where every value in A is joined to every other value in B. Ordinarily the join result selector would create a new anonymous type but I’m creating a new List here for reasons that will become obvious in a second.

We don’t know in advance how many sets of values we’re going to have, the user may want to set values for two or twenty properties. We need to be able to perform this same join for n sets, we’ll be working with a collection of these value sets. We can achieve this by combining the join with an aggregate operation e.g.

List<List<object>> sources = new List<List<object>>
{
	new List<object>
	{
		1, 
		2,
		3,
	},
	new List<object>
	{
		"hello",
		"world",
	},
	new List<object>
	{
		"foo",
		"bar",
	},
};

sources.Aggregate(
 	Enumerable.Repeat(new List<object>(), 1),
	(a, d) => a.Join(d, r => 0, r => 0, (f, g) => new List<object>(f) { g })
).Dump();

Here sources could contain any number of List objects and those List objects, containing the raw property values, can also contain any number of items. The output of the operation will be an enumeration over every variation of the values in sources, each represented as a List (in this case containing three items, one for each of the sources). We seed the Aggregate with what we expect to get out i.e. an IEnumerable of List objects. Our aggregating function is our join operation with a slight modification, our result selector creates a new List containing the result of the previous join (f) and the uses the collection initializer syntax to add one additional item (g), from the current set of values being joined on.

A relatively complex operation reduced to, effectively, a one-liner using LINQ. Snazzy.

2 comments:

Anonymous said...

This is one of the few places where I find it clearer to use the from...select LINQ style:

Try:

var ints = new[] { 1, 2, 3 };
var stra = new[] { "hello", "world" };
var strb = new[] { "foo", "bar" };

var sets =
from i in ints
from a in stra
from b in strb
select new Blah
{
Integer = i,
StringA = a,
StringB = b
};

Derek Fowler said...

The difference here is that I don't have a fixed number of sources, a fixed output type or fixed properties to which the values will be assigned. Also, I don't think it's possible to represent an aggregate in query expression syntax.