Tim Kellogg: March 2012

Saturday, March 24, 2012

Why Object IDs & Primary Keys Are Implementation Details

Recently I wrote a post about a project that I was working on with an abstracted data layer concept that can work in the context of either relational or document data store. In retrospect I think I brushed too quickly over the details of why I think object identifiers (and primary keys) are a part of the implementation that should be hidden, when possible. To explain what I mean I'll use a surreal-world story.

The Situation

You are the chief software engineer at a software company. One day your product manager comes to you with a list of ideas for a new product where users can post definitions to slang words, like a dictionary. He says people are going to love this new app because everyone has a different idea of what words mean. After talking with him to establish ubiquitous language and identify nouns and verbs, you crank up some coding music and hack out some model classes.

A weekend later you finish coding the app using Int32s (int) as the identity data type for most of your models because it's usually big enough and works well as a primary key. Honestly, you didn't really think about it because its what you always do.

After the launch your app quickly gains popularity with the user base doubling every day. Not only that, but as more definitions get posted, more people are attracted to the site and post their own word definitions. While reviewing the exponential data growth figures, your DBA decides that Definition.Id should be changed to an Int64 (long) to accommodate the rapidly multiplying postings.

Let's stop for a minute and review what the business needs were. Your product manager wants an app where people can post words and definitions. Each word has many definitions. There's no talk in the business domain of tables and primary keys. But you included those concepts in the model anyway, because that's how you think about your data.

The DBA chose to make the ID into a larger number to accommodate a larger amount of data. So now to help optimize the database, you are forced to update all your business logic to work nicely with the data logic.

Data Logic Was Meant to Live in the Database

The trouble with tying data logic closely to business logic is that the database isn't part of your business plan. As your application grows you'll have to tweak your database to squeeze out performance - or even swap it out for Cassandra. Databases are good at data logic because they are declarative. You can usually tune performance without affecting how the data is worked with. When you place an index, it doesn't affect how you write a SELECT or UPDATE statement, just how fast it runs.

At the same time, databases are also very procedural things. When you put business logic in stored procedures you lose the benefits of object oriented programming. It also makes unit tests complicated, slow, and fragile (which is why most people don't unit test the database). In the end, it's best to let your database optimize how data is stored and retrieved and keep your domain models clean and focused on the business needs.

The Type of the Object ID Is an Implementation Detail

Lets say you hire a new COO that lives in Silicon Valley and thinks the latest coolest technology is always the gateway to success. With the new growth he decides that you should rewrite the dictionary application to use MongoDB because it's the only way your application can scale to meet the needs of the business. While evaluating Mongo you draw out what an example word and definitions might look like when stored as BSON:

In Mongo, you usually would store the Definitions inline with the Word. Now there is no need for a Definition.Id or Definition.WordId because all of this is implicit. Not only that, but Word.Id is now an ObjectId - a very different 12 byte number that includes time and sequence components. In order to update your application to work with Mongo, you'll have to update all references IDs to use these ObjectIds.

The ID is an implementation concern. In a centralized SQL database, sequential integers make sense. In a distributed environment like Mongo, ObjectIDs offer more advantages. Either way, the type of your ID is an implementation detail.

Encapsulation Requires That You Hide Implementation Details

Most OO programmers understand that encapsulation means that an object has or contains another object. However, some forget that a large part of encapsulation is that you should keep the implementation details of an object hidden from other objects. When the details of an object leak into other objects, the contract is broken and you lose the benefits of the OO abstraction.

Any ORM tool should give you the ability to select protected (if not private) members of the object to be persisted. If it doesn't, it's not using because it'll cause too great of a compromise in design. This is how we should have been allowed to write our objects from the start:

But Dynamic Languages Diffuse The Problem

If you're in a dynamic language like Ruby or Node.js this is less of an issue. Most of my argument hinges on the idea that your API will latch onto the object's ID and insist that all methods that use it will match. This is really just a constraint of strict statically typed languages. Even implicit typing will mitigate the issue some.

You can notice above that I got around the constraint by using object as the ID type. This is really what you want. It's telling the compiler and API that you really, shouldn't care what the type is - it's an implementation detail. You shouldn't run into many problems as long as you are keeping the ID properly encapsulated within the object.

Monday, March 19, 2012

Abstract Data Layer Part 1: Object ID Types And Conventions

In February I went to the MongoDB conference in Boulder. That day was my first real taste of any sort of document oriented database. Since then I've played around with Mongo in C#, Node.JS and natively in the Mongo shell. Since then, I also can't help feeling overwhelmingly happy when thinking about how I can use Mongo for a project.

At Alteryx we're entering a project where we require some specific business needs. We require an extremely fast and scalable database, hence Mongo. But we also need to package our product for on-premise installations, which I hear requires that we also support certain SQL databases.

...I don't actually understand why enterprises insist on using SQL. I'm told that enterprise DBA's want control over everything, and they don't want to learn new products like MongoDB. To me, it seems that 3rd products that are bought would be exempt from DBA optimizations & other meddling. But I guess I wouldn't know what it takes to be an enterprise DBA, so I'll shut up about this now. Just my thoughts...

Since relational databases are a lot different than document oriented databases I decided to use NHibernate as an ORM since they've already figured out a lot of the hard problems. I chose NHibernate over Entity Framework mainly because I already know NHibernate, and I know that it has good support across many databases. Nothing against EF in particular.

I've been working on this for a week or so. I've gotten pretty deep into the details so I thought a blog post would be a good way to step out and think about what I've done and where I'm going. The design is mostly mine (of course, I stand on the backs of giants) and really just ties together robust frameworks.

Convention Based Object Model

In order to remain agnostic toward relational/document structure, I decided that there would have to be some basic assumptions or maxims. I like the idea of convention-based frameworks and I really think its the best way to go about building this kind of infrastructure. Also, conventions are a great way to enforce assumptions and keep things simple.

IDs Are Platform Dependent

It's not something I really thought about before this. In relational databases we'll often use an integer as the object ID. They're nice because they're small, simple, and sequential. However, Mongo assumes that you want to be extremely distributed. Dense sequential IDs (like int identity) run into all kinds of race conditions and collisions in distributed environments (unless you choose a master ID-assigner, which kind of ruins the point of being distributed).

MongoDB uses a very long (12 byte) semi-sequential number. It's semi-sequential in that every new ID is a bigger number than the IDs generated before it, but not necessarily just +1. Regardless, it's impractical to use regular integers in Mongo and also a little impractical to use long semi-sequential numbers in SQL.

As a result, I chose to use System.Object as the ID type for all identifiers. NHibernate can be configured to use objects as integers with native auto-increment after some tweaking. The Mongo C# driver also supports object IDs with client-side assignment.

Ideally, I would like to write some sort of IdType struct that contains an enumeration and object value (I'm thinking along the lines of a discriminated union here). This would help make IDs be more distinctive and easier to attach extension methods or additional APIs. I'd also like to make IDs protected by default (instead of public).

The Domain Object

I also created a root object for all persistent objects to derive from. This is a fairly common pattern, especially in frameworks where there is a lot of generic or meta-programming.

I had DomainObject implement an IDomainObject interface so that in all my meta-programming I can refer to IDomainObject. That way there shouldn't ever be a corner case where we can't or shouldn't descend from DomainObject but have to anyway (separate implementation from interface).

The User and Name objects are simple, as you can expect any NHibernate object model to look like. The idea is to keep them simple and keep business and data logic elsewhere.

Are You Interested?

From what I can tell, I think we're breaking ground on this project. It doesn't seem like too many people have tried to make a framework to support both relational and document data stores. Initially I was hesitant to support both relational and document stores. But I think there are some excellent side effects that I will outline in upcoming posts.

The content I've written about so far is only a small fraction of what it took to get this on it's feet. Someone once said that you should open source (almost) everything. So, if you (or anyone you know) would like to see the full uncensored code for this, let me know so I can start corporate conversations in that direction.

Saturday, March 10, 2012

Discriminated Unions in C# Mono Compiler

Recently I've been using F# a bit. F# is .NET's functional language (the syntax of F# 1.0 was backward compatible with OCaml, but 2.0 has diverged enough to make it more distinct). Learning F# was a huge mind-shift from the C-family of languages. Of all the features of F#, like implicit typing, tail recursion, and monads, many people list discriminated unions as their favorite.

Discriminated unions feel like C# enums on the surface. For instance, a union that can represent states of a light switch:

This example is really no different from C# enums. Discriminated unions, however, can hold data. For instance, consider when our light switch needs to also be a dimmer:

In C# we would have had to rewrite this whole program to handle the new dimmer requirement. Instead, we can just tack on a new state that holds data.

When you're deep in the F# mindset, this structure makes perfect sense. But try implementing a discriminated union in C#. There's the enum-like part, but there's also the part that holds different sizes of data. There's a great stackoverflow answer that explains how the F# compiler handles discriminated unions internally. It requires 1 enum, 1 abstract class and n concrete implementations of the abstract class. It's quite over-complicated to use in every-day C#.

Nevertheless, I really want to use discriminated unions in my C# code because of how easy they make state machines & workflows. I've been brainstorming how to do this. There are several implementations as C# 3.5 libraries, but they're cumbersome to use. I've been looking at the source code for the mono C# compiler, and I think I want to go the route of forking the compiler for a proof-of-concept.

I'm debating what the syntax should be. I figure that the change would be easier if I re-used existing constructs and just tweaked them to work with the new concepts.

I've been debating if the Dimmed case should retain the regular case syntax or get a lambda-like syntax:

I'm leaning toward the lambda syntax due to how C# usually handles variable scope. I've barely just cloned the mono repository and started reading the design documents to orient myself with the compiler. This could be a huge project, so I'm not sure how far I'll actually get. But this is a very interesting idea that I want to try hashing out.