Tim Kellogg

Sunday, April 22, 2012

Blog is moving

Thanks to everyone who reads my blog. Recently I made the move and switched to using Github pages for my blog. I also bought a domain name that sounds kind of cool. Now I have a place where I can add other content besides blog posts.

I made the switch because now I can write blog posts in markdown on Vim, and offline so I can blog while riding the bus or on an airplane. I also found that writing code in Blogger was a huge pain, but jekyll provides inline code highlighting for a large number of syntaxes. After considering it for a while, I realized the benefits of switching far outweighed the features blogger offers.

I've already written some pretty interesting posts. I have a lot more planned, so please keep with me. If you follow this blog via the atom feed, I also have a new atom feed, so please update your links!

Saturday, March 24, 2012

Why Object IDs & Primary Keys Are Implementation Details

Recently I wrote a post about a project that I was working on with an abstracted data layer concept that can work in the context of either relational or document data store. In retrospect I think I brushed too quickly over the details of why I think object identifiers (and primary keys) are a part of the implementation that should be hidden, when possible. To explain what I mean I'll use a surreal-world story.

The Situation

You are the chief software engineer at a software company. One day your product manager comes to you with a list of ideas for a new product where users can post definitions to slang words, like a dictionary. He says people are going to love this new app because everyone has a different idea of what words mean. After talking with him to establish ubiquitous language and identify nouns and verbs, you crank up some coding music and hack out some model classes.

A weekend later you finish coding the app using Int32s (int) as the identity data type for most of your models because it's usually big enough and works well as a primary key. Honestly, you didn't really think about it because its what you always do.

After the launch your app quickly gains popularity with the user base doubling every day. Not only that, but as more definitions get posted, more people are attracted to the site and post their own word definitions. While reviewing the exponential data growth figures, your DBA decides that Definition.Id should be changed to an Int64 (long) to accommodate the rapidly multiplying postings.

Let's stop for a minute and review what the business needs were. Your product manager wants an app where people can post words and definitions. Each word has many definitions. There's no talk in the business domain of tables and primary keys. But you included those concepts in the model anyway, because that's how you think about your data.

The DBA chose to make the ID into a larger number to accommodate a larger amount of data. So now to help optimize the database, you are forced to update all your business logic to work nicely with the data logic.

Data Logic Was Meant to Live in the Database

The trouble with tying data logic closely to business logic is that the database isn't part of your business plan. As your application grows you'll have to tweak your database to squeeze out performance - or even swap it out for Cassandra. Databases are good at data logic because they are declarative. You can usually tune performance without affecting how the data is worked with. When you place an index, it doesn't affect how you write a SELECT or UPDATE statement, just how fast it runs.

At the same time, databases are also very procedural things. When you put business logic in stored procedures you lose the benefits of object oriented programming. It also makes unit tests complicated, slow, and fragile (which is why most people don't unit test the database). In the end, it's best to let your database optimize how data is stored and retrieved and keep your domain models clean and focused on the business needs.

The Type of the Object ID Is an Implementation Detail

Lets say you hire a new COO that lives in Silicon Valley and thinks the latest coolest technology is always the gateway to success. With the new growth he decides that you should rewrite the dictionary application to use MongoDB because it's the only way your application can scale to meet the needs of the business. While evaluating Mongo you draw out what an example word and definitions might look like when stored as BSON:

In Mongo, you usually would store the Definitions inline with the Word. Now there is no need for a Definition.Id or Definition.WordId because all of this is implicit. Not only that, but Word.Id is now an ObjectId - a very different 12 byte number that includes time and sequence components. In order to update your application to work with Mongo, you'll have to update all references IDs to use these ObjectIds.

The ID is an implementation concern. In a centralized SQL database, sequential integers make sense. In a distributed environment like Mongo, ObjectIDs offer more advantages. Either way, the type of your ID is an implementation detail.

Encapsulation Requires That You Hide Implementation Details

Most OO programmers understand that encapsulation means that an object has or contains another object. However, some forget that a large part of encapsulation is that you should keep the implementation details of an object hidden from other objects. When the details of an object leak into other objects, the contract is broken and you lose the benefits of the OO abstraction.

Any ORM tool should give you the ability to select protected (if not private) members of the object to be persisted. If it doesn't, it's not using because it'll cause too great of a compromise in design. This is how we should have been allowed to write our objects from the start:

But Dynamic Languages Diffuse The Problem

If you're in a dynamic language like Ruby or Node.js this is less of an issue. Most of my argument hinges on the idea that your API will latch onto the object's ID and insist that all methods that use it will match. This is really just a constraint of strict statically typed languages. Even implicit typing will mitigate the issue some.

You can notice above that I got around the constraint by using object as the ID type. This is really what you want. It's telling the compiler and API that you really, shouldn't care what the type is - it's an implementation detail. You shouldn't run into many problems as long as you are keeping the ID properly encapsulated within the object.

Monday, March 19, 2012

Abstract Data Layer Part 1: Object ID Types And Conventions

In February I went to the MongoDB conference in Boulder. That day was my first real taste of any sort of document oriented database. Since then I've played around with Mongo in C#, Node.JS and natively in the Mongo shell. Since then, I also can't help feeling overwhelmingly happy when thinking about how I can use Mongo for a project.

At Alteryx we're entering a project where we require some specific business needs. We require an extremely fast and scalable database, hence Mongo. But we also need to package our product for on-premise installations, which I hear requires that we also support certain SQL databases.

...I don't actually understand why enterprises insist on using SQL. I'm told that enterprise DBA's want control over everything, and they don't want to learn new products like MongoDB. To me, it seems that 3rd products that are bought would be exempt from DBA optimizations & other meddling. But I guess I wouldn't know what it takes to be an enterprise DBA, so I'll shut up about this now. Just my thoughts...

Since relational databases are a lot different than document oriented databases I decided to use NHibernate as an ORM since they've already figured out a lot of the hard problems. I chose NHibernate over Entity Framework mainly because I already know NHibernate, and I know that it has good support across many databases. Nothing against EF in particular.

I've been working on this for a week or so. I've gotten pretty deep into the details so I thought a blog post would be a good way to step out and think about what I've done and where I'm going. The design is mostly mine (of course, I stand on the backs of giants) and really just ties together robust frameworks.

Convention Based Object Model

In order to remain agnostic toward relational/document structure, I decided that there would have to be some basic assumptions or maxims. I like the idea of convention-based frameworks and I really think its the best way to go about building this kind of infrastructure. Also, conventions are a great way to enforce assumptions and keep things simple.

IDs Are Platform Dependent

It's not something I really thought about before this. In relational databases we'll often use an integer as the object ID. They're nice because they're small, simple, and sequential. However, Mongo assumes that you want to be extremely distributed. Dense sequential IDs (like int identity) run into all kinds of race conditions and collisions in distributed environments (unless you choose a master ID-assigner, which kind of ruins the point of being distributed).

MongoDB uses a very long (12 byte) semi-sequential number. It's semi-sequential in that every new ID is a bigger number than the IDs generated before it, but not necessarily just +1. Regardless, it's impractical to use regular integers in Mongo and also a little impractical to use long semi-sequential numbers in SQL.

As a result, I chose to use System.Object as the ID type for all identifiers. NHibernate can be configured to use objects as integers with native auto-increment after some tweaking. The Mongo C# driver also supports object IDs with client-side assignment.

Ideally, I would like to write some sort of IdType struct that contains an enumeration and object value (I'm thinking along the lines of a discriminated union here). This would help make IDs be more distinctive and easier to attach extension methods or additional APIs. I'd also like to make IDs protected by default (instead of public).

The Domain Object

I also created a root object for all persistent objects to derive from. This is a fairly common pattern, especially in frameworks where there is a lot of generic or meta-programming.

I had DomainObject implement an IDomainObject interface so that in all my meta-programming I can refer to IDomainObject. That way there shouldn't ever be a corner case where we can't or shouldn't descend from DomainObject but have to anyway (separate implementation from interface).

The User and Name objects are simple, as you can expect any NHibernate object model to look like. The idea is to keep them simple and keep business and data logic elsewhere.

Are You Interested?

From what I can tell, I think we're breaking ground on this project. It doesn't seem like too many people have tried to make a framework to support both relational and document data stores. Initially I was hesitant to support both relational and document stores. But I think there are some excellent side effects that I will outline in upcoming posts.

The content I've written about so far is only a small fraction of what it took to get this on it's feet. Someone once said that you should open source (almost) everything. So, if you (or anyone you know) would like to see the full uncensored code for this, let me know so I can start corporate conversations in that direction.

Saturday, March 10, 2012

Discriminated Unions in C# Mono Compiler

Recently I've been using F# a bit. F# is .NET's functional language (the syntax of F# 1.0 was backward compatible with OCaml, but 2.0 has diverged enough to make it more distinct). Learning F# was a huge mind-shift from the C-family of languages. Of all the features of F#, like implicit typing, tail recursion, and monads, many people list discriminated unions as their favorite.

Discriminated unions feel like C# enums on the surface. For instance, a union that can represent states of a light switch:

This example is really no different from C# enums. Discriminated unions, however, can hold data. For instance, consider when our light switch needs to also be a dimmer:

In C# we would have had to rewrite this whole program to handle the new dimmer requirement. Instead, we can just tack on a new state that holds data.

When you're deep in the F# mindset, this structure makes perfect sense. But try implementing a discriminated union in C#. There's the enum-like part, but there's also the part that holds different sizes of data. There's a great stackoverflow answer that explains how the F# compiler handles discriminated unions internally. It requires 1 enum, 1 abstract class and n concrete implementations of the abstract class. It's quite over-complicated to use in every-day C#.

Nevertheless, I really want to use discriminated unions in my C# code because of how easy they make state machines & workflows. I've been brainstorming how to do this. There are several implementations as C# 3.5 libraries, but they're cumbersome to use. I've been looking at the source code for the mono C# compiler, and I think I want to go the route of forking the compiler for a proof-of-concept.

I'm debating what the syntax should be. I figure that the change would be easier if I re-used existing constructs and just tweaked them to work with the new concepts.

I've been debating if the Dimmed case should retain the regular case syntax or get a lambda-like syntax:

I'm leaning toward the lambda syntax due to how C# usually handles variable scope. I've barely just cloned the mono repository and started reading the design documents to orient myself with the compiler. This could be a huge project, so I'm not sure how far I'll actually get. But this is a very interesting idea that I want to try hashing out.

Wednesday, February 29, 2012

One Thing I Learned From F# (Nulls Are Bad)

Recently I started contributing to VsVim, a Visual Studio plugin that emulates Vim. When he was starting the project, Jared Parsons decided to write the bulk of it in F#. He did this mostly as a chance to learn a new language but also because it's a solid first class alternative to C#. For instance, F#'s features like pattern matching and discriminated unions are a natural fit for state machines like Vim.

This is my first experience with a truly functional language. For those who aren't familiar with F#, it's essentially OCaml.NET (the F# book uses OCaml for it's markup syntax), but also draws roots from Haskell. It's a big mind shift from imperative and pure object oriented languages, but one I'd definitely recommend to any developer who wants to be better.

Since I've been working on VsVim, I've been using F# in my spare time but C# in my regular day job. The longer I use F# the more I want C# to do what F# does. The biggest example is how F# handles nulls.

In C# (and Ruby, Python, and any imperative language) most values can be null, and null is a natural state for a variable to be in. In fact (partly due to SQL), null is used whenever a value is empty or doesn't exist yet. In C# and Java, null is the default value for any member reference, you don't even need to explicitly initialize it. As a result, you often end up with a lot of null pointer exceptions due to sloppy programming. After all, it's kind of hard to remember to check for null every time you use a variable.

In F#, nothing is null (that's not entirely true, but in it's natural state it's true enough). Typically you'll use options instead of null. For instance, if you have a function that fails to find or calculate something you might return null in imperative languages (and the actual value if successful). However, in F# you use an option type and return None on failure and Some value on success.

Here, every time you call find(kittens) you get back an option type. This type isn't a string, so you can't just start using string methods and get a null pointer exception. Instead, you have to extract the string value from the option type before it can be used.

At this point you might be thinking, "why would I want to do that? It looks like a lot of extra code". However, I challenge you to find a crashing bug in VsVim. Every time we have an instance of an invalid state we are forced to deal with it on the spot. Every invalid state is dealt with in a way that makes sense.

If we wrote it in C# it would be incredibly easy to get lazy while working late at night and forget to check for null and cause the plugin to crash. Instead, the only bugs we have are behavior quirks. If we ever have a crashing bug, the chances are the null value originated in C# code from Visual Studio or the .NET Framework and we forgot to check.

Discussion on HN

Friday, February 10, 2012

C# Reflection Performance And Ruby

I've always known that reflection method invocations C# are slower than regular invocations, but I've never never known to what extent. So I set out to make an experiment to demonstrate the performance of several ways to invoke a method. Frameworks like NHibernate or the mongoDB driver are known to serialize and deserialize objects. In order to do either of these activities they have to scan the properties of an object and dynamically invoke them to get or set the values. Normally this is done via reflection. However, I want to know if the possibility of memoizing a method call as an expression tree or delegate could offer significant performance benefits. On the side, I also want to see how C# reflection compares to Ruby method invocations.

I posted the full source to a public github repo. To quickly summarize, I wrote code that sets a property on an object 100 million times in a loop. Any setup (like finding a PropertyInfo or MethodInfo) is not included in the timings. I also checked the generated IL to make sure the compiler wasn't optimizing the loops. Please browse the code there if you need the gritty details.

Before I get into the implementation details, here are the results:

You can see that a reflection invoke is on the order of a hundred times slower than a normal property (set) invocation.

Here's the same chart but without the reflection invocation. It does a better job of showing the scale between the other tests.

Obviously, the lesson here is to directly invoke methods and properties when possible. However, there are times when you don't know what a type looks like at compile time. Again, object serialization/deserialization would be one of those use cases.

Here's an explanation of each of the tests:

Reflection Invoke (link)

This is essentially methodInfo.Invoke(obj, new[]{ value } on the setter method of the property. It is by far the slowest approach to the problem. It's also the most common way to solve the problem of insufficient pre-compile time knowledge.

Direct Invoke (link)

This is nothing other than obj.Property = value. Its as fast as it gets, but impractical for use cases where you don't have pre-compile time knowledge of the type.

Closure (link)

This isn't much more flexible than a direct invoke, but I thought it would be interesting to see how the performance degraded. This is where you create a function/closure ( (x,y) => x.Property = y) prior to the loop and just invoke the function inside the loop (action(obj, value)). At first sight it appears to be half as fast as a direct invoke, but there are actually two method calls involved here, so it's actually not any slower than a direct invoke.

Dynamic Dispatch (link)

This uses the C# 4.0 dynamic feature directly. To do this, I declared the variable as dynamic and assigned it using the same syntax as a direct invoke. Interestingly, this performs only 6x slower than direct invoke and about 20x faster than reflection invoke. Take note, if you need reflection, use dynamic as often as possible since it can really speed up method invocation.

Expression Tree (link)

The shortcoming of most of the previous approaches is that they require pre-compile time knowledge of the type. This time I tried building an expression tree (a C# 3.0 feature) and compiled a delegate that invokes the setter. This makes it flexible enough that you can call any property of an object without compile-time knowledge of the name, as long as you know the return type. In this example, like the closure, we're indirectly setting the property, so two method calls. With this in mind, it took almost 2.5 times as long as the closure example, even though they should be functionally equivalent operations. It must be that expression trees compiled to delegates aren't actually as simple as they appear.

Expression Tree with Dynamic Dispatch (link)

Since the expression tree approach requires compile-time knowledge of the return type, it isn't as flexible. Ideally you could use C# 4.0's covariance feature and cast it to Action which compiles, but fails at runtime. So for this one, I just assigned the closure to a variable typed as dynamic to get around the compile/runtime casting issues.

As expected, it's the slowest approach. However, its still 16 times faster than direct reflection. Perhaps, memoizing method calls, like property sets and gets, like this would actually yield a significant performance improvement.

Compared To Ruby

I thought I'd compare these results to Ruby where all method calls are dynamic. In Ruby, a method call looks first in the object's immediate class and then climbs the ladder of parent classes until it finds a suitable method to invoke. Because of this behavior I thought I would be interesting to also try a worst-case scenario with a deep level of inheritance.

To do this fairly, I initially wrote a while loop in Ruby that counted to 100 million. I rewrote the while loop in n.each syntax and saw the execution time get cut in half. Since I'm really just trying to measure method invocation time, I stuck with the n.each syntax.

I honestly thought C# Reflection would be significantly faster than the Ruby with 5 layers of in inheritance. While C# already holds a reference to the method (MethodInfo), Ruby has to search up the ladder for the method each time. I suppose Ruby's performance could be due to the fact that it's written in C and specializes in dynamic method invocation.

Also, it interests me why C# dynamic is so much faster than Ruby or reflection. I took a look at the IL code where the dynamic invoke was happening and was surprised to find a callvirt instruction. I guess I was expecting some sort of specialized calldynamic instruction (Java 7 has one). The answer is actually a little more complicated. There seems to be several calls - most are call instructions to set the stage (CSharpArgumentInfo.Create) and one callvirt instruction to actually invoke the method.

Conclusion

Since the trend of C# is going towards using more Linq, I find it interesting how much of a performance hit developers are willing to exchange for more readable and compact code. In the grand scheme of things, the performance of even a slow reflection invoke is probably insignificant compared to other bottlenecks like database, HTTP, filesystem, etc.

It seems that I've proved the point that I set out to prove. There is quite a bit of performance to be gained by memoizing method calls into expression trees. The application would obviously be best in JSON serialization, ORM, or anywhere when you have to get/set lots of properties on an object with no compile-time knowledge of the type. Very few people, if any, are doing this - probably because of the added complexity. The next step will be to (hopefully) build a working prototype.

Friday, February 3, 2012

Thoughts on the C# driver for MongoDB

I recently started a new job with a software company in Boulder. Our project this year is rewriting the existing product (not a clean rewrite, more like rewrite & evolve). One of the changes we're making is using MongoDB instead of T-SQL. Since we're going to be investing pretty heavily in Mongo we all attended the mongo conference in Boulder on Wednesday. The information was great and now I'm ready to dig into my first app. Today I played around with some test code and made some notes about features/shortcomings of the C# driver.

First of all, the so-called "driver" is much full featured than a typical SQL driver. It includes features to map documents directly to CLR objects (from here on I'll just say document if I mean Mongo BSON document and object for CLR object). There's plans to support Linq directly from the driver. So right off I'm impressed with the richness of the driver. However, I noticed some shortcomings.

For instance, all properties in the document must be present (and of the right type) in the object. I perceived this as a shortcoming because this is unlike regular JSON serialization where missing properties are ignored. After thinking a little further, this is probably what most C# developers would want since the behavior caters toward strongly typed languages that prefer fail-fast behavior. If you know a particular document might have extraneous properties that aren't in the object, you can use the BsonIgnoreExtraElements attribute.

Thinking about this behavior, refactor renaming properties could be less trivial. You would have to run a data migration script to rename the property (mongo does have an operation for renaming fields). It would be great if the driver had a [BsonAlias("OldValue")] attribute to avoid migration scripts (maybe I'll make a pull request).

Something I liked was that I could use object for the type of the _id property instead of BsonObjectId. This will keep the models less coupled to the Mongo driver API. Also, the driver already has a bi-directional alias for _id as Id. I don't know any C# developers who wouldn't squirm at creating a public property named _id.

This brings me to my biggest issue with the C# mongo driver. All properties must be public. This breaks the encapsulation and SRP principles. For instance, most of the time I have no reason to expose my Id (or _id) property as public. NHibernate solves this by hydrating protected fields. I would like this to be solved very soon (but there are some issues with this since there isn't any mappings).

Last, it has poor support for C# 4.0 types. Tuple doesn't fail, but it's serialized as an empty object ({ }). There is also zero support AFAIK for dynamic.

In conclusion, there's some room for improvement with Mongo's integration with .NET but overall I have to say I'm impressed. Supposedly Linq support is due out very soon, which will make it unstoppable (imo). Also, we haven't started using this in a full production environment yet, so there will most likely be more posts coming on this topic.