This blog has been moved to http://info.timkellogg.me/blog/

Wednesday, February 29, 2012

One Thing I Learned From F# (Nulls Are Bad)

Recently I started contributing to VsVim, a Visual Studio plugin that emulates Vim. When he was starting the project, Jared Parsons decided to write the bulk of it in F#. He did this mostly as a chance to learn a new language but also because it's a solid first class alternative to C#. For instance, F#'s features like pattern matching and discriminated unions are a natural fit for state machines like Vim.

This is my first experience with a truly functional language. For those who aren't familiar with F#, it's essentially OCaml.NET (the F# book uses OCaml for it's markup syntax), but also draws roots from Haskell. It's a big mind shift from imperative and pure object oriented languages, but one I'd definitely recommend to any developer who wants to be better.

Since I've been working on VsVim, I've been using F# in my spare time but C# in my regular day job. The longer I use F# the more I want C# to do what F# does. The biggest example is how F# handles nulls.

In C# (and Ruby, Python, and any imperative language) most values can be null, and null is a natural state for a variable to be in. In fact (partly due to SQL), null is used whenever a value is empty or doesn't exist yet. In C# and Java, null is the default value for any member reference, you don't even need to explicitly initialize it. As a result, you often end up with a lot of null pointer exceptions due to sloppy programming. After all, it's kind of hard to remember to check for null every time you use a variable.

In F#, nothing is null (that's not entirely true, but in it's natural state it's true enough). Typically you'll use options instead of null. For instance, if you have a function that fails to find or calculate something you might return null in imperative languages (and the actual value if successful). However, in F# you use an option type and return None on failure and Some value on success.


Here, every time you call find(kittens) you get back an option type. This type isn't a string, so you can't just start using string methods and get a null pointer exception. Instead, you have to extract the string value from the option type before it can be used.

At this point you might be thinking, "why would I want to do that? It looks like a lot of extra code". However, I challenge you to find a crashing bug in VsVim. Every time we have an instance of an invalid state we are forced to deal with it on the spot. Every invalid state is dealt with in a way that makes sense.

If we wrote it in C# it would be incredibly easy to get lazy while working late at night and forget to check for null and cause the plugin to crash. Instead, the only bugs we have are behavior quirks. If we ever have a crashing bug, the chances are the null value originated in C# code from Visual Studio or the .NET Framework and we forgot to check.

Discussion on HN

Friday, February 10, 2012

C# Reflection Performance And Ruby

I've always known that reflection method invocations C# are slower than regular invocations, but I've never never known to what extent. So I set out to make an experiment to demonstrate the performance of several ways to invoke a method. Frameworks like NHibernate or the mongoDB driver  are known to serialize and deserialize objects. In order to do either of these activities they have to scan the properties of an object and dynamically invoke them to get or set the values. Normally this is done via reflection. However, I want to know if the possibility of memoizing a method call as an expression tree or delegate could offer significant performance benefits. On the side, I also want to see how C# reflection compares to Ruby method invocations.

I posted the full source to a public github repo. To quickly summarize, I wrote code that sets a property on an object 100 million times in a loop. Any setup (like finding a PropertyInfo or MethodInfo) is not included in the timings. I also checked the generated IL to make sure the compiler wasn't optimizing the loops. Please browse the code there if you need the gritty details.

Before I get into the implementation details, here are the results:



You can see that a reflection invoke is on the order of a hundred times slower than a normal property (set) invocation.

Here's the same chart but without the reflection invocation. It does a better job of showing the scale between the other tests.



Obviously, the lesson here is to directly invoke methods and properties when possible. However, there are times when you don't know what a type looks like at compile time. Again, object serialization/deserialization would be one of those use cases.

Here's an explanation of each of the tests:

Reflection Invoke (link)

This is essentially methodInfo.Invoke(obj, new[]{ value } on the setter method of the property. It is by far the slowest approach to the problem. It's also the most common way to solve the problem of insufficient pre-compile time knowledge.

Direct Invoke (link)

This is nothing other than obj.Property = value. Its as fast as it gets, but impractical for use cases where you don't have pre-compile time knowledge of the type.

Closure (link)

This isn't much more flexible than a direct invoke, but I thought it would be interesting to see how the performance degraded. This is where you create a function/closure ( (x,y) => x.Property = y) prior to the loop and just invoke the function inside the loop (action(obj, value)). At first sight it appears to be half as fast as a direct invoke, but there are actually two method calls involved here, so it's actually not any slower than a direct invoke.

Dynamic Dispatch (link)

This uses the C# 4.0 dynamic feature directly. To do this, I declared the variable as dynamic and assigned it using the same syntax as a direct invoke. Interestingly, this performs only 6x slower than direct invoke and about 20x faster than reflection invoke. Take note, if you need reflection, use dynamic as often as possible since it can really speed up method invocation.

Expression Tree (link)

The shortcoming of most of the previous approaches is that they require pre-compile time knowledge of the type. This time I tried building an expression tree (a C# 3.0 feature) and compiled a delegate that invokes the setter. This makes it flexible enough that you can call any property of an object without compile-time knowledge of the name, as long as you know the return type. In this example, like the closure, we're indirectly setting the property, so two method calls. With this in mind, it took almost 2.5 times as long as the closure example, even though they should be functionally equivalent operations. It must be that expression trees compiled to delegates aren't actually as simple as they appear.

Expression Tree with Dynamic Dispatch (link)

Since the expression tree approach requires compile-time knowledge of the return type, it isn't as flexible. Ideally you could use C# 4.0's covariance feature and cast it to Action which compiles, but fails at runtime. So for this one, I just assigned the closure to a variable typed as dynamic to get around the compile/runtime casting issues.

As expected, it's the slowest approach. However, its still 16 times faster than direct reflection. Perhaps, memoizing method calls, like property sets and gets, like this would actually yield a significant performance improvement.

Compared To Ruby

I thought I'd compare these results to Ruby where all method calls are dynamic. In Ruby, a method call looks first in the object's immediate class and then climbs the ladder of parent classes until it finds a suitable method to invoke. Because of this behavior I thought I would be interesting to also try a worst-case scenario with a deep level of inheritance.

To do this fairly, I initially wrote a while loop in Ruby that counted to 100 million. I rewrote the while loop in n.each syntax and saw the execution time get cut in half. Since I'm really just trying to measure method invocation time, I stuck with the n.each syntax.



I honestly thought C# Reflection would be significantly faster than the Ruby with 5 layers of in inheritance. While C# already holds a reference to the method (MethodInfo), Ruby has to search up the ladder for the method each time. I suppose Ruby's performance could be due to the fact that it's written in C and specializes in dynamic method invocation.

Also, it interests me why C# dynamic is so much faster than Ruby or reflection. I took a look at the IL code where the dynamic invoke was happening and was surprised to find a callvirt instruction. I guess I was expecting some sort of specialized calldynamic instruction (Java 7 has one). The answer is actually a little more complicated. There seems to be several calls - most are call instructions to set the stage (CSharpArgumentInfo.Create) and one callvirt instruction to actually invoke the method.

Conclusion

Since the trend of C# is going towards using more Linq, I find it interesting how much of a performance hit developers are willing to exchange for more readable and compact code. In the grand scheme of things, the performance of even a slow reflection invoke is probably insignificant compared to other bottlenecks like database, HTTP, filesystem, etc.

It seems that I've proved the point that I set out to prove. There is quite a bit of performance to be gained by memoizing method calls into expression trees. The application would obviously be best in JSON serialization, ORM, or anywhere when you have to get/set lots of properties on an object with no compile-time knowledge of the type. Very few people, if any, are doing this - probably because of the added complexity. The next step will be to (hopefully) build a working prototype.


Friday, February 3, 2012

Thoughts on the C# driver for MongoDB

I recently started a new job with a software company in Boulder. Our project this year is rewriting the existing product (not a clean rewrite, more like rewrite & evolve). One of the changes we're making is using MongoDB instead of T-SQL. Since we're going to be investing pretty heavily in Mongo we all attended the mongo conference in Boulder on Wednesday. The information was great and now I'm ready to dig into my first app. Today I played around with some test code and made some notes about features/shortcomings of the C# driver.

First of all, the so-called "driver" is much full featured than a typical SQL driver. It includes features to map documents directly to CLR objects (from here on I'll just say document if I mean Mongo BSON document and object for CLR object). There's plans to support Linq directly from the driver. So right off I'm impressed with the richness of the driver. However, I noticed some shortcomings.

For instance, all properties in the document must be present (and of the right type) in the object. I perceived this as a shortcoming because this is unlike regular JSON serialization where missing properties are ignored. After thinking a little further, this is probably what most C# developers would want since the behavior caters toward strongly typed languages that prefer fail-fast behavior. If you know a particular document might have extraneous properties that aren't in the object, you can use the BsonIgnoreExtraElements attribute.

Thinking about this behavior, refactor renaming properties could be less trivial. You would have to run a data migration script to rename the property (mongo does have an operation for renaming fields). It would be great if the driver had a [BsonAlias("OldValue")] attribute to avoid migration scripts (maybe I'll make a pull request).

Something I liked was that I could use object for the type of the _id property instead of BsonObjectId. This will keep the models less coupled to the Mongo driver API. Also, the driver already has a bi-directional alias for _id as Id. I don't know any C# developers who wouldn't squirm at creating a public property named _id.

This brings me to my biggest issue with the C# mongo driver. All properties must be public. This breaks the encapsulation and SRP principles. For instance, most of the time I have no reason to expose my Id (or _id) property as public. NHibernate solves this by hydrating protected fields. I would like this to be solved very soon (but there are some issues with this since there isn't any mappings).

Last, it has poor support for C# 4.0 types. Tuple doesn't fail, but it's serialized as an empty object ({ }). There is also zero support AFAIK for dynamic.

In conclusion, there's some room for improvement with Mongo's integration with .NET but overall I have to say I'm impressed. Supposedly Linq support is due out very soon, which will make it unstoppable (imo). Also, we haven't started using this in a full production environment yet, so there will most likely be more posts coming on this topic.