11/23/2013

Why Scala is important

If you are a Java programmer, then unless you were living under a rock for the last couple of years, you must have noticed that there have been some some interesting things going on in the Java world. Funny thing is, almost none of it is attributable to Java. Just to clarify, when one speaks about Java, one could mean
  • Java, the programming language,
  • the Java Virtual Machine
  • the Java Ecosystem, surrounding the language and the JVM.
Some journalists (for example in Dr. Dobbs) argue about the popularity and healthiness of "Java", but they tend to mix the three meanings of the word, fluttering some flags with "Java Forever" written on them. Yes, the JVM is a mature, performant virtual machine with a first class garbage collector, which - given the right tuning - can get close in performance to native code. The ecosystem is thriving and has a multitude of options for any possible task you could imagine, even to address the shortcomings of some design or implementation issues (JRebel comes to my mind as the first such example).

The language itself, however, is a different story. Back in the days when it was new, its standout features - garbage collection and bytecode execution - made a real difference and relief for weary C++ programmers. Also, the dawn of the internet and Software as a Service favoured Java, and so it had a distinct advantage compared to pretty much anything else. Times change, however, and even though some attempts (such as Java 5 generics) were made to improve the language, development slowed down due to politics and "spastic" insistence on backward compatibility, and pulling the plug on Sun didn't help either. A generation had grown up and realised that they have so much more to express, which they couldn't do in Java - they had to resort to runtime hacking tactics, like AspectJ and Spring, or more recently, Lombok.

Some said, to hell with the chains of static type safety, long live dynamic typing - and this spurred the first wave of languages on the JVM, be them ports of existing languages (JRuby, Jython, Rhino) or new ones (such as Groovy, later Clojure).

Around this time came another blow from the world of reality: Moore's law stopped working for CPU speeds, and so processors started to become not much faster than their predecessors, but merely having more cores in them to facilitate the parallel execution of many threads. That means you have two options to scale: still run on one machine, but utilize as many cores as possible, or - if you have to serve millions of requests per second - try to scale your program to multiple computers. Both approaches lends themselves quite naturally to the functional programming style, where one has (mostly) immutable values, and functions without side-effects that can be passed around as the system requires it. This meant that some people started to realize the benefits of previously mostly ignored languages (Haskell and Erlang, for example), and some had the chance to try something new (Clojure and Scala are the most prominent examples).

But the most sneaky blow came from those who said, hey, I like Java, if only it were a bit less verbose and pedant and a bit more flexible and extensive. Can we do better if Sun and Oracle refused to do it? Why yes, we can - said the developers of Xtend, Kotlin, Ceylon, Fantom, and probably others that we haven't heard about just yet. They all try to cater to the seasoned Java programmer by sticking to the basics of Java while providing some new and exciting features - type inference, lambdas, closures, smart null pointer checking, and so on.

Meanwhile Oracle is moving with the speed of a snail climbing uphill, promising then failing to deliver some functional features in Java 8, and further delaying even more important and fundamental changes like Jigsaw to version 9 at the time of this writing.

You probably have guessed one of my points that I'm trying to make here: Java (the language) is dead. It has been for some time, and I don't see that it could be revived in the future. "Dead" I mean in the sense Fortran and Cobol are dead. They will probably never become entirely extinct, there will always be some systems written in them that need to be maintained and perhaps even developed. But the world had moved on, and I see no particular reason why someone would want to start a new project in Cobol, for example. Sure, Java has a long-long way to go before this "final" state, but I clearly see the path, and frankly, I don't really want to be a part of it. There is a glimmer of hope in the form of native Android development, but I'm not optimistic.

What is the future mainstream language of the JVM going to look like? I think these are the two options:
  • It is going to be JavaScript. JavaScript is the de facto language of browsers, and it has shown enormous potential on the server side as well. It has some essential characteristics which facilitate functional programming, but its weak type system is a real problem for some.
  • It is going to be a statically typed language with type inference, first class functions, pattern matching, having a collection library with immutable collections. It will have a syntax pretty similar to Java.
I doubt that the next mainstream language will be one
  • with LISP-style syntax (sorry Clojure, you lost it here)
  • dynamically typed (static types have too much value for large systems and organizations).
The most important thing is in my opinion is to understand that if you are a Java programmer now, then no matter how hard you try to avoid it, you WILL be programming  The Next Language in 5 or 10 years. You can fight, you can scream, but that is what is going to happen. And the more you prepare for it, the less painful it is going to be.

Here are some things you will need to be familiar with by the time we get there:
  • lambda calculus and expression evaluation
  • lazy evaluation
  • functions as first class values 
  • types as first class values (maybe this is a bit far off, but I can imagine it coming into the mainstream)
  • constructing your program in a functional way: immutability, pattern matching, type classes, currying
  • mixing functional and object oriented styles
  • working with persistent data structures
  • reasoning about functional constructs in terms of memory consumption and execution order
  • approaching concurrency in an entirely new way, in message passing style and/or transactions
There are many others, I'm sure, but as a start, you can't go wrong with the list above. Either this, or JavaScript, which contains many of the ideas above anyway.

I am now wondering if there is a language out there that unifies the ideas above and allows the programmer to go slow, introducing concepts one by one, or to go crazy and grow a beard.

Scala has lots of promise from this point of view. It has some problems, some serious issues which stop me from declaring it as the Next JVM Language, such as issues with compilation speed, compatibility, language design and so on. Maybe it is not Scala, even though there are more and more companies using it and building really large systems and churning out production code. Maybe it is just a fad that will collapse under its own weight.

But tell me honestly, for YOU, the programmer, does it really matter? Does it matter if you learn some syntax that you may not be using after 5 years, if the concepts behind it are fundamental and will help you become a better programmer for the rest of your life? Some people much smarter than me say it is worth it, so if you don't believe me or agree with me, believe them.

Fortunately, we have entered the age of Freedom of Learning, and so if you are convinced, you have several options to start learning online, without even buying a book. This is where I would recommend you to start.


1/13/2013

A flirt and a love affair, part 1

The year 2012 was as intense as it will ever be for me, looking back seems like looking back a decade. So many things happened and not all of them good, but there is a saying "what doesn't kill you makes you stronger", so getting older means I'm also getting stronger in the process.

Earlier last year I had a couple of tasks that were not classical software development assignments, but more like errands for the masters. "Tell me how many X-es do we have that are Y, and contain at least 10 Z's, and were active as of T." I hear you scream: SQL! Oh yes, but not so easy. The data containing X, Y and Z are in different databases, and perhaps not even in the same type of database: we have two general purpose systems installed, Sybase and DB2.

So what does one do when he receives a request like above? At the first time, I did the queries manually, bridging the gap between the databases by saving data from one in a file and uploading it to the other one in a temporary table, then joining on the temp table. Yes, it kind of works, but as a software developer I take pride in my laziness, and by the time the second or third such request arrived I knew I did not want to do these manually again. So I went looking for a better solution...

... which meant: rapid prototyping, even quicker development, and maximal flexibility. Java is not a very good option in this case, even if - like in my case - all the supporting libraries and frameworks are at my disposal, and all I have to do is build the logic around them. No, I wanted something better, and looking around found out that two scripting languages are currently available in the firm: Perl and Python. Now I don't want to say anything bad about Perl, but Python was really the only alternative here, so I jumped in refreshing my memories about it.

I used Python previously for programming a testing framework (The Grinder, by the way), and even though it felt a bit weird at first, I quickly got up to speed and started enjoying the ride. It was no different this time, perhaps even better, since I was building stuff from scratch instead of extending something that already existed.

What really struck me was the sheer enjoyment of expressing the problem in the language, using not fancy object oriented constructs and patterns but simple language data structures like tuples. I cannot praise those languages enough where you can write without much fanfare something like

    t = (0, 1, 'Hello')
    a, b, c = t

which doesn't look like much, but when you start thinking in data structures like "I need a set of lists whose first element is a tuple" you really start to appreciate how the language is not getting in your way but supports your thought process as much as possible. The same thing is true about first class functions and closures, when you have them in your toolbox everything just seems so much simpler...

I have a few problems with Python, of course, and it is more fundamental than the enforced indentation (you just get used to it) or the required "self" parameter of class methods (you just get used to it). For simpler systems - and "simpler" here refers not necessarily to the size of the code, but to the number of developers writing it - I see Python as an excellent choice. But once you get past a certain size, you need to be very careful, because it is dynamic, and you can assign value to a variable in the middle of your code (say: loop) that never existed before, and you will sweat blood two weeks later when you try to figure out whether this variable is used as intended and what the intention was by introducing it. Having N developers on a project elevates this problem exponentially, so code_size ^ N will be the number of bugs you are likely to introduce and not discover until much later. Also, I find the packaging system awful, which really hurts modularity and code reuse. I don't know though whether it's just a personal feeling, or I did not spend enough time learning it?

In the end, I was very happy with my choice to use Python for automating my repetitive tasks, but I knew it was going to be a flirt, and I was after a serious love affair. I wanted something with the expressive abilities of Python, but something that would give me the security of a statically typed language (there, I said it). Also, it would have had to be JVM-based, because our part of the world is a Java-shop, and all our libraries and investment lies in the world of Java. Oh, and it had to be approved as production-ready by management. Does such a language exist? We will find out in part 2...

1/21/2012

The Telephone Log

Back in 2004, when I arrived at a large multinational company (let's call it by the letter E, just because it is short and simple), one of the greatest cultural shocks to absorb was the development process, and the continuous documentation. When you develop software in a small team, and maybe interface with one or two other systems you might get away with minimal documentation, or even no documentation at all. Requirements can live in your mailbox, or in the minds of the people around. Especially if the fluctuation of people is low, and the original developer of a function is still there, and might even remember why he inserted a certain IF branch in the code five years ago in a certain printing routine ("Oh, yeah, I remember there was a bug in a certain version of the Macintosh print driver of this printer so we had to program around it..."), it is not a big deal.

On a larger scale though, things are very different. People come and go, products are developed on the other side of the world, and even if you document everything there is a good chance that someone trying to interface to your code doesn't even know where the documentation for your stuff is available, if at all. They might not even be aware of the existence of your product (let alone yourself), and start reimplementing it from scratch.

Now, at the E company product development in general and the documentation process specifically were highly organized. There was a rigorous process which required at each step a certain document to proceed, and on a larger scale things worked even better: there were (and I assume they still very much exist) internal documentation libraries for the products with all the documentation available, with detailed product description and interface specification. Looking back, this one thing stands out that was among the best infrastructural things there, something that I really miss now, but I digress...

Within our small team however things were not going so well. Yes, we followed the process, but documentation was way too formalized (and probably a little bit unnecessarily rigorous at this scale), and people in general didn't really understand why invest so much time in an Implementation Proposal, for example, which was trying to capture the thoughts of the developer pre-implementation, and which would be rendered obsolete as soon as the code was implemented, partly because the implementation changed compared to the plans, and partly because there would be a post-implementation document anyway.

I felt that this was way too overengineered, and that there was a better solution. Clearly, the problematic part was the communication between the Architect and the Developer. The Architect in this process is the person responsible for the understanding of the business logic of the product, and translating it to the developer who implements it. An example of this would be a feature of the product we were developing at the time, which modeled telecom nodes and the links between them, and we had to calculate the throughput of the network traffic flowing through the links according to a specific traffic model (I don't want to go into details here, perhaps in another post).

So there was this traffic model with its mathematical equations and precise definitions. One would think that doing an implementation based on a high quality specification would be easy and straightforward. But it was everything but. It turned out that it was hard, there were so many things that the developer would interpret one way, or make an architectural decision based on his assumptions which did not reflect the intention of the architect at all. I remember the developer being quite frustrated for a few months, and the architect too. So clearly something was missing.

I'm not sure where the idea came from exactly, but I did read The Mythical Man-Month at around that time, where I found this:

From The Mythical Man Month, by Frederick P. Brooks Jr (first edition 1975 (!)):
The Telephone Log
As implementation proceeds, countless questions of architectural interpretation arise, no matter how precise the specification. Obviously many such questions require amplifications and clarifications in the text. Others merely reflect misunderstandings.
It is essential, however, to encourage the puzzled implementer to telephone the responsible architect and ask his questions, rather than to guess and proceed. It is just as vital to recognize that the answers to such questions are ex cathedra architectural pronouncements that must be told to everyone.
One useful mechanism is a telephone log kept by the architect. In it he records every question and every answer. Each week the logs of the several architects are concatenated, reproduced, and distributed to the users and implementers. While this mechanism is quite informal, it is both quick and comprehensive.

This gave me a spark. OK, we did not use the telephone for communication, since we sat next to each other, therefore it was unnecessary. However, we made the following mistakes:
  1. We did not encourage the puzzled developer to ask whenever in doubt
  2. We did not record the result of the conversation between the architect and the developer when the frustrated developer did ask the architect in the end.
Being a technical person I concentrated on the second point. I did not want to enforce a formal documentation process because that would lose the essence of the recording, eg. capturing those minuscule details that do not fit inside a formal document but are important nevertheless.
So I looked around and found a newly invented technology, called Wiki, which seemed to be perfect for our purposes, because it was easy to use, the information contained within was shared automatically, and the resources required for maintenance were minimal.

Back in 2004 it was really something new (check the Mediawiki history if you don't believe me), something very unusual within the E organization, so it was to my greatest surprise that it was a relatively easy push to convince my boss to let me utilize a computer lying around and install Mediawiki on it and see what happens.

A few years later the concept of the intraweb Wiki was widely adopted inside the organization and became supported by our infrastructure provider. Of course, I don't want to say that this happened because of my little experiment, but anyway, the wiki experiment was one of the things I feel proud about when looking back, and I think rightly so.

Fast forward to present day
The whole point of this small essay was of course not to praise myself, but to think about my current situation. I have a similar feeling related to the project I'm working in now, the communication is not working and we have clear issues - not with interpreting the requirements, but having requirements in the first place. Just yesterday we were discussing a small implementation detail in our database model, when one of us asked if the whole feature was required at all, and why we spend so much time on something that - even if required - only required once (we are transitioning to a new model, and were thinking about converting old data to the new schema).

The problems, as I see, are the following (and compare this to the E case):
  1. We do not have an Architect.
  2. Even if we did have an Architect, he would be sitting in London, and not available locally.
  3. We do have people in the business unit talking to us, but they prefer the telephone. That's just how it is. They also use email, to formalize the discussion over the phone.
  4. We do have company-wide wiki infrastructure available, but business people don't use it.
  5. We are running around like headless chickens, to be honest. This is not strictly our fault, as the industry as a whole is like a headless chicken in general, and external requirements change frequently and quickly.

What can we do about this situation? That's what I've been thinking about when Brooks with its telephone log popped in my mind. "Been there, done that, got the T-shirt...". So why not step up and try what Mr. Brooks recommended almost 40 years ago: start the telephone log, however funny it might sound in 2012. There are some things that do not change, the suit, for example. No matter what the actual fashion trend is, the business man wears a suit, and that's just how it is, and how it is going to be.

The current action points could be:
  1. Step up as an Architect, if there isn't any. To be fair, this has already been communicated to us by management that it is an expectation towards us to learn the business as well as we can. The learning process would probably be much quicker and easier if we had a dedicated architect, but for now that's not going to happen.
  2. We do need to initiate telephone calls, not just answer them. Some people are very busy and hard to catch, but eventually they will answer. To be fair, this has also been communicated to us by management. Finding the person who can answer our questions is the real challenge though.
  3. We should keep using the wiki, but also to record the phone conversations. This is new, and might need some time getting used to. We can try treating each call as a phone conference call, and the log as a minutes of meeting afterwards.
The headless chicken situation is something that we need to get used to. I don't really know what to do about it, other than trying to incorporate a more agile development process, and trying to improve code quality which would hopefully save us time in the long run. This is really hard in a hit'n'run environment, but we need to strengthen our weaknesses to improve...

9/17/2011

The Temporal World

I had my biggest enlightenment experience last week since I had set foot in the investment banking world (about six months ago). I had some smaller ones of course, but this latest was really significant. Why did it take me so long to figure this one out? Well, one of the first things you need to figure out is that you must figure out everything yourself about how the earth moves beneath your feet. Do they not have trainings and inductions and such? Of course they do. Do they tell you anything really important? Absolutely not.
I can recommend one book, whether you work for an investment bank now or whether you don't (you may never know what the future might bring): A Practical Guide to Wall Street: Equities and Derivatives. One of the first things this book will tell you is that practically the only way you learn on the trading floor is by watching and listening to other, more senior people, and that's how it's been for the past 100 years. I don't work on the trading floor, not even near it, but the principle still applies. My direct manager is local, but my manager's manager is in London, so we unfortunately don't meet too often. Unfortunately, because any time we spend together I learn something. Even the book I mentioned previously was recommended to me by him.
Reading that book will not enlighten you unless you are ready for it. You definitely need some experience that you can attach the enlightening parts to, otherwise the words you read will immediately fall out from your head as soon as you read them. Reading that book will help you by the time you experience enough just to feel that something's wrong but you may not know exactly what it is. And I hope this will be the same for some people reading this little post.

Time Is Linear

In our "normal" everyday world, that statement is true most of the time. We assume that whatever happened in the past happened already, and we cannot change it. We live in the present and the future is uncertain.
In the investment banking world the notion of time is quite different. Let me try to explain.
When a trader sells a product to a client, he will usually want to capture the fact immediately. One reason for that is that every trader has his limit of intra-day risk he is allowed to take, and at the end of the day they must sign off their positions - effectively approving the trading they did during the day. The open positions are accounted for in a special intra-day risk system, which is updated continuously so that the trader is always up to date regarding his total risk. Of course this system can only work from data existing in the trading systems (sometimes they call these trade capture systems), so even if the trader is in a hurry and wants to go on with selling he must enter some data that is required for the risk systems to work - such as the product he sold, the counterparty he sold it to, the quantity or some kind of notional amount, etc. But even some of that data may not need to be punctual.
For example, he must indicate that he sold 1000 Options of IBM shares at $170 strike price to the Deutsche Bank. This is sufficient information for risk calculation, but definitely not sufficient for actually finalising the trade -  for example, which legal entity of the DB were these Options sold to? It is not important at the inception of the trade, but this information must be added to it later - the trade has to be amended. This amendment could happen later in the day, or maybe next day.
This could already cause some disturbance because for the day closing procedures this trade is in a temporal state, so it will most certainly not flow down to some back-end systems.
To continue this example, let's assume that even though the trader made the deal on Monday, the trade was only amended on the next day, Tuesday. Everyone is happy and life goes on, right? Now, if someone asks the question WHEN did the trade happen someone could say it happened (or more precisely "was created") on Monday, since that's the date of "inception". However, if you ask some trading systems, AS OF Monday the trade didn't exist, while AS OF Tuesday, the trade existed since Monday.
To make such kind of thing possible, you HAVE TO record every activity in your systems. The most important reason for that is that it is a very common question that anyone could ask at any time: tell me the state of the world AS OF some date in the past. So when we record a trade we need the exact time of any action or amendment that happens. This means that we rarely update data in our databases, in most cases we insert new records and link them to the previous ones. In this case we'll have to rows for this trade:
  1. The first action that happened was the trade creation, with a counterparty name of Deutsche Bank.
  2.  Next we amended the trade with the new information (the exact legal entity the deal was made with), so we insert a new record with this information added.
It works pretty much in the same way with trading activity and positions. Buys and sells are recorded with a timestamp attached to them, and if you are interested in the current position you have to replay the activities and make a sum of the actual position as you move along. This is quite logical when you are aware of that the event could be of interest, but not so trivial when the event is a change of attribute. Is it important? Most probably yes, to someone, for some reason, so better record it so when someone asks when and where the value of that attribute came from, you can tell it exactly.

In the trading systems I have seen so far the records as a whole are quite simply copied from the earlier state, and not only the information that changed is captured. So we will have two complete records of the trade, one with the legal entity missing and one filled out properly. Does it mean that the records are denormalized to some extent? Absolutely. But sometimes this is necessary to save you from the data model wreaking havoc. It is quite complicated even with these simplifications, believe me. 

The Past Is Uncertain

Sometimes - for some reason I'm not aware of - trades are made in the past. I mean that the two parties sit down together and decide that they will make a deal today but act as if it happened a month ago.
This is a perfectly valid scenario, and it does happen (as I learned it the hard way), so what do we need to handle this kind of trading? We need to have two parallel dimensions of time, and this is called a bitemporal data model.
We need records with two dates attached to them:
  • the creation date, which is a time stamp and records the time and date of when the action actually happened
  • the trade date, which says on what date was the trade action supposed to happen.
This way we can express the situation above in our data model. These kinds of trades can cause some trouble if we ask the AS OF question again. Let's say that the trade was created on the 2nd of September, but was backdated to the 30th of August. What happens if we need to make some reports to regulatory organizations about our trading activity at the end of each month? The trade will not exist AS OF 31 August, so it will not be included in the August report. But in the September reports it will appear with a trade date of 30th August.... Luckily the regulatory organizations are aware of this scenario, so they don't get as confused as I got when I first saw these trades happening.



We do some reconciliations for our reports to make sure that the reported data is correct and consistent. Data comes from a gazillion loosely coupled systems and might be intentionally or unintentionally different for the same trade (remember the trade amendment example?), so reconciliation is necessary to find out what is the most up-to-date information. We must be prepared for scenarios like backdated trades.
One way to do reconciliation is to match this month's data to last month's data, and try to find the same trades in both months. If the trade is found in both August and September then we are fine - if not, we must give an explanation as to what happened. For example, if we find a trade this month that has a trade date in this month then this is a new trade, and we have an explanation. However if the trade is new this month but the trade date is in last month we must check the creation date if it is in this month - if it is, then it is a backdated trade, end of story. If not then we have a data quality issue and someone must go and investigate the situation. And this is only one scenario which we have to deal with.

The Future Is Uncertain

So we can change the past alright. Can we change the future? Our bitemporal data model allows it and sometimes in unexpected ways - even some seemingly trivial cases might not be that simple...
To stick with our regulatory reporting example, let's say we have a trade with an expiration date in the far future - say December 2012. We are in the year 2011 and are making a report of live trades AS OF August 2011 - this trade is live alright, and it will be for more than a year.
Then in the September reporting period the trade disappears. What happened to it? You guessed it already: the expiration date was changed during September and the trade expired on 21 September. The rule says that if the trade is not live at the end of the month then we don't report it. So to a human observer this is a simple explanation when looking at this month's data. The trade data is present this month, we simply decide not to report it to the regulatory organization.
But the way our reconciliation is written (err... the way I wrote it... back in those days when I was a greenhorn...) is like this. We try to find the trade this month and last month. If the trade is found in both, then we display this month's data and forget about it. However if it is found this month or last month only then we cannot be certain whether the other month's data is available. I mean if it is a new trade this month then it will obviously be missing from last month. So I did the then-obvious and for last month only trades I assumed that the explanations are to be applied on last month's data. Can you see my mistake here?
The example trade was fine in August, and the expiration date was far away. In September however it all changed, making this trade a last month only trade in September. But that explanation is based on an information that was not available until late September, so looking at last month's data will tell us nothing. We need to use the latest information available even if the AS OF date is in the past.

All this is just the beginning. How do you test an end-of-day batch in a bitemporal world? How do you find the most up-to-date data? Do you always need the latest up-to-date data? What happens if we insert an activity between two interconnected transactions? And so on.
The moral of the story - for me at least - is that even the most trivial things might need some considerations in the bitemporal world, and everyday logic may not work as you expected. Now that I'm aware of it that's okay. But until you figure it out for yourself you will be lost and will make some bad decisions inevitably.

Finally, a book recommendation for myself: Temporal Data and the Relational Model.

7/19/2011

My First Job Interview

After so many years of hard labour and grief on one side of the table I finally have had the chance to sit on the other side for a change.
At this company there are a series of interviews that each and every candidate must go through, starting with the basic skill test in the language of his choice (eg. Java or C#), then we have the core competencies interview, the technical breadth interview, the placement interview...
Yesterday was the core competencies part. We always do it so that there are two interviewers who both ask questions, then after the interview they sit down together to fill out an evaluation form. One of the most important questions on the form is this: hire or no hire, and if hire, what experience level: junior / intermediate / senior? One can also say that the candidate is OK, but not for his team, so he/she is "recommended elsewhere".
Here are my few observations of this interview - on my part.
  • I did some preparation and thought of a problem to solve, unfortunately, I was not aware that this was the core competencies part, so my excellent database exercise was dismissed in favour of general Java questions. Luckily I have some of those up my sleeve any time :-)
  • I was at least as excited as the candidate, if not even more. Asking good questions is not easy, judging the answers and asking back is even harder.
  • One hour is a very short time, therefore the main exercise of the interview should be relatively simple so that the candidate has a chance to solve it in time. The main problem solving part fit inside the hour this time, but just.
  • The candidate did solve the problem in the end (with a few hiccups, but that is alright, given the stress factor), but I didn't like his solution. Why? Because I solved this problem before, and my solution was different, so obviously I was expecting something along the lines of my solution. Well, the moral of the story is: be prepared, or narrow the question down so that the only acceptable solution is mine...
There are some lessons to be learned for the candidate as well:
  • It's almost shocking that I have to say this, but guys: if you apply for a job at a multinational company (US-based), then your English communication skills should be up to the task! It was startling that when during the interview we switched to English the candidate started to stutter and murmur, then said some words that did not make up a valid sentence - as a matter of fact I thought about asking him to repeat this sentence. We (the interviewers) were not native speakers, and shared the mother tongue with the candidate, so trivial pronunciation and syntactical problems didn't even count. You MUST learn to communicate your thoughts at least on a basic level if you expect a career in this business. No exceptions!
  • This candidate applied for a senior associate position. Based on his education and experience I expected him to sail through the boring Java elementary questions - like knowing what type erasure is, or - given his EJB background - why it is not appropriate to start a new thread in an EJB container. Well, he wasn't sure about why he cannot say T.class, and had only a gut feeling that starting a thread in an EJB container was not a good idea - but again, couldn't tell why. This level of knowledge is acceptable for a junior - perhaps even for an intermediate - position, but if you apply for a senior position you better know these things! I can wholeheartedly recommend the famous SCJP study book which covers all the basics and is a surprisingly good read.
  • The candidate solved the problem that involved writing a recursive function - even if we had to push him a little towards this solution. It was my bad that I didn't specify the conditions properly so he ingeniously used a mutable list instead of sticking to a classical immutable one. I will have to fine-tune this exercise for the future.
  • The candidate confessed that he doesn't use JUnit or write any other tests because "it is not in the company culture". Well, I'm sorry, but that is a bad excuse - for a senior guy, definitely. 
The verdict: we put the candidate somewhere between the junior and intermediate level, based on the lack of depth of his knowledge. A little preparation could have put him easily on a firm intermediate ground, though... and the English communication problem was a very strong warning sign, so in the end, we chose "recommended elsewhere", as in our team there is daily communication with the guys in London over the phone (who know nothing about programming, by the way).
My advice: take an English course for the remainder of the year, and read the SCJP book, then return for an intermediate position. Do not let your daily job pull you down and make you believe that just because you've been doing it for years there is nothing new to be learned...
Speaking of which, I will also refresh my Java knowledge, now that version 7 is almost out. Also Scala is gaining some significant momentum in the firm... can't wait to sit down to learn the JVM language of the future...

6/24/2011

DB2 Performance Tuning 101

The DB2 Performance Tuning class I attended this week was simply mindblowing - so much information in so little time... going deep into details, but still covering a lot, about DB2 native data structures, the query optimizer, indexes etc. It will take months if not years to fully digest everything. And I got all this for free, as a matter of fact I was paid to go. Life is good sometimes.

I was thinking about some kind of homework - just to practice my newlywed skills, and for some reason an article popped into my mind that I'd read a long time ago. Unfortunately it is in Hungarian, so there is really no point linking it in here, but let me try to summarize the content briefly.

The author reasons about why it is sometimes better and more efficient to use direct ISAM data access and cursors (think: DBase and Clipper) instead of using SQL and any of the "fashionable" database products (Oracle, PostgreSQL, ...). He claims that if we take a query like

SELECT ACCNUM, BALANCE FROM ACC WHERE ACCNUM < '2%' ORDER BY ACC DESC

and give it to Oracle&co then these will always look up the accounts according to the criteria, put them in a temp table, sort the temp table, then return values from the temp table. They will not use indexes even if indexes exist, and there is no way to tell them otherwise - whereas in Clipper you use the index and get the first result record almost instantly, without waiting for a sort operation to complete. Conclusion: SQL is bad, ISAM and cursors and procedural style are good.

Before raising your eyebrows let me tell you the publishing date of this article: 2001. 10 years ago. I don't know which version Oracle was at in 2001, although I doubt the above were true even back then... anyway, how about DB2? Is it really true that it is worthless and I shall go ditch my SQL knowledge and go learn something new? Let's see.

First, I wanted to verify that the claims above are true, eg. without any "hints" or optimization DB2 will behave "incorrectly". Let's say it out loud beforehand though that "incorrectly" is a realitve term. One might optimize for fast retrieval of the whole result set (and this is often the case) in which case the "temp table + sort" behaviour can actually be the best strategy. The other choice is to optimize for fast retrieval of the first result, which - I acknowledge - is a valid use-case sometimes. The point is that the user should be able to choose which is more important for the application, so there must be a way of telling the DB server which is our preference.

I used the following table for the demonstration:

CREATE TABLE "MYSCHEMA"."TITLES"  (
                  "TITLE_ID" VARCHAR(6) NOT NULL ,
                  "TITLE" VARCHAR(80) NOT NULL ,
                  "PUBDATE" TIMESTAMP NOT NULL ,
                  ...)

and also created an index on the title column:

CREATE INDEX "MYSCHEMA"."IDX_TITLES_TITLES" ON "MYSCHEMA"."TITLES"
                ("TITLE" DESC)

The TITLES table was filled up with realistic test data, 5000 rows - not too much, but sufficient for the demonstration. I used the DB2EXPLN tool to get the access plan the optimizer chose for each query.

First attempt:

Statement:
 
  select title
  from titles
  where titles.TITLE < 'D%'
  order by title desc

Estimated Cost = 95.942017
Estimated Cardinality = 973.332703

(    4) Access Table Name = MYSCHEMA.TITLES  ID = 6,10001
        |  #Columns = 1
        |  Avoid Locking Committed Data
        |  May participate in Scan Sharing structures
        |  Scan may start anywhere and wrap, for completion
        |  Fast scan, for purposes of scan sharing management
        |  Scan can be throttled in scan sharing management
        |  Relation Scan
        |  |  Prefetch: Eligible
        |  Lock Intents
        |  |  Table: Intent Share
        |  |  Row  : Next Key Share
        |  Sargable Predicate(s)
        |  |  #Predicates = 1
(    3) |  |  Insert Into Sorted Temp Table  ID = t1
        |  |  |  #Columns = 1
        |  |  |  #Sort Key Columns = 1
        |  |  |  |  Key 1: TITLE (Descending)
        |  |  |  Sortheap Allocation Parameters:
        |  |  |  |  #Rows     = 974.000000
        |  |  |  |  Row Width = 48
        |  |  |  Piped
(    3) Sorted Temp Table Completion  ID = t1
(    2) Access Temp Table  ID = t1
        |  #Columns = 1
        |  Relation Scan
        |  |  Prefetch: Eligible
        |  Sargable Predicate(s)
(    1) |  |  Return Data to Application
        |  |  |  #Columns = 1
(    1) Return Data Completion

End of section


Optimizer Plan:

   Rows  
 Operator
   (ID)  
   Cost  
        
 973.333
   n/a  
 RETURN 
  ( 1)  
 95.942 
   |    
 973.333
   n/a  
 TBSCAN 
  ( 2)  
 95.942 
   |    
 973.333
   n/a  
  SORT  
  ( 3)  
 95.9418
   |    
 973.333
   n/a  
 TBSCAN 
  ( 4)  
 95.7429
   |    
  5000  
   n/a  
 Table: 
 MYSCHEMA
 TITLES 

Wow. The article is still correct today - you can see the TABLE SCAN for our table, SORT, then TABLE SCAN again for the temp table. Remember, table scan means accessing the data pages sequentially - the "worst type" of scan from an optimization point of view. No sign of index usage. What is going on here?

After a little study we find out that the cardinality - the number of records returned - is 934, out of the 5000, which is a lot. Since data is not stored sorted by the title, it is very likely that almost every page containing the data would have been touched anyway - even if using the index - so the optimizer decided it was not worth using the index and simply read all the pages sequentially.
To verify that this was happening let's see a smaller result set:

Statement:
 
  select title
  from titles
  where titles.TITLE < 'C%'
  order by title desc

Estimated Cost = 78.646866
Estimated Cardinality = 545.130737

(    2) Access Table Name = MYSCHEMA.TITLES  ID = 6,10001
        |  Index Scan:  Name = MYSCHEMA.IDX_TITLES_TITLES  ID = 1
        |  |  Regular Index (Not Clustered)
        |  |  Index Columns:
        |  |  |  1: TITLE (Descending)
        |  #Columns = 1
        |  Avoid Locking Committed Data
        |  #Key Columns = 1
        |  |  Start Key: Exclusive Value
        |  |  |  |  1: 'C%'
        |  |  Stop Key: End of Index
        |  Index-Only Access
        |  Index Prefetch: None
        |  Lock Intents
        |  |  Table: Intent Share
        |  |  Row  : Next Key Share
        |  Sargable Index Predicate(s)
(    1) |  |  Return Data to Application
        |  |  |  #Columns = 1
(    1) Return Data Completion

End of section


Optimizer Plan:

        Rows  
      Operator
        (ID)  
        Cost  
             
      545.131
        n/a  
      RETURN 
       ( 1)  
      78.6469
        |    
      545.131
        n/a  
      IXSCAN 
       ( 2)  
      78.6469
        |         
         0        
 Index:           
 MYSCHEMA          
 IDX_TITLES_TITLES

Here you go! TITLE < 'C%' instead of TITLE < 'D%', a cardinality of 546 and the optimizer gave in, and started to use the index with an index scan. There is no data page access this time since the column we are asking for is in the index.

So what can we do if we need the first result immediately for TITLE < 'D%', and cannot wait for the sort? There is actually a way to give this hint to the DB server, and in DB2 the syntax is

SELECT TITLE FROM TITLES WHERE TITLES.TITLE < 'D%' ORDER BY TITLE DESC OPTIMIZE FOR N ROWS

This is DB2 syntax. Of course this functionality exists in other DB servers, for example,
in ORACLE, you say FIRST_ROWS(n)
in MS SQL Server, you say FASTFIRSTROW
etc. For MySQL I don't know the syntax, perhaps you could try index hints.
Does it work?

Statement:
 
  select title
  from titles
  where titles.TITLE < 'D%'
  order by title desc
  optimize
  for 100 rows


Estimated Cost = 134.477875
Estimated Cardinality = 973.332703

(    2) Access Table Name = MYSCHEMA.TITLES  ID = 6,10001
        |  Index Scan:  Name = MYSCHEMA.IDX_TITLES_TITLES  ID = 1
        |  |  Regular Index (Not Clustered)
        |  |  Index Columns:
        |  |  |  1: TITLE (Descending)
        |  #Columns = 1
        |  Avoid Locking Committed Data
        |  #Key Columns = 1
        |  |  Start Key: Exclusive Value
        |  |  |  |  1: 'D%'
        |  |  Stop Key: End of Index
        |  Index-Only Access
        |  Index Prefetch: None
        |  Lock Intents
        |  |  Table: Intent Share
        |  |  Row  : Next Key Share
        |  Sargable Index Predicate(s)
(    1) |  |  Return Data to Application
        |  |  |  #Columns = 1
(    1) Return Data Completion

End of section


Optimizer Plan:

        Rows  
      Operator
        (ID)  
        Cost  
             
      973.333
        n/a  
      RETURN 
       ( 1)  
      134.478
        |    
      973.333
        n/a  
      IXSCAN 
       ( 2)  
      134.478
        |         
         0        
 Index:           
 MYSCHEMA          
 IDX_TITLES_TITLES

You bet! We gave the hint to the query optimizer, and now it works as we expect it. Note though that the cost using the index is 135 timerons, whereas with the full table scan it was only 95. Moral of the story is the optimizer is probably more clever than yourself, so do not trick it unless you know what you are doing...


You might say that I cheated with the query because I only asked for a column that was in the index. And you would be right, so let's modify it:


Statement:
 
  select title, pubdate
  from titles
  where titles.TITLE < 'C%'
  order by title desc
  optimize
  for 100 rows

Estimated Cost = 95.847122
Estimated Cardinality = 545.130737

(    4) Access Table Name = MYSCHEMA.TITLES  ID = 6,10001
        |  #Columns = 2
        |  Avoid Locking Committed Data
        |  May participate in Scan Sharing structures
        |  Scan may start anywhere and wrap, for completion
        |  Fast scan, for purposes of scan sharing management
        |  Scan can be throttled in scan sharing management
        |  Relation Scan
        |  |  Prefetch: 166 Pages
        |  Lock Intents
        |  |  Table: Intent Share
        |  |  Row  : Next Key Share
        |  Sargable Predicate(s)
        |  |  #Predicates = 1
(    3) |  |  Insert Into Sorted Temp Table  ID = t1
        |  |  |  #Columns = 2
        |  |  |  #Sort Key Columns = 1
        |  |  |  |  Key 1: TITLE (Descending)
        |  |  |  Sortheap Allocation Parameters:
        |  |  |  |  #Rows     = 546.000000
        |  |  |  |  Row Width = 56
        |  |  |  Piped
(    3) Sorted Temp Table Completion  ID = t1
(    2) Access Temp Table  ID = t1
        |  #Columns = 2
        |  Relation Scan
        |  |  Prefetch: Eligible
        |  Sargable Predicate(s)
(    1) |  |  Return Data to Application
        |  |  |  #Columns = 2
(    1) Return Data Completion

End of section


Optimizer Plan:

   Rows  
 Operator
   (ID)  
   Cost  
        
 545.131
   n/a  
 RETURN 
  ( 1)  
 95.8471
   |    
 545.131
   n/a  
 TBSCAN 
  ( 2)  
 95.8471
   |    
 545.131
   n/a  
  SORT  
  ( 3)  
 95.8469
   |    
 545.131
   n/a  
 TBSCAN 
  ( 4)  
 95.7429
   |    
  5000  
   n/a  
 Table: 
 MYSCHEMA
 TITLES 


Notice how the query that was working with the index earlier reverted back to the table scan even with the optimizer hint thrown in. It seems that we need to lower the bar.


Statement:
 
  select title, pubdate
  from titles
  where titles.TITLE < 'C%'
  order by title desc
  optimize
  for 10 rows

Estimated Cost = 746.309631
Estimated Cardinality = 545.130737

(    2) Access Table Name = MYSCHEMA.TITLES  ID = 6,10001
        |  Index Scan:  Name = MYSCHEMA.IDX_TITLES_TITLES  ID = 1
        |  |  Regular Index (Not Clustered)
        |  |  Index Columns:
        |  |  |  1: TITLE (Descending)
        |  #Columns = 2
        |  Avoid Locking Committed Data
        |  Evaluate Predicates Before Locking for Key
        |  #Key Columns = 1
        |  |  Start Key: Exclusive Value
        |  |  |  |  1: 'C%'
        |  |  Stop Key: End of Index
        |  Data Prefetch: None
        |  Index Prefetch: None
        |  Lock Intents
        |  |  Table: Intent Share
        |  |  Row  : Next Key Share
        |  Sargable Predicate(s)
(    1) |  |  Return Data to Application
        |  |  |  #Columns = 2
(    1) Return Data Completion

End of section


Optimizer Plan:

               Rows  
             Operator
               (ID)  
               Cost  
                    
             545.131
               n/a  
             RETURN 
              ( 1)  
             746.31 
               |    
             545.131
               n/a  
              FETCH 
              ( 2)  
             746.31 
            /       \
      545.131        5000  
        n/a           n/a  
      IXSCAN        Table: 
       ( 3)         MYSCHEMA
      78.6469       TITLES 
        |         
         0        
 Index:           
 MYSCHEMA          
 IDX_TITLES_TITLES
 

 Next to the index scan now we also have data fetched from the data pages - but the index is used, never mind the cost (747 instead of the original 96). You better need your first result for something useful, unless you just de-optimized your query to run almost 10 times slower. Was it worth it? It depends.

As a summary, I think that before deciding whether to use something or to reinvent the wheel, it is worth considering the time and effort spent on creating that thing. DB2 and Oracle have been around since the late 70s, so there are 30 years of experience, sweat and tears that went into the development of these software. Are you brave enough to say that you can do better than thousands of developers did? Do you have 30 years to get your product matured? Are you man enough to be the black sheep? If not... well, you get my point.

2/08/2011

A quick tip on improving productivity in the "Bermuda Triangle"

The Bermuda Triangle is of course: Netbeans, Glassfish and Maven. These three basic development tools and/or environments can boost your productivity and flatten the learning curve significantly. However I did not have very good experience with hot deployment in this triangle: Netbeans can hot deploy to Glassfish, but only using its own Ant-based internal projects. Maven projects work - sort of, but during development, it can be a real pain in the *ss. I've not been able to consistently explain Netbeans the structure and inter-dependencies of my maven projects, for example, it could have been my fault, as "clean and build" would sometimes work as expected (re-building projects from the source and not simply re-packaging them from the local repo), but telling Maven to re-build and deploy a *single* class in the EAR... no, that's way out of my league.

But thanks to my colleague, who found the ultimate solution to this problem, and it only takes a few extra minutes (or hours?) to get things going as smooth as possible. It is really trivial: first create the Maven projects as you normally would. Then do NOT import these projects into Netbeans, but re-create the whole project structure from scratch as Netbeans projects... that's all you need to enjoy the convenience of both Maven and hot deployment. Well, yes, you'll need to keep the two systems synchronized manually, but it will pay off very quickly, probably on the first day.

And then you discover Apache Ivy... and scratch your head and think "oh my... not again..." :-)