Medallia Blog

November 08, 2006

Aspirin for headache

A while ago a big pharmaceutical company wanted to expand their business to the Middle East since their marked research showed that there was a large untapped marked there for headache medication. The only problem was that since many people there did not know how to read their normal advertising would not be effective, but then someone came up with the idea of using this visual ad:

Aspirin reversed

They thought this was a brilliant idea and went ahead, putting it up on numerous big billboards. After a few weeks the sales were still very slow, however, and they could not figure out why, until after asking a few people why they were not interested it dawned on them.

Arabic is read from right to left:

Aspirin normal

I was recently given the task of writing a component for sending out email. Now, JavaMail works pretty well, but you still need to have a mail queue, delivery threads, error detection and retrying. Thinking that this surly must be a solved problem I went in search for a library with an implementation I could use and came across one called Aspirin, an embeddable java smtp server. The author wrote that it was so named because JavaMail is such a headache to use and Aspirin eases the pain by making it easy. Great I thought; my headache is solved!

Oh boy was I in for a surprise.

After integrating Aspirin into my email component I wrote a test case to make sure everything worked nicely, and mail added to Aspirin’s mail queue was indeed delivered. After doing some stress testing with multiple threads, however, I suddenly I got this:

java.util.ConcurrentModificationException

This was of course bad news; a mail queue better be thread safe! I dove into the Aspirin source code to see what was going on, and it did not take long to find the smoking gun: the mail queue was implemented as a Vector; I can only guess it was chosen since it is indeed synchronized. However, email was retrieved from the queue in a (synchronized) method which first sorted the vector in place (sorting the queue for each retrieval, using time O(n2 lg n) to empty the queue) and then iterated over it using an Iterator; it is explicitly stated in the Javadoc for Vector that this is not thread safe without external synchronization, which I guess the author hoped making the method synchronized would achieve. Clearly the author has no clue as to how synchronization in Java works. Did I mention that after the first time you add mail to the queue you must wait a bit, otherwise you get an IllegalThreadStateException?

After reading a bit more of the source code it soon became clear that this library should not be used for anything other than testing or demo purposes; there are many more smoking guns and gotchas hidden in there, and I will not even bother listing the ones I found.

I instead ended up writing my own mail queue wrapper around JavaMail, which is completely thread safe, has O(lg n) add and retrieval time and gives a lot more control over what happens to emails after they are added to the queue (I want to keep track of the status of each mail, so that in case the JVM is suddenly stopped I do not lose any mail).

The moral of the story is that when you use external libraries you should spend a few minutes examining the source code, and if it looks suspect do not use it. You will save yourself a big headache.

November 02, 2006

Lies, Damn Lies, and Statistics

Medallia has, as part of its product portfolio, an advanced data analysis tool which is the main interface we provide our customers for looking at the data we collect for them. This tool is a web-based application and can answer questions such as “show me, for each question asked, the percentage of females who gave me a top 10% score grouped by income level for the past quarter, and show the percent change since the same quarter of last year” in less than a second even when the number of surveys are in the millions. “Yeah, yeah, am I sure it makes coffee as well” you are probably thinking, but this is actually important for what I am about to explain.

Since we provide a web application to our customers that we host ourselves we are naturally interested in how it is being used. A while ago one of our account managers asked me to compile a report for her on how many times people from a specific client had logged in over the last week, and I quickly realized that this would not be the last such request. As much as I enjoy grepping through logfiles (or even writing a small perl script to do it for me) I would much rather spend my time developing new features, so then I got an idea. Since we already have an advanced reporting application that churns through millions of records in less than a second while making coffee, why not try to use it to do our usage statistics as well? How much work would it take? Not much as it turns out!

First I had to decide what kind of statistics I wanted to collect, and came up with this list:

  • Web browser
  • Operating system
  • Screen resolution
  • Color Depth
  • SSL used
  • Cookie support
  • JavaScript support
  • Flash version
  • IP address logged in from
  • Number of views for each page
  • Average and max time looked at each page
  • Average and max time to generate each page

So how much work did it actually take? First I had to write a User-Agent parser since I could not find one, which required about 85 lines plus 50 lines for the unit test (which is basically just a list of lots of User-Agent strings taken from Wikipedia). The JavaScript for determining screen resolution, color depth and Flash version came out to about 15 lines, while the rest of the information was already available. But how much code did it take to integrate all this into the reporting application? About 200 lines – the whole thing took about half a day for the initial version (I later added asynchronous committing to the database since the user should not have to wait for this, which added about 100 lines).

So now I can produce reports such as:

  • the number of logins using Firefox 1.5 or higher grouped by the percentage using each operating system over the last 6 months, and the change since the 6 months before that
  • a graphical view of the browser distribution grouped by month over the last 12 months
  • Who spent at least one minute looking at the Profiler report last week
  • How much the different parts of our application are being used, and which type of user (e.g. corporate, region manager, individual property manager) are using the different reports

These reports can be quite useful in making business decisions. For example, does enough of our user base have Flash enabled that we could use it to play back voice files, or is a browser used enough that we need to have it be one of the browsers we QA our site with?

I can also produce nice graphs like this one, which shows the total number of logins (y-axis scale has been omitted to protect the innocent):

medallia-logins-graph.png

At the start of this post I mentioned that the system is designed to handle millions of records while still doing real-time calculations, and this quickly becomes important since we can expect thousands or even tens of thousands of logins as we scale the system up. Fortunately, even 50.000 logins per month over the next 5 years is only a total of 3m logins, so scaling should not be an issue.

So how come that a reporting application designed for people taking surveys could be made to do all kinds of statistics on login sessions in half a day with only a few hundred lines of code? Good abstractions and reusable code. We already had a unit test which created some questions, created the possible answers, created a record with some made-up answers, injected it into the OLAP engine, did some calculations and finally checked the result. Most of the time was thus spend on actually collecting the required information, and it was straight forward to adapt the code from the unit test to create the data structures on-the-fly and injecting the data. The reporting application itself is completely configurable (since we need to customize it for each industry vertical as well as different clients), and now the account managers can simply look at the usage statistics for themselves. Actually, even our clients can use it, thus offloading our account managers. And I had much more fun than I would have had tweaking regular expressions in perl (fun as that is).