Medallia Blog

August 18, 2006

Some subtle and not-so-subtle Java bugs

For you Java programmers out there, there are some buggy code snippets below the fold -- can you spot them all? (spoilers here)

One

Why is this wrong?
    public static String escapeSql(String str) {
        if (str == null) {
            return null;
        }
        return StringUtils.replace(str, "'", "''");
    }
(source: Apache Commons "lang" package)

Two

There are at least two four things wrong here.
package com.initech.web.fluff;
public class StringUtils {
    private static final SimpleDateFormat UTC_FORMAT = new SimpleDateFormat("yyyyMMdd'T'HH:mm:ss");

    public String formatUtcDate(Date d) {
        return UTC_FORMAT.format(date);
    }
}
(source: lots)

Three

This code is supposed to protect the 'override' flag from being set if 'obj' is a constructor for the 'Class' class. Can you set it anyway, without using reflection? "Three" is a hint, by the way.
class AccessibleObject {
    private boolean override;

    /* Check that you aren't exposing java.lang.Class.<init>. */
    private static void setAccessible0(AccessibleObject obj, boolean flag)
	throws SecurityException
    {
	if (obj instanceof Constructor && flag == true) {
	    Constructor c = (Constructor)obj;
	    if (c.getDeclaringClass() == Class.class) {
		throw new SecurityException("Can not make a java.lang.Class" +
					    " constructor accessible");
	    }
	}
	obj.override = flag;
    }
(source: JDK 1.5)

August 14, 2006

Summer intern in Silicon Valley

Even though Medallia recently opened an office in Oslo, just where I live, I opted to do my summer internship in Silicon Valley. There aren't many places as exciting as California. When you get a chance to go there, how can you say no?

    During this summer, I have:
  • Biked the Golden Gate Bridge and beyond
  • Toured Stanford Campus
  • Eaten delicious crab at Pier 39, Fisherman's Wharf as well as enjoyed Indian, Japanese, Korean, Mexican, Pakistani, Greek, Iranian, Italian and Thai food.
  • Rock climbed outdoor on Fresno Dome (7,500 feet)
  • Jumped out of an aircraft at 14,000 feet
  • Gone hiking in Yosemite National Park
  • Cruised around in Kristian Eide's awesome Corvette
  • Designed and implemented a system for catching people who cheat on Medallia's surveys, while interacting with the world-class team of engineers at Medallia

The work was as challenging as the spare time was fun. Among the algorithms that eventually made it into the system are: Shortest Path, Matrix Inversion and Great Circle Distance (!). While these algorithms were fun to implement, the real challenge was the everyday software engineering needed to make the system robust, general and easy to use.

The most interesting part of the project, as I see it, is the way that we abstracted the process of information retrieval. To illustrate this, consider the task of finding the geographical location of a customer. It can be done in several ways. For instance, we could write a findLocation method which does this:

SELECT latitude, longitude FROM customer_table WHERE customer_id='foo'

The problem is that, more often than not, those fields will be empty. Another approach is this:

SELECT phone_area_code FROM customer_table WHERE customer_id='foo' SELECT latitude, longitude FROM area_codes WHERE area_code=phone_area_code

Now if we are lucky, this approach will give a location close to what we want. It is also possible to obtain the information in a way that is less likely to give a wrong answer:

SELECT zip_code FROM customer_table WHERE customer_id='foo' SELECT latitude, longitude FROM zip_codes WHERE zip_code=zip_code

These may also fail. For instance, the wrong zip code or phone number could have been entered, or they may not be entered at all, or just some dummy value is there. A fourth option is to do this:

SELECT ip_address FROM customer_table WHERE customer_id='foo' SELECT latitude, longitude FROM ip2location WHERE ip_address=ip_address

Now at this point the code is getting rather big, with lots of ifs. It is not very general or easy to maintain either, because the findLocation method has to deal with several different databases. If we come up with another way to derive the wanted information, we need to hardcode that one as well. If one of the databases change, we have to change references from all the findWhatever methods.

There is also the non-trivial task of determining in which order to execute the four approaches.

The solution that I implemented consists of the following:

A set of information sources. Each information source can provide a certain set of information. The only input of the information sources is the customer_id.
A set of information converters. Each information converter has an input information type and and output information type as well as a cost.
An algorithm that glue the sources and converters together into an information network that can be queried for some type of information. This will typically be a shortest path algorithm.
To complete the task described, we need only do the following:

informationNetwork.get('foo', Type.LATITUDE_LONGITUDE);

The path that this request takes through the network is now abstracted away. If I somehow find a service that will take a street address and turn it into a super-accurate location (latitude, longitude), I need only write a new converter and the network will use it if necessary. If I remove one information converter, I need not change any code. A change in the structure of one of the databases will likely only affect a specific source or converter, so I don't need to change code all over the place.

Another interesting part of the project is the classification of survey responses into cheaters and non-cheaters. It is based on a set of rules, each assigning a value to the survey response. In the end, the output from each rule is multiplied by a weight, and the resulting sum is compared to a threshold.

The problem is how to assign the optimal weight to each rule, in order to minimize the number of false positives and false negatives. The chosen approach uses a set of surveys that we have already by hand classified into cheaters and non-cheaters. Based on these 'learning points', the weights and the threshold are automatically computed. This allows us to train the system by adding new data.

Another nice side effect of this approach is that, given a large enough set of 'learning points', it is easy to assess new rules. Rules which are assigned very low weights are usually badly formulated, or could be plagued by implementation bugs. Rules with high weights are doing their job. The number of mis-classifications in the sample data can also be monitored over time and can be used as an indicator of how well the system performs.

All in all, it's been an awesome summer internship! I'd like to thank Juan Pablo for his guidance and enthusiasm during this project. I'd also like to thank Kristian Eide for very rewarding discussions and insights into software development, and for the idea of the information network. And for taking me to adventurous places in California. Also, thanks to Masoud Talebian for his insights in statistics and for the learning-algorithm, which made this project so much cooler! And thank you to all the other people at Medallia who have made the internship such a rewarding experience with their insights, enthusiasm, humour and friendliness!