Saturday, March 21, 2015

Digging Deeper Into Java's HashMap

This post is about Java's HashMap. Even though it is filled under Beginning Java tag, it won't be an introductory post on how to begin using HashMap: its API or the common tasks you can accomplish with it. It would be a peep into how HashMap works, how it is implemented, with the hope that knowing this would provide more insight into what goes on whenever HashMap is being used.

I would wager that HashMap is probably one of the most readily used data structures when programming in Java. The functionality it provides: which is the ability to store stuff in pairs, is a particularly universal data structure need, not unique to Java. This facility goes by other names elsewhere: associative array, symbol table, dictionary or map: which HashMap is: a map implementation that is based on the concept of hashing; and by hashing, what do we mean?

Hashing is basically a transformation process; that takes a value and transform it into another form which is more desirable (due to shorter length, easy to remember etc) but still can be used to represents the original value.



Such a transformation process that can be applied to a list of names, could be to take the sum of the numeric positions of the alphabets that makes up the name. With A being 1, B being 2...and Z being 26, applying such a transformation we could have:

ADE-> 1 + 4 + 5 = 10
OLA -> 15 + 12 + 1 = 28
BODE -> 2 + 15 + 4 + 5 = 26

This is an example of hashing. And this simple, trivial, most likely not applicable in real life hashing function, may be desirable if we want a numeric representation for the names. But without looking that hard, we can easily find ourselves in a situation where this transformation process could lead to the same output when applied to different inputs. For example, it won't take long to run into our regular JOE, which also translates to 26...same as Bode.

This situation is called hash collision. And it can occur even with sophisticated hashing algorithms...

...but how does this relates to HashMaps? Continue reading...

What we would do right now is to quickly go over how some of the basic operations performed with HashMap, touching on the details regarding the internals of HashMap in the process:

How HashMap's Insertion Works.

HashMap is backed by an array, meaning internally the information storage is actually done using an array. And Since arrays in Java are basically a logical memory allocation that holds values of single type with it's contents retrievable by their numeric position (indexes), the question then is: "how is it possible for the contents of an HashMap, which consists of pair of stuffs, to be stuffed into an array, which can only allow values of a given type"?

Simple. This is where the hashing comes in.

HashMap's hash function is applied to the key which produces a numeric value which is used as the insertion point (array index) in the backing array, and the value stuffed into that position represents the key and value in the form of Map.Entry object. The fact that both the key and value is stored in the backing array as a Map.Entry object (and not just the value) would help provide clarity into why iterating through a HashMap works the way it does and also would help understand how hashing collision mentioned above is resolved within HashMap. But before we touch on those topics let us take a much close look into the insertion processes:

Map<String, String>  books = new HashMap<>();
books.put("id#1", "Ake");

From the view point of the programmer, the above code stores a pair of information: id and book name (with the id being the key and name of the book being the value) into a HashMap. Internally this is what happens:

The hashcode of the key is retrieved. In this case the hashcode would be the value returned by the Java's String implementation of the hashcode method. In case the object being used as the key does not provide it's own hashcode implementation, the implementation found in the Java's Object would be used. Basically Java's hashcode method is a mechanism which allows an object to return a value that theoretically would be unique and which serves as the object's identity.

So the key's hashcode is retrieved, hashed into a numeric value by applying HashMap's hashing function, and its value would then determine the exact position in the backing array in which the value (together with the key) is stored. This is how the insertion process works.

Map.Entry and Collision resolution.

What is the Map.Entry and how does it help in dealing with a situation where two keys (even though they are different) produces the same hash. Definitely HashMap implementation must have a way to resolve this situation. If not we would have a case where data get's overwritten internally.

The collision resolution is done when we see the Map.Entry as a LinkedList. So we have established that internally values of Map.Entry is stored in an array, the next step is to see the Map.Entry value as what it is: A LinkedList.

Internally the key and value being stuffed into a HashMap are converted into a Map.Entry object. Map.Entry is actually an interface which is implemented within HashMap as a Node<K,V> class.



Taking a look at the code snippet, it can be seen that Node<K,V> is some sort of LinkedList data structure: with a Node having reference to a next Node...as can be seen from the code snippet above: Node<K, V> next.

So basically what happens is that the key and value being stuffed into a HashMap, is converted into a Map.Entry object (a LinkedList). And stored in a location in the backing array. But because Map.Entry is a LinkedList, If two keys per chance have their hash turning out to be identical, instead of the corresponding Map.Entry of the second key overwriting the first, it becomes the value stored in the next variable of the first entry.

This is how the collision is resolved.

So when the get method is used with a key whose hash value points to a location in the backing array that has two Map.Entry, the correct Map.Entry can successfully be retrieved by doing a check between the supplied key and the individual keys of the two Map.Entry

How Iterating Through a HashMap Works

There are a couple of ways of iterating through the contents of any HashMap:

Getting the keys as a set and iterating through it:

Map<String, Integer> map = new HashMap<>();

//Get the keys as a set and use a for loop to iterate
for (String key : map.keySet()) {
    System.out.println("Key = " + key);
}

Getting the values as a set and iterating through it:
Map<String, Integer> map = new HashMap<>();

//Get the values as a set and use a for loop to iterate
for (String value : map.values()) {
    System.out.println("Value = " + value);
}

Using the set of keys to iterate through values:
Map<String, Integer> map = new HashMap<>();
for (String key : map.keySet()) {
    Integer value = map.get(key);
    System.out.println("Key = " + key + ", Value = " + value);
}

And more recently with Java 8 by just using the forEach construct. eg:
Map<String, Integer> map = new HashMap<>();

map.forEach((key, val) -> {
    System.out.println("Key = " + key + ", Value = " + value);
});

But none of these methods felt strange at first encounter as using the entrySet() method for iteration:

Map<String, Integer> map = new HashMap<>();
// Using the for loop
for (Map.Entry<String, Integer> entry : map.entrySet()) {
    System.out.println("Key = " + entry.getKey() + ", Value = " + entry.getValue());
}

// Using the iterator
Iterator<Map.Entry<Integer, Integer>> entries = map.entrySet().iterator();
while (entries.hasNext()) {
    Map.Entry<Integer, Integer> entry = entries.next();
    System.out.println("Key = " + entry.getKey() + ", Value = " + entry.getValue());
}

But with a little more understanding of the internal workings of an HashMap, it is seen that there is nothing strange to this method. What is going on is simply this: the .entrySet method returns all the Map.Entry objects in the backing array as a set. And you can either iterate through this using the for loop or an iterator.


How HashMap's Growth Policy Affects Usage In Multi-threading.

Hash maps are backed by arrays. In Java when arrays are defined, they are defined with a definite size. How come then, it seems you can keep on adding into a HashMap without it getting "full"?

This is possible because there is a mechanism in place for increasing the size of the backing array as need be. In some implementation the re sizing see's the backing array increasing by 1.5 of its original size, In others it doubles. HashMaps doubles. This re-sizing mechanism is sometimes referred to as the Growth policy. It usually kicks in when the content of the HashMap crosses a particular threshold, often referred to as the load factor.

So how does this have anything to do with using HashMap in Multi-threaded environments?

The answer is in the way the re-sizing is implemented in HashMaps. It is done in a way that makes room for the possibility of encountering an infinite loop when used within a multi threaded environment.

Succinctly explained this is what happens: When the threshold for re-sizing is crossed, the re-sizing algorithm kicks in. The algorithm involves creating a new array with a bigger size and copying the contents (the Map.Entry) into the bigger array. In instances where a location contains more than one entry of Map.Entry, the entry is still copied into the new, bigger array, but not always in the same order. so for example:

old location = Map.Entry1->Map.Entry2->MapEntry3->null

after re-sizing could be

new location = Map.Entry2->Map.Entry3->Map.Entry1->null

The problem arises in multi threaded environment where multiple threads interacting with the HashMap during the re-sizing process can leave the new HashMap in the following state

new location = Map.Entry1->Map.Entry2->MapEntry3->Map.Entry1

i.e a cyclic link between the Map.Entry is established, effectively creating an infinite loop. This article A Beautiful Race Condition does a good job at explaining the nitty-gritty of how this can occur.

ConcurrentHashMap is thus recommended when the need to use maps in multithreaded environment arises.

Setting Initial Capacity and Load Factor

After understanding these implementation details of HashMap, some operations that can be performed on the HashMap becomes more obvious: like setting the initial capacity of an HashMap or setting the threshold at which the re-sizing algorithm kicks in.

Taking a peek into the source of the HashMap class, it can be seen that it does offer a couple of constructor methods. One of such constructors is one which allows the ability to specify how big the array should be at creation and what the load factor should be. i.e.


So I can easily do this:

Map<String, String> myMap = new HashMap<>(32, 0.5f);

Which would create myMap with an array of size 32 (instead of 16 which is the default) and it would re-size when half of the array is used (instead of when 3/4 is used since the default load factor is 0.75f)

This can come in handy in resource limited environments where you know before hand that the contents to be put into the HashMap would be more than 16 and would want to limit the resources used for performing the re-sizing as much as possible.

Some Good Practices with HashMap

Based on the understanding of how HashMap work thus far, some general good practices can be recommended to be followed when using HashMaps:
  1. Do not use in Multithreaded environment, use ConcurrentHashMap instead
  2. Use immutable object as key values.
    Since the hash code of the key is used in hashing which produces the index that determines the location within the backing array in which the value being passed would be stored, then it makes sense to use immutable objects as the key values. If not, then there is the risk of using an object as a key to determine the index in the array, the state of the object changes (thus its hash code changes) thereby the original value it was used to store becomes unreachable within the Hashmap.

Related Reading
Defining hashCode() and equals() effectively and correctly

No comments: