Hash Table

The hash table is a data structure that has the property of storing, retrieving and removing data in a fixed amount of time (computational complexity), this can be quite good because it means that we can continue to store or get those elements without having our program running slower as the data size increases. This is not always the case, let's consider for example an array: storing can be time constant provided that our array is available in memory and we store always at its end but if we need to find an element we will need to iterate through it or if we need to remove an element then we might need to shift one position to the left all the element at the right of the removed element and these operations depend on the size of the array and therefore are not time constant. Similarly if working with a liked list, inserting and removing can be time constant but searching will not.

So how can the hash table do this?

The hash table is an array so what we need is a function that receives the element and returns the array index where that element should be stored or can be found. Ideally the time this hash function will take to execute will be short and independent of the amount of elements stored in the hash table.
At this point you might be thinking that the problem has been delegated to the hash function and it will be difficult to make this function time constant. But if this function does not depend on nothing but the given element itself then this can be possible.
What we need to do is to take something from the element (key) and process it in order to get a number and make sure that number is a valid index, this is (for 0 indexed arrays): [0, array.length). A good trick to make sure the number will be in the valid range is simply to apply to module operator:
int hash(Element element){	
	index = do someting with the key to get a number;	
	return index % hashTable.array.length;
}


Let's say we have a class whose instances have a name which is nothing but a string. If we want a hash table of this class of objects we can use its name property as the key to hash. From an object we get its name and we apply to it the hash function to get the index of the hash table where to store the object. Note that we are using the key to get the index but we are actually storing the object, not just the name.
This figure shows the underlying idea:
hash table example


Collisions

You might be thinking what will happen if two different objects have the same name or what if two objects with different names have the hash function returning the same value.
Initially there is nothing that guarantee that this scenario will not happen and note that if the amount of elements to hash is greater than the hash table size then it is inevitable that some of them will have the hash function returning the same index value.
When the index where to store a given element is already used by another element then we have a collision and we will have to define how to deal with them.

Separate Chaining

One approach could be to store multiple elements under the same index by having another data structure related to that index. So what we typically do is to have a list or a tree for each of the hash table positions and store the hashed elements there. This is known as Separate Chaining.
This figure shows the Separate Chaining:
hash table with separate chaining example


Open Addressing

Here the idea is store the elements directly into the hash table:
If the position given by the hash function applied to an element is free, then the element is stored there else another position is searched. These steps are repeated until a free position is found or it is decided that there is no me space available for that element.
Note that the way the free positions are searched must be determinist so that later when retrieving an element, it can be found.
There are several ways of looking for another position, moving to the next position or using another hash function for example.

When using this technique we have to be careful with elements deletion! If an element 'x' is deleted this could affect the search for another element 'y' that was inserted in the table after 'x' was inserted since the search might need to consider the position of 'x' as used in order to reach 'y'.
So one solution is to mark the 'x' element as logically deleted and only "physically" delete it if the next position (in case of collision) is free. A position logically deleted can be used to store a new element later.
Also in order for Opening Addressing to be efficient we need to guarantee at least some percentage of the table to have free slots, if this percentage is not satisfied the table need to be expanded and all its elements rehashed.

In both approaches, unfortunately when we have collisions and we still want to keep all the hashed elements then our hash table will stop being time constant.

As usual, nothing stops us thinking out of the box and having some other approach like a hybrid between Separate Chaining and Open Addressing.

Sometimes collisions can be useful

We need to count how many chars (of each character) a text has:
Since we know in advance all the different chars there are for a given alphabet we can easily numbered them and use an array where the index will match the char number and the value at that index will denote the amount of times this char was found.
Here is the source code:
public static void countCharactersDemo() {
    
    long[] ocurrences = countCharacters(TXT_TO_ANALIZE);
    long n;

    for (int i = 0; i < ocurrences.length; i++) {
        n = ocurrences[i];
        if (n > 0) {
            // TODO for the sake of simplicity we are not using a StringBuilder (or StringBuffer) here but 
            // you might consider doing so.
            System.out.println("letter '" + (char) (i + Character.MIN_CODE_POINT) + 
			"' found " + n + (n == 1 ? " time" : " times") + ".");
        }
    }

    System.out.println("------- ------- ------- ------- ");

}    


/**
 *
 * @param txt
 * @return an array where each value represents the amount of times a char with 
 * (code point == the array index + Character.MIN_CODE_POINT) was found.
 *
 * This is just for demo, consider using a smaller array if it is not necessary to
 * cover all unicode characters
 *
 */
public static long[] countCharacters(String txt) {

    long[] counts = new long[Character.MAX_CODE_POINT - Character.MIN_CODE_POINT];
    
    for (char ch : txt.toCharArray()) {
        counts[hash(ch)]++;
    }
    
    return counts;

}


private static int hash(char ch) {
    // quite a simple hash function: the integer value of the ch minus Character.MIN_CODE_POINT
    // note that due to how the hash table (array) is defined then this value will always be
    // a valid array index        
    return ch - Character.MIN_CODE_POINT;
}


And here the complete version of it, ready to try: textStatistics.tar