------------------------------------------------
         "Understanding Hash Tables"
------------------------------------------------
  C/O :: arp of DynamicHell Development Team
------------------------------------------------
  http://dynamichell.org | irc.dynamichell.org
------------------------------------------------


This tutorial is aimed at users with a basic understanding of C, pointers, and
more importantly linked lists.  This technique, like linked lists, is portable
and can be used on any operating system with a standard C library and compiler.

There are many data structures available to the programmer.  The most common
being the linked-list.  However, linked-list traversal can become quite slow
with large lists.  A hash tables provides a means of storing and accessing data
much more efficiently.

Essentially, a hash table is an array of linked-lists, with each list indexed
by the hash of a common variable stored in a node.

For large amounts of data, in which hashing is possible, hash tables become
faster to traverse than a standard linked-list.  Rather than traversing one
large list, the program tests the data to be added, generates a hash index and
inserts it into the array of linked-lists at that index.  Meaning that the 
linked-list to be traversed is much shorter; saving time.

The following diagram shows a visual representation of a hash table:

Node    : ===
Pointer : ---

 Index
  [0]---===---===---===---===---===---===---NULL
  [1]---NULL
  [2]---===---===---===---NULL
  [3]---===---===---===---===---===---NULL
  [4]---===---NULL
  ...

As you can see once the index is found, the number of nodes to traverse is much
smaller than a traditional long linked-list.

Implementing a Hash Table (C)
=============================

The traditional structure for a node in a hash table is identical to that of a
linked list:

struct node {
       char *data;
       struct node *next;
};


Now our table must be defined, remembering to define a size which is 
accessible by all parts of our program (this will become clearer later.)

#define TABLE_SIZE 100

struct node *hashtable[TABLE_SIZE];


We must now define a method of creating a new node.  It is identical to the 
function typically used in a linked-list:

struct node *hashtable_alloc(void)
{
	struct node *tmp = calloc(1, sizeof(struct hash));
	if(tmp == NULL) {
	       fprintf(stderr, "Error: calloc()\n");
	       exit(EXIT_FAILURE);
	}
	
	tmp->next = NULL;	

	return tmp;       
}


The next step is important--easy to forget--and involves initialising the array
of linked lists.  There are various ways to do this, though for simplicity we
will allocate one root node for each list.

void hashtable_init(void)
{
	int x;

	for(x = 0; x <TABLE_SIZE; x++) {
	      hashtable[x] = hashtable_alloc();
	}

}


Before adding to our table, we need a means of finding an index into our array;
a hashing function.  Our hashing function will take our value, a string, and
convert it into an index into our hash table (array of linked lists).

unsigned int hash_gen(char *string)
{
	unsigned int num = 0;

	while(*string++ != '\0')
			num += *string;

	return num % TABLE_SIZE;
}


Now able to find the correct index into the hash table, we are ready to add 
data to it.  A typical function for adding data to a hash table follows.  
Notice the double traversal. The first checks the whole list, up to the NULL 
pointer and acts appropriately--in this case returning.  The second traversal
ends when node->next equals NULL, so we know that the current node exists, and
thus, using its pointer will result in a successful link.

void hashtable_add(char *data)
{
	unsigned int x = hash_gen(data);
	struct node *tmp;
	char *strdup(const char *s);


	/* Our first loop checks to see the data doesn't already exist */

	for(tmp = hashtable[x]; tmp != NULL; tmp = tmp->next) {

		if(tmp->data != 0) { /* for our root node */
		
			if(!strcmp(data, tmp->data))
					 return; /* Exists already */

		}
	}

	for(tmp = hashtable[x]; tmp->next != NULL; tmp = tmp->next);

	if(tmp->next == NULL) { 
		     tmp->next = hashtable_alloc();
		     tmp = tmp->next;
		     tmp->data = strdup(data);
		     tmp->next = NULL;
	} else
		exit(EXIT_FAILURE); 
}


With our functions for creating nodes, generating a hash and adding data to 
the hash table, we are now ready to write a function to read the data from the
list.  This entails looping through the table and traversing each list as you
would a linked-list.

void hashtable_list(void)
{
	int x = 0;
	struct node *tmp;

	for(x = 0; x < TABLE_SIZE; x++) {
	
		for(tmp = hashtable[x]; tmp != NULL; tmp = tmp->next) {
			if(tmp->name != 0) {
		                printf("Index is %d, Data is %s\n", x, 
							     tmp->data);
			}
		}

	}
}


We are almost complete.  The only thing that remains to be done is to clean up
once we have finished with our hash table.  We use two temporary pointers in 
order to reconnect to the list once one node has been freed.

void hashtable_free(void)
{
	struct node *tmp;
	struct node *fwd;
	int x;

	for(x = 0; x < TABLE_SIZE; x++) {

	      tmp = hashtable[x];
	      while(tmp != NULL) {
	              fwd = tmp->next;
		      free(tmp->data);
		      free(tmp);
		      tmp = fwd;

	      }
	}
}


Again as with linked lists we never modify the pointer to a root node as this
would break our lists.


Final Code
==========

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define TABLE_SIZE 100

struct node {
       char *data;
       struct node *next;
};

struct node *hashtable[TABLE_SIZE]; /* Our hash table */


struct node *hashtable_alloc(void);
void hashtable_init(void);
unsigned int hash_gen(char *string);
void hashtable_add(char *data);
void hashtable_list(void);
void hashtable_free(void);

int main(void)
{
	hashtable_init();

	hashtable_add("David");
	hashtable_add("Goliath");
	hashtable_add("Alan");

	hashtable_list();
	
	hashtable_free();

	return EXIT_SUCCESS;
}

void hashtable_list(void)
{
	int x = 0;
	struct node *tmp;

	for(x = 0; x < TABLE_SIZE; x++) {
	
		for(tmp = hashtable[x]; tmp != NULL; tmp = tmp->next) {
			if(tmp->data != 0) {
		                printf("Index is %d, Data is %s\n", x, 
							     tmp->data);
			}
		}

	}
}

void hashtable_add(char *data)
{
	unsigned int x = hash_gen(data);
	struct node *tmp;
	char *strdup(const char *s);

	/* Our first loop checks to see the data doesn't already exist */

	for(tmp = hashtable[x]; tmp != NULL; tmp = tmp->next) {

		if(tmp->data != 0) { /* for our root node */

			if(!strcmp(data, tmp->data))
					 return;
		}
	}

	for(tmp = hashtable[x]; tmp->next != NULL; tmp = tmp->next);

	if(tmp->next == NULL) { 
		     tmp->next = hashtable_alloc();
		     tmp = tmp->next;
		     tmp->data = strdup(data);
		     tmp->next = NULL;
	} else
		exit(EXIT_FAILURE); 
}

unsigned int hash_gen(char *string)
{
	unsigned int num = 0;

	while(*string++ != '\0')
			num += *string;

	return num % TABLE_SIZE;
}

void hashtable_init(void)
{
	int x;

	for(x = 0; x <TABLE_SIZE; x++) {
	      hashtable[x] = hashtable_alloc();
	}

}

struct node *hashtable_alloc(void)
{
	struct node *tmp = calloc(1, sizeof(struct node));
	if(tmp == NULL) {
	       fprintf(stderr, "Error: calloc()\n");
	       exit(EXIT_FAILURE);
	}
	
	tmp->next = NULL;	

	return tmp;       
}

void hashtable_free(void)
{
	struct node *tmp;
	struct node *fwd;
	int x;

	for(x = 0; x < TABLE_SIZE; x++) {

	      tmp = hashtable[x];
	      while(tmp != NULL) {
	              fwd = tmp->next;
		      free(tmp->data);
		      free(tmp);
		      tmp = fwd;

	      }
	}
}


Improvements
============

There are various improvements which could be made to this design.  Firstly, 
the hash table could be larger to avoid unnecessary collisions.  Secondly,
instead of wasting memory by allocating one root node for each array of linked
lists, we could allocate on demand, instead initialising by setting each 
linked-list to NULL.  Another suggestion is to use a bitmap for defining used
and unused lists.  The program could quickly check the bits and decide whether
to allocate or to traverse.  


Summary
=======

Hash tables are faster than large linked lists.  They--like linked lists--are 
used throughout the software industry.  

The root node of any list is never directly modified, only pointers to it, so
not to damage the list's structure.

There are various ways to improve hash table efficiency.



Copright (c) 2006.  Alastair Poole.

Verbatim copying and distribution of this entire article are permitted
worldwide, without royalty, in any medium, provided this notice, and the
copyright notice, are preserved.