------------------------------------------------ "Understanding Hash Tables" ------------------------------------------------ C/O :: arp of DynamicHell Development Team ------------------------------------------------ http://dynamichell.org | irc.dynamichell.org ------------------------------------------------ This tutorial is aimed at users with a basic understanding of C, pointers, and more importantly linked lists. This technique, like linked lists, is portable and can be used on any operating system with a standard C library and compiler. There are many data structures available to the programmer. The most common being the linked-list. However, linked-list traversal can become quite slow with large lists. A hash tables provides a means of storing and accessing data much more efficiently. Essentially, a hash table is an array of linked-lists, with each list indexed by the hash of a common variable stored in a node. For large amounts of data, in which hashing is possible, hash tables become faster to traverse than a standard linked-list. Rather than traversing one large list, the program tests the data to be added, generates a hash index and inserts it into the array of linked-lists at that index. Meaning that the linked-list to be traversed is much shorter; saving time. The following diagram shows a visual representation of a hash table: Node : === Pointer : --- Index [0]---===---===---===---===---===---===---NULL [1]---NULL [2]---===---===---===---NULL [3]---===---===---===---===---===---NULL [4]---===---NULL ... As you can see once the index is found, the number of nodes to traverse is much smaller than a traditional long linked-list. Implementing a Hash Table (C) ============================= The traditional structure for a node in a hash table is identical to that of a linked list: struct node { char *data; struct node *next; }; Now our table must be defined, remembering to define a size which is accessible by all parts of our program (this will become clearer later.) #define TABLE_SIZE 100 struct node *hashtable[TABLE_SIZE]; We must now define a method of creating a new node. It is identical to the function typically used in a linked-list: struct node *hashtable_alloc(void) { struct node *tmp = calloc(1, sizeof(struct hash)); if(tmp == NULL) { fprintf(stderr, "Error: calloc()\n"); exit(EXIT_FAILURE); } tmp->next = NULL; return tmp; } The next step is important--easy to forget--and involves initialising the array of linked lists. There are various ways to do this, though for simplicity we will allocate one root node for each list. void hashtable_init(void) { int x; for(x = 0; x next equals NULL, so we know that the current node exists, and thus, using its pointer will result in a successful link. void hashtable_add(char *data) { unsigned int x = hash_gen(data); struct node *tmp; char *strdup(const char *s); /* Our first loop checks to see the data doesn't already exist */ for(tmp = hashtable[x]; tmp != NULL; tmp = tmp->next) { if(tmp->data != 0) { /* for our root node */ if(!strcmp(data, tmp->data)) return; /* Exists already */ } } for(tmp = hashtable[x]; tmp->next != NULL; tmp = tmp->next); if(tmp->next == NULL) { tmp->next = hashtable_alloc(); tmp = tmp->next; tmp->data = strdup(data); tmp->next = NULL; } else exit(EXIT_FAILURE); } With our functions for creating nodes, generating a hash and adding data to the hash table, we are now ready to write a function to read the data from the list. This entails looping through the table and traversing each list as you would a linked-list. void hashtable_list(void) { int x = 0; struct node *tmp; for(x = 0; x < TABLE_SIZE; x++) { for(tmp = hashtable[x]; tmp != NULL; tmp = tmp->next) { if(tmp->name != 0) { printf("Index is %d, Data is %s\n", x, tmp->data); } } } } We are almost complete. The only thing that remains to be done is to clean up once we have finished with our hash table. We use two temporary pointers in order to reconnect to the list once one node has been freed. void hashtable_free(void) { struct node *tmp; struct node *fwd; int x; for(x = 0; x < TABLE_SIZE; x++) { tmp = hashtable[x]; while(tmp != NULL) { fwd = tmp->next; free(tmp->data); free(tmp); tmp = fwd; } } } Again as with linked lists we never modify the pointer to a root node as this would break our lists. Final Code ========== #include #include #include #define TABLE_SIZE 100 struct node { char *data; struct node *next; }; struct node *hashtable[TABLE_SIZE]; /* Our hash table */ struct node *hashtable_alloc(void); void hashtable_init(void); unsigned int hash_gen(char *string); void hashtable_add(char *data); void hashtable_list(void); void hashtable_free(void); int main(void) { hashtable_init(); hashtable_add("David"); hashtable_add("Goliath"); hashtable_add("Alan"); hashtable_list(); hashtable_free(); return EXIT_SUCCESS; } void hashtable_list(void) { int x = 0; struct node *tmp; for(x = 0; x < TABLE_SIZE; x++) { for(tmp = hashtable[x]; tmp != NULL; tmp = tmp->next) { if(tmp->data != 0) { printf("Index is %d, Data is %s\n", x, tmp->data); } } } } void hashtable_add(char *data) { unsigned int x = hash_gen(data); struct node *tmp; char *strdup(const char *s); /* Our first loop checks to see the data doesn't already exist */ for(tmp = hashtable[x]; tmp != NULL; tmp = tmp->next) { if(tmp->data != 0) { /* for our root node */ if(!strcmp(data, tmp->data)) return; } } for(tmp = hashtable[x]; tmp->next != NULL; tmp = tmp->next); if(tmp->next == NULL) { tmp->next = hashtable_alloc(); tmp = tmp->next; tmp->data = strdup(data); tmp->next = NULL; } else exit(EXIT_FAILURE); } unsigned int hash_gen(char *string) { unsigned int num = 0; while(*string++ != '\0') num += *string; return num % TABLE_SIZE; } void hashtable_init(void) { int x; for(x = 0; x next = NULL; return tmp; } void hashtable_free(void) { struct node *tmp; struct node *fwd; int x; for(x = 0; x < TABLE_SIZE; x++) { tmp = hashtable[x]; while(tmp != NULL) { fwd = tmp->next; free(tmp->data); free(tmp); tmp = fwd; } } } Improvements ============ There are various improvements which could be made to this design. Firstly, the hash table could be larger to avoid unnecessary collisions. Secondly, instead of wasting memory by allocating one root node for each array of linked lists, we could allocate on demand, instead initialising by setting each linked-list to NULL. Another suggestion is to use a bitmap for defining used and unused lists. The program could quickly check the bits and decide whether to allocate or to traverse. Summary ======= Hash tables are faster than large linked lists. They--like linked lists--are used throughout the software industry. The root node of any list is never directly modified, only pointers to it, so not to damage the list's structure. There are various ways to improve hash table efficiency. Copright (c) 2006. Alastair Poole. Verbatim copying and distribution of this entire article are permitted worldwide, without royalty, in any medium, provided this notice, and the copyright notice, are preserved.