How do distributed hash tables work


If an application needs quick access to stored data, hash tables offer considerable advantages over trees or lists. If there are no collisions, a single key comparison is sufficient, while the search time for trees grows logarithmically at best and linearly for lists with the number of entries. The more data records there are to manage, the more it pays off hashing.

However, as the amount of data grows, you inevitably get to the point where the data no longer fit in the main memory and you have to move it to mass storage, which slows down access again - considerably. An extreme example is data deduplication, in which terabytes or even petabytes are broken down into billions of small pieces ("chunks") and stored under their hash value. Even if you just keep the keys and references to the data in RAM, around 20 to 30 bytes per chunk are left over.

Calculate distributed, save distributed

In distributed systems, an additional complication is that one computer has to carry out all access to the hash table. Distributed hash tables (DHT for short) provide relief in both cases. The advantage of this is that each of the computers involved only saves part of the data records (see iX-Link [a]). This can be achieved relatively easily by partitioning the key space, for example by assigning the possible keys to the individual computers according to a certain scheme. The caching server, among other things, works according to this principle memcached [b].

However, it offers neither persistent storage - which would be pointless with a cache anyway - nor protection against the failure of storage nodes. Both gaps are filled by the “NoSQL” database Couchbase (formerly Membase) [c], which uses the same network protocol and therefore communicates with all of them memcached-Client understands - not to be confused with Apache's document-oriented database CouchDB [1]. Couchbase writes changes to the database in the background on the disk. To protect against the failure of individual servers, the data can be saved multiple times in the cluster. Couchbase can do the work of memcached with take over.

With the increasing spread of P2P networks, academic interest in DHTs grew, especially in the underlying network structures - so-called overlay networks [d] -, routing protocols (key-based routing, KBR for short) and mechanisms for increasing availability. Four proposals emerged within a short period of time: the Content Addressable Network (CAN) [e], Chord [f], Pastry [g] and Tapestry or its successor Chimera [h]. Source code is available for Chord, Pastry and Chimera, but apparently no further development has been carried out for several years. The frequently referenced Bamboo [i] also had its last official release in 2006.

P2P networks as the engine of development

However, there are more recent implementations in various languages ​​for Kademlia [j, k]. Each computer in a Kademlia cluster is responsible for the data sets whose keys are "closest" to its own ID. A simple XOR function is used to calculate the “distance” between two keys, which considerably simplifies routing in the tree-shaped overlay network. Replication of the key-value pairs and an internal balancing mechanism prevent data from being lost if a node is lost, which is a constant occurrence in P2P networks.

The developers of Voldemort [l] are also still active. In contrast to Kademlia, the key value store, written in Java, is less aimed at P2P applications than at general data storage, where the probability of a computer failure is lower. The use of a so-called consistent hash function [m] ensures that if a node fails, the cluster replicates its data with a minimum of data transport across the remaining nodes. If a new computer is added to the cluster, the process is reversed so that the data is always evenly distributed over the nodes. Voldemort uses the Berkeley DB Java Edition as the database backend, alternatively the user can use MySQL, the project's “Read-Only Storage Engine” or a self-made one. The underlying storage layer in RAM can be emulated for testing applications.

However, the fact that DHTs, like all distributed storage systems, are subject to the CAP theorem should not be concealed. It states that only a maximum of two of the three requirements consistency, availability and partition tolerance can be met [2]. (mr)


[1] Rudolf Jansen; NoSQL; Upstart; NoSQL - alternative to relational databases; iX Developer 1/2011, p. 132

[2] Martin Scholl; Distributed storage; Far removed; What distinguishes cloud storage from classic storage; iX 3/2012, p. 46

All links: