Dht Kademlia

1 view

Skip to first unread message

Nadia Summerhill

unread,

Aug 3, 2024, 5:12:42 PM8/3/24

to lenmazddahand

Okay, I've been reading articles and the paper about Kademlia recently to implement a simple p2p program that uses kademlia dht algorithm. And those papers are saying, those 160-bit key in a Kademlia Node is used to identify both nodes (Node ID) and the data (which are stored in a form of tuple).

Additionally, bootstrapping method of Kademlia too confuses me.When the node gets created (user executes the program), It seems like it uses bootstrapping node to fill up the buckets. But then what's bootstrapping node? Is it another process that's always running? What if the bootstrapping node gets turned off?

Each node has a routing table by which it organizes the neighbors it knows about and another table in which it organizes the data it is asked to store by others. Nodes have a quasi-random ID that determines their position in the routing keyspace. The hashes of keys for stored data don't precisely match any particular node ID, so the data is stored on the nodes whose ID is closest to the hash, as determined by the distance metric. That's how node IDs and key hashes are used for both.

When you perform a lookup for data (i.e. find_value) you ask the remote nodes for the k-closest neighbor set they have in their routing table, which will allow you to home in on the k-closest set for a particular target key. The same query also asks the remote node to return any data they have matching that target ID.

When you perform a find_node on the other hand you're only asking them for the closest neighbors but not for data. This is primarily used for routing table maintenance where you're not looking for any data.

Those are the abstract operations, if needed an actual implementation could separate the lookup from the data retrieval, i.e. first perform a find_node and then use the result set to perform one or more separate get operations that don't involve additional neighbor lookups (similar to the store operation).

Since kademlia is UDP-based you can't really serve arbitrary files because those could easily exceed reasonable UDP packet sizes. So in practice kademlia usually just serves as a hash table for small binary values (e.g. contact information, public keys and such). Bulk operations are either performed by other protocols bootstrapped off those values or by additional operations beyond those mentioned in the kademlia paper.

What the paper describes is only the basic functionality for a routing algorithm and most basic key value storage. It is a spherical cow in a vacuum. Actual implementations usually need additional features or work around security and reliability problems faced on the public internet.

It seems like it replaces the refinement of the bucket splitting for unbalanced trees with a a list of the closest known nodes relative to the local node ID. Unlike the bucket splitting approach it uses a different parameter instead of the bucket size K.

The details don't seem to be spelled out but it seems logical that one simply computes whether a node would be inserted in that list based on the currently furthest node in that list (assuming the maximum size based on the new parameter has been reached) and otherwise spilling it into the main routing table which is still based on buckets.

Pretty much the same way that kademlia does with the refined splitting approach (which many implementations fail to consider!), but in a way that's easier to reason about and can be parametrized separately.

I want to confirm my understanding of buckets in Kademlia DHT.Kademlia has m k-buckets where m is the size of the network in bits and k is the number of key-value pairs stored per bucket.for example, let's say m=4 then we can have 2^4 nodes, namely from 0 to 15.

Each node has the routing table of the 0-bit match, 1st-bit match and 2nd-bit match and so on, this is m buckets. Furthermore, for each bucket, it will store k representatives instead of a single NodeId.So, if we say k=2, the routing Table for node 0101 would be something like:

k is the number of entries in a bucket. Their node IDs are expected to be randomly distributed within the ID-range the bucket covers, which means doubling the number of entries per bucket would only increase its resolution by one bit on average, i.e. it does not scale well.That's why we have a structured routing table with multiple buckets with different scope each. It increases local resolution around the node's own node ID.

Implementing the full kademlia algorithm requires a dynamic routing table layout. Therefore m is not fixed. The fixed size layout was only used in the simplified pre-print version of the paper as part of a theoretical proof.

Every node in Kademlia store a list of other nodes. This list is based on bit-differences and is loosely termed as buckets. Now, 'k' is a system wide replication parameter. This means that for every list, there are k entries inside for that 'bucket'. This is my understanding. This is the link to the paper. Hope this helps..:)