forked from aQuaYi/MIT-6.824-Distributed-Systems
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathl-dht.txt
337 lines (297 loc) · 13 KB
/
l-dht.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
6.824 2018 Lecture 19: P2P, DHTs, and Chord
Reminders:
course evaluations
4B due friday
project reports due friday
project demos next week
final exam on May 24th
Today's topic: Decentralized systems, peer-to-peer (P2P), DHTs
potential to harness massive [free] user compute power, network b/w
potential to build reliable systems out of many unreliable computers
potential to shift control/power from organizations to users
appealing, but has been hard in practice to make the ideas work well
Peer-to-peer
[user computers, files, direct xfers]
users computers talk directly to each other to implement service
in contrast to user computers talking to central servers
could be closed or open
examples:
bittorrent file sharing, skype, bitcoin
Why might P2P be a win?
spreads network/caching costs over users
absence of central server may mean:
easier/cheaper to deploy
less chance of overload
single failure won't wreck the whole system
harder to attack
Why don't all Internet services use P2P?
can be hard to find data items over millions of users
user computers not as reliable as managed servers
if open, can be attacked via evil participants
The result is that P2P has been limited to a few niches:
[Illegal] file sharing
Popular data but owning organization has no money
Chat/Skype
User to user anyway; privacy and control
Bitcoin
No natural single owner or controller
Example: classic BitTorrent
a cooperative, popular download system
user clicks on download link for e.g. latest Linux kernel distribution
gets torrent file w/ content hash and IP address of tracker
user's BT app talks to tracker
tracker tells it list of other users w/ downloaded file
user't BT app talks to one or more users w/ the file
user's BT app tells tracker it has a copy now too
user's BT app serves the file to others for a while
the point:
provides huge download b/w w/o expensive server/link
But: the tracker is a weak part of the design
makes it hard for ordinary people to distribute files (need a tracker)
tracker may not be reliable, especially if ordinary user's PC
single point of attack by copyright owner, people offended by content
BitTorrent can use a DHT instead of a tracker
this is the topic of today's readings
BT apps cooperatively implement a "DHT"
a decentralized key/value store, DHT = distributed hash table
the key is the torrent file content hash ("infohash")
the value is the IP address of an BT app willing to serve the file
Kademlia can store multiple values for a key
app does get(infohash) to find other apps willing to serve
and put(infohash, self) to register itself as willing to serve
so DHT contains lots of entries with a given key:
lots of peers willing to serve the file
app also joins the DHT to help implement it
Why might the DHT be a win for BitTorrent?
more reliable than single classic tracker
keys/value spread/cached over many DHT nodes
while classic tracker may be just an ordinary PC
less fragmented than multiple trackers per torrent
so apps more likely to find each other
maybe more robust against legal and DoS attacks
How do DHTs work?
Scalable DHT lookup:
Key/value store spread over millions of nodes
Typical DHT interface:
put(key, value)
get(key) -> value
weak consistency; likely that get(k) sees put(k), but no guarantee
weak guarantees about keeping data alive
Why is it hard?
Millions of participating nodes
Could broadcast/flood each request to all nodes
Guaranteed to find a key/value pair
But too many messages
Every node could know about every other node
Then could hash key to find node that stores the value
Just one message per get()
But keeping a million-node table up to date is too hard
We want modest state, and modest number of messages/lookup
Basic idea
Impose a data structure (e.g. tree) over the nodes
Each node has references to only a few other nodes
Lookups traverse the data structure -- "routing"
I.e. hop from node to node
DHT should route get() to same node as previous put()
Example: The "Chord" peer-to-peer lookup system
Kademlia, the DHT used by BitTorrent, is inspired by Chord
Chord's ID-space topology
Ring: All IDs are 160-bit numbers, viewed in a ring.
Each node has an ID, randomly chosen, or hash(IP address)
Each key has an ID, hash(key)
Assignment of key IDs to node IDs
A key is stored at the key ID's "successor"
Successor = first node whose ID is >= key ID.
Closeness is defined as the "clockwise distance"
If node and key IDs are uniform, we get reasonable load balance.
Basic routing -- correct but slow
Query (get(key) or put(key, value)) is at some node.
Node needs to forward the query to a node "closer" to key.
If we keep moving query closer, eventually we'll hit key's successor.
Each node knows its successor on the ring.
n.lookup(k):
if n < k <= n.successor
return n.successor
else
forward to n.successor
I.e. forward query in a clockwise direction until done
n.successor must be correct!
otherwise we may skip over the responsible node
and get(k) won't see data inserted by put(k)
Forwarding through successor is slow
Data structure is a linked list: O(n)
Can we make it more like a binary search?
Need to be able to halve the distance at each step.
log(n) "finger table" routing:
Keep track of nodes exponentially further away:
New state: f[i] contains successor of n + 2^i
n.lookup(k):
if n < k <= n.successor:
return successor
else:
n' = closest_preceding_node(k) -- in f[]
forward to n'
for a six-bit system, maybe node 8's finger table looks like this:
0: 14
1: 14
2: 14
3: 21
4: 32
5: 42
Why do lookups now take log(n) hops?
One of the fingers must take you roughly half-way to target
Is log(n) fast or slow?
For a million nodes it's 20 hops.
If each hop takes 50 ms, lookups take a second.
If each hop has 10% chance of failure, it's a couple of timeouts.
So: good but not great.
Though with complexity, you can get better real-time and reliability.
Since lookups are log(n), why not use a binary tree?
A binary tree would have a hot-spot at the root
And its failure would be a big problem
The finger table requires more maintenance, but distributes the load
How does a new node acquire correct tables?
General approach:
Assume system starts out w/ correct routing tables.
Add new node in a way that maintains correctness.
Use DHT lookups to populate new node's finger table.
New node m:
Sends a lookup for its own key, to any existing node.
This yields m.successor
m asks its successor for its entire finger table.
At this point the new node can forward queries correctly
Tweaks its own finger table in background
By looking up each m + 2^i
Does routing *to* new node m now work?
If m doesn't do anything,
lookup will go to where it would have gone before m joined.
I.e. to m's predecessor.
Which will return its n.successor -- which is not m.
We need to link the new node into the successor linked list.
Why is adding a new node tricky?
Concurrent joins!
Example:
Initially: ... 10 20 ...
Nodes 12 and 15 join at the same time.
They can both tell 10+20 to add them,
but they didn't tell each other!
We need to ensure that 12's successor will be 15, even if concurrent.
Stabilization:
Each node keeps track of its current predecessor.
When m joins:
m sets its successor via lookup.
m tells its successor that m might be its new predecessor.
Every node m1 periodically asks successor m2 who m2's predecessor m3 is:
If m1 < m3 < m2, m1 switches successor to m3.
m1 tells m3 "I'm your new predecessor"; m3 accepts if closer
than m3's existing predecessor.
Simple stabilization example:
initially: ... 10 <==> 20 ...
15 wants to join
1) 15 tells 20 "I'm your new predecessor".
2) 20 accepts 15 as predecessor, since 15 > 10.
3) 10 asks 20 who 20's predecessor is, 20 answers "15".
4) 10 sets its successor pointer to 15.
5) 10 tells 15 "10 is your predecessor"
6) 15 accepts 10 as predecessor (since nil predecessor before that).
now: 10 <==> 15 <==> 20
Concurrent join:
initially: ... 10 <==> 20 ...
12 and 15 join at the same time.
* both 12 and 15 tell 20 they are 20's predecessor; when
the dust settles, 20 accepts 15 as predecessor.
* now 10, 12, and 15 all have 20 as successor.
* after one stabilization round, 10 and 12 both view 15 as successor.
and 15 will have 12 as predecessor.
* after two rounds, correct successors: 10 12 15 20
To maintain log(n) lookups as nodes join,
Every one periodically looks up each finger (each n + 2^i)
What about node failures?
Nodes fail w/o warning.
Two issues:
Other nodes' routing tables refer to dead node.
Dead node's predecessor has no successor.
Recover from dead next hop by using next-closest finger-table entry.
Now, lookups for the dead node's keys will end up at its predecessor.
For dead successor
We need to know what dead node's n.successor was
Since that's now the node responsible for the dead node's keys.
Maintain a _list_ of r successors.
Lookup answer is first live successor >= key
Dealing with unreachable nodes during routing is important
"Churn" is high in open p2p networks
People close their laptops, move WiFi APs, &c pretty often
Fast timeouts?
Explore multiple paths through DHT in parallel?
Perhaps keep multiple nodes in each finger table entry?
Send final messages to multiple of r successors?
Kademlia does this, though it increases network traffic.
Geographical/network locality -- reducing lookup time
Lookup takes log(n) messages.
But messages are to random nodes on the Internet!
Will often be very far away.
Can we route through nodes close to us on underlying network?
This boils down to whether we have choices:
If multiple correct next hops, we can try to choose closest.
Idea: proximity routing
to fill a finger table entry, collect multiple nodes near n+2^i on ring
perhaps by asking successor to n+2^i for its r successors
use lowest-ping one as i'th finger table entry
What's the effect?
Individual hops are lower latency.
But less and less choice as you get close in ID space.
So last few hops are likely to be long.
Though if you are reading, and any replica will do,
you still have choice even at the end.
Any down side to locality routing?
Harder to prove independent failure.
Maybe no big deal, since no locality for successor lists sets.
Easier to trick me into using malicious nodes in my tables.
What about security?
Can someone forge data? I.e. return the wrong value?
Defense: key = SHA1(value)
Defense: key = owner's public key, value signed
Defense: some external way to verify results (Bittorrent does this)
Can a DHT node claim that data doesn't exist?
Yes, though perhaps you can check other replicas
Can a host join w/ IDs chosen to sit under every replica of a given key?
Could deny that data exists, or serve old versions.
Defense: require (and check) that node ID = SHA1(IP address)
Can a host pretend to join millions of times?
Could break routing with non-existant hosts, or control routing.
Defense: node ID = SHA1(IP address), so only one node per IP addr.
Defense: require node to respond at claimed IP address.
this is what trackerless Bittorrent's token is about
What if the attacker controls lots of IP addresses?
No easy defense.
But:
Dynamo gets security by being closed (only Amazon's computers).
Bitcoin gets security by proving a node exists via proof-of-work.
How to manage data?
Here is the most popular plan.
[diagram: Chord layer and DHT layer]
Data management is in DHT above layer, using Chord.
DHT doesn't guarantee durable storage
So whoever inserted must re-insert periodically
May want to automatically expire if data goes stale (bittorrent)
DHT replicates each key/value item
On the nodes with IDs closest to the key, where looks will find them
Replication can help spread lookup load as well as tolerate faults
When a node joins:
successor moves some keys to it
When a node fails:
successor probably already has a replica
but r'th successor now needs a copy
Summary
DHTs attractive for finding data in large p2p systems
Decentralization seems good for high load, fault tolerance
But: log(n) lookup time is not very fast
But: the security problems are difficult
But: churn is a problem, leads to incorrect routing tables, timeouts
Next paper: Amazon Dynamo, adapts these ideas to a closed system.
References
Kademlia: www.scs.stanford.edu/~dm/home/papers/kpos.pdf
Accordion: www.news.cs.nyu.edu/~jinyang/pub/nsdi05-accordion.pdf
Promiximity routing: https://pdos.csail.mit.edu/papers/dhash:nsdi/paper.pdf
Evolution analysis: http://nms.csail.mit.edu/papers/podc2002.pdf
Sybil attack: http://research.microsoft.com/pubs/74220/IPTPS2002.pdf