Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

database filesystem: tech details

2 views

Skip to first unread message

=TL2=Tom Lord()

unread,

Dec 17, 1997, 3:00:00 AM12/17/97

For more information (and source code), retrieve:

ftp://emf.net/users/lord/src/bdb-1.0.tar.gz

This post is:

---------------------------------------------

PERSISTENT HASH TRIES
Thomas Lord, Berkeley CA.

1. Introduction

A "persistent hash trie" (PHT) is a low level database in which higher
level databases may be stored. It is another layer or an alternative
to a traditional file system.

The basic operations on a PHT have at least three important
properties. They provide arbitrarily complex transactions; guarantee
instant recovery from system failures; and permit concurrent, isolated
access by multiple processes. Hereafter, these properties will be
referred to as the "TRC" properties ("T"ransactions, "R"ecoverability,
"C"oncurrency).

When a higher-level database is stored in a PHT, the operations on the
higher-level database automatically acquire the TRC properties.

2. Background: The PHT Abstract Data Type

Described as an abstract data type, a PHT provides 7 fundamental
operations:

value = pht.REF (key)
pht.SET (key, value)

A PHT provides a mutable mapping for short
values (typically 64-bit keys and and values).

location = pht.ALLOC (n_pages)
pht.FREE (location, n_pages)

A PHT provides a page-oriented memory manager
for the underlying storage device in which
it stores the REF/SET mapping. (A typical page
size is 4k and a typical disk address is a 31 bit
quantity.)

We presume that the operating system or
particular PHT implementation provides a way
to read and write data in allocated regions
of a PHT database.

pht.COMMIT ()
pht.ABORT ()
pht.CATCH_UP ()

Mutations to a PHT are hidden from processes other
than the writer until the execution of a COMMIT.
Mutations can be discarded by ABORT and are
automatically discarded by writer crash or system
failure.

A read-only transaction obtains a fixed view of the
state of the database at the time of the first read.
CATCH_UP updates that view to take into account any
subsequently committed transactions.

The demonstration implementation chooses 64-bit addresses and values
for REF and SET, 4096-byte pages, and a 31-bit range for the file
offsets returned by ALLOC and accepted by FREE. These numbers work
out nicely in the bit-level details of the implementation and there
does not seem to be much reason to vary from them unless to change the
page size some other factor of a small power of two.

2.1 Database Mutations and PHT Transactions

There are exactly four kinds of mutation possible for a PHT. Three of
these are some of the basic operations just introduced:

pht.SET (key, value)
pht.ALLOC (n_pages)
pht.FREE (location, n_pages)

The fourth kind of mutation is to write directly to pages allocated
using "pht.ALLOC()". The demonstration implementation includes a
simple buffered I/O package for that purpose.

Database transactions are divided into two types: read-only and rd/wr
(read/write).

A rd/wr transaction makes mutations. Transaction boundaries are
marked by calls to COMMIT or ABORT.

A read-only transaction avoids the four mutation-invoking operations.
A read-only transaction starts when a database file is first opened,
and continues until the reading process exits or calls CATCH_UP. In
the latter case, a new read-only transaction begins.

Transactions are partially ordered in the following way.

Every write transaction is assigned a sequence number. These sequence
numbers increase with every successive COMMIT (transaction N is
followed by N+1). COMMIT is an atomic operation. Thus, write
transactions are totally ordered by sequence numbers. Note that when
write transaction N+1 begins, the database state observed by that
writer is the state of the database at the conclusion of transaction
N. That is also the state returned to by a call to ABORT.

Every read transaction obtains a view of the database at the time of
the most recent commit prior to the beginning of the read transaction.
More than one reader may obtain the same view. Therefore, read
transactions (and so, the set of all transactions) are _partially_
ordered by sequence numbers.

Programs which use PHTs must maintain three easy but extremely
important invariants:

If those invariants are maintained, then the each of the four kinds of
mutation will have the TRC properties.

Only a system failure or I/O error can cause a PHT COMMIT to fail. If
it fails, the effect is the same as if ABORT had been called instead;
the database is left in a consistent state.

2.2 Database Garbage Collection

The isolation of read and write transactions means that pages freed by
a write transaction can not be immediately recycled or otherwise
overwritten -- they may still be needed by a reader who has not caught
up with the freeing transaction. Therefore, a garbage collector is
employed which picks up storage passed to FREE, and makes it available
for actual re-use only after proving that it is no longer accessible
to readers.

Three PHT operations are involved in garbage collection: SLATE,
COMMIT, and RECLAIM. First, a writing process SLATEs storage for
recycling and executes a commit. Then, that process waits until all
readers have caught up with that commit. Finally, the writer RECLAIMs
the storage that was earlier SLATEd, making it available for
recycling. Postponing the RECLAIM until readers have caught up with
the SLATE assures the writer that no reader is still using the
recycled pages.

2.3 Locking

Only two kinds of locking are required when using a PHT. (1) Readers
must exclude a garbage collector with whom they have not
caught-up. (2) Writers must exclude one another so that only one write
transaction runs at a time. All other synchronization between PHT
processes is implicit and mediated through the storage device on which
PHT data is kept.

What other database systems do with locking and logging, BDB does
primarily with ordered writes, as explained below. [Seltzer]

3. The Physical Implementation Of PHTs

3.1 Assumptions About The Underlying Storage Device

We assume that reads and writes are most efficient if they are large
enough (2^12 bytes works out nicely, for example) and page aligned at
the optimal read/write size. If the amount of storage available can
be dynamically increased, we will take advantage of that, but it is
not necessary.

We assume that page aligned writes, if they are the size of one page
or smaller, occur in order from the first byte to the last, so that a
system failure part-way through such a write will leave the disk
[DISK] in a state that is consistent with having written some initial
sequence of the bytes to be written. If two writes to the same block
are ordered roughly concurrently, they must take place in some
definite order. In general, after a write A followed by B to the same
block, either of which may have succeeded or failed, the block
contains:

[..bytes from B..][..bytes from A..][..bytes from before A..]

Any of those regions may be empty or may fill some or all of the
block. If write A succeeded, the section "..bytes from before A.."
will be empty. If write B succeeded, the section "..bytes from B.."
will fill the entire block. No other region of the disk may be
effected by either write.

These constraints concerning the ordering of writes could be altered
should the requirements of some esoteric storage device require it.
All that is essential is that a PHT implementation be able to
atomically and with full recoverability write a single structure,
smaller than the size of a page, to the storage device. From the
write ordering constraints outlined above, which simply describe the
typical behavior of ordinary storage devices, the particular PHT
implementation described by this paper synthesizes the capability of a
suitably atomic and recoverable write.

3.2 Static Structure of the REF/SET Hash Trie

We begin by explaining read-only access to the data structure used to
implement the REF/SET operations. This data structure, a "hash trie",
is central to the implementation of a PHT. [ARRAYS]

The hash trie has two kinds of nodes, internal and leaf, both
occupying exactly one page. There is always at least one internal
node (the root), and at least one leaf.

To operate on a PHT, one must have a "root pointer". The root pointer
is the physical disk address of the root node of the REF/SET hash
trie. This pointer is a parameter to the REF and SET functions.

The format of an internal node is simple: it is an array of physical
disk locations, each a pointer to a child node. Any of these pointers
may be 0, indicating that there is no child in that position.

As in an ordinary trie, a child pointer may point to either an
internal node or a leaf. We use tag bits in the pointer to
distinguish these two cases. We use low-order tag bits because child
pointers are page aligned.

The format of a leaf node is different, and is explained more fully
below.

We call this tree a "trie" because at each level we consume some
number of bits from the key in order to choose the correct child
pointer to reach the next level. We call this tree a "hash trie"
because the manner in which a child pointer is looked up borrows a
technique from implementation strategies for extensible hashing, in
contrast to more ordinary trie algorithms which perform a child
pointer lookup by "radix sort" technique (simply use the bits from the
key to index an array of child pointers). Here is some code which
explains.

/* An ordinary trie, not a hash trie, chooses
* a child pointer like this:
*/
#define choose_some_bits(depth, key) \
(mask & (key >> (depth * mask_width)))

disk_addr
classic_get_child_ptr (disk_addr * internal_node, int depth, int key)
{
int index;
index = choose_some_bits (depth, key);
return internal_node[depth];
}

/* But a hash trie chooses a child pointer
* like this:
*/
disk_addr
hash_trie_get_child_ptr (disk_addr * internal_node, int depth, int key)
{
int index;
int attempts;

index = choose_some_bits (depth, key);
attempts = 0;
do
{
if (internal_node[index])
return internal_node[index];

index = (index >> 1);
++attempts;
}
while (attempts <= mask_width);
return 0;
}

Many variations are possible, especially in how "choose_some_bits" is
implemented. Whatever implementation is chosen will interact with the
distribution of addresses passed to REF and SET and have a critical
impact on performance. (The demonstration implementation currently
shaves off address bits from high-to-low order, rather than
low-to-high as illustrated here.)

This technique of searching each internal node, discarding bits from
the key, is useful (in combination with the insertion algorithm
explained below) for reducing the number of children at each internal
node without increasing the depth of the tree, thereby saving lots of
storage space. This will become clearer when the insertion algorithm
is explained. The number of hash table probes necessary to find the
correct child pointer can be reduced (perhaps with significant
performance implications) from what is shown above by techniques such
as a binary search through the number of significant index bits.

Each leaf page is a linearly probed hash table searched using a hash
value computed from the key. The format of a bucket entry is:

struct hash_trie_bucket
{
key_type key;
val_type value;
};

3.3 Lookup

Putting the pieces together, REF, works this way:

val_type
ref (disk_addr root, key_type key)
{
disk_addr bucket;
struct hash_trie_bucket * p;
struct hash_trie_bucket * entry;

bucket = find_bucket (root, 0, key);
p = (struct hash_trie_bucket *)cache_page (bucket);
entry = search_leaf (p, key);
if (entry)
return entry->value;
else
return 0;
}

"search_leaf" is a simple hash table lookup. Here's "find_bucket":

disk_addr
find_bucket (disk_addr internal_node, int depth, key_type key)
{
disk_addr child;
disk_addr * p;

p = cache_page (internal_node);
child = get_child_ptr (p, depth, key);
if (!child || tag_bits_indicate_leaf (child))
return child;
else
return find_bucket (child, depth + 1, key);
}

3.4 Insertion

The insertion algorithm follows almost directly from the lookup
algorithm, but with two intricacies: the need to handle leaf-node
overflow, and the need to honor the TRC Invariants, especially #2:

Writes are permitted only on pages
allocated during the current transaction

After a commit, a given node of the trie can not be modified by a
subsequent SET.

We use a copy-on-write algorithm. The first time a trie node is
modified within a given transaction, it is copied to a newly allocated
page. Modifications are made to the copy, not the original. This
satisfies the TRC Invariants.

One immediate consequence of the copy-on-write algorithm is that a
modification to a trie node causes a modification to that node's
parent because the parent's child pointer must be updated to point to
the newly allocated page.

That in turn implies:

Every write transaction modifies the root node.

So by copy-on-write, the root node changes to a newly allocated
address with every write transaction. This important fact will turn
up again in the discussion of transactions.

When a leaf-node overflows, it is split into two leaf nodes and an
attempt is made to insert these in the parent node. One will remain
in the same child pointer position. The other will be inserted at an
index formed by prepending a "1" bit to the original index (if the
index so formed does not overflow the size of the node).

Recall from the "get_child_ptr" algorithm that a child pointer to a
leaf node is usually addressed by fewer than the full complement of
key bits available at the trie node containing the child pointer.
Usually, therefore, the leafs resulting from a split can be reinserted
in their parent by adding one significant bit of key, to their
address.

The following example illustrates:

initial condition:
------------------

Initially there is just one internal node (the root node) and one leaf
node in the trie. In the demonstration implementation, each layer of
internal trie node has room for 1024 child pointers and so consumes 10
bits of key address. But extensible hashing is used to translate
those 10 bits into an actual child pointer -- so in the initial
condition, _0_ bits of key are used to find the leaf node for any key:

root[0] == offset of leaf 0
root[1..1023] == 0

All keys map to root[0].

No bits of the key are significant.

In the demonstration implementation, up to 256 insertions can occur
before the first leaf node overflows.

after one split:
----------------

When that first leaf is split, the trie will contain one internal node
and two leaves. It will consume 10 bits of key address at the root
node, of which it will actually use _1_ bit to choose the appropriate
leaf. This example is based on 64-bit keys and on taking the 10 bits
for the root node from the high-order bits of the key:

root[0] == offset of leaf 0
root[1] == offset of leaf 1
root[2..1023] == 0

Keys with 0 in bit 54 map to root[0].
Keys with 1 in bit 54 map to root[1].

One bit of each key is significant.

after two splits:
-----------------

Note that if one of those leaves then splits but the other doesn't,
then some addresses will use 1 bit in the root while others use 2
bits.

root[0] == offset of leaf 0
root[1] == offset of leaf 1
root[2] == 0
root[3] == offset of leaf 2
root[4..1023] == 0

Keys with 0 in bit 54 map to root[0].
Keys with 01 in bits 54-55 map to root[1].
Keys with 11 in bits 54-55 map to root[3].

Either 1 or 2 bits of each key is significant,
depending on whether bit 54 is 0 or 1.

internal node overflow
----------------------

Before any splits, root[0] points to a leaf node which is addressed by
0 bits. After one split, root[0] is addressed by 2 bits. What
happens when the leaf at root[0] is addressed by all the bits
available (10 in the demonstration), but needs to split?

root[0] == offset of leaf 0

Keys with 0 in bits 54-63 map to root[0].

There isn't room in the root for two leaf pointers with 11-bit
addresses. There are only 10 bits worth of child pointers and key
available at this level of the trie.

In such cases, the leaf is replaced by a new internal node with just
one child -- the leaf itself.

root[0] == internal node 0.
internal_node_0[0] = leaf 0

keys with 0 in bits 54-63 map to root[0]
all keys under root[0] map to internal_node_0[0]

The insertion is retried causing an immediate split, resulting in this
situation:

root[0] == internal node 0.
internal_node_0[0] = leaf 0

Keys with 0 in bits 54-63 map to root[0].
Of keys under root[0]:
Keys with 0 in bit 44 map to internal_node_0[0];
Keys with 1 in bit 44 map to internal_node_0[1].

3.5 Transactions, Concurrency

The implementations of transactions for PHTs is very simple. It is
based on the observation that if a writer uses a root node shared by
no other process and, furthermore, obeys the TRC Invariants, then the
writer is safely isolated from concurrent readers. All that remains
to implement transactions is to reliably communicate successive root
pointers from writers to readers (the COMMIT operation).

In this section, the essentials of transactions and concurrent access
are explained, but details pertaining to ALLOC and FREE are omitted.
They are explained in the next section.

Every write transaction creates a new root node. In each of the first
two pages of the database, we store a root pointer, and a sequence number:

struct commit_record
{
disk_addr root_location;
int sequence;
};

Call these two records A and B. At any given time, we preserve the
invariant that:

if (A->sequence > B->sequence)
A->root_location is the most recently committed root
and should be used by new transactions
else
B->root_location is the most recently committed root.
and should be used by new transactions

"transaction initiation algorithm"

Every reader and writer begins by finding the most recently committed
root. Writers proceed by making a copy of that root in preparation
for making modifications to the copy.

The COMMIT algorithm is quite simple:

flush_modified_pages ();
sync ();
if (A->sequence > B->sequence)
{
B->sequence = A->sequence + 1;
B->root_location = current_root;
write_commit_record_to_disk (B);
}
else
{
A->sequence = B->sequence + 1;
A->root_location = current_root;
write_commit_record_to_disk (A);
}

ABORT makes no changes to the storage device. In the demonstration
implementation, it throws away modified pages cached in memory and
reconstructs the state of the root node at the start of the
transaction.

A system failure prior to a commit simply aborts the current
transaction by leaving the two commit_records unmodified. It doesn't
matter whether the writing process that failed has written on any
pages it allocated during the aborted transaction -- those pages
are implicitly returned to the free pool by ABORT.

If a system failure occurs _during_ a commit, our assumptions about
storage devices combine with the structure layout of a "commit_record"
to ensure that one of two conditions hold:

1. If the system failure occurs before any bits of the
"sequence" field of the commit record have been
written, then the commit record will be considered
out of date and ignored by the "transaction initiation
algorithm". The outcome is the same as an aborted
transaction.

2. If the system failure occurs after any bits of
the "sequence" field have been written, then the
commit record may or may not be chosen by subsequent
transaction initiations. It doesn't matter. In either
case, a valid root pointer will already have been
stored in the commit record. If the commit record
becomes active, that root pointer will be used and it
is as if the transaction had completed normally.
Otherwise, the outcome is again the same as an aborted
transaction.

3.6 Memory Allocation

Two of the fundamental PHT operations are ALLOC and FREE for
allocating and releasing storage space in page-sized increments.

One of the TRC Invariants is:

+-------------------------------------------+
| 2) Writes are permitted only on pages |
| allocated during the current transaction. |
+-------------------------------------------+

Naturally, the memory manager must violate that invariant. If a page
is passed to FREE, eventually that page must be written on again.

In fact the memory manager need only make sure that a page passed to
FREE is not recycled until all readers are "done" with that page. A
reader is certain to be "done" with that page if it has caught up to a
transaction which is later than the transaction in which the page was
freed. How this can be gauged is application specific. The
demonstration implementation uses a simple locking protocol. Some
applications could get by with even less than that.

Another PHT invariant is:

+-------------------------------------------+
| 3) Any transaction (read-only or rd/wr) |
| may only examine pages allocated by ALLOC |
| in or before it was committed and |
| not subsequently passed to FREE. |
+-------------------------------------------+

That invariant is supported by the memory manager just described. The
invariant permits a page examined by a reader to have been freed in a
transaction later than the the one with which the reader is caught-up
-- but not in any earlier transaction. Once all readers catch up with
a transaction subsequent to the FREE, the memory manager is free to
re-use the page.

3.7 Caching

In the demonstration implementation, data from PHT files is cached in
memory by a simple page manager. This is critical for acceptable
performance.

Cache consistency is currently maintained by brute force: CATCH_UP not
only finds a new root pointer, it clears the entire cache for the
database file. This implementation of cache consistency favors
long-lived read transactions. Shorter read transactions would have to
flush with greater precision or lose the benefit caching at all. This
is discussed further in the section on extensions.

Caching in write transactions presents interesting opportunities which
the demonstration implementation does not yet exploit. For example,
in-core pointer sorting is fast which suggests that buffered writes
can be ordered optimally for the underlying storage device before
being made. [Hellerstien]

4. Theoretical Properties

In this analysis, we cheerfully presume a uniform distribution of
return values from choose_some_bits. This is most easily accomplished
by using addresses that look like random numbers, and a
choose_some_bits function that simply extracts a sequence of
contiguous bits from an address.

In practice, when using a PHT, we sometimes deliberately choose
addresses which are _not_ uniformly distributed in order to exploit
locality. For example, the demonstration algorithm is such that
addresses which differ only in low-order bits (namely, the bits
returned by "choose_some_bits" for the greatest possible trie depth)
are likely to be physically situated on the same page. But wherever
locality is not specifically desired, an even distribution is the
goal, and this section makes some predictions about the performance
when addresses are evenly distributed.

The REF and SET algorithms use as few bits from keys as possible to
select leaf nodes. By our assumption of even distribution, leafs
(except for the initial leaf) are typically created in a half full
state. There is always at least one leaf.

N = number values stored in the database
B = number of buckets per leaf node
L = the number of leaves
= max (1, 2 * N/B)

For all but the very smallest databases, L is simply "2 * N/B" (or
less).

Let's consider the generation of interior nodes just prior to the
leafs. Members of this generation are created having just one child
pointer (which then immediately splits into two).

Our distribution-of-keys presumption assures us that the tree will
tend to be of uniform height and will fill evenly, so we can expect
members of this second-to-last generation of nodes to be 1/2 full, on
average. There is always at least one node in this generation (the
root node, if the trie is only of depth 2).

C = the number of child pointers an interior node can hold
I0 = the number of nodes in the second-to-last generation
= max (1, 2 * L/C)

We also want to consider the worst case for this generation, when
nodes are mostly empty because of a recent round of splits:

IO_worst = max (1, L/2)

For large databases, I0 is simply "2 * L/C" and I0_worst is "L/2".

Other interior nodes, those not in the second-to-last generation, tend
to be full. The trie doesn't become deeper until they are.

I1 = the number of nodes not in the second-to-last generation
= I0/C

The total number of pages used by the trie is therefore:

T = L + I0 + I1

T for small databases
= 1 + L
= 1 + 2 * N/B

T for large databases
= 1 + L + I0 + I1
= 1 + 2 N/B + 4 N/B C + 4 N/B C^2

Small databases, with fewer than (B * C/4) entries, are likely to be
only two layers deep (a root node pointing directly to bucket nodes).

Databases must be quite large (B * C ^ 2/4) and still have an
expected depth of just three layers.

The number of internal nodes is small compared to the total size of a
database. In most situations, it should be possible to cache all
internal nodes in memory.

The storage efficiency of a hash trie can be estimated by comparing
the amount of space needed to store the keys and values without any
indexing structure, to the amount of space needed to store them in a
PHT:

P = the number of key/value pairs that fit in a page
IDEAL = the minimum number of pages possible (assuming random keys
and values)
= N/P
E = the storage space inefficiency factor for a trie
= T/IDEAL
= P * T/N

For a large database:

= P * ((1/N) + 2/B + 4/B C + 4/B C^2)

And since N is large, the inefficiency doesn't vary much with the
number of elements:

approx. = P * (2/B) * (1 + 2/C + 2/C^2)

When the second-to-last generation is in a bad case (using
I0_worst instead of I0):

I1_bad = the number of nodes not in the second-to-last generation
= I0_worst/C
= L/2C

T_bad for large databases
= 1 + L + I0_worst + I1_bad
= 1 + L + L/2 + L/2C

E_bad = the storage space inefficiency factor for a trie
= T/IDEAL
= P * T_bad/N
= P * (1/N + L/N + L/2N + L/2CN)
= P * (1/N + 2/B + 1/B + 1/BC)

approx. = P * 1/B * (2 + 1 + 1/C)
approx. = P * 3/B

The demonstration implementation uses 64-bit keys and values, 32 bit
file offsets, and 4096 byte pages. This implementation has an average
inefficiency of about 2 meaning that indexing multiplies the storage
requirements for the key/value pairs by a factor of 2. When the all
important last-internal-node generation is in its worst state, the
inefficiency rises to a factor of 3. This database should hold up to
about 65536 entries before trie depth exceeds 2. It can hold up to
around 67108864 entries (2^26) with a trie only 3 layers deep. (The
depth of the trie is also the number of pages which must be examined
or modified to read or write a value.)

Note that we have counted only the inefficiency of the REF/SET data
structure. If many large records (page size or more) are stored in
the database, the overall space inefficiency of the database will be
much _lower_.

5. Uses for PHTs

With a little stretching of the imagination, more or less direct uses
for PHTs can be imagined but they seem quite specialized. One example
would be a simple database mapping social security numbers to small
integers or page-sized chunks of data).

We can also build higher-level data structures on top of PHTs. For
example:

5.1 Nested PHTs (multiple address spaces)

There is no reason to stop at just one root pointer. We can keep more
than one trie in the same storage arena. There is no need to
complicate the commit protocol either: Instead, one trie is
designated "primary". It is the root ptr for the primary trie that is
used for the commit protocol. Pointers to all of the other tries are
stored at "well known addresses" in the primary trie.

Nested PHTs are useful for achieve REF/SET name-space separation.
Aside from having to look up and perhaps SET a root pointer in the
primary trie, there is no additional overhead when using nested PHTs.

5.2 Inodes

The unix operating system presents the abstraction of an "inode". An
inode is a file, identified by a device number and an inode number
which is unique to that device.

A unix file has some very convenient properties:

1. A record is kept, for each inode, of the
size of the file, the times of its most recent
modification and access, and some operating-system
specific information such as the user id of an
"owner" and a set of access permissions.

2. A unix file may be any length and its length may be
increased or decreased dynamically.

3. Portions of a file to which data has never been
explicitly written are initialized to all 0 bytes.
Large, sparsely filled files are represented
in a space-efficient manner.

4. Random access is efficient for large reads and writes.
Every file-system advises an ideal "block size".

5. The contents of an inode may be modified but its number
never changes.

A unix inode, or anything like it, is a sensible abstraction for many
kinds of data one might want to store in a database.

It is straightforward to implement inodes using the basic PHT
operations. An implementation in the demonstration implementation
works roughly as follows:

1. To each inode X, we assign a range of addresses:

(X << 32) ... (((X + 1) << 32) - 1)

We permit inode numbers to have no more than 32 significant
bits.

2. We use a nested PHT in which to store inodes.

3. We store a pointer (disk address) for block N of inode
X at this address of the nested PHT:

((X << 32) | (N + 1))

4. We store inode statistics information for inode X at address:

(X << 32)

Statistics information for one inode takes up an entire database page
(4K). If the file is small enough, it is stored with the statistics
information. Otherwise, the file is divided into blocks and stored
using the addressing scheme outlined above.

This representation automatically provides for "holes in files". No
storage is taken up representing regions of the file that have never
been written. If a block is missing, the inode layer assumes that
block is filled with 0 bytes.

If the file is large, the statistics information includes a bitmap
that summarizes which regions of the file contain non-0 data. That
summary is useful in the event the inode is deleted: it enables the
inode layer to find and FREE all blocks associated with the file.

Within one transaction, when an inode block is first written, it is
copied. Subsequent writes to the same block during the same
transaction avoid copying the block.

Not only does this implementation provide the 5 convenient properties
of unix files outlined above, but in addition, it adds transaction
semantics to the meaning of concurrent access to inodes and makes
stronger recovery guarantees than typical unix file system
implementations.

5.3 Embedded Indexes

File-based indexing packages that operate on unix files are easily
ported to store their data in PHT inodes instead. When they do so,
without any further modification, their operations honor PHT
transactions and make the same strong recovery and reader isolation
guarantees as the underlying PHT. A PHT is like a "magic file system"
that instantly turns database libraries with weak guarantees into
database libraries with strong TRC guarantees.

What effect does using a PHT have on the performance of these indexing
libraries? We have optimistic expectations.

Consider a system involving around 400,000 blocks of indexing data
(1.5G) stored in inodes embedded in a PHT. The hash trie which
indexes those blocks of file data will contain 400,000 entries, data
which ideally occupies 6.4 M. A perfectly even distribution of block
index keys, which the implementation sketched above does not provide
but which it could be modified to provide, would lead to a hash trie
between 12.8 M and 19.2 M in size: a space overhead of less than 1%.
Such a small amount of block indexing data could be held entirely in
RAM for applications that required very high performance.

5.4 Inversion File System

Unix and unix like operating systems use a process abstraction which
includes the notions of a root and current directory for every
process, and a tree-structured name-space of files (and file-like
objects). If that file system is stored in one or more PHT, we can
extend the abstraction with transactions, implicit indexing, and
probably many other features besides. The added functionality of a
PHT costs little. It is similar in size and performance to a more
traditional file-system implementation. [Leffler] [Linux-ext2]

6. Extensions

6.1 Extension: Multiple Concurrent Writers

There are a variety of ways in which the problem of permitting
multiple writers to operate on a PHT concurrently may be approached.
To name three:

1. If transactions are suitably localized, many
small databases can be used instead of one
large one. This is not a general solution but
where it applies it gives the best performance.

2. By using shared memory and serializing calls to
PHT operations multiple processes can cooperate
in a single write transaction.

3. Writers can avoid writing any data at all except when
the entire database is locked, and then try to
lock the database for as short a time as possible.

This can be done by buffering changes during a transaction,
then, just before a COMMIT, locking the database,
reading its current state to make sure the changes
of this transaction don't interfere with what other
concurrent writers have done, writing the changes and
finishing the COMMIT.

This approach creates a new condition (interference) under
which a COMMIT can fail. If the cost of interference
checks can be kept low, and if the implementation and
application can make interference unlikely, this is a
very efficient approach to multiple writers.

4. Each writer can create a new database file which contains
only the pages modified by that writer. Unmodified pages
continue to be read from older database files. Which
database file is current changes with each successful
COMMIT. (This represents an extension to the COMMIT
protocol.)

Two writers can operate independently, completely without
interference or locking, but when writer A issues a COMMIT,
writer B will be operating in an "alternative time-line" --
that is, writer B's view of the database will be out of
date because it does not take into consideration the effects
of writer A's transaction.

Such cases can either cause writer B's transaction to fail,
or call for a "time-line merge" algorithm in which B
attempts to incorporate the modifications made in
transaction A. A merge algorithm is similar to the
interference check algorithm in (3), but does not require
the writer to lock the database while running the merge.

If merging is inexpensive, this technique can handle many
concurrent writers.

-------------------------------------------------------

NOTES

[ARRAYS]

By "array", we mean an abstract data type which is a randomly
accessible mapping from keys to values. In many cases we will want
the keys to be of a small, fixed size, but also mean to include cases
where keys are arbitrarily long, requiring only that they be
completely ordered.

It is straightforward to make a constructive theory (and perhaps from
this, a recombinant implementation) of implementation techniques for
arrays.

The simplest technique is to store an array in contiguous memory
(presuming that keys are of a small, fixed size). At some cost in
access performance, other familiar techniques (such as hash tables or
tries) facilitate arrays which handle arbitrary keys or which are
relatively space efficient either for their size or their sparseness.
We can conceive of a variety of hybrids such as those formed by
dividing an array's address space into separate regions, and handling
each region by its own implementation technique; or by creating a
primary array and secondary cache using two different implementation
techniques.

Some implementation techniques for arrays (hash tables, tries,
hybrids) decompose the problem of representing one large array into
the problems of representing one or more smaller arrays. Consequently
these techniques can be stacked arbitrarily high. PHTs turn out to be
just one of many possible combinations; they are the trie
implementation technique, composed with two kinds of hash-table
techniques, which are in turn composed with the contiguous memory
technique.

Another real world data structure that emerges from this space is the
Berkeley Fast File System. Inodes in that system (files, considered
apart from the structure of directories) are represented as a
(possibly sparse) array of blocks. This array is implemented by the
region splitting implementation technique composed with the trie
implementation technique composed with the contiguous memory
technique: the array of blocks is divided into three regions,
represented by tries of depth 1, 2, and 3, each exponentially larger
than the last.

Yet another real world data structure that emerges from this space is
the original unix DB hash-table library by Ken Thompson. That library
is an extensible hashing hash-table algorithm composed with a unix
file system implementation such as the Berkeley Fast File System (DB
relies on unix's ability to represent sparsely filled files with
space-efficiency.) That particular composition, Thompson's DB with
the Berkeley FFS, bears a striking resemblance to PHTs once the
resulting system is viewed in its entirety (considering the structure
of the file system and a Thompson-DB file as one continuous data
structure).

Should it turn out that many of these combinations are useful (which
is not proven) then a thorough theory of them, backed by an
implementation consistent with the theory, would be of value. A
"thorough theory" would provide simple apparatus for predicting the
performance of a particular combination under a wide variety of easily
characterized, realistic conditions and a consistent implementation
would allow engineers to easily assemble any particular combination
and obtain a data structure that actually performed according to the
theory.

[DISK]

Throughout the paper we sometimes say simply "disk" when a cumbersome
phrase like "underlying storage device" might be more accurate. We
use these phrases to name an important abstraction, but one for which
there is no widely accepted formal language.

To be explicit, we use "disk" or "underlying storage device" to mean
any device presenting an interface to a byte-addressable memory which
is not (ordinarily) directly read or written by the CPU but which is
read or written in (possibly fixed size) blocks. We expect there to
be a uniformly small upper bound on the delay-to-completion of an I/O
operation, but note that the lower bound varies as a function of the
preceding I/O operations and perhaps also as a function of time and
that sometimes these variations are interesting to consider for the
purposes of engineering real-world, high-performance systems.

Some literature has begun to use phrases such as "secondary storage
device" to name what we call a "disk". Both names are problematic
because taken literally, they refer to incidental and historically
contingent properties of the abstract category of storage devices that
is really under consideration: "Secondary" refers to a particular
position in a "memory hierarchy", the detailed structure of which is
subject to multiple interpretations and change. "Disk" refers to the
physical geometry of a particular subset of the abstract category of
storage devices under consideration. Of these two problematic terms,
we choose "disk" because it is already informally used in our abstract
sense.

By analogy, we might choose the term "tape" rather than "tertiary
storage" to refer to any storage device which is disk-like except that
the upper bound on the delay-to-completion of an I/O operation varies
significantly as a function of the preceding history of I/O
operations.

-------------------------------------------------------

Citations

[Hellerstien] Conversation.

[Leffler] "The Design and Implementation of the 4.3 BSD UNIX Operating
System"; Samuel J. Leffler, Marshall Kirk McKusick, Michael J. Karels,
John S. Quarterman; Addison-Wesley Publishing Company, 1989

[Linux-ext2] The EXT-2 file-system for Linux.; Remy Card, Linus
Torvalds, Stephen Tweedie; Available on the Internet.

[Selzter] "LIBTP: Portable Modular Transactions For UNIX"; Margo
Seltzer, Michael Olson; Available on the Internet as part of the
distribution "db-2.3.10".

0 new messages