RethinkDB internals: Patching for fun and throughput

A central challenge in developing a high performance database is to maximize the efficiency at which hardware resources are used. By implementing a new optimization in how RethinkDB handles disk writes – the patch log – we were able to achieve an additional 50% increase in query throughput. The following graph compares the sustainable performance of RethinkDB for a mixed read/write workload (roughly 10% write operations) with and without the patch log. In both cases we configured the system to be bound by the disks' write bandwidth.

1

Like most databases, RethinkDB uses an in-memory page cache to allow for efficient data processing. Modifying a data block involves the following steps:

  • check if the block is in the page cache

  • if the block is not in the cache: acquire the block from disk

  • perform the actual modification on the in-memory page

  • mark the page “dirty”

Eventually the page cache will fill up, so no more data blocks can be acquired unless space is freed. This requires evicting pages from the page cache. However, some of the pages might have been modified. To preserve these modifications, pages marked as “dirty” must be written back to disk. There are different approaches to when such a write-back should happen. In RethinkDB, we write data back periodically, and trigger an immediate write-back whenever the number of dirty pages in the page cache exceeds some limit.

2
Now, there is one problem with this approach: Consider five updates which set five different rows/keys to new values. RethinkDB uses a B-tree data structure to organize data internally; most databases use similar tree data structures to index and/or store data. As a consequence, updates to five random keys have a good chance of affecting five different parts of the tree. This has consequences for the next write-back. Even minor modifications to five different keys involve writing five complete blocks to disk. Depending on the workload, this can easily lead to a saturation of disk write throughput, which in turn limits the overall database performance as data from the page cache cannot be evicted fast enough.

RethinkDB tackles this problem by using a system of serializable page patches. A patch is a data structure representing a certain modification to a block. Patches can operate on different layers of the database. On lower levels we use patches that replace a chunk of data in a page or move data around. On higher levels we have patches which insert a key/value pair into a leaf node of the B-tree or remove one.

The basic idea is to reserve some space for a “patch log” in the database. To come back to our previous example: using patches, the changes made by our five update operations can be stored much more efficiently. For smaller updates, all required patches may well fit into a single block. Instead of writing all five affected data pages back to disk, we simply write five small patches to the patch log. At this point, we have reduced the required disk write throughput by a factor of five.

3

Of course things are not as simple as that. Eventually our reserved log space will fill up. To free up space, we now have to apply some of the patches to the affected pages and rewrite the actual data blocks. It seems like we have just delayed the work of writing back data by using the patch log. Even worse, we spent additional time for writing the patches.

Happily another effect kicks in to save us: the patch log has aggregated patches over a long period of time. Chances are, additional updates have been performed in the same data blocks as the first five updates. This allows us to batch multiple updates for the block into a single write-back. The overall efficiency of the log depends on its size and the size of the subset of data in the database that is frequently modified. Additionally, a great boost in efficiency can be achieved by carefully tuning the decision of whether to write a patch or to rewrite the affected block instead. We have implemented automated tuning logic to guarantee low overhead and great write throughput across a wide range of workloads without any need for manual configuration.

There is one last part which we have not covered so far: When acquiring a block from disk, it must be guaranteed that the in-memory page for that block always reflects the latest changes. Specifically, this means that patches with changes that have not made it into the actual on-disk block yet must be replayed when acquiring the block. To make this fast, we load all on-disk patches into an efficient in-memory data structure when starting up the database. As the size of the patch log is limited, this only consumes a small amount of main memory. Therefore we just have to check the in-memory log data structure to decide whether patches have to be applied. To replay only the subset of patches which have not yet been applied to the on-disk data block, a sequential “transaction id” is stored in the block whenever a block gets written back to disk. Each patch on the other hand contains the transaction id to which it applies. If the block's transaction id is higher than the one of some patch, the patch is already reflected in the block's data and must not be replayed again.

Overall, the patch log turns out to be a powerful technique for optimizing the translation of in-memory data modifications to disk writes. According to our benchmarks, it allows for significant performance increases both when writing to SSDs and rotational drives.

You should make it a portal

In a recent interview with Entrepreneur magazine, Paul Graham said something that reminded me of an interesting story:

When Google started, there were eight to 10 successful established search engines already, and search was so uncool that they were trying to get people to call them "portals".

I landed my first real programming job at a long defunct dot-com startup when I was a senior in high school in 1999. We were building an EBay for home improvement. Homeowners would post home improvement projects, and contractors would bid on them. We were going to take a cut out of each job.

One day the company's founders called a meeting. They met with a top-tier investor and wanted to relay the feedback to the team. "He loved it", they said. "The only piece of feedback he gave is that we should make it be more of a destination site. We should be building a portal." So we did. We delayed the product to integrate a to-do list, a calendar, events management, and various collaboration features so that users would spend more time on the site.

The company ran out of money before we had a chance to launch. We rationalized the failure by telling ourselves and others that the dot-com bust prevented us from raising a second round of funding, but the real reason was that we blew through a million dollars with no real product to show for it. In hindsight, "the portal" wasn't the only reason for failure, but it was a large contributing factor (at the time we were seriously considering licensing "Come Together" from The Beatles for a TV commercial, the idea being that home owners and contractors come together on the site).

What's most interesting about this story is that both the investor and the founders were very smart people. The technical cofounder was from CMU, the business cofounder was from Yale. Both have gone on to do very interesting, intellectually challenging things. It's very easy to credit our own intelligence for success and others' ineptitude for failure, but stories like these suggest that our behavior often has much more to do with institutional (or social) knowledge than it does with our own mind.

What is going to look ridiculous to us ten years from now about the current startup mentality?

On TRIM, NCQ, and Write Amplification

Alex Popescu wrote a blog post asking some questions about RethinkDB and SSD performance. There is a related Twitter conversation happening here. There are two fundamental questions:

  • In which cases does SSD performance begin to degrade over time?
  • How does the TRIM command affect performance degradation?

The questions are very deep and I cannot do them justice in a single blog post, but I decided to post a quick write-up as a start.

Flash basics

Flash memory has physical properties that prevent overwriting a data block directly. Before new version of the data can be written, a special ERASE command must be called on the block. Erasing a block is significantly slower than writing to a block that has already been erased, so when software tries to overwrite a block of data, SSD drives automatically write the new block to a pre-erased area, and erase obsolete blocks in the background. This technology (usually referred to as Flash Translation Layer) allows for fast write performance on modern SSDs.

There are two complications associated with erasing blocks. The first is that flash memory only allows erasing relatively large block sizes (128KB and above). This means that if there is a significant fragmentation on the drive, before a block can be erased, live data must be copied to a different location on the drive. The rate at which this happened is called write amplification factor. Write amplification negatively affects write performance because out of a fixed number of write operations that can be done per second, some number of operations must be spent on copying data within the drive. The higher the internal fragmentation (and therefore write amplification), the lower the perceived performance of the drive.

The second complication is that SSD drives often lack sufficient information to determine whether a particular block is no longer useful. Consider a file system where many 4KB files are created and soon deleted. If the file system reuses space (i.e. places new 4KB files on the same block device locations where the deleted files used to be), SSD drives will write the file to a different physical location to avoid a slow ERASE command, but will note that the old space no longer contains useful data. However, if the file system does not reuse space and places every new file onto a new block device location, there is no way for the drive to know that the old file has been erased. If the workload persists for a considerable amount of time, from the point of view of the SSD drive, most flash space will soon be filled, and the drive will no longer be able to efficiently remap writes and run erase in the background - it will be forced to do a slow erase command for every write, killing performance.

The TRIM command

In order to get around the second issue, SSD drives and most modern operating systems support a TRIM command. Sending the TRIM command to the drive lets it know that the given interval of the block device is no longer in use. This lets the drive garbage collect the data and continue running its translation algorithms efficiently.

An additional problem is that the TRIM command comes with a penalty. Modern SSD drives are parallelized devices that allow performing multiple operations in parallel. In order to get peak efficiency, multiple concurrent requests have to be sent to the drive. This happens via a Native Command Queuing protocol (or NCQ). Modern drives can run up to 31 commands concurrently and often achieve peak performance when there are close to 31 commands in the pipeline. The problem with TRIM is that on many interfaces it requires stalling the NCQ pipeline. That is, when a TRIM command is issued, all 31 commands must be finished while no other commands can go in. Then the TRIM command goes into the drive, and then the pipeline of traditional read and write commands accumulates again.

This is not a big issue on desktop systems where TRIM can be sent during idle time, but it is a huge problem on server systems that do not encounter much idle time. It's possible to get around this issue by grouping blocks that can be trimmed and sending fewer TRIM commands for larger blocks, but this makes disk layout significantly more complicated. For this reason most systems don't use the explicit TRIM command in server-grade environments. Instead, it's much more efficient to lay out the data in a way where the drive can figure out which data is no longer useful implicitly.

Performance

So, the performance issues are not related to TRIM, and have much more to do with internal fragmentation of the drive over time. If special care is not taken to layout the data in a certain way, performance degradation will always happen for write-heavy workloads with little idle time (which is a very common scenario in server environments). This is not database specific and will happen for file systems in exactly the same way.

RethinkDB internals

RethinkDB gets around these issues in the following way. We identified over a dozen parameters that affect the performance of any given drive (for example, block size, stride, timing, etc.) We have a benchmarking engine that treats the underlying storage system as a black box and brute forces through many hundreds of permutations of these parameters to find an ideal workload for the underlying drive.

We've also built a transformation engine that converts the user workload on the database into the ideal workload for the underlying drive. For example, if you modify the same row (and therefore, the same B-Tree block) twice, our transformation engine may decide to write the second block into the same location as the first one, a location right next to it, a completely different location, etc. For every drive we test, we take the ideal parameters for that drive and feed them back into the production system. This lets us get up to a factor of ten performance improvement in the number of write operations per second we can push through a drive. This happens by a combination of "implicit TRIM", and a number of emergent properties of the drive that our benchmarking engine discovers. We can benchmark a given drive very quickly by treating it as a black box, instead of spending months or years reverse engineering the behavior of every drive on the market.

Here is a graph for an IO bound workload run on a commonly used SSD drive. The workload is heavily biased towards reads (the ratio is roughly 4:1), but still does enough writes to gain significant performance improvements from the disk.

Media_httpwwwrethinkd_jzmwe

RethinkDB internals: the caching architecture

A few days ago Salvatore Sanfilippo wrote about his work on a new Redis diskstore - a reworking of the old VM layer based on a different set of assumptions and tradeoffs. I was very interested in the new approach because it is very similar to the RethinkDB caching architecture in principle. When we designed our caching layer we had a number of constraints we needed to satisfy:

  • Performance - the system needed to work at memory speeds when the data fits into RAM, as well as when the data far exceeds the amount of available RAM, but the active working dataset mostly fits into RAM. Once the active dataset no longer fits into RAM, the system still needs to work at the speed of the underlying storage layer.
  • Durability - some clients require every transaction to be durable, others are willing to lose a few seconds worth of data, yet more clients are willing to put up with more significant data loss to sustain high write performance. We needed a system that allows the users fine control of latency, performance, and durability, and still performs well even under the most strict durability requirements. In order to satisfy this requirement we need fine tuned control of when a given page is flushed to disk (to maximize the possibility to combine multiple changes into a single flush), and when we receive the acknowledgement (so that we can tell the client immediately, or after the data has been persisted, depending on their needs).
  • Robustness - there are various edge cases around the size of available memory cache, the distributions of key and value sizes, the number of concurrent clients, network pipelining considerations, etc. We needed a system that works well across all possibilities.
  • Modularity - the caching subsystem needed to be independent of the data structure layers above it, and the storage layer below it. Currently RethinkDB indexing is implemented using a B-Tree, but we may support other data structures (such as a Hash) in the future. On the other side, the storage subsystem is designed to adapt to the underlying storage device, often generates very different workloads to get good I/O performance, and must be entirely decoupled from the caching layer.
  • Systems considerations - the underlying operating and storage systems have a number of peculiarities. They support different asynchronous IO APIs, require different sizes of blocks and various location distributions for optimal performance, blocks need to be allocated in RAM on various offsets to work with the DMA controllers, etc. We need to be able to satisfy these considerations in the caching and storage layers.

The first potential solution to all of the contraints above is to use the page (buffer) cache from the underlying operating system. The data store file can be memory-mapped into the address space, and the kernel can manage all of the underlying details. However, after prototyping this solution we dismissed it because of the following reasons:

  • The page cache does not allow the application fine tuned control over the page reclamation policy. In case of a database, pages can be arranged in a dependency graph where a group of pages cannot be used by the system if one of the pages is flushed (for example, if the root of a B-Tree is flushed out of the page cache, none of the remaining pages can be used for new queries). This can significantly improve the performance of the page reclamation algorithm by empowering it with extra knowledge and by significantly simplifying the necessary bookkeeping.
  • The page cache does not allow the application fine tuned control over the flushing policy. Furthermore, it is not possible to implement a storage layer that intelligently writes to certain locations on the disk independently of the memory usage (say, a log-structured system) without implementing a custom file system - a significant development and deployment complication.
  • On some kernels (for example, on Linux) the asynchronous disk I/O API (io_submit and friends) works reliably only with direct I/O turned on, unless you're using certain specific file systems (currently, XFS). In our testing asynchronous disk APIs performed better than user-space threadpools, and we wanted the system to use it reliably across different file systems and block devices.

In principle, the high level caching layer architecture is similar to the kernel's buffer cache and is illustrated by the following diagram:

Media_httpwwwrethinkd_cyuxu
The high level data structures know nothing about the storage layer and only deal with the cache. This gives us the flexibility to implement anything from an unbacked (volatile) cache for testing, to a fallthrough cache that uses no memory and always goes to the storage layer, to a cache that implements different reclamation and flushing policies, backed by different storage layers that handle the specifics of disk performance.

The current (and future) data structures interact with the cache by asking for specific page IDs and access requirements. If the cache contains a given page in memory, it returns it to the layer above. On the other hand, if the page is not in RAM, the page cache asks the underlying storage system to asynchronously schedule an I/O request, and tells the above layer that the page isn't present. This allows the system to move on to handling a different client request without the application blocking, and when the I/O request is satisfied by the storage system, the cache calls the right layer to tell it to proceed with the operation.

Interestingly, even though the cache knows nothing about the data structure the pages are arranged in, the data structure is maintained implicitly in the cache, and the higher level components are merely functions that transform the data structure in the cache from one state to the next. For example, we may have a B-Tree split operation that splits a page into two pages by adding an extra page to the cache and shuffling around child IDs in other pages, but there is no additional "B-Tree structure" - everything about the B-Tree is automatically in the cache, and therefore on disk. This takes care of persistence automatically, as long as the data in the pages only relies on IDs of other pages and not volatile memory pointers.

In addition to satisfying all of our initial goals, this system gives us a number of additional benefits:

  • Flexibility - any number of reclamation and flushing algorithms can be implemented, tested, and measured independently of the rest of the code. Furthermore, we can develop and deploy any number of storage algorithms from standard, to log structured, to networked, in user-space and have full control of how we perform I/O.
  • Portability - our code will work the same way on every operating system and will not depend on different page cache implementations.
  • Concurrency control - when the cache contains the dependency graph of the pages, concurrency policies can be managed automatically by the caching layer without interaction with other components. For example, point-in-time copy-on-write for a given B-Tree can be trivially implemented entirely by the cache.
  • Data streaming - when accessed properly from the higher level structures, very large values (even gigabytes in size) can easily be supported because they can be managed (and therefore, loaded and unloaded by the cache) a few pages at a time.
  • Architectural benefits - a hierarchy of caches are a good starting point for implementing transactions. For example, when a transaction starts it's trivial to create a new (possibly disk-backed) cache for it that points to the original cache as the parent. At the end of the operation, if the transaction is committed, the cache is merged with the parent, while if the transaction is rolled back, the cache is discarded.

The caching layer has been one of the most interesting pieces of our infrastructure from both algorithmic, systems, and performance standpoints. The architecture seems simple but there are many details that require careful thinking - I'll try to write more about them after I cover other components.

Interested in working at RethinkDB? We're hiring - please see our jobs page for more details.

Handling stack overflow on custom stacks

On my computer, the callstack of a new process is around 10MB*. Modern operating system automatically reserve some amount of virtual memory and install protections on the page below the stack to create a segmentation fault on stack overflow. This ensures that a stack overflow won't go corrupting random parts of memory.

We want to have a lot of coroutines, so they should have smaller stacks, maybe 16 or 64KB. This makes stack overflow an even greater possibility, but at the same time, coroutines implemented in user space don't get this checking for free—we have to build it ourselves. In the process, we can even do better: we can give some information about the coroutine which crashed. Here's the idea:

  • Memory-protect the page on the bottom of the stack (remember, x86 callstacks grow down) to disallow reads and writes.
  • Install a signal handler for SIGSEGV.
  • In the handler, examine which address caused the page fault:
    • If the address is within the protected page of a coroutine, report a callstack overflow for the coroutine, with a stack backtrace and some information about the coroutine that crashed.
    • Otherwise, report a generic memory fault at the address, with a stack backtrace.

I'm treating stack overflow as a fatal error here, but a more nuanced approach is possible: rather than killing the whole database, it could just kill that coroutine and return to the scheduler. But this particular kind of error tolerance would require broad modifications to RethinkDB which we're not ready to do. I could also make stack overflow resize the stack to be larger, but this is difficult in C++ because there might be pointers into the stack.  

Nothing here is complicated to implement, but it involves the interaction of a few different system calls, which I'll explain in this article.

Manipulating memory protection

We want to allocate the stack and make the bottom page unreadable and unwritable. The mprotect system call manipulates memory protection, and getpagesize tells us how big a page is (it might not be 4KB). valloc makes a page-aligned memory allocation.

 

void *stack = valloc(stack_size);
mprotect(stack, getpagesize(), PROT_NONE);

 

When deallocating the stack, be sure to reset the protection to what it was before.

mprotect(stack, getpagesize(), PROT_READ|PROT_WRITE);
free(stack);

Installing a signal handler

In order to catch the segfault, we have to install a signal handler. The signal system call won't cut it—it just doesn't give us enough information about what happened. Instead, we have to use sigaction, which takes a whole struct of parameters, not just a function pointer, for how to handle the signal. One struct member is sa_flags. We have to turn on the SA_ONSTACK flag in order to use a user-provided stack (see below) and the SA_SIGINFO flag, in order to call a function with more information. If SA_SIGINFO is set, then we can set the sa_sigaction member to a function which takes a siginfo_t struct as an argument. The si_addr member of that struct contains the address of the location which caused the fault. All together, the code for establishing the page handler is as follows:

struct sigaction action;
bzero(&action, sizeof(action));
action.sa_flags = SA_SIGINFO|SA_STACK;
action.sa_sigaction = &sigsegv_handler;
sigaction(SIGSEGV, &action, NULL);

The signal handler itself will print out the CPU where the coroutine was initialized, but it would be easy to extend to support printing other metadata contained in the coroutine. int coro_t::in_coro_from_cpu(int) reports which CPU a coroutine was initialized on, or -1 if the address was not from the protected page of a coroutine stack. crash will cause the program to terminate with the given error message, together with a stack trace.

void sigsegv_handler(int signum, siginfo_t *info, void *data) {
    void *addr = info->si_addr;
    int info = coro_t::in_coro_from_cpu(addr);
    if (cpu == -1) {
        crash("Segmentation fault from reading the address %p.", addr);
    } else {
        crash("Callstack overflow from a coroutine initialized on CPU %d at address %p.", cpu, addr);
    }
}

Installing a special stack for the signal handler

By default, when a signal is delivered, its handler is called on the same stack where the program was running. But if the signal is due to stack overflow, then attempting to execute the handler will cause a second segfault. Linux is smart enough not to send this segfault back to the same signal handler, which would prevent an infinite cascade of segfaults. Instead, in effect, the signal handler does not work.

To make it work, we have to provide an alternate stack to execute the signal handler on. The system call to install this stack is called sigaltstack. As a parameter, it takes a stack_t, which consists of a pointer to the base of the stack, the size of the stack, and some flags that aren't relevant for our purposes.

stack_t segv_stack;
segv_stack.ss_sp = valloc(SEGV_STACK_SIZE);
segv_stack.ss_flags = 0;
segv_stack.ss_size = SEGV_STACK_SIZE;
sigaltstack(&segv_stack, NULL);

SEGV_STACK_SIZE doesn't have to be so big, but it has to be big enough to call printf from. The MINSIGSTKSZ constant indicates how big a stack has to be to execute any signal handler at all. To be on the safe side, used that constant plus 4096 for SEGV_STACK_SIZE. sigaltstack should be called before the associated call to sigaction which is intended to register a signal handler with that stack.

Operating in a multithreaded environment

If a process calls sigaction and then spawns pthreads within it, then those pthreads will inherit the signal handlers that were already installed. Apparently, this is not the case for sigaltstack: If a signal handler is installed with sigaction using a sigaltstack, and a thread spawned from that process is killed with the right signal, then the installed stack will not be found! The signal handler must instead be installed on each pthread individually. I'm not sure whether this is a bug in Linux or just a quirk of POSIX; in any case, I couldn't find it documented anywhere.

Pulling it all together

A few simple system calls can allow stack overflows in user-space coroutines to be handled nicely, providing detailed error messages without runtime overhead in cases where the stack does not overflow. Indeed, the benchmarks which I previously reported are unaffected by this change. Other programming language runtimes, like that of Io, checks the height of the callstack on every function call in order to catch overflow. This technique is more efficient.

Io supports callstacks that will resize on overflow, to a point. Such a feature is more difficult to implement in C++ because there might be pointers into the callstack, and weak typing makes it impossible to trace the stacks and heap to update these even if we wanted to. However, virtual memory may be usable to implement this resizing. First, it may just work to allocate very large stacks and not touch them, hoping that the virtual memory system will ensure that no physical memory or swap space is used for the stacks. But this might put stress on the kernel's data structures, and it may not work well if overcommitting is turned off, as it is in some server environments. Alternatively, mremap may be used to expand the stack from the page fault handler. But I'm not sure how I could reserve a chunk of virtual memory to expand into, without depending on overcommitting in some cases. None of these techniques would work out well on 32-bit systems because there isn't enough address space. This is still something to look into in the future, though.
I implemented stack overflow checking after getting stuck on a particular bug in the database, and when returning to that bug, I found that this was in fact the cause! Stack overflow isn't an obscure, uncommon thing, especially when stacks are small and there might be recursion. Working together with the operating system allows us to broaden the applicability of these software engineering benefits to performance-critical code.

 


* You can find the value on your computer by running the following program (without optimizations!) and examining the difference between the first and last number, when it terminates with a segfault.

#include <stdio.h>
int main() {
        char stuff[2048];
        printf("I'm at %p\n", &stuff);
        main();
}

If this is all as exciting to you as it is to me, you should consider working at RethinkDB! We're hiring—please see our jobs page for more details.

Making coroutines fast

Previously, I wrote about using coroutines in RethinkDB. Coroutines are a nice alternative to callbacks because they are easier to program in, and they are a nice alternative to threads because of their greater performance. But how fast are they?

Just using an off-the-shelf library like libcoroutine isn't as fast as you might think. The graph below shows the huge performance degradation of a naive implementation of coroutines (the short red bar) compared to the old callback-based code (the blue bar on the left). But with the right set of optimizations, we can recover a level of performance about equal to the non-coroutine version. With these optimizations, throughput is recorded as the pink bar on the right, which is within the margin of error of the original version.

Media_httpwwwrethinkd_ensxa

The improvement comes from  two optimizations: reuse of stacks and a lightweight swapcontext implementation. The two optimizations are completely internal to the coroutines library and require no changes to code which uses them. As you can see from the graph, both optimizations are essential for acceptable performance.

Reusing coroutines

Spawning a new coroutine with libcoroutine on Unix issues the following actions:

  • A stack is allocated
  • A ucontext object is allocated and initialized using makecontext and getcontext
  • A libcoroutine Coro object is created around the stack and context
  • The context is swapped to the new coroutine, jumping to a trampoline, and from there, executing user code with supplied arguments.

When the user code terminates, all of these structures have to be deallocated. Unless our allocator is perfect (which it isn't) and the ucontext routines are really fast (which they aren't), this constitutes a lot of waste since each select request issues multiple coroutines. Why not reuse the same coroutine object for multiple actions?

The logic is simple. Each OS thread has a list of free coroutines. To do something in a new coroutine, the user pops an item off the free list and sends it a message containing the action. Each coroutine has a 'run loop' which does the following:

  • Wait for a message containing an action
  • Execute that action
  • Push self back on to the free list

If a coroutine is requested and the free list is empty, a new coroutine executing that run loop is spawned and pushed onto the list.

A lightweight swapcontext implementation

Once old coroutines are reused, the only frequent call into libcoroutine that gets made is switching contexts. Ultimately, this calls the ucontext routine swapcontext. What does that do exactly? Well, here's the glibc 2.12.2 implementation for x86-64:

swapcontext:
        /* Save the preserved registers, the registers used for passing args,
           and the return address.  */
        movq        %rbx, oRBX(%rdi)
        movq        %rbp, oRBP(%rdi)
        movq        %r12, oR12(%rdi)
        movq        %r13, oR13(%rdi)
        movq        %r14, oR14(%rdi)
        movq        %r15, oR15(%rdi)

        movq        %rdi, oRDI(%rdi)
        movq        %rsi, oRSI(%rdi)
        movq        %rdx, oRDX(%rdi)
        movq        %rcx, oRCX(%rdi)
        movq        %r8, oR8(%rdi)
        movq        %r9, oR9(%rdi)

        movq        (%rsp), %rcx
        movq        %rcx, oRIP(%rdi)
        leaq        8(%rsp), %rcx                /* Exclude the return address.  */
        movq        %rcx, oRSP(%rdi)

        /* We have separate floating-point register content memory on the
           stack.  We use the __fpregs_mem block in the context.  Set the
           links up correctly.  */
        leaq        oFPREGSMEM(%rdi), %rcx
        movq        %rcx, oFPREGS(%rdi)
        /* Save the floating-point environment.  */
        fnstenv        (%rcx)
        stmxcsr oMXCSR(%rdi)

        /* The syscall destroys some registers, save them.  */
        movq        %rsi, %r12

        /* Save the current signal mask and install the new one with
           rt_sigprocmask (SIG_BLOCK, newset, oldset,_NSIG/8).  */
        leaq        oSIGMASK(%rdi), %rdx
        leaq        oSIGMASK(%rsi), %rsi
        movl        $SIG_SETMASK, %edi
        movl        $_NSIG8,%r10d
        movl        $__NR_rt_sigprocmask, %eax
        syscall
        cmpq        $-4095, %rax                /* Check %rax for error.  */
        jae        SYSCALL_ERROR_LABEL        /* Jump to error handler if error.  */

        /* Restore destroyed registers.  */
        movq        %r12, %rsi

        /* Restore the floating-point context.  Not the registers, only the
           rest.  */
        movq        oFPREGS(%rsi), %rcx
        fldenv        (%rcx)
        ldmxcsr oMXCSR(%rsi)

        /* Load the new stack pointer and the preserved registers.  */
        movq        oRSP(%rsi), %rsp
        movq        oRBX(%rsi), %rbx
        movq        oRBP(%rsi), %rbp
        movq        oR12(%rsi), %r12
        movq        oR13(%rsi), %r13
        movq        oR14(%rsi), %r14
        movq        oR15(%rsi), %r15

        /* The following ret should return to the address set with
        getcontext.  Therefore push the address on the stack.  */
        movq        oRIP(%rsi), %rcx
        pushq        %rcx

        /* Setup registers used for passing args.  */
        movq        oRDI(%rsi), %rdi
        movq        oRDX(%rsi), %rdx
        movq        oRCX(%rsi), %rcx
        movq        oR8(%rsi), %r8
        movq        oR9(%rsi), %r9

        /* Setup finally  %rsi.  */
        movq        oRSI(%rsi), %rsi

        /* Clear rax to indicate success.  */
        xorl        %eax, %eax
        ret

(The o* symbols are #define'd in another file to be offsets into the ucontext_t struct.)

You might spot a few unnecessary things here:

  • Argument registers (RDI, RDX, RCX, R8, R9 and RSI) are saved and restored even though the calling convention doesn't guantee that they're saved and restored
  • x87 and SSE state is saved and restored, even though we're not doing anything with floats or SIMD where register contents need to be preserved across function calls
  • The signal mask is saved and restored with the system call sigprocmask, even though the signal handlers in RethinkDB are the same globally

The final sin—making an unnecessary syscall—is the greatest, and eliminating this provides a big speed boost. On top of this, it helps to eliminate the memory traffic from saving and restoring the unnecessary registers.

Things that weren't important

I made one minor optimization across all of these: stacks were malloc'ed rather than calloc'ed, since there's no reason that they have to be zeroed beforehand. It turns out that this isn't crucial to performance once pooling is applied; it just reduces the warmup time. Reducing the size of the callstacks is similar—with pooling, it just reduces warmup time. Without pooling, both of these optimizations yield significant improvements in throughput. It still might be a good idea to have small callstacks so that they don't take up too much memory, but this is not a problem right now. In practice, warmup time will be dominated by disk I/O in filling up the in-memory cache, and the database will be restarted only rarely.

Conclusion

Coroutines can be made to work fast, though it's not as immediate as one might hope. The POSIX primitives aren't optimal for all applications. I'd like to thank Andrew Hunter for giving me advice on how to proceed in this optimization process.

If this is all as exciting to you as it is to me, you should consider working at RethinkDB! We're hiring—please see our jobs page for more details.

Improving a large C++ project with coroutines

At the core of RethinkDB is a highly parallel B-tree implementation. Due to our performance requirements, it is too expensive to create a native thread for each request. Instead, we create one thread per CPU on the server (logical CPU in the case of hyperthreading) and use cooperative concurrency within a thread.

A single thread will have multiple logically concurrent units of control, taking turns when a unit needs to block. Blocking needs to take place ultimately for either I/O--waiting for information from the network or disk, or waiting to be notified that sending information there has completed--or for coordination with other threads. On top of this, we implemented higher-level abstractions which also block.

The previous approach

Blocking is implemented using callbacks. A blocking operation is expected to return immediately, and it will call a callback when it's done. In Javascript, callbacks are generally done by passing a function argument. C++ doesn't have lexically scoped closures or automatic memory management, so we instead pass a pointer to an object. This object has a virtual method on it for the callback. Different virtual methods are used for different situations demanding a callback, allowing the same object to be passed as the callback to several different blocking functions. Reusing an object among callbacks allows state to be shared easily among multiple blocking operations.

This all sounds like a very sensible model, but there are several reasons why it becomes annoying to program in:

  • A single logical procedure must be implemented as a class, with state in instance variables rather than locals because it involves blocking
  • The flow of control is broken up over several different virtual method bodies
  • A single virtual method might be invoked as the callback for multiple situations, requiring complex logic to maintain the state of the object and dispatch to the appropriate implementation

Each of these problems creates more boilerplate code and more chances to introduce bugs.

The coroutines interface

For the past couple weeks, I've been working on a solution: use coroutines to implement blocking rather than callbacks, so a single function can have blocking locations and still look like a straight line of code. It turns out to be fairly straightforward to do this in C or C++. I used the open-source library libcoroutine for this task. Libcoroutine is a thin, cross-platform wrapper around native APIs for coroutines like fibers on Windows and ucontext.h on POSIX systems, falling back to a setjmp/longjmp-based implementation on other platforms. Libcoroutine provides functions for launching new coroutines, switching among coroutines and initializing the coroutine system.

On top of this, I built a small library in C++ to make them easier to use, integrating coroutines with our existing system for scheduling the issuing of callbacks. The API for coroutines is very simple, summarized by this fragment of a header file:

struct coro_t {
    static void wait();
    static coro_t *self();
    void notify();
    static void move_to_thread(int thread);

    template
    static void spawn(callable_t fun);

    template
    struct cond_var_t {
        cond_var_t();
        var_t join();
        void fill(var_t value);
    };

    template
    static cond_var_t *task(callable_t fun);

    struct multi_wait_t {
        multi_wait_t(unsigned int notifications);
        void notify();
    };
};

For convenience, I included versions of coro_t::task and coro_t::spawn which take up to five parameters, using boost::bind. The above is all the functionality we need for concurrency in RethinkDB. The first five commands are really all that's needed, and tasks and condition variables are implemented in terms of these.

The coro_t::self function returns the currently executing coroutine. This coroutine has a single public method, notify, which schedules the coroutine to be resumed if it is blocked. (It is an error to notify a coroutine which is not blocked.) A coroutine can cause itself to block by calling the function coro_t::wait. Hopefully, the coroutine will have passed itself to another location so that it can be notified. coro_t::spawn causes a new coroutine to be launched, given an object with operator() defined. It doesn't bother returning the coroutine that it spawned, because that would be redundant with self.

It turns out to be natural to have a a coroutine's flow of execution be the only thing that can get a reference to itself. The general pattern is that a coroutine gets itself, sends a pointer of itself to another coroutine, and then waits until that coroutine notifies it. For example, say we have a legacy procedure void write_async(file_t*, data_t*, write_callback_t*) which reads a file and calls the virtual method void write_callback_t::on_write_complete(). In terms of the above primitives, a coroutine-blocking version of write_async is constructed as follows:

struct co_write_callback_t : public write_callback_t {
    coro_t *waiter;
    co_write_callback_t() : waiter(coro_t::self()) { }
    void on_write_complete() { waiter->notify() }
};
void write(file_t *file, data_t *data) {
    co_write_callback_t cb;
    write_async(file, data, &cb);
    coro_t::wait();
}

Coroutines and multithreading

coro_t::move_to_thread is a special function which causes a coroutine to be blocked, transported to another thread and resumed there. RethinkDB is usually configured to use one thread per CPU. We attempt to minimize sharing between threads, but sometimes communication is necessary. It is often natural to express this communication as a single line of control, executing on one thread for some time and later on another thread. For example, requests are dispatched on all threads, and the B-tree is divided into slices that are owned by each thread. If a request handled by a particular thread needs to look at a B-tree portion maintained by a different thread, then coro_t::move_to_cpu can be used to transfer control from one thread to the other and back. In the old system, a callback would be registered which would be called on the other thread, and all state would have be stored in an object shared between the threads, with ownership passed from one to the other. With coroutines, the stack forms the shared object, and programming is much more natural.

For code that moves between threads in a stack-based manner, the RAII idiom can make it easier to use move_to_thread. The coroutine library provides a special class whose constructor moves to a thread given as a parameter, and whose destructor moves back to the thread observed to be in use when the object was created. If this object is stack-allocated, it will have the effect of switching threads for a particular block scope.

(I'm not sure whether move_to_thread is guaranteed to work in POSIX in general with ucontext.h or Linux in particular, but it seems to function correctly as I'm using it. Windows fibers explicitly support being passed among threads. On some platforms, libcoroutine does not behave properly when passing coroutines between threads—coroutine pausing and resuming may take some thread-local state with it, corrupting the thread where the coroutine is resumed.)

Higher-level abstractions on coroutines

A few utilities are built on top of these primitives. A condition variable, of type coro_t::cond_var_t<var_t> is a variable that is initially unfilled, but can be filled with a particular value. Calling join on a condition variable causes the current coroutine to block until the condition variable is filled, returning that value. task is a convenient way to spawn a coroutine to fill a condition variable, given a function returning the value for the condition variable. If RethinkDB receives a large request for several B-tree lookups, each of these can be issued concurrently as separate tasks. The condition variables returned are joined in sequence, as the values are returned to the client.

The class coro_t::multi_wait_t is used for a situation where a coroutine needs to wait until it is notified a number of times. This is used in the implementation of reporting performance statistics. When gathering statistics, a request is given to each thread to put their portion of the information in a known location. When each thread completes, it reports its completion by notifying a multi_wait_t object about its completion. The main coroutine waits until the multi_wait_t causes it to resume execution. It can then gather the results from all threads and report the aggregated statistics.

Memory management and ownership

You might be wondering: this is C++, where is all the manual memory management? You never have to call delete on coroutines or condition variables. Memory is handled completely by the coroutine library, and it was simple to implement. When the function object that spawn calls returns, the coroutine object corresponding to it is immediately deallocated. It makes no sense to notify a coroutine which has returned, and there is no useful state within a coroutine. Similarly, condition variables deallocate themselves after they are joined.

To program in this model, it's useful to have an idea of ownership. Each coroutine has exactly one owner, a unique piece of code that's supposed to notify it when it's in a particular waiting state. A condition variable's owner is the only thing allowed to join it. It could be useful to have a type system that enforces these invariants (uniqueness/linear typing anyone?), but no such system exists in C++, so the constraints go unchecked except at runtime if asserts are turned on.

Thinking ahead

This change isn't completed yet. There's a lot of code that can be simplified using coroutines, and it'll take a long time to change everything and be sure that no bugs were introduced. There will also be some work to ensure that coroutines don't cause performance regressions. My first attempt hurt throughput significantly, but I believe that I can get the slowdown down to a few percent with some work. While that isn't perfect, the software engineering benefits of coroutines are just too good to ignore. Shorter, simpler code using coroutines will enable us to implement better algorithms and achieve higher performance while maintaining correctness.

If this is all as exciting to you as it is to me, you should consider working at RethinkDB! We're hiring—please see our jobs page for more details.

Distributed Software Testing

About me

A word about me first: My name is Daniel Mewes and I just came over to California to work at RethinkDB as an intern for the oncoming months. After having been an undergraduate student of computer science at Saarland University, Germany for the last two years, I am exited to work on an influential real-world project at RethinkDB now. Why RethinkDB? Not only does RethinkDB develop an exciting and novel piece of database technology, RethinkDB also provides the great “startup kind” of work experience.

Software testing

In complex software systems like database management systems, different components have to work together. These components can interact in complex ways, yielding a virtually infinite number of possible states that the overall system can reach. This has consequences for software testing. As bugs in the code might only show up in a small fraction of the possible states, comprehensive testing of the system is essential. Encapsulation of code and data into objects can reduce the number of states that must be considered for any single piece of code. However an extremely large number of states can still remain, especially when considering parallel systems. Reliability requirements for database management systems on the other hand are stringent. Losing or corrupting data due to bugs in the program cannot be tolerated here.

Among other measures, we at RethinkDB ensure the reliability of our software by running extensive tests on a daily basis. The problem with these tests is that they take a lot of time to complete. We recently reached time requirements of more than 24 hours on a decent machine for a single test run. So clearly a single machine is not enough anymore to run the tests. For our daily test runs, we want to get results quickly. Buying more machines is pricey, especially as those machines would be idle during the times at which no tests are run. It also is not very flexible.

Tapping into the endless resources of the cloud to ensure software quality and reliability

Cloud computing provides a more flexible and less pricey way to circumvent the limitations of limited local hardware resources. We decided to use Amazon's Elastic Compute Cloud (Amazon EC2). If you need the computing power of ten systems, you can get that from EC2 in a matter of minutes. If you need the power of a hundred machines, you can get that in a matter of minutes, too. Basically, Amazon's EC2 provides you with as much computing power as you need, at just the time that you need it. EC2 allows to dynamically allocate and deallocate virtual compute nodes, which are billed on an hourly basis. Each node can be used like a normal computer. The nodes run Linux (Windows nodes are also available) and are accessible through SSH. So EC2 looked like a promising platform to make our tests finish faster.

Media_httpwwwrethinkd_vgfdk

 

EC2 console showing a few nodes

Our existing test suite already split up the work into independent test scripts. What was missing for utilizing EC2 was an automated mechanism to start and setup a number of EC2 nodes and dispatch the individual tests to these nodes to run in parallel. Setting up a node especially involves the step of installing a current build of RethinkDB together with a number of dependencies on the node's file system. I wrote a Python script to fulfill exactly these tasks. Our main concern was to improve the overall performance of the testing process as much as possible.

In more detail, our new distributed testing tool works in the following steps:

  • Allocate a number of nodes in Amazon's EC2.
  • Once all nodes are up and booted, install the current build of RethinkDB on each of them. As the bandwidth of the Internet connection in our office is much lower than what is available to the EC2 nodes, we use SFTP to install RethinkDB on only one of the nodes and then let that node distribute it to all remaining ones.
  • We can now start running tests on the nodes:
    • Pick a test from the list of all individual tests to be run.
    • Find a node which is not currently busy running another test. If no node is available, wait until a node becomes free.
    • Initiate the test on the free node. To do this, we use a wrapper script which we invoke and immediately background on the remote node. The wrapper script takes care of running the actual test and redirecting its output and result into specific files, which we can later retrieve asynchronously.
  • After repeating step 3 for all tests in the list, wait for all nodes to finish their current work.
  • Collect the results of all tests from the different nodes. This works by reading from the files in which our wrapper script has stored the tests' results.
  • Finally, terminate the allocated nodes in EC2.

To communicate with the compute nodes, I opted for the use of Paramiko, an implementation of SSH2 for Python. Having direct access to the SSH2 protocol from a Python script makes running commands remotely as well as fetching and installing files from/into the remote systems very convenient. For allocating and terminating EC2 nodes, we use Boto, which provides an interface for accessing Amazon's AWS API from within Python programs.

The results are convincing: Instead of 26 hours on a (fast) local machine, running all of our tests takes only 4 hours when distributed across ten nodes in EC2. By using still more nodes, the time for testing can be lowered even further. This is very useful. Say we just made an important change to our code and want to verify that everything works as it is supposed to. With local test runs, this would mean waiting at least a day, even longer if our testing machine is occupied with an earlier test run. If one of the test detects a problem with the change and we fix it, it takes another day at least until we can see if the fix even worked and had no other side effects. Thanks to cloud computing and our distributed testing system, we can now initiate an arbitrary number of test runs on demand, each of which finishes in a matter of mere hours.

Multi Slicing

The basic data structure that powers databases is called a B-tree. This is where you actually store the user's data. B-trees are great because you can put huge amounts of data in them and access remains fast. In fact, the naive strategy of putting a whole data set into one B-Tree doesn't break down because of access time. It does, however, break down when you try to support a multiaccess paradigm.

Six years ago, multiaccess was nice. Now that processors have multiple cores, it's crucial. Four cores fighting over one B-tree means a lot of wasted processor time. In a multiaccess scheme, different cores can concurrently access data. This gets tricky. You can go looking for a piece of data only to find that someone has moved it since you started; that's trouble: for all you know it was deleted. You could start the search over, but without guarantees—maybe you’ll get unlucky and it will be plucked out from under you again. How do you know when to give up? Your database is now blazingly fast, but also broken. We handle this with a locking scheme.

With a locking scheme, you gain the ability to say: "Okay, I'm going to go look for this piece of data, no one else is allowed to touch it." If someone else wants to get at that piece of data before you're done—they have to wait. The database is fixed, but it's not quite as blazingly fast as it used to be. The problem is that sometimes cores have to wait for data, and while they wait they're just twiddling their thumbs. There isn't a good way to prevent these waits, but we can put that time toward something productive.

The problem isn't the waiting, it's the lack of things to do during that time. The solution is to slice up our B-Tree into several smaller B-Trees. Every possible key has one and only one slice that it can ever be stored in. We do this with hashing.

Now even if a core locks up an entire B-Tree, you can still do work in the intervening time. Now that we support multiple slices, the question is: how many slices do we actually want? There are downsides to both extremes: too many slices and the efficiency of the B-Tree structure is underused, too few and CPUs are underused. At RethinkDB we've found experimentally that 4 is a sufficiently high value. The quick back-of-the-envelope calculation is that an unsliced B-Tree will defy computation about $latex 2 \%$ of the time. That translates to an hour of wasted CPU time every 2 days. This waste falls off very sharply with our innovation: 2 slices are busy: $latex .02^2 = 0.0004$ (just $latex .04 \% $ of the time) and 4 slices are busy $latex .02^4 = 0.00000016$ or $latex 0.000016 \% $. At which point it will take you 700 years to waste an hour of CPU time. That's a number we can live with.

Make Debugging Easier With Custom Pretty-Printers

What's Good About Pretty-Printers

One of the best features in Gdb 7.0+ is the ability to write pretty-printers in Python. Instead of printing a vector and seeing this:

$1 = {
  <std::_Vector_base<int,std::allocator<int> >> = {
    _M_impl = {
      <std::allocator<int>> = {
        <__gnu_cxx::new_allocator<int>> = {<No data fields>}, <No data fields>}, 
      members of std::_Vector_base<int,std::allocator<int> >::_Vector_impl: 
      _M_start = 0x0, 
      _M_finish = 0x0, 
      _M_end_of_storage = 0x0
    }
  }, <No data fields>}

I can now see this:

$1 = std::vector of length 0, capacity 0

Much better!

Background On Pretty-Printers

Gdb has never been good at pretty-printing complex data structures, but that's finally changing. Project Archer is a Gdb development branch primarily dedicated to improving the C++ debugging experience. For me their most exciting project is PythonGdb, which aims to integrate Python scripting into Gdb. You can do cool things like define your own Gdb commands, script gdb in Python and write pretty-printers in Python. We do our debugging in Gdb and we've come to rely on our pretty-printers.

What Has Been Done So Far

If all you're looking for is pretty-printers for STL containers, you're in luck. ProjectArcher has written pretty-printers for all the containers in the STL. If you have Gdb 7.0+, you can have decent printers in a couple of minutes. If you don't have GDB 7.0+, it's a short build.

What Is Still To Be Done

The problem is, printing out classes and structs is still an issue. If you've defined some complex struct, a pretty-printer can save you a lot of time squinting at your screen (or writing print functions). ProjectArcher hasn't written pretty-printers for structs, maybe because each struct is different. But it should be possible to write a generic function for printing structs: most of the time, you just want to print out the values of the members in the struct. All you need is a decent generic pretty-printer. With that in mind, I decided to write a pretty-printer that could be used for any struct.

Before we look at the code, I suggest reading Tromey's blog post on pretty printing in Gdb if you don't know the basic idea. To start off, here's a basic pretty printer that prints the names and types of each member in a class:

class GenericPrinter:
    def __init__(self, val):
        self.val = int(val)
 
    def to_string(self):
        return "Generic object with the following members:"
 
    def children(self):
        for field in self.val.type.fields():
            yield field.name, str(field.type)

That's pretty easy, we just iterate over the fields. If we want to print out the values, things get a little trickier. We need to check the type of the field and print it accordingly. For example, here's how we could check for built-in types:

for field in self.val.type.fields():
    key = field.name
    val = self.val[key]
    if val.type.code == Gdb.TYPE_CODE_INT:
        yield key, int(val)
    elif val.type.code == Gdb.TYPE_CODE_FLT or val.type.code == Gdb.TYPE_CODE_DECFLOAT:
        yield key, float(val)
    elif val.type.code == Gdb.TYPE_CODE_STRING or val.type.code == Gdb.TYPE_CODE_ARRAY:
        yield key, str(val)
    else: yield key, val

For each member, we need to check it's type and yield accordingly.

Pointers are also tricky. Gdb usually just prints the address of the pointer, but usually, I want to know more about the object that it's pointing to. So we can dereference the pointer:

if val.type.code == gdb.TYPE_CODE_PTR or val.type.code == gdb.TYPE_CODE_MEMBERPTR:
    yield key, val.dereference()

But that's not enough. You could have a null pointer:

if val.type.code == gdb.TYPE_CODE_PTR or val.type.code == gdb.TYPE_CODE_MEMBERPTR:
    if not val: yield key, "NULL"
    else: yield key, val.dereference()

The real problem is pointers that point to garbage data. There's no clean way to check for those, so we hack it by trying to convert the deference to a string. If we get a RuntimeError, we know there's no object there:

if val.type.code == gdb.TYPE_CODE_PTR or val.type.code == gdb.TYPE_CODE_MEMBERPTR:
    if not val: yield key, "NULL"
    else:
        try:
            str(val.dereference())
            yield key, val.dereference()
        except RuntimeError: 
            yield key, "Cannot access memory at address " + str(val.address)

We can also print out the members of all the base classes of a given class by writing a function that calls itself recursively:

def process_kids(state, PF):
    for field in PF.type.fields():
        key = field.name
        val = PF[key]
        if field.is_base_class and len(field.type.fields()) != 0:
            for k, v in process_kids(state, field):
                yield key + " :: " + k, v
        else:
            yield key, val

The full code looks something like this:

import re
import gdb
 
def lookup_function (val):
    "Look-up and return a pretty-printer that can print val."
    # Get the type.
    type = val.type
 
    # If it points to a reference, get the reference.
    if type.code == gdb.TYPE_CODE_REF:
        type = type.target ()
 
    # Get the unqualified type, stripped of typedefs.
    type = type.unqualified ().strip_typedefs ()
 
    # Get the type name.    
    typename = type.tag
 
    if typename == None:
        return None
 
    # Iterate over local dictionary of types to determine
    # if a printer is registered for that type.  Return an
    # instantiation of the printer if found.
    for function in sorted(pretty_printers_dict):
       if function.match (typename):
           return pretty_printers_dict[function] (val)
 
    # Cannot find a pretty printer.  Return None.
    return None
 
 
class GenericPrinter:
    def __init__(self, val):
        self.val = val
 
    def to_string(self):
        return "Generic object with the following members:"
 
    def children(self):
        for k, v in process_kids(self.val, self.val):
            for k2, v2 in printer(k, v): yield k2, v2
 
 
def process_kids(state, PF):
    for field in PF.type.fields():
        if field.artificial or field.type == gdb.TYPE_CODE_FUNC or \
        field.type == gdb.TYPE_CODE_VOID or field.type == gdb.TYPE_CODE_METHOD or \
        field.type == gdb.TYPE_CODE_METHODPTR or field.type == None: continue
        key = field.name
        if key is None: continue
        try: state[key]
        except RuntimeError: continue
        val = PF[key]
        if field.is_base_class and len(field.type.fields()) != 0:
            for k, v in process_kids(state, field):
                yield key + " :: " + k, v
        else:
            yield key, val
 
 
def printer(key, val):
    if val.type.code == gdb.TYPE_CODE_PTR or val.type.code == gdb.TYPE_CODE_MEMBERPTR:
        if not val: yield key, "NULL"
        else:
            try:
                str(val.dereference())
                yield key, val.dereference()
            except RuntimeError: 
                yield key, "Cannot access memory at address " + str(val.address)
    elif val.type.code == gdb.TYPE_CODE_INT:
        yield key, int(val)
    elif val.type.code == gdb.TYPE_CODE_FLT or val.type.code == gdb.TYPE_CODE_DECFLOAT:
        yield key, float(val)
    elif val.type.code == gdb.TYPE_CODE_STRING or val.type.code == gdb.TYPE_CODE_ARRAY:
        yield key, str(val)
    else: yield key, val
 
# register the pretty-printer
pretty_printers_dict = {}
pretty_printers_dict[re.compile ('.*Generic.*')] = GenericPrinter
gdb.pretty_printers.append(lookup_function)

That's it. We now have generic pretty-printers that we can use for any struct. Suppose we have this struct:

struct Generic {
    int a;
    double b;
    vector<int> c;
    int *ptr;
}

And we initialize it with a = 0, b = 10, and ptr = pointer to an int with value 50.

Without pretty printers:

$4 = {a = 0, b = 10, c = {<std::_Vector_base<int, std::allocator<int> >> = {_M_impl = {<std::allocator<int>> = {<__gnu_cxx::new_allocator<int>> = {<No data fields>}, <No data fields>}, _M_start = 0x0, _M_finish = 0x0, _M_end_of_storage = 0x0}}, <No data fields>}, 
  ptr = 0x7fff5fbffabc}

With pretty-printers:

$4 = Generic object with the following members: = {
  a = 0,
  b = 10,
  c = std::vector of length 0, capacity 0,
  ptr = 50
}

Don't forget, even if you have pretty-printers, you can see the output gdb would have produced by default using

print /r *generic