BonsaiDb performance update: A deep-dive on file synchronization
Written by Jonathan Johnson. Published 2022-05-22. Last updated 2022-05-23.
What is BonsaiDb?
BonsaiDb is a new database aiming to be the most developer-friendly Rust database. BonsaiDb has a unique feature set geared at solving many common data problems. We have a page dedicated to answering the question: What is BonsaiDb?. All source code is dual-licensed under the MIT and Apache License 2.0 licenses.
tl;dr: BonsaiDb is slower than previously reported
The day after the last post, @justinj reported that they traced one of the Nebari examples and did not see any fsync syscalls being executed. This was an honest mistake of misunderstanding the term "true sink" in std::io::Write . It turns out Write::flush() ‘s implementation for std::io::File is a no-op, as the "true sink" is the kernel, not the disk.
I released Nebari v0.5.3 the same day. I ran the Nebari benchmark suite and. nothing changed. I ran the suite on GitHub Actions, no change. I ran the suite on my dedicated VPS I use for a more stable benchmarking environment than GitHub Actions. no change. I ran the suite on my Mac. huge slowdown. I’ll cover why further in this post, but my initial impression was that I dodged a bullet somehow.
A few days later, I noticed the BonsaiDb Commerce Benchmark was running slowly. I quickly realized it was due to the synchronization changes and began an entire rewrite of the view indexer and document storage. A few days ago, I reached a stage where I could run the suite with my changes. Excited to see if it was enough to catch back up to PostgreSQL, I ran it and. it was a little faster but was still very slow.
The rest of this post explores everything I’ve learned since then. Since this a summary, let me end with a tl;dr: Reading data from BonsaiDb is still very efficient, but due to mistakes in benchmarking, writes are quite slow for workflows that insert or update a lot of data in a single collection. I am still excited and motivated to build BonsaiDb, but I am currently uncertain whether I will still write my own low-level database layer. All assumptions about BonsaiDb’s performance must be reset.
Why didn't my refactor help?
The rest of that day and the next two days were spent profiling and trying to understand the results. I would then try to test my assumptions in an isolated test, and I wasn’t able to make significant progress.
The following day as I was sipping coffee, I ran df to check my disk’s free space. I realized that each time I ran that command, /tmp was always listed as a separate mountpoint. I use Manjaro Linux (based on Arch), and while I can generally solve any problems I have with my computer, I never considered the implications of /tmp listed.
Given that it’s a separate mountpoint than my main filesystem, the next logical question is: what filesystem does it use? The answer: tmpfs, a filesystem that acts like a RAM-disk and never persists your files except through memory paging. fsync is essentially a no-op on such a filesystem. Many of my benchmarks and tests used the excellent tempfile crate.
Despite having worked with Linux off and on since the early 2000s, I never noticed this detail. The temporary directory is not a different filesystem on the Mac, which is one factor in why Nebari’s benchmarks exhibited a major change on the Mac while no change on Linux.
Should I continue Nebari?
The realization that all of my testing of other database’s performance was severely flawed meant that I needed to recheck everything. There is a real question: should I just ditch Nebari and use another database to back BonsaiDb? Thinking about my motivations, I’ve never wanted to create "the fastest database."
My pitch for BonsaiDb has always been about developer experience and being "good enough" for "most people." I believe there are a lot of developers who waste a lot of development cycles building apps that are more concerned about high-availability than being able to scale to be the next Google. At that scale, a one-size-fits-all solution is almost never the solution. If I can simplify scaling from a test project to a highly-available solution and do it with "reasonable" performance, I’ve met my goals for BonsaiDb.
The benefits of Nebari come down to it being tailor-fit to BonsaiDb’s needs. With my new approach to document and view storage that leverages Nebari’s ability to embed custom stats within its B+Tree, it meant I could build and query map-reduce views in a very efficient manner. I have yet to see another low-level database written in Rust that enables embedding custom information inside of the B+Tree structure to enable these capabilities.
Because a tailor-fit solution is so attractive, I wanted to explore what it might take to make Nebari faster.
Why is Nebari slow?
To answer that question, I needed to fix my usage of tempfile in the Nebari benchmark. After doing that, Nebari still competed with SQLite on many benchmarks, but Sled was reporting astonishing numbers. This led me to question whether Sled’s transactions are actually ACID-compliant when explicitly asking for them to be flushed.
See, in my testing, I was able to determine that fdatasync() was unable to return in less than 1 millisecond on my machine. Here’s the output if the transactional insert benchmark measuring the time it takes to do an ACID-complaint insert of 1KB of data to a new key:
As you can see, Nebari sits in the middle with 2.6ms, SQLite is the slowest with 4.11ms, and Sled reports an incredibly short 38.9μs. After a quick skim of Sled’s source, Sled isn’t calling fdatasync() most of the time when you ask it to flush its buffers. It’s using a different API: sync_file_range() .
From the output above, you might be tempted to infer that Sled never calls fsdatasync , since the max single iteration time for Sled is 39.4μs, and I claim that fdatasync never completes in under 1ms on my machine. However, Criterion uses sampling-based statistics, which means that it doesn’t look at individual iteration times but rather iteration times for a set number of iterations.
By logging out each individual iteration time, I can see that there are individual iterations that do take as long as an fdatasync call. To simplify, I created three tests to benchmark:
- append : Write to the end of the file and call fdatasync .
- preappend : When a write needs more space in the file, extend the file using ftruncate before writing. Call fdatasync after each write.
- syncrange : When a write needs more space, extend the file using ftruncate and calling fdatasync after the write. When a write does not need more space, call sync_file_range to persist the newly written data.
The output of this benchmark is:
Our syncrange benchmark appears to not ever take longer than 195.8μs. But, we know that it calls fdatasync , so what’s happening? Let’s open up Criterion’s raw.csv report for syncrange :
| group | function | value | throughput_num | throughput_type | sample_measured_value | unit | iteration_count |
|---|---|---|---|---|---|---|---|
| writes | syncrange | 2,929,333.0 | ns | 6 | |||
| writes | syncrange | 2,932,484.0 | ns | 12 | |||
| writes | syncrange | 5,701,286.0 | ns | 18 | |||
| writes | syncrange | 5,788,786.0 | ns | 24 |
Criterion isn’t keeping track of every iteration. It keeps track of batches. Because of this, the final statistics tallied aren’t able to see the true maximum iteration time. Let’s run the same benchmark in my own benchmarking harness:
| Label | avg | min | max | stddev | out% |
|---|---|---|---|---|---|
| append | 2.628ms | 2.095ms | 5.702ms | 147.0us | 0.004% |
| preallocate | 1.243ms | 612.3us | 4.042ms | 859.3us | 0.004% |
| syncrange | 189.1us | 12.83us | 2.847ms | 653.6us | 0.063% |
Using this harness, I can now see results that make sense. The averages match what we see from Criterion, but now our min and max show a wider range. We can now see that even for the syncrange benchmark, some writes will take 2.8ms.
The takeaway is very important: using sync_file_range , Sled is able to make the average write’s time take 38.9μs, even though occasionally there will be write operations that are longer due to the need for an fdatasync when the file’s size changes.
Given how much faster sync_file_range() is, is it safe to use to achieve durable writes?
What does sync_file_range do?
sync_file_range() has many modes of operation. At its core, it offers the ability to ask the kernel to commit any dirty cached data to the filesystem. For the purposes of this post, we are most interested in the mode of operation when SYNC_FILE_RANGE_WAIT_BEFORE , SYNC_FILE_RANGE_WRITE , and SYNC_FILE_RANGE_WAIT_AFTER are passed.
With these flags, sync_file_range() is documented to wait for all dirty pages within the range provided to be flushed to disk. However, the documentation for this function advises that it is "extremely dangerous."
What are 'durable writes'?
When writing data to a file, the operating system does not immediately write the bits to the phsysical disk. This would be incredibly slow, even with modern SSDs. Instead, operating systems will typically manage a cache of the underlying storage and occasionally flush the updated information to the storage. This allows writes to be fast and enables the operating system to attempt to schedule and reorder operations for efficiency.
The problem with this approach occurs when the power suddenly is cut to the machine. Imagine a user hits "Save" in their program, the program confirmed it was saved, and suddenly the building’s power dies. The program claimed it saved the file, but upon rebooting, the file is missing or corrupt. How does that happen? The file may have only been saved to the kernel’s cache and never written to the physical disk.
The solution is called flushing or syncing. Each operating system exposes one or more functions to ensure all writes to a file have successfully been persisted to the physical media:
- On Linux, it’s fsync() , fdatasync() , and sync_file_range() .
- On Windows, it’s FlushFileBuffers .
- On Mac/iOS, fsync() is available but does not provide the same guarantees as Linux. Instead, a call to fcntl with the F_FULLFSYNC option must be used to trigger a write to physical media.
Rust uses the correct APIs for each platform when calling File::sync_all or File::sync_data to provide durable writes. The standard library does not provide APIs to invoke the underlying APIs mentioned above. Thankfully, the libc crate makes it easy to call the APIs we are interested in for this post.
Linux: Is sync_file_range viable for durable writes?
The man page for sync_file_range() includes this warning (emphasis mine):
This system call is extremely dangerous and should not be used in portable programs. None of these operations writes out the file’s metadata. Therefore, unless the application is strictly performing overwrites of already-instantiated disk blocks, there are no guarantees that the data will be available after a crash. There is no user interface to know if a write is purely an overwrite. On filesystems using copy-on-write semantics (e.g., btrfs) an overwrite of existing allocated blocks is impossible. When writing into preallocated space, many filesystems also require calls into the block allocator, which this system call does not sync out to disk. This system call does not flush disk write caches and thus does not provide any data integrity on systems with volatile disk write caches.
Sled is using it to achieve it’s incredible speed, and the author is aware of this warning. One of the commentors on the linked page points out that RocksDB has special code to disable using the API on zfs . Pebble, which is a Go port/spinoff of RocksDB, takes the approach of opting-in ext4 . Both RocksDB and Pebble seem to still use fsync / fdatasync at various locations to ensure durability.
I decided to look at PostgreSQL’s source as well. They use sync_file_range() ‘s asynchronous mode to hint to the OS that the writes need to be flushed, but they still issue fsync or fdatasync as needed.
I also looked to SQLite’s source: no references. I could not find any relevant discussion threads either.
I’m not an expert on any of these databases, so my skim of their codebases should be taken with a grain of salt.
I tried finding any information about the reliability of sync_file_range for durable overwrites on various filesystems, and I couldn’t find anything except these little bits already linked.
Lacking any definitive answer regarding whether it’s able to provide durability on any filesystems, I set out to test this myself.
Testing sync_file_range's durability
I set up an Ubuntu 20.04 Server virtual machine running kernel 5.4.0-110-generic. While pondering how to best shut the machine down after the call to sync_file_range , @justinj came to the rescue again by pointing out that /proc/sysrq-trigger exists. He also shared his blog post where he performed similar tests against fsync while exploring how to build a durable database log.
It turns out if you write o to /proc/sysrq-trigger on a Linux machine (requires permissions), it will immediately power off. This greatly simplified my testing setup.
I executed a VM using this command:
In another terminal, I executed the various examples from the repository over ssh . After each example executed, the virtual machine would automatically reboot. By executing examples in a loop, I was able to run these commands for extended periods of time. My results are:
| Filesystem | is sync_file_range durable? |
|---|---|
| btrfs | No |
| ext4 | Yes |
| xfs | Yes |
| zfs | No |
Safely extending a file's length while using sync_file_range
My original testing of sync_file_range showed some failures, but after some additional testing, I noticed it was only happening with either the first or second test, but never on subsequent tests for filesystems I’ve labeled durable above.
There are two examples that test sync_file_range :
sync_file_range.rs : When initializing the data file, zeroes are manually written to the file.
sync_file_range_set_len.rs : When initializing the data file, File::set_len() is called to extend the file, which is documented currently as:
If it is greater than the current file’s size, then the file will be extended to size and have all of the intermediate data filled in with 0s.
Both examples use File::sync_all() , and both examples call File::sync_all on the containing directories to sync the file length change.
On ext4 and xfs, my testing showed that I could reliably reproduce data loss on the initial run in the sync_file_range_set_len example but not the sync_file_range example. Subsequent runs were durable. Why is that?
Despite what the Rust documentation states, under the hood, File::set_len uses ftruncate , which is documented as:
If the file size is increased, the extended area shall appear as if it were zero-filled.
The distinction between "shall appear as if it were zero-filled" and "will be extended to size and have all of the intermediate data filled in with 0s" is subtle, but very important when considering the safety of sync_file_range . In my earlier quote of sync_file_range ‘s warning, the second emphasis also seems to relate to these findings.
From my testing, using ftruncate to fill pages with 0 will conflict with sync_file_range on the first operation, but will likely succeed on future tests on ext4 and xfs.
Volatile Write Caches
Update 2022-05-23: A comment on Reddit correctly pointed out I skipped discussing this portion of the sync_file_range warning:
This system call does not flush disk write caches and thus does not provide any data integrity on systems with volatile disk write caches.
Even with all of the aforementioned preconditions being true, we can’t guarantee that sync_file_range if write caching is enabled. This is because the device itself may have a write cache that is volatile. Unless write caching is explicitly disabled, the only way for sync_file_range to be safe on ext4 and xfs is for the user to verify that the devices being used do not have volatile write caches.
For my NVME boot drive, I’m able to see that it has a volatile write cache:
Let’s turn it off and run our benchmark again:
| Label | avg | min | max | stddev | out% |
|---|---|---|---|---|---|
| append | 5.492ms | 5.123ms | 12.75ms | 564.1us | 0.016% |
| preallocate | 2.763ms | 1.665ms | 11.75ms | 1.685ms | 0.006% |
| syncrange | 2.025ms | 1.592ms | 5.723ms | 910.1us | 0.064% |
Our sub-millisecond times have vanished. The only reason sync_file_range was faster was because it was only writing to the volatile write cache. By disabling the volatile write cache, the benefits of sync_file_range compared to the preallocation strategy diminish.
Conclusions about sync_file_range
- sync_file_range is only safe to use on specific filesystems. Of the four I tested, xfs and ext4 appear to be completely reliable in their implementations, and zfs and btrfs both are completely unreliable in their implementations.
- sync_file_range is only safe to use on fully initialized pages.
- ftruncate to extend a file does not fully initialize newly allocated pages with zeroes and may take shortcuts instead. This makes using sync_file_range on space allocated with ftruncate or similar operations unsafe to use.
- Even with all of these conditions being met, volatile write caches on the disk must be disabled to ensure full durability.
Mac OS/iOS: Does F_BARRIERFSYNC provide durable writes?
Some apps require a write barrier to ensure data persistence before subsequent operations can proceed. Most apps can use the fcntl(:🙂 F_BARRIERFSYNC for this.
Only use F_FULLFSYNC when your app requires a strong expectation of data persistence.
Great, so fnctl with F_FULLFSYNC is what is used to instead of fsync . Let’s keep reading.
Note that F_FULLFSYNC represents a best-effort guarantee that iOS writes data to the disk, but data can still be lost in the case of sudden power loss.
Apple really dropped the ball here. According to all available documentation I can find: there is no way to run a truly ACID-compliant database on Mac OS. You can get close, but a power loss could still result in a write being reported as successfully persisted being gone after a power outage. A post on Michael Tsai’s blog covers the investigation into this in more detail.
One interesting note is that SQLite uses F_BARRIERFSYNC by default for all of its file synchronization on Mac/iOS. Optionally, you can use a #pragma to enable usage of F_FULLFSYNC . Given the relative overhead of the two APIs in my limited testing, I can understand their decision, but I’m not sure it’s the best default.
Windows: Are there any APIs for partially syncing files?
No. Unless you are utilizing memory mapped files, the only API avialable on Windows is FlushFileBuffers .
What does all of this mean for BonsaiDb?
Astute readers may have noticed that the Nebari benchmarks claimed to be similar in performance to SQLite even post-sync changes. This is true on many metrics, but it’s not an apples-to-apples comparison, and the difference is the primary reason BonsaiDb slowed down.
SQLite has many approaches to persistence, but let’s look at the journaled version as it’s fairly straightforward. When using a journal, SQLite creates a file that contains information needed to undo the changes its about to make to the database file. To ensure a consistent state that can be recovered after a power outage at any point in this process, it must make at least two fsync calls.
Nebari gets away with one fsync operation due to its append-only nature. However, the moment you use the Roots type, there’s one more fsync operation: the transaction log. Thus, Nebari isn’t actually faster than SQLite when a transaction log is used, which is a requirement for multi-tree transactions.
This is further exacerbated by Nebari’s Multi-Reader, Single-Writer model of transactions. If two threads are trying to write to the same tree, one will begin its transaction and finish it while the other has to wait patiently for the lock on the tree to be released.
This two-step sync method combined with contention over a few collections is what caused the Commerce Benchmark to grind to a halt after fsync was actually doing real work. Individual worker threads would back up waiting for their turn to modify a collection.
Nebari’s architecture was designed in October, and I spent countless hours profiling and testing its performance. Due to the aforementioned issues with my methodology, so many of my performance assumptions were flat out wrong.
What's next?
It’s clear from these results that whatever solution is picked for BonsaiDb, it needs to support a way to allow multiple transactions to proceed at the same time. This is a much tougher problem to solve, and I’m uncertain I want to tackle this problem myself.
For the longest time, I developed BonsaiDb with minimal "advertising." Imposter syndrome prevented me from sharing it for most of 2021. Over the alpha period, I finally started feeling confidence in its reliability. Now, I’m back to questioning whether I should attempt a new version of Nebari.
On one hand, seeing that Nebari is still pretty fast after fixing this bug should prove to me that I can write a fast database. On the other hand, I’m so embarrassed I didn’t notice these issues earlier, and it’s demoralizing to think of all the time spent building upon mistaken assumptions. Nebari will also need to transition to a more complex architecture, which makes it lose some of the appeal I had for it.
The only thing I can say with confidence right now is that I still firmly believe in my vision of BonsaiDb, regardless of what storage layer powers it. I will figure out my plans soon so that existing users aren’t left in a lurch for too long.
Lastly, I just want to say thank you to everyone who has supported me through this journey. Despite the recent stress, BonsaiDb and Nebari have been fun and rewarding projects to build.
BonsaiDb by Khonsu Labs. BonsaiDb’s source code is dual-licensed with MIT and Apache License 2.0. This website’s content is licensed CC BY NC SA 4.0.
Hardware Recommendations
Ceph was designed to run on commodity hardware, which makes building and maintaining petabyte-scale data clusters economically feasible. When planning out your cluster hardware, you will need to balance a number of considerations, including failure domains and potential performance issues. Hardware planning should include distributing Ceph daemons and other processes that use Ceph across many hosts. Generally, we recommend running Ceph daemons of a specific type on a host configured for that type of daemon. We recommend using other hosts for processes that utilize your data cluster (e.g., OpenStack, CloudStack, etc).
Check out the Ceph blog too.
CephFS metadata servers (MDS) are CPU-intensive. CephFS metadata servers (MDS) should therefore have quad-core (or better) CPUs and high clock rates (GHz). OSD nodes need enough processing power to run the RADOS service, to calculate data placement with CRUSH, to replicate data, and to maintain their own copies of the cluster map.
The requirements of one Ceph cluster are not the same as the requirements of another, but here are some general guidelines.
In earlier versions of Ceph, we would make hardware recommendations based on the number of cores per OSD, but this cores-per-OSD metric is no longer as useful a metric as the number of cycles per IOP and the number of IOPs per OSD. For example, for NVMe drives, Ceph can easily utilize five or six cores on real clusters and up to about fourteen cores on single OSDs in isolation. So cores per OSD are no longer as pressing a concern as they were. When selecting hardware, select for IOPs per core.
Monitor nodes and manager nodes have no heavy CPU demands and require only modest processors. If your host machines will run CPU-intensive processes in addition to Ceph daemons, make sure that you have enough processing power to run both the CPU-intensive processes and the Ceph daemons. (OpenStack Nova is one such example of a CPU-intensive process.) We recommend that you run non-Ceph CPU-intensive processes on separate hosts (that is, on hosts that are not your monitor and manager nodes) in order to avoid resource contention.
Generally, more RAM is better. Monitor / manager nodes for a modest cluster might do fine with 64GB; for a larger cluster with hundreds of OSDs 128GB is a reasonable target. There is a memory target for BlueStore OSDs that defaults to 4GB. Factor in a prudent margin for the operating system and administrative tasks (like monitoring and metrics) as well as increased consumption during recovery: provisioning
8GB per BlueStore OSD is advised.
Monitors and managers (ceph-mon and ceph-mgr)
Monitor and manager daemon memory usage generally scales with the size of the cluster. Note that at boot-time and during topology changes and recovery these daemons will need more RAM than they do during steady-state operation, so plan for peak usage. For very small clusters, 32 GB suffices. For clusters of up to, say, 300 OSDs go with 64GB. For clusters built with (or which will grow to) even more OSDs you should provision 128GB. You may also want to consider tuning the following settings:
Metadata servers (ceph-mds)
The metadata daemon memory utilization depends on how much memory its cache is configured to consume. We recommend 1 GB as a minimum for most systems. See mds_cache_memory_limit .
Memory
Bluestore uses its own memory to cache data rather than relying on the operating system’s page cache. In Bluestore you can adjust the amount of memory that the OSD attempts to consume by changing the osd_memory_target configuration option.
Setting the osd_memory_target below 2GB is typically not recommended (Ceph may fail to keep the memory consumption under 2GB and this may cause extremely slow performance).
Setting the memory target between 2GB and 4GB typically works but may result in degraded performance: metadata may be read from disk during IO unless the active data set is relatively small.
4GB is the current default osd_memory_target size. This default was chosen for typical use cases, and is intended to balance memory requirements and OSD performance.
Setting the osd_memory_target higher than 4GB can improve performance when there many (small) objects or when large (256GB/OSD or more) data sets are processed.
OSD memory autotuning is “best effort”. Although the OSD may unmap memory to allow the kernel to reclaim it, there is no guarantee that the kernel will actually reclaim freed memory within a specific time frame. This applies especially in older versions of Ceph, where transparent huge pages can prevent the kernel from reclaiming memory that was freed from fragmented huge pages. Modern versions of Ceph disable transparent huge pages at the application level to avoid this, but that does not guarantee that the kernel will immediately reclaim unmapped memory. The OSD may still at times exceed its memory target. We recommend budgeting approximately 20% extra memory on your system to prevent OSDs from going OOM (Out Of Memory) during temporary spikes or due to delay in the kernel reclaiming freed pages. That 20% value might be more or less than needed, depending on the exact configuration of the system.
When using the legacy FileStore back end, the page cache is used for caching data, so no tuning is normally needed. When using the legacy FileStore backend, the OSD memory consumption is related to the number of PGs per daemon in the system.
Data Storage
Plan your data storage configuration carefully. There are significant cost and performance tradeoffs to consider when planning for data storage. Simultaneous OS operations and simultaneous requests from multiple daemons for read and write operations against a single drive can slow performance.
Hard Disk Drives
OSDs should have plenty of storage drive space for object data. We recommend a minimum disk drive size of 1 terabyte. Consider the cost-per-gigabyte advantage of larger disks. We recommend dividing the price of the disk drive by the number of gigabytes to arrive at a cost per gigabyte, because larger drives may have a significant impact on the cost-per-gigabyte. For example, a 1 terabyte hard disk priced at $75.00 has a cost of $0.07 per gigabyte (i.e., $75 / 1024 = 0.0732). By contrast, a 3 terabyte disk priced at $150.00 has a cost of $0.05 per gigabyte (i.e., $150 / 3072 = 0.0488). In the foregoing example, using the 1 terabyte disks would generally increase the cost per gigabyte by 40%—rendering your cluster substantially less cost efficient.
Running multiple OSDs on a single SAS / SATA drive is NOT a good idea. NVMe drives, however, can achieve improved performance by being split into two or more OSDs.
Running an OSD and a monitor or a metadata server on a single drive is also NOT a good idea.
With spinning disks, the SATA and SAS interface increasingly becomes a bottleneck at larger capacities. See also the Storage Networking Industry Association’s Total Cost of Ownership calculator.
Storage drives are subject to limitations on seek time, access time, read and write times, as well as total throughput. These physical limitations affect overall system performance—especially during recovery. We recommend using a dedicated (ideally mirrored) drive for the operating system and software, and one drive for each Ceph OSD Daemon you run on the host (modulo NVMe above). Many “slow OSD” issues (when they are not attributable to hardware failure) arise from running an operating system and multiple OSDs on the same drive.
It is technically possible to run multiple Ceph OSD Daemons per SAS / SATA drive, but this will lead to resource contention and diminish overall throughput.
To get the best performance out of Ceph, run the following on separate drives: (1) operating systems, (2) OSD data, and (3) BlueStore db. For more information on how to effectively use a mix of fast drives and slow drives in your Ceph cluster, see the block and block.db section of the Bluestore Configuration Reference.
Solid State Drives
Ceph performance can be improved by using solid-state drives (SSDs). This reduces random access time and reduces latency while accelerating throughput.
SSDs cost more per gigabyte than do hard disk drives, but SSDs often offer access times that are, at a minimum, 100 times faster than hard disk drives. SSDs avoid hotspot issues and bottleneck issues within busy clusters, and they may offer better economics when TCO is evaluated holistically.
SSDs do not have moving mechanical parts, so they are not necessarily subject to the same types of limitations as hard disk drives. SSDs do have significant limitations though. When evaluating SSDs, it is important to consider the performance of sequential reads and writes.
We recommend exploring the use of SSDs to improve performance. However, before making a significant investment in SSDs, we strongly recommend reviewing the performance metrics of an SSD and testing the SSD in a test configuration in order to gauge performance.
Relatively inexpensive SSDs may appeal to your sense of economy. Use caution. Acceptable IOPS are not the only factor to consider when selecting an SSD for use with Ceph.
SSDs have historically been cost prohibitive for object storage, but emerging QLC drives are closing the gap, offering greater density with lower power consumption and less power spent on cooling. HDD OSDs may see a significant performance improvement by offloading WAL+DB onto an SSD.
To get a better sense of the factors that determine the cost of storage, you might use the Storage Networking Industry Association’s Total Cost of Ownership calculator
Partition Alignment
When using SSDs with Ceph, make sure that your partitions are properly aligned. Improperly aligned partitions suffer slower data transfer speeds than do properly aligned partitions. For more information about proper partition alignment and example commands that show how to align partitions properly, see Werner Fischer’s blog post on partition alignment.
CephFS Metadata Segregation
One way that Ceph accelerates CephFS file system performance is by segregating the storage of CephFS metadata from the storage of the CephFS file contents. Ceph provides a default metadata pool for CephFS metadata. You will never have to create a pool for CephFS metadata, but you can create a CRUSH map hierarchy for your CephFS metadata pool that points only to SSD storage media. See CRUSH Device Class for details.
Controllers
Disk controllers (HBAs) can have a significant impact on write throughput. Carefully consider your selection of HBAs to ensure that they do not create a performance bottleneck. Notably, RAID-mode (IR) HBAs may exhibit higher latency than simpler “JBOD” (IT) mode HBAs. The RAID SoC, write cache, and battery backup can substantially increase hardware and maintenance costs. Some RAID HBAs can be configured with an IT-mode “personality”.
The Ceph blog is often an excellent source of information on Ceph performance issues. See Ceph Write Throughput 1 and Ceph Write Throughput 2 for additional details.
Benchmarking
BlueStore opens block devices in O_DIRECT and uses fsync frequently to ensure that data is safely persisted to media. You can evaluate a drive’s low-level write performance using fio . For example, 4kB random write performance is measured as follows:
Write Caches
Enterprise SSDs and HDDs normally include power loss protection features which use multi-level caches to speed up direct or synchronous writes. These devices can be toggled between two caching modes — a volatile cache flushed to persistent media with fsync, or a non-volatile cache written synchronously.
These two modes are selected by either “enabling” or “disabling” the write (volatile) cache. When the volatile cache is enabled, Linux uses a device in “write back” mode, and when disabled, it uses “write through”.
The default configuration (normally caching enabled) may not be optimal, and OSD performance may be dramatically increased in terms of increased IOPS and decreased commit_latency by disabling the write cache.
Users are therefore encouraged to benchmark their devices with fio as described earlier and persist the optimal cache configuration for their devices.
The cache configuration can be queried with hdparm , sdparm , smartctl or by reading the values in /sys/class/scsi_disk/*/cache_type , for example:
The write cache can be disabled with those same tools:
Normally, disabling the cache using hdparm , sdparm , or smartctl results in the cache_type changing automatically to “write through”. If this is not the case, you can try setting it directly as follows. (Users should note that setting cache_type also correctly persists the caching mode of the device until the next reboot):
This udev rule (tested on CentOS 8) will set all SATA/SAS device cache_types to “write through”:
This udev rule (tested on CentOS 7) will set all SATA/SAS device cache_types to “write through”:
The sdparm utility can be used to view/change the volatile write cache on several devices at once:
Additional Considerations
You typically will run multiple OSDs per host, but you should ensure that the aggregate throughput of your OSD drives doesn’t exceed the network bandwidth required to service a client’s need to read or write data. You should also consider what percentage of the overall data the cluster stores on each host. If the percentage on a particular host is large and the host fails, it can lead to problems such as exceeding the full ratio , which causes Ceph to halt operations as a safety precaution that prevents data loss.
When you run multiple OSDs per host, you also need to ensure that the kernel is up to date. See OS Recommendations for notes on glibc and syncfs(2) to ensure that your hardware performs as expected when running multiple OSDs per host.
Networks
Provision at least 10 Gb/s networking in your racks.
Speed
It takes three hours to replicate 1 TB of data across a 1 Gb/s network and it takes thirty hours to replicate 10 TB across a 1 Gb/s network. But it takes only twenty minutes to replicate 1 TB across a 10 Gb/s network, and it takes only one hour to replicate 10 TB across a 10 Gb/s network.
The larger the Ceph cluster, the more common OSD failures will be. The faster that a placement group (PG) can recover from a degraded state to an active + clean state, the better. Notably, fast recovery minimizes the likelihood of multiple, overlapping failures that can cause data to become temporarily unavailable or even lost. Of course, when provisioning your network, you will have to balance price against performance.
Some deployment tools employ VLANs to make hardware and network cabling more manageable. VLANs that use the 802.1q protocol require VLAN-capable NICs and switches. The added expense of this hardware may be offset by the operational cost savings on network setup and maintenance. When using VLANs to handle VM traffic between the cluster and compute stacks (e.g., OpenStack, CloudStack, etc.), there is additional value in using 10 Gb/s Ethernet or better; 40 Gb/s or 25/50/100 Gb/s networking as of 2022 is common for production clusters.
Top-of-rack (TOR) switches also need fast and redundant uplinks to spind spine switches / routers, often at least 40 Gb/s.
Baseboard Management Controller (BMC)
Your server chassis should have a Baseboard Management Controller (BMC). Well-known examples are iDRAC (Dell), CIMC (Cisco UCS), and iLO (HPE). Administration and deployment tools may also use BMCs extensively, especially via IPMI or Redfish, so consider the cost/benefit tradeoff of an out-of-band network for security and administration. Hypervisor SSH access, VM image uploads, OS image installs, management sockets, etc. can impose significant loads on a network. Running three networks may seem like overkill, but each traffic path represents a potential capacity, throughput and/or performance bottleneck that you should carefully consider before deploying a large scale data cluster.
Failure Domains
A failure domain is any failure that prevents access to one or more OSDs. That could be a stopped daemon on a host; a disk failure, an OS crash, a malfunctioning NIC, a failed power supply, a network outage, a power outage, and so forth. When planning out your hardware needs, you must balance the temptation to reduce costs by placing too many responsibilities into too few failure domains, and the added costs of isolating every potential failure domain.
Minimum Hardware Recommendations
Ceph can run on inexpensive commodity hardware. Small production clusters and development clusters can run successfully with modest hardware.
Explicit volatile write back cache control¶
Many storage devices, especially in the consumer market, come with volatile write back caches. That means the devices signal I/O completion to the operating system before data actually has hit the non-volatile storage. This behavior obviously speeds up various workloads, but it means the operating system needs to force data out to the non-volatile storage when it performs a data integrity operation like fsync, sync or an unmount.
The Linux block layer provides two simple mechanisms that let filesystems control the caching behavior of the storage device. These mechanisms are a forced cache flush, and the Force Unit Access (FUA) flag for requests.
Explicit cache flushes¶
The REQ_PREFLUSH flag can be OR ed into the r/w flags of a bio submitted from the filesystem and will make sure the volatile cache of the storage device has been flushed before the actual I/O operation is started. This explicitly guarantees that previously completed write requests are on non-volatile storage before the flagged bio starts. In addition the REQ_PREFLUSH flag can be set on an otherwise empty bio structure, which causes only an explicit cache flush without any dependent I/O. It is recommend to use the blkdev_issue_flush() helper for a pure cache flush.
Forced Unit Access¶
The REQ_FUA flag can be OR ed into the r/w flags of a bio submitted from the filesystem and will make sure that I/O completion for this request is only signaled after the data has been committed to non-volatile storage.
Implementation details for filesystems¶
Filesystems can simply set the REQ_PREFLUSH and REQ_FUA bits and do not have to worry if the underlying devices need any explicit cache flushing and how the Forced Unit Access is implemented. The REQ_PREFLUSH and REQ_FUA flags may both be set on a single bio.
Implementation details for bio based block drivers¶
These drivers will always see the REQ_PREFLUSH and REQ_FUA bits as they sit directly below the submit_bio interface. For remapping drivers the REQ_FUA bits need to be propagated to underlying devices, and a global flush needs to be implemented for bios with the REQ_PREFLUSH bit set. For real device drivers that do not have a volatile cache the REQ_PREFLUSH and REQ_FUA bits on non-empty bios can simply be ignored, and REQ_PREFLUSH requests without data can be completed successfully without doing any work. Drivers for devices with volatile caches need to implement the support for these flags themselves without any help from the block layer.
Implementation details for request_fn based block drivers¶
For devices that do not support volatile write caches there is no driver support required, the block layer completes empty REQ_PREFLUSH requests before entering the driver and strips off the REQ_PREFLUSH and REQ_FUA bits from requests that have a payload. For devices with volatile write caches the driver needs to tell the block layer that it supports flushing caches by doing:
and handle empty REQ_OP_FLUSH requests in its prep_fn/request_fn. Note that REQ_PREFLUSH requests with a payload are automatically turned into a sequence of an empty REQ_OP_FLUSH request followed by the actual write by the block layer. For devices that also support the FUA bit the block layer needs to be told to pass through the REQ_FUA bit using:
and the driver must handle write requests that have the REQ_FUA bit set in prep_fn/request_fn. If the FUA bit is not natively supported the block layer turns it into an empty REQ_OP_FLUSH request after the actual write.
Volatile write cache что это
Durability: NVMe disks
[ 2020-September-27 10:26 ]
Durability is the guarantee that data can be accessed after a failure. It seems like this should be very simple: either your system provides durable data storage, or it does not. However, durability is not a binary yes/no property, and instead should be defined as the kinds of failures you want your data to survive. Since there is usually some performance penalty for durability, many systems provide a way for only "important" writes to be durable, while "normal" writes will eventually be durable, with no specific guarantee about when. Finally, durability is rarely tested, since really testing it involves cutting the power to computer systems, which is disruptive and hard to automate. Production environments are designed to avoid these failures, so bugs are rarely triggered and hard to reproduce.
I’ve recently been investigating the durability guarantees in cloud platforms. I decided to start at the beginning: what guarantees are provided by the disks we connect to computers? To find out, I read the relevant sections of the Non-Volatile Memory Express (NVMe) specification (version 2.0), since it is the newest standard for high-performance SSDs. It also has an easy to find, freely available specification, unlike the older SATA or SCSI standards that were originally designed for magnetic disks. In the rest of this article, I will attempt to summarize the durability NVMe devices provide. I believe that most of this should also apply to SATA and SCSI. NVMe was designed as a higher performance replacement for those protocols, so the semantics can’t be too different.
[Updated 2021-10-28]: Russ Cox asked if disk sector overwrites are atomic on Twitter. It turns out that the NVMe specification requires that at a minimum, writes of a single logical block must be atomic. I’ve updated this article, and also updated the references to the NVMe 2.0 specification. For more details, see this mailing list post from Matthew Wilcox about the NVMe specification, and this excellent StackOverflow answer.
Ordering and atomicity
Before we can discuss durability, we should discuss some basic semantics of NVMe writes. Commands are submitted to devices using a set of queues. At some time later, the device acknowledges that the commands have completed. There is no ordering guaranteed between commands. From the Command Set Specification Section 2.1.2 "Command Ordering Requirements": "each command is processed as an independent entity without reference to other commands [. ]. If there are ordering requirements between these commands, host software or the associated application is required to enforce that ordering". This means if the order matters, the software needs to wait for commands to complete before issuing the next commands. However, read commands are guaranteed to return the most completed write (Command Set Section 2.1.4.2.2 "Non-volatile requirements"), although they may also return data from uncompleted writes that have been queued.
A related issue with concurrent updates is atomicity. If there are concurrent writes to overlapping ranges, what are the permitted results? Typically, there are no guarantees. Specifically, "After execution of command A and command B, there may be an arbitrary mix of data from command A and command B in the LBA [logical block address] range specified" (Command Set Section 2.1.4.1.1 AWUN/NAWUN Example). This seems to permit literally any result in the case of concurrent writes, such as alternating bytes from command A and command B.
NVMe includes optional support for atomic writes, with different values for "normal operation" and after power failure. These are defined by the Atomic Write Unit Normal (AWUN) and Atomic Write Unit Power Fail (AWUPF) settings for the device. The couple of NVMe devices I looked at have these values set to zero (according to the nvme id-ctrl command). Somewhat confusingly, this means writes of a single logical block are atomic. The specification defines these values as "0’s based" (Base 1.4.2 Numerical Descriptions "A 0’s based value is a numbering scheme in which the number 0h represents a value of 1h [. ]"; Command Set 4.1.5.2 I/O Command Set specific fields: "This field is specified in logical blocks and is a 0’s based value"). The device exposes the size of atomic writes so software can configure itself to use it. For example, see the MariaDB documentation about atomic writes. This can replace MySQL’s "doublewrite buffer," which is a mechanism that provides atomic writes on devices that don’t natively support them (nearly all disks).
Basically, NVMe provides "weak" ordering semantics similar to shared memory in multi-threaded programs. There are no guarantees if there are concurrent operations. This means if the order of writes matters, the software needs to submit the commands and wait for them to complete, and never have concurrent writes to overlapping ranges. The specification requires single logical block writes to be atomic. However, I would be nervous to rely on this. It requires very careful reading of the specification to determine that this is required. Older devices did not provide this guarantee. I suspect many devices may have bugs, particularly when power fails.
The Flush command
Without special commands, NVMe provides no guarantees about what data will survive a power failure (Command Set 2.1.4.2 "AWUPF/NAWUPF"). My reading of this means devices are permitted to return an error for all ranges where writes were "in flight" at the time of failure. If you want to be completely safe, you should avoid overwriting critical data by using write-ahead logging. This matches the semantics I found during power fail testing of SATA magnetic hard drives and SSDs in 2010.
The first NVMe mechanism that can be used to ensure data is durably written is the Flush command (Base Specification 7.1 "Flush command"). It writes everything in the write cache to non-volatile memory. More specifically, "The flush applies to all commands [. ] completed by the controller prior to the submission of the Flush command". This means if you want a durable write, you need to submit the write, wait for it to complete, submit the flush, and wait for that to complete. If you submit writes after submitting the flush, but before it completes, they might also be flushed ("The controller may also flush additional data and/or metadata"). Most importantly, if you issue a flush, and it fails in the middle, there is no guarantee about what writes might exist on disk. The disk could have any of the writes, with no relation to the order they were submitted or completed. It could also choose to return an error for all the ranges.
Force Unit Access (FUA)
The second mechanism to ensure durability is to set the Force Unit Access option on Write commands. This means that "the controller shall write that data and metadata, if any, to non-volatile media before indicating command completion" (Command Specification 3.2.6 "Write Command" Figure 63). In other words, data written with a FUA write should survive power failures, and the write will not complete until that is true. There is no ordering with other FUA writes, so you should avoid issuing writes for overlapping ranges. Interestingly, you can also specify FUA on a Read command, which is a bit surprising. It forces the referenced data to be flushed to non-volatile media, before reading it (Command Specification 3.2.4 "Read command" Figure 48). This mean you can do a set of normal writes, then selectively flush a small portion of it by executing a FUA read of the data you want committed.
Disabling write caching
The last mechanism that may ensure durability is to explicitly disable the write cache. If an NVMe device has a volatile write cache, it must be controllable. This means you can disable it (Base Specification 5.27.1.4 "Volatile Write Cache"). It appears to me that if the cache is disabled, then every write must not complete until it is written to non-volatile media, which should be equivalent to setting the FUA bit on every write. However, this is not clearly described in the specification, and I suspect this is rarely used.
Devices with power loss protection
Finally, it is worth pointing out that some disks provide "power loss protection." This means the device has been designed to complete any in-flight writes when power is lost. This can be implemented by providing backup power with a supercapacitor or battery that is used to flush the cache. In theory, these devices should show that they do not have a volatile write cache, so software could detect that and just use normal writes. However, these devices should ideally also treat FUA writes the same as non-FUA writes, and ignore cache flushes. As a result, I think it is best to design software for disks that have caches, since it can then work with any storage device. If you are using a device with power loss protection, you should still get better performance and some additional protection from failures.