Rafael - Apr 9, 2023

Add Optional Swap For NixOS

SSDs are primarily valued for their speed, although inexpensive SSDs may exhibit reliability issues while maintaining acceptable performance.

While using NixOS, if some hardware is not available at boot, you cannot simply boot into rescue mode and perform a quick edit on /etc/fstab; there are more advanced recovery processes that need to take place. This article explores the implementation of an inexpensive NVMe M.2 device for swap storage while maintaining system bootability in case of hardware failure.

This document examines the implementation of such a device as swap storage, analyzing performance implications and overall results.

Benchmark

This document's recommendations are grounded in experimental results and benchmarks. It is advisable to carry out these same benchmarks on other hardware because optimal configuration choices may vary based on specific hardware characteristics. This approach ensures that decisions are based on objective measurements rather than assumptions about particular hardware configurations. Keep in mind that the ideal swap setup may vary based on the hardware employed. Therefore, running benchmarks on the system before implementation can help achieve the best overall outcome.

A quote about NVMe from Wikipedia:

By its design, NVM Express allows host hardware and software to fully exploit the levels of parallelism possible in modern SSDs. As a result, NVM Express reduces I/O overhead and brings various performance improvements relative to previous logical-device interfaces, including multiple long command queues, and reduced latency. - Wikipedia on NVMe

Based on the following statement from Wikipedia and benchmarks, these devices are a good choice for caching and swap partitions. Below is the benchmark for the NVMe drive that will be used:

sudo nix-shell -p hdparm --command "hdparm -tT /dev/nvme1n1"


/dev/nvme1n1:
 Timing cached reads:   22858 MB in  2.00 seconds = 11449.78 MB/sec
 Timing buffered disk reads: 2878 MB in  3.00 seconds = 959.07 MB/sec

The main focus of the above execution result should be on the Timing buffered disk reads. Working with this as a base, we can better understand the impact of each choice made later.

LUKS

Although LUKS may have a slight performance impact, it is necessary to use encryption on all data at rest according to my threat model. Additionally, sensitive personal data, access tokens, and other credentials are stored in RAM and must be protected accordingly. To assess the performance of LUKS on your hardware, use the following command.

cryptsetup benchmark

Benchmark Table for LUKS

Algorithm	Key	Encryption	Decryption
aes-cbc	128b	1167.7 MiB/s	3614.7 MiB/s
serpent-cbc	128b	110.2 MiB/s	402.6 MiB/s
twofish-cbc	128b	227.6 MiB/s	407.1 MiB/s
aes-cbc	256b	898.8 MiB/s	3065.7 MiB/s
serpent-cbc	256b	111.8 MiB/s	402.7 MiB/s
twofish-cbc	256b	230.7 MiB/s	407.9 MiB/s
aes-xts	256b	2946.5 MiB/s	2956.0 MiB/s
serpent-xts	256b	369.3 MiB/s	370.7 MiB/s
twofish-xts	256b	376.1 MiB/s	376.5 MiB/s
aes-xts	512b	2520.8 MiB/s	2522.5 MiB/s
serpent-xts	512b	374.0 MiB/s	370.7 MiB/s
twofish-xts	512b	378.5 MiB/s	377.0 MiB/s

The results indicate that the aes-xts algorithm offers stable read and write throughput, making it a well-balanced choice. However, it is important to note that this algorithm may be 20% slower than aes-cbc for decryption. When using swap on a machine, consistent levels of writes are required. If a workload primarily comprises read operations, other options should be considered. Additionally, it is possible to enhance security by using a 512-bit key, although this will result in a performance loss of approximately 20% for both read and write operations and is not required in this specific scenario.

Notes on AES XEX-based tweaked-codebook mode with ciphertext stealing (XTS)

Analyzing the XTS [cite:@mcgrew_extended_2004] implementation and our threat model more deeply, consider the statement below:

XTS mode is susceptible to data manipulation and tampering, and applications must employ measures to detect modifications of data if manipulation and tampering is a concern: "…since there are no authentication tags then any ciphertext (original or modified by attacker) will be decrypted as some plaintext and there is no built-in mechanism to detect alterations. The best that can be done is to ensure that any alteration of the ciphertext will completely randomize the plaintext, and rely on the application that uses this transform to include sufficient redundancy in its plaintext to detect and discard such random plaintexts." This would require maintaining checksums for all data and metadata on disk, as done in ZFS or Btrfs. However, in commonly used file systems such as ext4 and NTFS only metadata is protected against tampering, while the detection of data tampering is non-existent. - Wikipedia

We can assume that this is not our concern given that the swap is handled and cleaned up by the kernel, and all modifications to the disk structure at rest by an attacker will not be able to effectively reflect in any deterministic structure given the encryption. The only attack vector is destroying the actual data with random noise, which will invalidate the whole device and is beyond the threat model of this implementation.

Kernel references for cleaning up the swap

To elaborate further on the risk raised above, let's explore the kernel implementation. The new kernel implementation uses Frontswap as the frontend for the swap interfaces. The following is the initialization code taken from frontswap.c

/*
 * Called when a swap device is swapon'd.
 */
void frontswap_init(unsigned type, unsigned long *map)

The initialization delegates the process to a field called init stored inside the frontswap_ops structure, defined below:

/*
 * frontswap_ops are added by frontswap_register_ops, and provide the
 * frontswap "backend" implementation functions.  Multiple implementations
 * may be registered, but implementations can never deregister.  This
 * is a simple singly-linked list of all registered implementations.
 */
static const struct frontswap_ops *frontswap_ops __read_mostly;

This structure is populated using the frontswap_register_ops function.

/*
 * Register operations for frontswap
 */
int frontswap_register_ops(const struct frontswap_ops *ops)
{
  if (frontswap_ops)
    return -EINVAL;

  frontswap_ops = ops;
  static_branch_inc(&frontswap_enabled_key);
  return 0;
}

In our current concern and use case, the usage of zswap handles it on zswap.c

ret = frontswap_register_ops(&zswap_frontswap_ops);

Which is defined by the following struct:

static const struct frontswap_ops zswap_frontswap_ops = {
  .store = zswap_frontswap_store,
  .load = zswap_frontswap_load,
  .invalidate_page = zswap_frontswap_invalidate_page,
  .invalidate_area = zswap_frontswap_invalidate_area,
  .init = zswap_frontswap_init
};

The function zswap_frontswap_init is defined as follows:

static void zswap_frontswap_init(unsigned type)
{
  struct zswap_tree *tree;

  tree = kzalloc(sizeof(*tree), GFP_KERNEL);
  if (!tree) {
    pr_err("alloc failed, zswap disabled for swap type %d\n", type);
    return;
  }

  tree->rbroot = RB_ROOT;
  spin_lock_init(&tree->lock);
  zswap_trees[type] = tree;
}

So we finally reach the end of the execution tree, and we can prove that it is initialized and set to zero given the usage of kzalloc, as stated in the kzalloc documentation.

Name

kzalloc — allocate memory. The memory is set to zero.
Synopsis
void * kzalloc (size_t size,
                gfp_t flags);

Arguments

size_t size

    how many bytes of memory are required.
gfp_t flags

    the type of memory to allocate (see kmalloc).

Partitioning

The following disk will be split into two partitions in a 60/40 ratio:

lsblk /dev/nvme1n1

NAME        MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
nvme1n1     259:3    0 476.9G  0 disk
├─nvme1n1p1 259:4    0 286.2G  0 part
└─nvme1n1p2 259:5    0 190.8G  0 part

Use the new device:

export DEVICE="/dev/nvme1n1"
parted "${DEVICE}" -- mklabel gpt
parted "${DEVICE}" -- mkpart swap 0% 60%
parted "${DEVICE}" -- mkpart swap 60% 100%

LUKS

LUKS can be set up with the following:

export DEVICE="/dev/nvme1n1"
cryptsetup -v luksFormat "${DEVICE}p1"
cryptsetup -v luksFormat "${DEVICE}p2"
cryptsetup open "${DEVICE}p1" "swap"
cryptsetup open "${DEVICE}p2" "cache"

Keys

NixOS needs the keys to be available at boot, or mounted in a partition at boot. I will use my /root directory for this.

sudo dd count=4096 bs=1 if=/dev/urandom of=/root/.swap.key
sudo dd count=4096 bs=1 if=/dev/urandom of=/root/.cache.key

The last step is to add it to LUKS:

cryptsetup luksAddKey "${DEVICE}p1" /root/.swap.key
cryptsetup luksAddKey "${DEVICE}p2" /root/.cache.key

Notes on making the device optional

Two things are required to make the device optional but keep mounting it at boot:

auto
nofail

This will allow the device to be optional, given that it is a cheap piece of hardware that can die at any moment. From the mount(8) manual page:

       nofail
           Do not report errors for this device if it does not exist.

The Nix code representing this configuration:

swapDevices = [{
    device = "...";
    options = [ "defaults" "nofail" ];
}];

Swap

Creating the swap partition using mkswap, first determine the actual disk:

sudo mkswap -L swap-nvme /dev/mapper/swap

Setting up swapspace version 1, size = 286.1 GiB (307248492544 bytes)
LABEL=swap-nvme, UUID=ac965b4f-f857-4cd3-8c87-91e0ca3a2271

A lazy way to get the proper configuration for the new swap partition is to activate it and run nixos-generate-config --root /tmp. It will generate the NixOS configuration in /tmp/etc/nixos/ and you can retrieve the hardware configuration directly from the directory.

sudo swapon /dev/mapper/swap
sudo nixos-generate-config --root /tmp

Another approach is to adapt the code below to your needs. Note that the block device backing the swap should be referred to by the partition UUID. Optionally, it can be referred to using partition labels.

  swapDevices = [{
    device = "/dev/disk/by-uuid/ac965b4f-f857-4cd3-8c87-91e0ca3a2271";
    options = [ "defaults" "nofail" ];
    discardPolicy = "once";
    encrypted = {
      label = "swap";
      blkDev = "/dev/disk/by-partuuid/faeffa11-a44f-47df-9520-4bdeb479a4e2";
      enable = true;
      keyFile = "/mnt-root/root/.swap.key";
    };
  }];

After enabling this configuration, the system will have available swap memory:

swapon --show

NAME      TYPE        SIZE USED PRIO
/dev/dm-2 partition 286.1G   1G   -2

ZSwap

ZSwap is a feature available in the Linux kernel that acts as a virtual memory compression tool, creating a compressed write-back cache for swapped pages. Rather than sending memory pages to a swap device when they are to be swapped out, the kernel creates a dynamic memory pool in system RAM and compresses the pages. This reduces the I/O required for swapping in Linux systems and allows for deferred or even avoided writeback to the actual swap device. However, it should be noted that utilizing this feature will require additional CPU cycles to perform the necessary compression.

ZSwap compresses memory pages using the Frontswap API. This provides a compressed pool which ZSwap can use to evict pages on a least recently used (LRU) basis. If the pool is full, it writes the compressed pages back to the swap device from which they were sourced.

Each allocation within the zpool is not directly accessible but requires a handle to be mapped before being accessed. The compressed memory pool is dynamically adjusted based on demand and is not preallocated. The default zpool type is zbud, but it can be changed at boot time or runtime using the zpool attribute of sysfs.

echo zbud > /sys/module/zswap/parameters/zpool

Zbud allocates one physical page to store up to two compressed pages, yielding a maximum compression ratio of 2:1. Performance may degrade with half-full zbud pages. Z3fold improves upon zbud by storing up to three compressed pages per physical page, providing better memory density while maintaining the ability to evict pages when memory pressure increases. Zsmalloc offers the highest compression density through sophisticated page packing algorithms but lacks eviction capabilities. Once zsmalloc reaches capacity, it cannot remove older compressed pages and can only reject new allocation requests, potentially leading to memory pressure issues.

When transitioning a swap page from frontswap to zswap, zswap establishes and preserves a correspondence between the swap entry, consisting of the swap type and swap offset, and the zpool handle that denotes the compressed swap page. This correspondence is accomplished by utilizing a red-black tree for each swap type, wherein the swap offset serves as the key for searching and accessing the tree nodes. During a page fault event that involves a Page Table Entry (PTE) which is associated with a swap entry, the frontswap module invokes the zswap load function. This function is responsible for decompressing the page and assigning it to the page that was previously allocated by the page fault handler.

Upon detection of a zero count in the PTE pointing to a swap page in zswap, the swap mechanism triggers the zswap invalidate function through frontswap to release the compressed entry.

ZSwap parameters can be changed at runtime by using the sysfs interface as follows:

echo lzo > /sys/module/zswap/parameters/compressor

Modifying the zpool or compressor parameter while the system is running does not affect already compressed pages, which remain in their original zpool. If a page is requested from an old zpool, it is uncompressed using the original compressor. Once all pages are removed from an old zpoo, the zpool and its compressor are freed.

Some of the pages in zswap are same-value filled pages (i.e. contents of the page have same value or repetitive pattern). These pages include zero-filled pages and they are handled differently. During store operation, a page is checked if it is a same-value filled page before compressing it. If true, the compressed length of the page is set to zero and the pattern or same-filled value is stored.

This is defined at zswap.c:

static int zswap_is_page_same_filled(void *ptr, unsigned long *value)
{
  unsigned long *page;
  unsigned long val;
  unsigned int pos, last_pos = PAGE_SIZE / sizeof(*page) - 1;

  page = (unsigned long *)ptr;
  val = page[0];

  if (val != page[last_pos])
    return 0;

  for (pos = 1; pos < last_pos; pos++) {
    if (val != page[pos])
      return 0;
  }

  *value = val;

  return 1;
}

Same-value filled pages feature is enabled by default as defined in zswap.c:

/*
 * Enable/disable handling same-value filled pages (enabled by default).
 * If disabled every page is considered non-same-value filled.
 */
static bool zswap_same_filled_pages_enabled = true;
module_param_named(same_filled_pages_enabled, zswap_same_filled_pages_enabled, bool, 0644);

And can be disabled with:

echo 0 > /sys/module/zswap/parameters/same_filled_pages_enabled

Compression algorithm

The choice of the compression algorithm will be made considering the input as a low entropy, while this doesn't reflect all the possible use cases, this reflects a quite significant amount of use cases on virtualization and machine learning models where the entropy is low. For the benchmark lzbench will be used.

git clone --depth=1 git@github.com:torvalds/linux.git
tar cf benchmark-linux linux/

lzbench benchmark-linux

Below is the normalized table with the output sorted by compression ratio.

Compressor name	Compress.	Decompress.	Compr. size	Ratio
memcpy	14056 MB/s	14754 MB/s	1632276480	100.00
pithy 2011-12-24 -0	13817 MB/s	13463 MB/s	1632245638	100.00
shrinker 0.1	10285 MB/s	13367 MB/s	1616198100	99.01
pithy 2011-12-24 -6	15377 MB/s	12930 MB/s	1632244500	100.00
pithy 2011-12-24 -9	14700 MB/s	12148 MB/s	1632244506	100.00
pithy 2011-12-24 -3	15092 MB/s	11888 MB/s	1632244920	100.00
lz4fast 1.9.2 -17	1238 MB/s	4194 MB/s	815460247	49.96
lz4fast 1.9.2 -3	932 MB/s	4135 MB/s	650891909	39.88
lz4 1.9.2	887 MB/s	4086 MB/s	621863629	38.10
lizard 1.0 -14	105 MB/s	3650 MB/s	530856258	32.52
lizard 1.0 -13	115 MB/s	3598 MB/s	538995628	33.02
lizard 1.0 -12	169 MB/s	3518 MB/s	554852288	33.99
lizard 1.0 -10	703 MB/s	3421 MB/s	630084911	38.60
lizard 1.0 -11	604 MB/s	3327 MB/s	610824735	37.42
density 0.14.2 -1	1478 MB/s	2146 MB/s	1038311442	63.61
snappy 2019-09-30	675 MB/s	2073 MB/s	628223243	38.49
zstd 1.4.5 -1	653 MB/s	2054 MB/s	478706032	29.33
zstd 1.4.5 -4	449 MB/s	2022 MB/s	451605004	27.67
zstd 1.4.5 -3	478 MB/s	2019 MB/s	452407912	27.72
zstd 1.4.5 -5	228 MB/s	2000 MB/s	438812038	26.88
zstd 1.4.5 -2	587 MB/s	1990 MB/s	466928101	28.61
density 0.14.2 -2	870 MB/s	1497 MB/s	707573496	43.35
lzvn 2017-03-08	79 MB/s	1377 MB/s	531756070	32.58
lzf 3.6 -1	402 MB/s	973 MB/s	640607930	39.25
lzo1c 2.10 -1	277 MB/s	961 MB/s	628902387	38.53
lzfse 2017-03-08	103 MB/s	952 MB/s	467004940	28.61
lzo1x 2.10 -1	810 MB/s	950 MB/s	634398382	38.87
lzo1b 2.10 -1	295 MB/s	939 MB/s	610647471	37.41
lzf 3.6 -0	423 MB/s	934 MB/s	661446913	40.52
fastlz 0.1 -2	412 MB/s	918 MB/s	624463805	38.26
lzo1y 2.10 -1	810 MB/s	904 MB/s	631981327	38.72
lzo1f 2.10 -1	267 MB/s	895 MB/s	632987938	38.78
fastlz 0.1 -1	348 MB/s	893 MB/s	647180421	39.65
lzrw 15-Jul-1991 -3	373 MB/s	743 MB/s	702146953	43.02
lzrw 15-Jul-1991 -1	309 MB/s	691 MB/s	762638110	46.72
lzrw 15-Jul-1991 -5	167 MB/s	586 MB/s	629737911	38.58
quicklz 1.5.0 -1	568 MB/s	566 MB/s	614024659	37.62
tornado 0.6a -1	412 MB/s	555 MB/s	676369612	41.44
lzrw 15-Jul-1991 -4	409 MB/s	554 MB/s	678729307	41.58
tornado 0.6a -2	367 MB/s	535 MB/s	591666214	36.25
lzjb 2010	387 MB/s	530 MB/s	777076808	47.61
quicklz 1.5.0 -2	287 MB/s	463 MB/s	568841016	34.85
density 0.14.2 -3	487 MB/s	423 MB/s	612773674	37.54
tornado 0.6a -3	251 MB/s	324 MB/s	493115543	30.21

This makes lz4fast 1.9.2 -3 a balanced option. While the compression rate of 932 MB/s is slightly below the NVMe's measured throughput of 959.07 MB/s, the majority of operations involve reads where the decompression rate of 4135 MB/s provides excellent performance, along with a compression ratio of 39.88%.

Linux Kernel defines it as at lz4.h.

#define LZ4_ACCELERATION_DEFAULT 1

From the manual reference.

Same as LZ4_compress_default(), but allows selection of "acceleration" factor. The larger the acceleration value, the faster the algorithm, but also the lesser the compression. This represents a trade-off that can be fine tuned, with each successive value providing roughly +~3% to speed. An acceleration value of "1" is the same as regular LZ4_compress_default() Values <= 0 will be replaced by LZ4_ACCELERATION_DEFAULT (currently = 1, see lz4.c). Values > LZ4_ACCELERATION_MAX will be replaced by LZ4_ACCELERATION_MAX (currently = 65537, see lz4.c).

So similar but not exactly the same results as those shown above should be expected.

Setting NixOS Configuration for ZSwap with `lz4fast`

NixOS has already built-in support for zswap; it just needs to be enabled. First, as a good practice for configuration management, it is required to confirm that the configuration is not set, and then after the change is applied, to confirm that it is up and running. Whether ZSwap is enabled at boot time depends on whether the CONFIG_ZSWAP_DEFAULT_ON Kconfig option is enabled. This setting can then be overridden by providing the kernel command line zswap.enabled option, for example zswap.enabled=0. ZSwap can also be enabled and disabled at runtime using the sysfs interface.

cat /proc/cmdline

initrd=\efi\nixos\dnap9dk2mgx1gdjgd61bdircvd08pbn7-initrd-linux-6.1-initrd.efi init=/nix/store/dgyxblfcrdgy6f1xiwfzvyaipzsh78vg-nixos-system-markarth-23.05.20230305.dirty/init loglevel=4

sudo cat /sys/module/zswap/parameters/enabled

Enabling using `sysfs`

An alternative is to enable it by using the sysfs interface. This is useful in cases where you want to test it, but prefer not to change the configuration just yet.

sudo echo 1 > /sys/module/zswap/parameters/enabled

Then the following can be used to assert that it is running:

sudo grep -r . /sys/kernel/debug/zswap

/sys/kernel/debug/zswap/same_filled_pages:0

/sys/kernel/debug/zswap/stored_pages:0

/sys/kernel/debug/zswap/pool_total_size:0

/sys/kernel/debug/zswap/duplicate_entry:0

/sys/kernel/debug/zswap/written_back_pages:0

/sys/kernel/debug/zswap/reject_compress_poor:0

/sys/kernel/debug/zswap/reject_kmemcache_fail:0

/sys/kernel/debug/zswap/reject_alloc_fail:0

/sys/kernel/debug/zswap/reject_reclaim_fail:0

/sys/kernel/debug/zswap/pool_limit_hit:0

Page Allocator Selection

For this implementation, z3fold is selected over zbud and zsmalloc based on the following considerations: z3fold provides superior memory efficiency compared to zbud (3:1 vs 2:1 compression ratio) while maintaining eviction capabilities that zsmalloc lacks. The benchmark results demonstrate that z3fold performs slightly better than zbud for both high and low entropy workloads, making it the optimal choice for swap scenarios where memory pressure management is critical.

Nix Code

As explained in a previous section of this document, the default lz4 algorithm uses LZ4_ACCELERATION_DEFAULT=1. The configuration requires setting both the compressor and zpool allocator parameters.

GRUB_CMDLINE_LINUX_DEFAULT="zswap.enabled=1 zswap.compressor=lz4"

Below is the complete code for enabling ZSwap on NixOS along with other parameters.

boot.initrd = {
    availableKernelModules = [ "lz4" "lz4_compress" "z3fold" ];
    kernelModules = [ "lz4" "lz4_compress" "z3fold" ];
    preDeviceCommands = ''
    printf lz4 > /sys/module/zswap/parameters/compressor
    printf z3fold > /sys/module/zswap/parameters/zpool
    '';
};

boot.kernelParams = [ "zswap.enabled=1" "zswap.compressor=lz4" ];
boot.kernelPackages = pkgs.linuxPackages.extend (lib.const (super: {
    kernel = super.kernel.overrideDerivation (drv: {
    nativeBuildInputs = (drv.nativeBuildInputs or [  ]) ++ [ pkgs.lz4 ];
    });
}));

Validate the changes

Then after a reboot confirm the configuration changes

cat /proc/cmdline

initrd=\efi\nixos\pax13psm300w02m0cfcd9rhif6v75694-initrd-linux-6.1-initrd.efi init=/nix/store/18785fqmc3vv9dm67gpzld64zni5vrxn-nixos-system-markarth-23.05.20230305.dirty/init zswap.enabled=1 zswap.compressor=lz4 loglevel=4

sudo cat /sys/module/zswap/parameters/enabled

Validate the compression algorithm:

sudo cat /sys/module/zswap/parameters/compressor

lz4

Notes on Swap

Below is a non-exhaustive list of parameters which can be tweaked for better performance.

`compact_memory`

Available only when CONFIG_COMPACTION is set. When 1 is written to the file, all zones are compacted such that free memory is available in contiguous blocks where possible. This can be important, for example, in the allocation of huge pages, although processes will also directly compact memory as required.

`compaction_proactiveness`

This tunable takes a value in the range [0, 100] with a default value of 20. This tunable determines how aggressively compaction is done in the background. Writing a non-zero value to this tunable will immediately trigger proactive compaction. Setting it to 0 disables proactive compaction.

Note that compaction has a non-trivial system-wide impact as pages belonging to different processes are moved around, which could also lead to latency spikes in unsuspecting applications. The kernel employs various heuristics to avoid wasting CPU cycles if it detects that proactive compaction is not being effective.

Be careful when setting it to extreme values like 100, as that may cause excessive background compaction activity.

`swappiness`

This control is used to define the rough relative IO cost of swapping and filesystem paging, as a value between 0 and 200. At 100, the VM assumes equal IO cost and will thus apply memory pressure to the page cache and swap-backed pages equally; lower values signify more expensive swap I/O, higher values indicate cheaper swap I/O.

Keep in mind that filesystem IO patterns under memory pressure tend to be more efficient than swap’s random IO. An optimal value will require experimentation and will also be workload-dependent.

The default value is 60.

For in-memory swap, like zram or zswap, as well as hybrid setups that have swap on faster devices than the filesystem, values beyond 100 can be considered. For example, if the random IO against the swap device is on average 2x faster than IO from the filesystem, swappiness should be 133 (x + 2x = 200, 2x = 133.33).

At 0, the kernel will not initiate swap until the amount of free and file-backed pages is less than the high watermark in a zone.

Final benchmark

The benchmarking methodology for modern systems is a topic of debate. Often, the measurement methods utilized do not accurately represent real-world usage or expected performance while utilizing the system.

To eliminate any potential prejudices in our evaluation, we will employ two approaches. Firstly, a straightforward C script will be utilized to verify sequential and random access to memory regions, using a single byte at a time. This access will be performed by a single thread, and we will conduct assessments using two sets of data: low and high entropy.

As a second approach, sysbench will be used to check read and write speed of the memory. The main reason for two approaches is that sysbench explores a synthetic use of memory, with data that is not as close as the usage pattern as expected.

Sysbench uses low entropy data for reads, giving higher compression rate then normal usage data which can affect the tests and skew the results towards more performance the memory is initialized with zero. This will exploit the same-filled feature on zswap and should be taken into consideration while interpreting the results. The code below was edited to remove irrelevant lines, unless sysbench is running on a system with huge pages enabled, the buffer is always filled with zero.

int memory_init(void)
{
  unsigned int i;
  char         *s;
  size_t       *buffer;

  // ...
  // Code omitted for breviety...
  if (memory_scope == SB_MEM_SCOPE_GLOBAL)
  {
    // ...
    memset(buffer, 0, memory_block_size);
  }

  // ...
  // Code omitted for breviety...
  for (i = 0; i < sb_globals.threads; i++)
  {
    if (memory_scope == SB_MEM_SCOPE_GLOBAL)
      buffers[i] = buffer;
    else
    {
      // ...
      memset(buffers[i], 0, memory_block_size);
      // ...
    }
  }
  // ...
  return 0;
}

While reproducing these results, it is also interesting to experiment with hogging 95% of the memory so more swap is used. Below is the command to accomplish this:

stress-ng \
  --vm-bytes \
  $(awk '/MemAvailable/{printf "%d\n", $2 * 0.95;}' < /proc/meminfo)k \
  --vm-keep -m 1

Full Data in RAM

# time ./bench 1000
[+] Allocating 1000 MB
[+] Initializing memory with random data
[+] Memory initialized
[+] Sequential Access High Entropy:   483.09 mb/s
[+] Random Access High Entropy:       22.89 mb/s
[+] Allocating 1000 MB
[+] Initializing memory with low entropy data
[+] Memory initialized
[+] Sequential Access Low Entropy:    480.85 mb/s
[+] Random Access Low Entropy:        23.02 mb/s

real  1m50.561s
user  1m49.807s
sys   0m0.745s

Traditional Swap

# time ./bench 100000
[+] Allocating 100000 MB
[+] Initializing memory with random data
[+] Memory initialized
[+] Sequential Access High Entropy: 149.36 mb/s
[+] Random Access High Entropy:     19.36 mb/s
[+] Allocating 100000 MB
[+] Initializing memory with low entropy data
[+] Memory initialized
[+] Sequential Access Low Entropy: 150.17 mb/s
[+] Random Access Low Entropy:      19.60 mb/s

real 249m1.489s
user 207m38.827s
sys  4m25.676s

With ZSwap

lz4 + z3fold

# time ./bench 100000
[+] Allocating 100000 MB
[+] Initializing memory with random data
[+] Memory initialized
[+] Sequential Access High Entropy: 151.28 mb/s
[+] Random Access High Entropy:     19.49 mb/s
[+] Allocating 100000 MB
[+] Initializing memory with low entropy data
[+] Memory initialized
[+] Sequential Access Low Entropy: 381.58 mb/s
[+] Random Access Low Entropy:     19.72 mb/s

real 236m21.983s
user 207m16.326s
sys  4m3.063s

lz4 + zbud

# time ./bench 100000
[+] Allocating 100000 MB
[+] Initializing memory with random data
[+] Memory initialized
[+] Sequential Access High Entropy: 166.09 mb/s
[+] Random Access High Entropy:     19.68 mb/s
[+] Allocating 100000 MB
[+] Initializing memory with low entropy data
[+] Memory initialized
[+] Sequential Access Low Entropy:  381.18 mb/s
[+] Random Access Low Entropy:      19.68 mb/s

real 225m49.379s
user 206m33.178s
sys  3m59.969s

lzo + zbud

# time ./bench 100000
[+] Allocating 100000 MB
[+] Initializing memory with random data
[+] Memory initialized
[+] Sequential Access High Entropy: 169.18 mb/s
[+] Random Access High Entropy:     19.53 mb/s
[+] Allocating 100000 MB
[+] Initializing memory with low entropy data
[+] Memory initialized
[+] Sequential Access Low Entropy:  381.07 mb/s
[+] Random Access Low Entropy:      19.39 mb/s

real 225m59.620s
user 208m29.208s
sys  3m58.475s

Sysbench Read

# sysbench memory --memory-block-size=4G --memory-total-size=20G --memory-oper=read run
sysbench 1.0.20 (using system LuaJIT 2.1.0-beta3)

Running the test with following options:
Number of threads: 1
Initializing random number generator from current time


Running memory speed test with the following options:
  block size: 4194304KiB
  total size: 20480MiB
  operation: read
  scope: global

Initializing worker threads...

Threads started!

Total operations: 5 (    5.02 per second)

20480.00 MiB transferred (20571.87 MiB/sec)


General statistics:
    total time:                          0.9942s
    total number of events:              5

Latency (ms):
         min:                                  197.70
         avg:                                  198.82
         max:                                  201.40
         95th percentile:                      200.47
         sum:                                  994.12

Threads fairness:
    events (avg/stddev):           5.0000/0.00
    execution time (avg/stddev):   0.9941/0.00

Sysbench Write

# sysbench memory --memory-block-size=4G --memory-total-size=20G --memory-oper=write run
sysbench 1.0.20 (using system LuaJIT 2.1.0-beta3)

Running the test with following options:
Number of threads: 1
Initializing random number generator from current time


Running memory speed test with the following options:
  block size: 4194304KiB
  total size: 20480MiB
  operation: write
  scope: global

Initializing worker threads...

Threads started!
Total operations: 5 (    2.17 per second)

20480.00 MiB transferred (8869.54 MiB/sec)


General statistics:
    total time:                          2.3077s
    total number of events:              5

Latency (ms):
         min:                                  452.65
         avg:                                  461.51
         max:                                  477.72
         95th percentile:                      475.79
         sum:                                 2307.57

Threads fairness:
    events (avg/stddev):           5.0000/0.00
    execution time (avg/stddev):   2.3076/0.00

Sysbench Read with Swap

sysbench memory --memory-block-size=64G --memory-total-size=1500G --memory-oper=read run
sysbench 1.0.20 (using system LuaJIT 2.1.0-beta3)

Running the test with following options:
Number of threads: 1
Initializing random number generator from current time


Running memory speed test with the following options:
  block size: 67108864KiB
  total size: 1536000MiB
  operation: read
  scope: global

Initializing worker threads...

Threads started!

Total operations: 1 (    0.00 per second)

65536.00 MiB transferred (159.72 MiB/sec)


General statistics:
    total time:                          410.3168s
    total number of events:              1

Latency (ms):
         min:                               410313.56
         avg:                               410313.56
         max:                               410313.56
         95th percentile:                   100000.00
         sum:                               410313.56

Threads fairness:
    events (avg/stddev):           1.0000/0.00
    execution time (avg/stddev):   410.3136/0.00

Sysbench Write with Swap

sysbench 1.0.20 (using system LuaJIT 2.1.0-beta3)

Running the test with following options:
Number of threads: 1
Initializing random number generator from current time


Running memory speed test with the following options:
  block size: 67108864KiB
  total size: 1536000MiB
  operation: write
  scope: global

Initializing worker threads...

Threads started!

Total operations: 1 (    0.00 per second)

65536.00 MiB transferred (85.83 MiB/sec)


General statistics:
    total time:                          763.5311s
    total number of events:              1

Latency (ms):
         min:                               763527.78
         avg:                               763527.78
         max:                               763527.78
         95th percentile:                   100000.00
         sum:                               763527.78

Threads fairness:
    events (avg/stddev):           1.0000/0.00
    execution time (avg/stddev):   763.5278/0.00

Sysbench Read with ZSwap

# sysbench memory --memory-block-size=64G --memory-total-size=1500G --memory-oper=read run
sysbench 1.0.20 (using system LuaJIT 2.1.0-beta3)

Running the test with following options:
Number of threads: 1
Initializing random number generator from current time


Running memory speed test with the following options:
  block size: 67108864KiB
  total size: 1536000MiB
  operation: read
  scope: global

Initializing worker threads...

Threads started!

Total operations: 1 (    0.02 per second)

65536.00 MiB transferred (1299.51 MiB/sec)


General statistics:
    total time:                          50.4301s
    total number of events:              1

Latency (ms):
         min:                                50428.18
         avg:                                50428.18
         max:                                50428.18
         95th percentile:                    50446.94
         sum:                                50428.18

Threads fairness:
    events (avg/stddev):           1.0000/0.00
    execution time (avg/stddev):   50.4282/0.00

Sysbench Write with ZSwap

# sysbench memory --memory-block-size=64G --memory-total-size=1500G --memory-oper=write run
sysbench 1.0.20 (using system LuaJIT 2.1.0-beta3)
Running the test with following options:
Number of threads: 1
Initializing random number generator from current time


Running memory speed test with the following options:
  block size: 67108864KiB
  total size: 1536000MiB
  operation: write
  scope: global

Initializing worker threads...

Threads started!

Total operations: 1 (    0.00 per second)

65536.00 MiB transferred (109.19 MiB/sec)


General statistics:
    total time:                          600.1754s
    total number of events:              1

Latency (ms):
         min:                               600078.36
         avg:                               600078.36
         max:                               600078.36
         95th percentile:                   100000.00
         sum:                               600078.36

Threads fairness:
    events (avg/stddev):           1.0000/0.00
    execution time (avg/stddev):   600.0784/0.00

ZSwap Compression Results

To determine how much gain was obtained in terms of space, we can determine the difference between the expected consumption and the actual consumption in terms of storage.

Each page is stored in memory aligned in blocks, usually with a size of 4k, but determined by the variable PAGESIZE. ZSwap information can be obtained from the /sys/kernel/debug/zswap directory. The main calculation is the amount of stored pages, multiplied by their in-memory page size, divided by the sum of all storage in use by ZSwap. The script below facilitates the process of determining the actual gains:

P=$(sudo cat /sys/kernel/debug/zswap/stored_pages)
S=$(sudo cat /sys/kernel/debug/zswap/pool_total_size)
PZ=$(getconf PAGESIZE)
SWZ=$(free -m | grep Swap | awk '{print $2}')
RATIO=$(( P*PZ * 100 / S ))
TOTAL=$(( SWZ * RATIO / 100 ))
echo "ZSwap compression gain of ${RATIO}%, actual swap of ${SWZ}mb can hold an estimated ${TOTAL}mb."

ZSwap compression gain of 237%, actual swap of 293014mb can hold an estimated 694443mb.

It should be noted that the presented numbers are based on estimations derived from a statistical approach. It is important to acknowledge that the actual results may differ slightly from those presented. Furthermore, it is worth mentioning that the workload used in this test was focused on training a convolutional neural network, with a relatively lower level of entropy compared to other tasks, such as video encoding.

The impact of using a single thread for test execution has been considered. Future work could investigate the performance of the benchmark in a multi-threaded environment and compare it with the performance of executing Python code in consideration of the GIL (Global Interpreter Lock).

Overall, the results demonstrate significant improvements. With lz4 and z3fold enabled, low entropy data achieves 381.58 mb/s compared to traditional swap's 150.17 mb/s, representing a 154% improvement. High entropy scenarios maintain similar performance to traditional swap (151.28 mb/s vs 149.36 mb/s). The storage capabilities were expanded by 237% on average during testing, while achieving approximately 79% of RAM performance for low entropy data and maintaining equivalent performance for high entropy workloads.

Appendix

Benchmark Software Source Code

Compile with:

gcc bench.c -o bench

It only accepts one argument, the amount of memory to be allocated for the benchmark.

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#define DEFAULT_MEM_SIZE 1024 // default benchmark uses 1GB

/*
 * Calculate a score which represents the throughput of data
 * in megabytes per second. Start and End are defined in nanoseconds.
 */
double score(unsigned long start, unsigned long end, size_t size){
  double mbs = size / 1024 / 1024;
  double score = end - start;
  score = mbs / score * 1000;
  return score;
}

/*
 * Initialize the memory with low entropy values to exploit the compression
 * capabilities and check the actual performance with low entropy data.
 */
void init_sequential(char* mem, size_t size) {
    for (size_t i = 0; i < size; i++) {
        mem[i] = i % 8;
    }
}

/*
 * Initialize the memory with random values so there are no optimizations nor
 * any hack that can be done during the benchmark to avoid the real access.
 */
void init_random(char* mem, size_t size) {
    for (size_t i = 0; i < size; i++) {
        mem[i] = rand() % 256;
    }
}
#pragma GCC push_options
#pragma GCC optimize ("O0")
/*
 * Test the access in sequential order, exploiting the speculative execution
 * engine on the processor.
 */
double test_sequential_access(char* mem, size_t size) {
    long sum = 0;

    struct timespec ts;
    timespec_get(&ts, TIME_UTC);
    long start = ts.tv_sec * 1000 + ts.tv_nsec / 1000000;

    for (size_t i = 0; i < size; i++) {
        sum += mem[i];
    }

    timespec_get(&ts, TIME_UTC);
    long end = ts.tv_sec * 1000 + ts.tv_nsec / 1000000;
    return score(start, end, size);
}

/*
 * Test the access in random order, to avoid exploiting the speculative
 * execution engine on the processor.
 */
double test_random_access(char* mem, size_t size) {
    long sum = 0;

    struct timespec ts;
    timespec_get(&ts, TIME_UTC);
    long start = ts.tv_sec * 1000 + ts.tv_nsec / 1000000;

    for (size_t i = 0; i < size; i++) {
        sum += mem[rand() % size];
    }

    timespec_get(&ts, TIME_UTC);
    long end = ts.tv_sec * 1000 + ts.tv_nsec / 1000000;
    return score(start, end, size);
}
#pragma GCC pop_options


int main(int argc, char **argv){

    // initialize random seed
    srand(time(NULL));

    // parse command line arguments
    size_t mem_size_mb = DEFAULT_MEM_SIZE;
    if (argc > 1) {
        mem_size_mb = atoi(argv[1]);
    }
    size_t mem_size = mem_size_mb  * 1024 * 1024;

    // First round, not exploiting zram
    printf("[+] Allocating %zu MB\n", mem_size_mb);
    // allocate memory
    char* mem = (char*) malloc(mem_size);
    if (mem == NULL) {
        fprintf(stderr, "[-] Failed to allocate memory\n");
        exit(EXIT_FAILURE);
    }

    printf("[+] Initializing memory with random data\n");
    init_random(mem, mem_size);
    printf("[+] Memory initialized\n");
    printf("[+] Sequential Access High Entropy:\t %0.2lf mb/s \n", test_sequential_access(mem, mem_size));
    printf("[+] Random Access High Entropy:    \t %0.2lf mb/s \n", test_random_access(mem, mem_size));

    // free memory
    free(mem);
    mem = NULL;

    // Second round, exploiting zram/zswap
    printf("[+] Allocating %zu MB\n", mem_size_mb);

    // allocate memory
    mem = (char*) malloc(mem_size);
    if (mem == NULL) {
        fprintf(stderr, "[-] Failed to allocate memory\n");
        exit(EXIT_FAILURE);
    }
    printf("[+] Initializing memory with low entropy data\n");
    init_sequential(mem, mem_size);
    printf("[+] Memory initialized\n");
    printf("[+] Sequential Access Low Entropy: \t %0.2lf mb/s \n", test_sequential_access(mem, mem_size));
    printf("[+] Random Access Low Entropy:     \t %0.2lf mb/s \n", test_random_access(mem, mem_size));

    return EXIT_SUCCESS;
}

Benchmark

LUKS

Benchmark Table for LUKS

Notes on AES XEX-based tweaked-codebook mode with ciphertext stealing (XTS)

Kernel references for cleaning up the swap

Partitioning

LUKS

Keys

Notes on making the device optional

Swap

ZSwap

Compression algorithm

Setting NixOS Configuration for ZSwap with lz4fast

Enabling using sysfs

Page Allocator Selection

Nix Code

Validate the changes

Notes on Swap

compact_memory

compaction_proactiveness

swappiness

Final benchmark

Full Data in RAM

Traditional Swap

With ZSwap

lz4 + z3fold

lz4 + zbud

lzo + zbud

Sysbench Read

Sysbench Write

Sysbench Read with Swap

Sysbench Write with Swap

Sysbench Read with ZSwap

Sysbench Write with ZSwap

ZSwap Compression Results

Appendix

Benchmark Software Source Code

References

Setting NixOS Configuration for ZSwap with `lz4fast`

Enabling using `sysfs`

`compact_memory`

`compaction_proactiveness`

`swappiness`