PyNVMe3 Design

PyNVMe3 Design

Last Modified: September 5, 2025

Copyright © 2020-2025 GENG YUN Technology Pte. Ltd.
All Rights Reserved.


1. Architecture

PyNVMe3 builds on the SPDK PCIe/NVMe user-space driver to deliver high throughput and tight, deterministic timing. We expose SPDK’s capabilities through a clean Python API so test authors can script at the protocol level without touching C or kernel paths. The result is a fast, reproducible NVMe test platform that maps directly to NVMe concepts—controllers, namespaces, and queue pairs—while adding the testability features production drivers lack.

arch
Figure 1. PyNVMe3 System Architecture

2. Performance

By leveraging SPDK’s polling-based I/O engine and exposing it through a Python API, PyNVMe3 removes the overhead imposed by the Linux kernel NVMe driver. This yields higher IOPS, lower latency, and more efficient CPU utilization than traditional tools such as fio.

The figure below compares Random Read IOPS for fio (green) and PyNVMe3 (orange) as the number of CPU cores increases from 1 to 5.

Random Read IOPS
Figure 2. Random Read IOPS: fio vs. PyNVMe3

  1. Single-core performance
    • With a single CPU core, PyNVMe3 achieves 1.1M IOPS—nearly double fio’s 600K under the same conditions.
    • This advantage comes from the user-space design: PyNVMe3 avoids system calls, interrupts, and context switches, relying instead on direct memory access and polling-based queue management.
    • This higher single-core throughput enables testers to stress SSDs more effectively and uncover potential issues earlier.
  2. Scalability across multiple cores
    • As core counts increase, both fio and PyNVMe3 scale in IOPS. However, PyNVMe3 consistently reaches peak SSD throughput with fewer cores.
    • By the time 3–4 cores are active, PyNVMe3 has already saturated device performance, while fio requires additional cores to match it.
    • This demonstrates PyNVMe3’s better CPU efficiency, as it extracts maximum I/O capability without wasting compute cycles.

PyNVMe3 reveals SSD performance characteristics and potential issues using limited hardware resources. For SSD validation, soak testing, and performance benchmarking, PyNVMe3 provides not only speed and scalability, but also observability and integrity checks that fio lacks.

3. Data Verification

Ensuring data consistency is the most fundamental requirement of any storage validation system. While PyNVMe3 provides high throughput and rich testing features, its data verification mechanism is carefully designed to offer transparent, reliable, and high-performance consistency checks without compromising flexibility.

  1. Transparent, Script-Free Verification
    PyNVMe3 supports automatic data verification through a built-in CRC (Cyclic Redundancy Check) mechanism. Test authors do not need to explicitly compare read and write buffers—instead, enabling the verify fixture in a test case activates automatic integrity checks at the driver level.
  2. High Performance via In-Memory CRC
    Each LBA’s checksum is stored in huge-page memory, with a 1-byte CRC per 512-byte LBA. For a 4TB drive, this requires approximately 8GB of DRAM. PyNVMe3 allocates this memory during Namespace initialization. If insufficient memory is available, a warning is issued and verification is disabled—tests will still run, but without integrity checks. Users may restrict the range of verification via the nlba_verify parameter, which limits the number of LBAs tracked for CRC. This enables partial verification for large-capacity drives in resource-constrained environments.
  3. Enhanced Data Patterns for Collision Detection
    CRC alone is insufficient in some edge cases—such as repeated writes of identical data across all LBAs. To detect such anomalies, PyNVMe3 injects semantic patterns into each I/O payload. Scripts can disable this injection using Namespace.inject_write_buffer_disable(), in which case only CRC is used for validation.

    • First 8 bytes of each LBA: Encodes (NSID, LBA) as an identifier, similar to Protect Information (PI), enabling detection of address mix-ups.
    • Last 8 bytes of each LBA: Injects a global incrementing token per LBA, simulating a data version number to detect stale reads or out-of-order writes.
  4. Concurrency-Safe via LBA Locking
    NVMe operates asynchronously and command completions are not ordered. To ensure meaningful verification, PyNVMe3 implements an in-process LBA lock mechanism. This prevents concurrent I/O from corrupting or racing over shared LBA ranges, ensuring that every verified I/O is isolated and deterministic.

    • Before any I/O is issued, all targeted LBAs are checked for lock status.
    • An I/O is only submitted when all its LBAs are unlocked.
    • Upon command completion, the LBAs are unlocked.

    ⚠️ Note: LBA locks are valid within a single process. If multiple ioworker instances in different processes target overlapping LBAs, false verification failures may occur. Scripts should avoid LBA collisions or disable verification for stress tests.

PyNVMe3’s data verification mechanism is thoughtfully engineered to ensure storage correctness without sacrificing performance or automation simplicity. Through an in-memory CRC system, injected semantic data patterns, and concurrency-safe LBA locking, PyNVMe3 offers transparent, high-speed, and robust validation for every I/O path.

Unlike traditional tools, PyNVMe3 makes data integrity checks the default—integrated directly into the driver layer, requiring no extra scripting effort. Its design supports persistence, stress handling, and protocol edge-case coverage, making it ideal for both functional validation and long-duration reliability testing.

In short, PyNVMe3 treats data consistency not as an optional feature, but as a first-class requirement—aligning with the core purpose of storage testing.

4. Extended Features

SPDK’s NVMe driver targets production environments and deliberately omits test-oriented instrumentation. PyNVMe3 adds three foundational modules to close that gap:

  1. cmdlog
    Records, per queue, the SQEs you submit and the CQEs you receive—each with precise timestamps. By preserving the full, ordered command history at the queue level, cmdlog accelerates root cause analysis for both script authors and firmware engineers (e.g., correlating a media error with preceding feature changes or format events).
  2. ioworker
    A high-performance, low-latency I/O generator—conceptually similar to fio, but scriptable via Python and tightly coupled to SPDK. ioworker lets you describe complex, dynamic workloads (rates, mixes, distribution of LBAs, queue depths, affinity) while extracting maximal performance from the user-space engine. It’s the preferred tool for sustained load, soak, and tail-latency studies.
  3. checksum
    Transparent data-integrity verification for tests. For each LBA, PyNVMe3 maintains a CRC in DRAM. On successful writes, the CRC is updated; on successful reads, a fresh CRC is computed and compared against DRAM. This catches integrity regressions immediately without cluttering test logic, enabling broad and continuous verification during long runs.

5. API Abstraction

PyNVMe3 packages SPDK’s capabilities into a small set of objects that mirror NVMe’s model:

classes
Figure 3. PyNVMe3 Classes

  • Pcie: Represents the SSD as a PCIe device, providing interfaces for reading and writing to its configuration and memory space. This class is especially useful for early-stage SSD prototypes that do not yet support full NVMe functionality, enabling low-level inspection and execution of various PCIe-level reset operations. Users can create multiple Pcie objects for multiple DUTs.
  • Controller: Models the NVMe controller, capable of issuing and receiving all administrative commands. It provides the main interface for initializing the NVMe device and supports controller-level reset operations. It typically serves as the primary point of interaction for user test scripts. Users can create multiple Controller objects for multiple DUTs.
  • Namespace: Represents an NVMe namespace, enabling the submission and completion of I/O commands such as read, write, flush, etc. It encapsulates the logical storage space exposed to the host. Users can create multiple Namespace objects for multiple namespaces in the DUT.
  • Qpair: Encapsulates a one-to-one relationship between a Submission Queue and a Completion Queue. It is used in conjunction with a Namespace to submit and process I/O commands. It reflects the core I/O command path as defined by the NVMe specification. Users can create multiple Qpair objects
  • Subsystem: Represents an NVMe subsystem, providing operations such as power on/off and subsystem-level reset. Users can provide callback functions to support new power control modules.
  • Others (Low-Level Objects): Include classes that model NVMe protocol data structures such as SQE (Submission Queue Entry), CQE (Completion Queue Entry), PRP (Physical Region Page), and PRPList. These are primarily used in metamode or advanced testing scenarios where direct manipulation of protocol-level fields is required.

6. Hello World

The Hello World example is the simplest PyNVMe3 test, yet it captures the entire flow of NVMe command submission and completion. It also serves as the best entry point to understand ns.cmd, the most fundamental way PyNVMe3 sends and reaps I/O.

import time
import pytest
import logging

import nvme as d

def test_hello_world(nvme0, nvme0n1, qpair, verify):
    # allocate DMA-capable buffers
    read_buf = d.Buffer(512)
    write_buf = d.Buffer(512)
    write_buf[10:21] = b'hello world'  # insert test pattern

    # callback triggered when the write completes
    def write_cb(cdw0, status):
        nvme0n1.read(qpair, read_buf, 0, 1)

    # issue the write; the callback submits the read
    nvme0n1.write(qpair, write_buf, 0, 1, cb=write_cb)

    # verify data pattern before and after
    assert read_buf[10:21] != b'hello world'
    qpair.waitdone(2)  # wait for both write and read to complete
    assert read_buf[10:21] == b'hello world'

At first glance, this script appears simple: it writes “hello world” into LBA0, then reads LBA0 back and checks the result. But under the surface, it demonstrates how ns.cmd embodies the asynchronous, queue-based nature of NVMe. When the script calls nvme0n1.write(), it does not block; instead, it places a submission queue entry into the qpair and immediately returns. The actual completion will only arrive later in the form of a CQE.

To handle this, the script attaches a callback function to the write. When PyNVMe3 reaps the write CQE, it triggers the callback, which issues a read command to the same LBA. This chaining of operations shows the natural way to express dependencies in an asynchronous system: the read does not wait in Python code; it is launched as soon as the device signals that the write has finished.

The line qpair.waitdone(2) is crucial. Since two commands have been submitted (the write and the read), the script tells PyNVMe3 to wait until both completions are collected. Unlike blocking system calls in traditional drivers, here waiting is explicit and deterministic: the script decides when and how many completions to wait for. This control makes it possible to synchronize precisely with the device, reproduce timing-sensitive bugs, or measure latency accurately.

Finally, the assertions before and after waitdone() highlight data movement in the test. With the verify fixture enabled, the driver computes and checks CRC checksums for every LBA, so the script does not need to implement data comparison logic.

This Hello World example demonstrates the essence of ns.cmd: commands are submitted to the device asynchronously, callbacks allow dependent operations to be chained, waitdone() provides deterministic synchronization, and data integrity is checked automatically in the background. From here, test authors can extend the same model to more complex scenarios—multiple queue pairs, mixed read/write workloads, or stress loops—but the fundamental mechanism remains unchanged.

ns.cmd is the fundamental building block of PyNVMe3, and Hello World is its simplest yet most illustrative demonstration. However, while ns.cmd is ideal for understanding the command flow and constructing fine-grained test logic, it becomes inefficient when used to generate large numbers of I/Os directly in Python. For sustained or batch workloads, the right tool is ioworker, which automates high-volume I/O generation with far greater efficiency.

7. ioworker

In PyNVMe3, ioworker is a built-in, high-performance I/O generator designed for producing synthetic NVMe workloads directly from Python scripts. It is conceptually similar to tools like fio, but implemented entirely on top of the SPDK user-space NVMe driver, and fully integrated into the PyNVMe3 API.

Unlike manually submitting I/Os via ns.cmd, which can be inefficient for high-volume workloads, ioworker autonomously issues and reclaims I/O commands according to user-defined parameters—maximizing both test development productivity and runtime performance.

7.1 Key Features

  • User-space speed: Leverages SPDK’s polling engine to bypass the kernel stack for extremely low latency and high throughput.
  • Simple Python API: Easily embedded into test scripts and CI environments.
  • Parameter-rich configuration: Control workload shape via io_size, qdepth, read_percentage, lba_random, iops, time, and more.
  • Built-in data verification: Seamless integration with PyNVMe3’s verify fixture for CRC-based integrity checks.
  • Automatic metrics: Returns IOPS, bandwidth, latency percentiles, and other statistics upon completion.

7.2 Basic Usage

The following script runs a 4KB-aligned random write workload for 2 seconds:

def test_ioworker(nvme0, nvme0n1, qpair):
    r = nvme0n1.ioworker(
        io_size=8,            # 4KB I/O (8 * 512B)
        lba_align=8,          # 4KB alignment
        lba_random=True,      # random access
        read_percentage=0,    # write-only workload
        time=2                # duration: 2 seconds
    ).start().close()
  • .start() launches the ioworker in a child process pinned to a CPU core.
  • .close() blocks until completion and returns workload statistics.

PyNVMe3 supports using ioworker with Python’s with statement to improve script readability and control:

with nvme0n1.ioworker(io_size=8, time=10) as w:
    while w.running:
        time.sleep(1)
        speed, iops = nvme0.iostat()
        logging.info("iostat: %dB/s, %d IOPS" % (speed, iops))

While the ioworker is running in the background, the main process remains responsive and can:

  • Monitor real-time I/O statistics
  • Send admin commands
  • Trigger resets, power cycles, or background operations

This asynchronous execution model is ideal for building stress tests, error recovery scenarios, and mixed workload tests.

7.3 Internal Design

Each ioworker instance:

  • Spawns a dedicated child process, pinned to a CPU core
  • Builds its own submission and completion queues
  • Executes I/O in tight polling loops for maximum throughput
  • Automatically tracks and reports performance data

ioworker is a core component of PyNVMe3’s design philosophy: combining raw performance with scriptable control. It replaces external tools like fio in most test workflows and unlocks precise workload generation, observability, and repeatability—entirely from Python. Whether for simple throughput checks or complex long-running tests, ioworker is the ideal tool for I/O generation at scale.

8. metamode

High-level interfaces like ns.cmd and ioworker provide excellent usability and performance by running through PyNVMe3’s NVMe driver (which itself wraps SPDK). They’re ideal for functional and performance testing. However, when you need to probe protocol corner cases or inject low-level errors that production drivers deliberately prevent, you need direct control of the queues and doorbells. That’s what metamode is for.

metamode
Figure 4. metamode: access the SSD in pure Python without the SPDK driver

metamode bypasses the PyNVMe3 NVMe driver and SPDK entirely, letting your script treat the SSD as a raw PCIe device. In metamode you build and manage the NVMe data path yourself:

  • Construct SQEs (Submission Queue Entries)—including fields like CID and PRP/SGL—then place them into the submission queue buffer.
  • Ring doorbells explicitly to notify the controller of new submissions.
  • Manage shared memory buffers via Buffer, PRP, and PRPList.
  • Inspect CQEs directly as they appear in the completion queue and drive your own completion handling logic.

This gives you protocol-level control to exercise behaviors and error conditions that are otherwise blocked by higher-level stacks.

8.1 Basic Example

The following test submits two write commands with the same CID—a condition disallowed by standard drivers but perfectly valid to test at the protocol level:

import nvme as d

def test_metamode_write(nvme0):
    # Construct Completion Queue
    cq = d.IOCQ(nvme0, qid=1, qsize=10, prp=d.PRP(10 * 16))

    # Construct Submission Queue and link it to the CQ
    sq = d.IOSQ(nvme0, qid=1, qsize=10, prp=d.PRP(10 * 64), cq=cq)

    # Manually construct an SQE; duplicate CID is intentional for error injection
    sqe = d.SQE(cid=(1 << 16), opc=1)  # 'opc=1' is typically WRITE

    # Place two identical SQEs and ring the SQ doorbell
    sq[0] = sqe
    sq[1] = sqe
    sq.tail = 2  # advance tail to notify hardware

This pattern is representative of metamode’s power: you control the exact SQE contents, queue pointers, and doorbells, and you observe raw completions from the CQ—enabling tests that are impossible through driver-mediated paths.

8.2 When to use metamode

Use metamode when you must validate spec corner cases, perform error injection, or debug firmware behavior at the queue/doorbell level. For most functional coverage, ns.cmd is sufficient; for sustained or mixed workloads, ioworker delivers the best throughput and flexibility. When you need unfiltered, protocol-accurate access, metamode is the tool that closes the gap between the NVMe spec and an executable test.

8.3 Complicated SGL Chain

PyNVMe3 inherits SegmentSGL, LastSegmentSGL, DataBlockSGL, and BitBucketSGL from the Buffer class to construct various SGL Descriptors. The following script implements the SGL example listed in the NVMe Specification. The total length of the data in this example is 13KB (26 LBA), divided into 3 SGLs:

  • seg0: Contains a 3KB block of memory and points to seg1;
  • seg1: contains a 4KB block of memory and a 2KB hole, and points to seg2;
  • seg2: is the last SGL and contains a 4KB block of memory.

The specific memory layout and corresponding PyNVMe3 scripts are as follows: python

Here is the PyNVMe3 scripts implemented in metamode for this complicated SGL test case.

   def test_sgl_segment_example(nvme0, nvme0n1):
       sector_size = nvme0n1.sector_size
       cq = IOCQ(nvme0, 1, 10, PRP(10*16))
       sq = IOSQ(nvme0, 1, 10, PRP(10*64), cq=cq)

       sseg0 = SegmentSGL(16*2)
       sseg1 = SegmentSGL(16*3)
       sseg2 = LastSegmentSGL(16)
       sseg0[0] = DataBlockSGL(sector_size*6)
       sseg0[1] = sseg1
       sseg1[0] = DataBlockSGL(sector_size*8)
       sseg1[1] = BitBucketSGL(sector_size*4)
       sseg1[2] = sseg2
       sseg2[0] = DataBlockSGL(sector_size*8)

       sq.read(cid=0, nsid=1, lba=100, lba_count=26, sgl=sseg0)
       sq.tail = sq.wpointer
       cq.waitdone(1)
       cq.head = cq.rpointer

       sq.delete()
       cq.delete()

9. Hardware Support

PyNVMe3 is designed with broad hardware compatibility and integration in mind, enabling robust testing across a wide range of platforms and test environments.

9.1 Platform Compatibility

PyNVMe3 supports all mainstream x86-based systems, including those built with Intel and AMD processors. It is compatible with a variety of computing platforms such as:

  • Consumer laptops and desktops
  • Professional workstations
  • Enterprise-grade servers

This ensures that NVMe device testing can be performed across development, validation, and deployment environments with consistent behavior.

9.2 Power Fixtures

To support advanced test scenarios, PyNVMe3 provides seamless integration with industry-standard and custom power control systems:

  • Quarch PAM (Power Analysis Module): Full support for automated power cycling and control, enabling accurate and repeatable power-related testing.
  • Custom Power Switch Control: Users can register Python callbacks to integrate proprietary or in-house power control hardware into PyNVMe3 test scripts.
  • Power Consumption Sampling: Power telemetry data can be collected and analyzed during test execution, enabling validation of power states and efficiency.

9.3 Management Fixtures

PyNVMe3 supports external communication interfaces used for out-of-band management:

  • TotalPhase Aardvark: Enables control over I²C/SMBus buses to inject low-level test sequences or commands.
  • Protocol Stack Extensions: Users can extend Python scripts to implement and validate full SMBus-based out-of-band management protocols, enabling comprehensive system-level testing.
  • VDM controller
  • I3C controller

9.4 High-Density Test Server

To facilitate large-scale and long-duration NVMe validation, PyNVMe3 is deployed on a custom-built 12-bay NVMe test server developed in-house. This platform is engineered for:

  • High concurrency
  • Enhanced thermal and power stability
  • Scalable test execution
  • Improved reliability during continuous stress testing

server
Figure 5. PyNVMe3 Test Server

This setup is particularly suited for qualification labs, regression testing environments, and production-level validation workflows.

10. Conclusion

PyNVMe3 delivers a modern, high-performance, and test-focused NVMe validation framework built on the strengths of SPDK while addressing its limitations in testability and extensibility. By exposing protocol-level control through a clean Python API, PyNVMe3 empowers developers, validation engineers, and firmware teams to create precise, deterministic, and reproducible test scenarios—from functional conformance and error injection to large-scale performance and soak testing. Its architecture combines:

  • User-space speed with kernel-free I/O for maximum throughput,
  • Protocol-level abstraction for intuitive NVMe modeling,
  • Advanced tooling such as cmdlog, ioworker, and checksum for observability and data integrity,
  • Flexible hardware integration, supporting power control, out-of-band interfaces, and dense test infrastructure.

Whether validating a single namespace or stress-testing an entire subsystem, PyNVMe3 provides the scalability, precision, and transparency needed for modern NVMe device development. It is not just a testing tool—it is a robust platform for building confidence in NVMe SSD quality, reliability, and performance.