PyNVMe3 Testing on Server

PyNVMe3 Testing on Server

Last Modified: August 1, 2024

Copyright © 2020-2024 GENG YUN Technology Pte. Ltd.
All Rights Reserved.

PyNVMe3 is a powerful and comprehensive NVMe SSD testing tool that sets rigorous standards for extreme performance and high reliability under heavy stress scenarios for enterprise SSDs. It also supports test engineers in developing customized test cases using its Python API. As a result, PyNVMe3 has been widely adopted in the industry, becoming a professional NVMe SSD testing software platform. Over the past year, we have further refined our software and developed an extensive library of test scripts for enterprise NVMe SSDs, based on our reference desktop and server testing platform.

1. Hardware Platform

While PyNVMe3 is a software-defined testing platform, we have also selected several hardware platforms for in-depth tuning. Users can choose the hardware platform that best suits their testing needs and enjoy the best user experience.

1.1 Desktop Testing Platform

During the R&D phase of NVMe SSDs, desktops are commonly used as the primary testing platform. Below is the configuration of our desktop testing platform:

  • Motherboard: Asus X670E ROG CROSSHAIR GENE webpage
  • CPU: AMD Ryzen™ 5 7600X
  • Memory: DDR5, 16GB x 2
  • OS Drive: 2.5″ SATA SSD

This CPU configuration offers excellent single-core performance and sufficient core count, supporting PCIe Gen5 and DDR5, meeting the requirements for enterprise NVMe SSD testing. Additionally, we have equipped it with two extra test fixtures, further enhancing the platform’s testing capabilities.

  1. PMU2 Power Module
    The green PCB board shown below provides many functions required for enterprise NVMe SSD testing:

    • AIC-U.2 adapter
    • Supports PCIe Gen5
    • Script-controlled main power switch
    • Script-controlled AUX power switch
    • SSD power consumption measurement
    • Compatible with Total Phase I2C module for MI testing
    • This power module also has an M.2 version supporting all the above functions
  2. Total Phase I2C Module

This hardware platform offers high cost-effectiveness, making it suitable for daily development and testing by R&D personnel. PyNVMe3 has developed numerous protocol, stress, power cycling, power consumption, and out-of-band management test scripts on this platform.

1.2 General Server Testing Platform

While desktop testing platforms are convenient for daily use by R&D personnel and cover a wide range of tests, their testing density and efficiency are not high. We can use general servers to improve the efficiency of routine testing, such as the Asus RS520A-E12-RS24U.

This 2U general server, equipped with an AMD CPU, supports PCIe Gen5 and can test up to 24 U.2 NVMe SSDs simultaneously, offering extremely high testing density and cost-effectiveness. The specific configuration is as follows:

  • CPU: 1 x AMD EPYC 9254 (24C/48T, 2.9GHz, 128MB L3 Cache)
  • Memory: 4 x 16GB DDR5-4800 ECC RDIMM
  • OS Drive: 2 x 2.5″ SATA SSD

General server platforms cannot connect to external test fixtures, so their testing coverage is not as extensive as desktops. In actual R&D processes, a combination of server and desktop testing platforms can balance testing efficiency and coverage.

PyNVMe3 customers can choose their own servers. Below are some selection recommendations:

  1. U.2 Slots: To improve cooling efficiency in actual tests, we can insert SSDs alternately, leaving more space for air cooling. Although the number of test drives is halved, the cooling efficiency of the test drives is improved. Each U.2 slot should support PCIe Gen5 and be directly connected to the CPU.
  2. CPU Cores: Users can allocate 2-4 CPU cores per DUT. PyNVMe3’s NVMe driver has been fully optimized, achieving over 60% higher single-core IOPS performance than the Linux kernel driver, significantly reducing the number of CPU cores required.
  3. Memory: PyNVMe3’s CRC table requires a large amount of DRAM. For example, an 8TB SSD with an LBA size of 512 bytes requires 16GB DRAM. It is recommended to equip 64GB DDR5 memory or more. Additionally, the number of memory channels is also crucial to ensure that memory access bandwidth does not become a bottleneck for SSD sequential read performance.

1.3 Customized Server Testing Platform

Based on our experience with the general server, we have developed a customized server testing platform in collaboration with our partners.

This customized server uses the Intel Xeon platform, supporting 12 PCIe Gen5 U.2 slots. The standard 3U chassis provides more space and better cooling, and it can easily control the fan speed via software to achieve a high-temperature testing environment.

2. PyNVMe3 Software Platform

For long-term read/write tests on many test drives, we recommend using servers to test multiple SSDs simultaneously, improving testing efficiency. If you need to test power cycling, power consumption data, and out-of-band management functions, we recommend using the desktop testing platform. PyNVMe3 can be deployed and used on any of the above hardware platforms with a unified experience. Please refer to the User’s Guide.

2.1 Server Platform Configuration

PyNVMe3 has been enhancing the user experience on server platforms. Users can specify the CPU core number, BDF address, and test cases for each test slot in the slot.conf file. As shown in the example below:

# slot_N = BDF, CPU, TESTS
slot_0  = 0000:6f:00.0,  0, ./scripts/production/01_normal_io_test.py::test_case1_16k_randrw_1day
slot_1  = 0000:70:00.0,  2, ./scripts/production/01_normal_io_test.py::test_case1_16k_randrw_1day
slot_2  = 0000:46:00.0,  4, ./scripts/production/01_normal_io_test.py::test_case1_16k_randrw_1day

With the slot.conf configuration file, we can simplify the test execution command line as follows:

make test slot=1

This command line will locate the BDF address of the test drive, the corresponding CPU core number, and the test case to be executed based on the slot.conf configuration. We can even use the following command line to start tests on 12 SSDs simultaneously in the background tasks.

for slot in {0..11}; do nohup make test slot=$slot &; done;

Since starting tests on 12 drives simultaneously will create a large number of log files in the results directory, we recommend clearing the results directory before testing.

2.2 WebUI

When testing a single drive, we can monitor the log files to get the current test status. However, when testing many SSDs on multiple platforms, this method becomes inefficient. Therefore, we provide a professional WebUI to get the test status, performance, and other information of a specified platform and slot in the test cluster, as shown below.

This WebUI offers several significant benefits:

  1. Real-Time Monitoring:
    • Through the WebUI, users can monitor all active DUTs (Devices Under Test) in real-time, including their slot numbers, BDF (Bus, Device, Function) addresses, and model names. This allows users to quickly and easily obtain the current status and location information of each SSD, eliminating the need for manual searching and recording.
  2. Performance Charts:
    • The WebUI provides detailed performance charts showing IOPS (Input/Output Operations Per Second) and throughput (in MB/s or GB/s) over a specified time range. Users can use these charts to rapidly assess the performance of SSDs and identify potential performance bottlenecks or anomalies.
  3. Comparison Feature:
    • The WebUI allows users to select and compare the performance of two DUTs. This is particularly useful for evaluating performance differences between different SSD models or the same model under different test conditions.
  4. Detailed Information:
    • In addition to performance metrics, the WebUI provides other detailed information about the selected DUT, such as firmware version, temperature, and health status. This enables users to gain a comprehensive understanding of the operating condition and potential issues of each SSD.
  5. Custom Monitoring Functions:
    • The WebUI allows users to define their own monitoring functions. PyNVMe3 will periodically collect data based on these custom functions and display it on the WebUI. This feature enables users to tailor the monitoring process to their specific needs and preferences, ensuring that the most relevant data is always at their fingertips.
  6. Cluster-Wide Monitoring:
    • The WebUI allows users to specify which machines in the entire test cluster to monitor. These machines can be servers, desktops, or even laptops. This flexibility ensures that users can centralize the monitoring and management of all relevant devices, regardless of their type or location within the cluster.
  7. No Impact on Existing Tests:
    • Importantly, the addition of this WebUI does not impact the functionality or performance of existing tests. Users can continue to run their tests as usual, with the added benefit of enhanced monitoring and data visualization capabilities.

By utilizing this professional WebUI, users can significantly improve the efficiency and accuracy of SSD testing, reduce the time spent on manual operations and monitoring, and ensure comprehensive and visualized test data. This is especially important in environments where many SSDs are tested across multiple platforms, aiding in centralized management and optimization of the testing process.

3. Enterprise SSD Test Scripts

On the above software and hardware testing platforms, we have developed numerous enterprise SSD test scripts. Customers can directly test these scripts or use them as references to develop their own test script libraries. The following scripts can be found in the PyNVMe3/scripts directory.

Test Category Script Directory Platform Requirements Test Duration Overview
Basic Protocol conformance Desktop Platform 1-2 hours Various protocol and function tests, including PCIe, NVMe, TCG, OCP, etc.
Product Testing production Server and Desktop Platforms 1-2 weeks Long-term, high-stress test cases for verifying product performance and reliability when the product is relatively stable.
Out-of-Band Management management Desktop Platform 1-2 hours Comprehensive out-of-band management test scripts.
Other Features features Desktop Platform 1-10 minutes Various enterprise features test script examples as a basis for secondary development by users, including multi-Namespace, SGL, PI, ZNS, SRIOV, CMB, MSIx, etc.

Some test cases require controlling the power of DUT, so we need to use the power module test fixtures. If the test platform does not have the power module installed, the test will be skipped. Out-of-band management tests require the Total Phase I2C adapter. These tests need to be performed on the aforementioned desktop testing platform.

3.1 Product Testing

With the server testing platform, we can significantly increase the scale and duration of the tests. We refer to this type of large-scale, long-duration testing as “Product Testing.” Product testing focuses on checking performance and reliability, including:

  1. Simulation of various common application scenarios
  2. Various extreme usage patterns targeting NAND characteristics and limitations
  3. High-stress tests involving a mix of different IO types
  4. Testing various FTL scenarios such as wear leveling, garbage collection, error handling, etc.
  5. Power cycling tests
  6. JEDEC workload tests

Each product test file requires 1-3 weeks of testing duration, but we can increase the level of concurrency by adding more DUTs, executing different test cases simultaneously. It is also possible to test SSDs from multiple vendors on the same server and compare the results.

3.2 Testing Progress

Here is a template of our testing progress.

Suite Test Status
Conformance
Admin
IO PASS
Reset PASS
Power
Register PASS
Controller PASS
TCG
Production
01_normal_io_test.py PASS
02_mix_io_test.py
03_data_model_test.py PASS
04_trim_format_test.py
05_small_range_test.py
06_jesd_workload_test.py
07_power_cycle_test.py PASS
08_io_stress_test.py
09_wl_stress_test.py
Management
01_mi_inband_test.py
02_basic_mgmt_cmd_test.py
03_mi_cmd_set_test.py
04_mi_admin_cmd_test.py
05_mi_control_primitive_test.py
06_mi_pcie_cmd_test.py
07_mi_feature_test.py
08_mi_error_inject_test.py
09_mi_stress_test.py
Features
CMB N/A
MPS
MSIx PASS
Metadata PASS
PI PASS
NS PASS
SGL N/A

The PyNVMe3 testing platform offers a comprehensive solution for enterprise SSD evaluations, combining robust software tools with versatile hardware support. Its extensive test scripts and professional WebUI enhance efficiency and precision in SSD testing, ensuring products meet the demanding standards of enterprise environments. This platform is integral for developers and engineers aiming to rigorously assess SSD performance and reliability.