Dismiss
InnovationQ will be updated on Sunday, Oct. 22, from 10am ET - noon. You may experience brief service interruptions during that time.
Browse Prior Art Database

Scalable Performance Optimized RDMA Detailed Statistics

IP.com Disclosure Number: IPCOM000228893D
Publication Date: 2013-Jul-10
Document File: 7 page(s) / 95K

Publishing Venue

The IP.com Prior Art Database

Abstract

Statistics acquisition in highly concurrent systems is typically addressed by a best effort software design pattern. Speed takes precedence over accuracy. Said approach suffices for a macroscopic view of the product operation, however it does not address the microscopic view necessary for SW service/maintainability. We detail a mechanism whereby both speed and accuracy are achieved in a massively parallel application using RDMA protocols.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 49% of the total text.

Page 01 of 7

Scalable Performance Optimized RDMA Detailed Statistics

For massively parallel and high performance concurrent applications / usecases where statistics accuracy is required, an alternate method beyond the invention described herein to achieve the goals cannot be found or envisioned.

Statistics acquisition in highly concurrent systems is typically addressed by a best effort software design pattern. Speed takes precedence over accuracy. Said approach suffices for a macroscopic view of the product operation, however it does not address the microscopic view necessary for SW service/maintainability. A few examples are:


 Potential lack of accuracy in micro-benchmarks.


 Complications of low level driver debug using statistics.

Describe known solutions to this problem (if any)?

Statistics acquisition models for NIC and FC adapters are typically a single statistics structure encompassing all relevant resources. Said structure at the device driver level will track both HW and SW generated statistics.

At 1Gbps and 10Gbps NIC / FC speeds, this single structure is not a contention point as the number of resources (ie. parallel queues, engines) is quite small, typically 2-4 queues (Fig. 1). As line speeds increase to 40Gbps+ and 100Gbps+ then number of resources concurrently operating on the same statistics structure becomes a contention point. This is typically >=16 queues / engines (Fig. 2).

Most importantly, for both the NIC and FC models, the statistics are persistent memory. This means the structure containing the statistics persists for the lifetime of the driver which also matches the lifetime of the resources.

If the adapter is RDMA capable, then the number of concurrent queues is typically >2000 (Fig. 3). Clearly if the previously discussed models at 16 queues cause contention, then the scale of RDMA queues requires a rethinking of the SW design patterns to achieve both speed and accuracy in statistics counters. Furthermore, RDMA resources are dynamic memory backed. The resources are volatile, meaning they are constantly being created / destroyed. Said protocol is largely analogous to sockets where each instance comes and goes independently of the associated statistics.

The drawback of the aforementioned statistics solution is that a shared statistics structure for a chatty protocol such as RDMA will cause cache bouncing across the CPU complex as the number of cores / threads scale in concurrency. In a typical RDMA workload {HPC, DB, HFT}, if threads are pseudo-concurrently accessing a statistics set within the same cache line (ie. 128B) then a dirty cache condition is exhibited. The resulting cache thrashing can result in system wide performance impacts, especially on POWER* systems due to the NUMA type architecture. A NUMA architecture has high costs for cache updates across the CPU complex. A sample cost from min->max is {Node, CEC, System}. This invention details a mechanism whereby highly concurrent IO devices such as RDMA...