Building the AI-RAN Data Path in OCUDU: Inline GPU Processing for PRACH and SRS

By TOSSI · Jun 23, 2026 · 7 min read

NVIDIA GPUDirect RDMA delivers fronthaul packets directly into GPU memory — no host copies, no PCIe bottleneck. PRACH detection latency drops by 70%, and SRS channel estimation achieves more than 20× acceleration at 256 UEs, all on a workstation-class RTX A4000.

OCUDU inline GPU acceleration architecture: fronthaul packets arriving via NIC are DMA-transferred directly into GPU VRAM using GPUDirect RDMA, bypassing host CPU memory entirely. PRACH and SRS workloads execute GPU-resident throughout the processing chain. — Inline GPU architecture in OCUDU. Fronthaul packets are DMA-transferred by the NIC directly into GPU VRAM via GPUDirect RDMA. PRACH and SRS workloads execute entirely on the GPU — no host-memory copies occur in the accelerated path.

As the telecommunications industry moves toward AI-RAN and heterogeneous compute architectures, accelerating Layer 1 processing is becoming increasingly important. Modern Open RAN deployments are no longer limited to CPUs alone. GPUs, hardware accelerators, and specialized processing engines are being introduced to meet growing performance demands while enabling new classes of radio intelligence.

However, many acceleration approaches still retain a significant bottleneck: data movement. Even when signal processing executes on a GPU, radio samples often travel through CPU memory before being copied across PCIe into the accelerator. Computation becomes faster, but the data path remains constrained by unnecessary memory transfers between devices.

There is also a common perception that GPU-accelerated RAN requires expensive data-center-class hardware. In practice, that assumption can become a deployment barrier of its own. AI-RAN should not be limited to organizations with access to large-scale accelerator infrastructure.

With the latest OCUDU release, we introduce a different approach. Inline GPU Acceleration for PRACH Detection and SRS Channel Estimation, powered by NVIDIA GPUDirect RDMA, enables radio data to arrive directly into GPU memory where processing begins immediately. The implementation runs on widely available workstation-class GPUs, including the NVIDIA RTX A4000, while preserving a migration path toward larger accelerator platforms as deployment requirements evolve.

At TOSSI, our vision is to enable a modular, accelerator-aware Open RAN ecosystem where compute resources can be introduced without disrupting existing architectures. This release represents another important step toward that goal.

Why the AI-RAN Data Path Matters

The future of AI-RAN is not defined solely by faster processors or larger AI models. It is defined by how efficiently radio data reaches those accelerators.

Traditional architectures introduce multiple memory copies before processing can begin. Each transfer consumes bandwidth, adds latency, and limits scalability. As more PHY functions, AI models, and advanced radio algorithms move toward accelerators, the cost of moving data increasingly outweighs the cost of computation itself.

To fully realize AI-RAN, radio data must be delivered directly to the compute platform where intelligence resides. That is precisely the problem inline GPU processing is designed to solve.

The Evolution of RAN Acceleration

The journey toward accelerator-native RAN can be viewed in three stages.

Traditional CPU Processing

In conventional Open RAN deployments, radio packets arrive in host memory and the CPU performs decompression and all PHY processing in software. While flexible, CPU-only processing becomes increasingly expensive as bandwidth, cell capacity, and user density increase.

Accelerator Offload

Previous OCUDU releases introduced hardware acceleration through Intel ACC100 for LDPC encoding and decoding, and CUDA Graph accelerated PRACH detection. These innovations significantly reduced computation time. However, radio samples still passed through CPU memory before reaching the accelerator. The accelerator performed the work; the CPU still moved the data.

Inline GPU Processing

The latest release removes that dependency. Using NVIDIA GPUDirect RDMA, fronthaul packets arriving from the network interface card are DMA-transferred directly into GPU VRAM. The payload never enters CPU memory. The GPU processes data where it already resides, and only compact processing results are returned to the host. This transforms the GPU from a device that receives work into the location where work naturally begins.

Inline GPU Architecture in OCUDU

OCUDU now introduces a dedicated accelerated data path alongside the existing CPU processing pipeline. PRACH and SRS packets are delivered directly into GPU memory. A lightweight listener thread examines packet headers and dispatches workloads to the appropriate acceleration pipeline. The payload itself remains entirely resident on the GPU throughout processing.

Key characteristics of the architecture include:

Direct NIC-to-GPU data delivery using GPUDirect RDMA
Zero host-memory copies for accelerated workloads
GPU-resident radio sample processing
Selective acceleration through configuration
Full backward compatibility with existing CPU-based deployments

This enables operators and developers to adopt GPU acceleration incrementally while preserving deployment flexibility.

Inline PRACH Detection

PRACH represents the first interaction between a user equipment (UE) and the network. Detection must occur within strict timing constraints while maintaining high reliability under varying radio conditions.

The inline PRACH pipeline executes entirely on the GPU:

BFP9 decompression
Resource extraction
Symbol accumulation
Zadoff-Chu correlation
Batched cuFFT IDFT
Non-coherent combining
GLRT detection

The CUDA Graph infrastructure introduced in previous releases remains intact. The difference is that samples now originate directly from GPU memory rather than arriving through a host-to-device copy, removing an entire stage from the processing chain.

PRACH inline GPU pipeline: BFP9 decompression, resource extraction, symbol accumulation, Zadoff-Chu correlation, batched cuFFT IDFT, non-coherent combining, GLRT detection — all executing GPU-resident — The inline PRACH pipeline. Every stage from decompression through GLRT detection executes on the GPU, with samples sourced directly from GPU VRAM via GPUDirect RDMA.

Metric	Inline GPU	CPU
Mean Latency	106.1 µs	351.6 µs
Detections	11,350	11,350

The inline GPU path reduces average PRACH detection latency by approximately 70%, delivering more than 3× acceleration while maintaining identical detection outcomes.

Inline SRS Channel Estimation

SRS processing presents a different challenge. Unlike PRACH, SRS shares uplink resources with user traffic and requires coordination with MAC scheduling information. The OCUDU GPU pipeline performs:

Least-squares channel estimation
Reference sequence correlation
Timing advance estimation
Phase compensation
Noise variance calculation
Wideband channel coefficient generation

Generated channel estimates are published through lock-free interfaces and combined with MAC-layer context before being consumed by higher PHY functions. As user density increases, batched CUDA Graph execution allows many SRS occasions to be processed through a single graph replay, significantly reducing per-user processing overhead.

SRS inline GPU pipeline: BFP9 decompression, reference sequence correlation, least-squares channel estimation, timing advance estimation, phase compensation, noise variance calculation, wideband channel coefficient generation — all GPU-resident — The inline SRS pipeline. Channel estimation, timing advance, and coefficient generation all run GPU-resident. Results are published via lock-free interfaces to higher PHY functions.

For a 64-PRB allocation:

Number of UEs	CPU Cost	GPU Cost
1	13.5 µs	44.1 µs
16	13.5 µs	2.7 µs
64	13.5 µs	0.84 µs
256	13.5 µs	0.59 µs

At low user counts, CPU execution remains competitive. As network load increases, GPU batching dramatically improves efficiency. At higher densities, the GPU achieves more than 20× acceleration while remaining comfortably within the 500 µs slot budget.

Benchmark results collage showing PRACH and SRS GPU acceleration performance on an NVIDIA RTX A4000 — PRACH and SRS benchmark results on an NVIDIA RTX A4000. PRACH detection latency drops from 351.6 µs (CPU) to 106.1 µs inline GPU; SRS per-user cost at 256 UEs drops from 13.5 µs to 0.59 µs.

Why This Matters Beyond PRACH and SRS

The significance of this work extends beyond PRACH and SRS. Inline GPU processing fundamentally changes where radio data lives. Once uplink samples arrive directly into GPU memory, additional processing stages can operate on the same data without introducing new movement costs.

This enables a future where:

Channel estimation runs on GPUs
Neural receivers operate directly on radio samples
Beamforming algorithms leverage AI accelerators
Learned PHY components integrate seamlessly into existing stacks

The challenge shifts from moving data to applying intelligence.

Roadmap

Inline GPU processing provides the foundation AI-RAN requires: an architecture capable of delivering radio data to accelerators efficiently, consistently, and at scale.

Combined with OCUDU's xFAPI-based vendor-neutral Layer 1 and Layer 2 interoperability framework, future PHY implementations can evolve independently while remaining compatible with existing Layer 2 deployments. The result is a practical pathway toward:

GPU-native Open RAN deployments
AI-assisted radio processing
Heterogeneous compute environments
Accelerator-aware network architectures
Open and interoperable AI-RAN ecosystems

This is not simply a performance optimization. It is infrastructure for the next generation of Open RAN innovation. With PRACH and SRS now operating directly on GPU-resident radio samples, the next phase is already underway — extending inline acceleration to additional Layer 1 functions and bringing AI-native radio processing closer to reality.

Availability

Inline GPU acceleration for PRACH and SRS is available today in OCUDU.

GitHub: OCUDU-RAN — hwacc_gpu_inline branch
Documentation: Inline GPU PRACH & SRS — TOSSI Docs

At TOSSI, our mission is to bridge open telecom ecosystems while enabling the technologies that will power future mobile networks. From xFAPI interoperability to ACC100 acceleration and now inline GPU processing, each milestone contributes to a common objective: building an open, modular, and accelerator-aware foundation for AI-RAN. The future of AI-RAN starts with the data path, and that journey continues with OCUDU.