Pioneering Work in CUDA: GPU-based Video Processing

During my time as a working student at Dallmeier electronic, I had the opportunity to be one of the first to unlock the potential of NVIDIA CUDA for professional video surveillance technology. At a time when GPU computing (GPGPU) was still in its infancy in the industry, I conducted the first tests and documented the groundbreaking technology for video processing.

The Challenge: Massively Parallel Processing

The classic CPU reached its performance limits when simultaneous decoding many high-resolution video streams (e.g., H.264). The GPU architecture offered a decisive advantage here: Instead of a few complex cores, it possesses hundreds of simple processing units that are perfectly suited for parallel processing of pixel data.

Implementation of the GPU Decoder

As part of my research work, I developed a framework for efficient video decoding that taps deep into the NVIDIA API.

Core Components of the Architecture:

VideoSource & Parser: Splitting the bitstream into processable packets. Since the standard NVIDIA API was primarily designed for files at the time, I developed a custom dalliVideoSource to be able to process network streams with low latency.
Decoder Engine: Usage of dedicated hardware decoders on the GPU (NVDEC).
Postprocessing & Color Space: Implementation of CUDA kernels for fast color space conversion (NV12 to RGB/ARGB) directly in graphics memory to avoid expensive copy operations between GPU and CPU.

Performance & Benchmarks

The results of my investigations laid the foundation for future Dallmeier products:

Latency: Pure decoding time per frame could be reduced to an average of 4 ms.
Resolution: First tests with resolutions up to 4k (4080 x 2030) demonstrated the scalability of the approach.
Zero-Copy Design: By using D3D/OpenGL interoperability, images could be displayed directly from graphics memory without burdening the main processor.

Conclusion

This work was pioneering in a field that is standard today in modern video technology and AI-supported image analysis. It combined hardware-near C++ programming with the mathematical foundations of massive parallel coding.