Working with CUDA

Mike Perkins

We’ve recently been working with a cool technology that is rapidly penetrating scientific and engineering computing, but seems little known otherwise. It’s called CUDA. In a nutshell, it is an SDK to allow you to run parallelizable compute-intensive applications on your Nvidia graphics card instead of serially on your CPU.

CUDA is one of a number of emerging methods that all more or less enable the same concept. A similar approach is behind OpenCL, which is backed by Apple and purports to be more cross-platform, in that it eventually will allow developers to develop code for a range of GPUs and not just those from Nvidia. However, at the moment OpenCL is limited to Macs running Apple’s new Snow Leopard, so I’m not sure that’s more open than CUDA.

Our use of CUDA involves running a moderately intensive frame-based image-processing algorithm on extremely high-definition images. The images are 3840×1080 monsters, and the goal is to process a total of 24 such images per second, so it’s definitely not something you can accomplish on a CPU alone.

Our client in this case had originally implemented the image-processing algorithm in MATLAB, so our first task was to convert from MATLAB to C. The resulting C code was the basis for using Nvidia’s CUDA SDK to get the algorithm running on the Nvidia board. We also had to tie the entire framework into Microsoft’s DirectShow architecture, because the data is delivered to us as H.264 encoded images.

The parallelization comes through the GPU chip’s many stream processors—also called thread processors. For example, the GeForce 8 GPU has 128 stream processors. Each stream processor comprises 1024 registers and a 32 bit floating point unit. The stream processors are grouped on the chip into clusters of 16. Each cluster shares a common 16 KB memory and is referred to as a “core.”

From a software perspective, each stream processor executes multiple threads, each of which has its own memory, stack, and register file. For the GeForce 8 GPU, each stream processor can handle 96 concurrent threads (for a total of 12,288) although this number is seldom reached in practice. Fortunately, programmers do not need to write explicitly threaded code since a hardware thread manager handles threading automatically. The biggest challenge for the programmer is to properly decompose the data so that the 128 stream processors stay active.

In an efficient decomposition, the data is subdivided into chunks that can be allocated to a core. Algorithms execute most efficiently when the data for the threads executing on a core can all be stored in the core’s local memory. Data is transferred back and forth between the host CPU and the GPU’s global memory (i.e. graphic device memory) via DMA. Data is then transferred back and forth from device memory to core memory as needed. An efficient implementation will minimize the number of device to core data transfers.

In CUDA terminology, a block of threads runs on each core. While one thread block is processing its data, other thread blocks (running on other cores) are performing identical operations in parallel on other data. Thread blocks are required to execute independently and they must be able to be executed in any order. To specify the operations of a thread block, programmers define “C” functions, called kernels. The kernel function is executed by each thread in a thread block. Fortunately, there is specific CUDA support for synchronizing the threads within a block.

We have enjoyed this project immensely and are excited by CUDA’s prospects. We look forward to using CUDA on other projects in the future. More information on CUDA can be found on Nvidia’s website.  I found this article particularly useful as an introduction.

Mike Perkins, Ph.D., is a managing partner of Cardinal Peak and an expert in algorithm development for video and signal processing applications.

Cardinal Peak
Learn more about our Audio & Video capabilities.

Dive deeper into our IoT portfolio

Take a look at the clients we have helped.

We’re always looking for top talent, check out our current openings. 

Contact Us

Please fill out the contact form below and our engineering services team will be in touch soon.

We rely on Cardinal Peak for their ability to bolster our patent licensing efforts with in-depth technical guidance. They have deep expertise and they’re easy to work with.
Diego deGarrido Sr. Manager, LSI
Cardinal Peak has a strong technology portfolio that has complemented our own expertise well. They are communicative, drive toward results quickly, and understand the appropriate level of documentation it takes to effectively convey their work. In…
Jason Damori Director of Engineering, Biamp Systems
We asked Cardinal Peak to take ownership for an important subsystem, and they completed a very high quality deliverable on time.
Matt Cowan Chief Scientific Officer, RealD
Cardinal Peak’s personnel worked side-by-side with our own engineers and engineers from other companies on several of our key projects. The Cardinal Peak staff has consistently provided a level of professionalism and technical expertise that we…
Sherisse Hawkins VP Software Development, Time Warner Cable
Cardinal Peak was a natural choice for us. They were able to develop a high-quality product, based in part on open source, and in part on intellectual property they had already developed, all for a very effective price.
Bruce Webber VP Engineering, VBrick
We completely trust Cardinal Peak to advise us on technology strategy, as well as to implement it. They are a dependable partner that ultimately makes us more competitive in the marketplace.
Brian Brown President and CEO, Decatur Electronics
The Cardinal Peak team started quickly and delivered high-quality results, and they worked really well with our own engineering team.
Charles Corbalis VP Engineering, RGB Networks
We found Cardinal Peak’s team to be very knowledgeable about embedded video delivery systems. Their ability to deliver working solutions on time—combined with excellent project management skills—helped bring success not only to the product…
Ralph Schmitt VP, Product Marketing and Engineering, Kustom Signals
Cardinal Peak has provided deep technical insights, and they’ve allowed us to complete some really hard projects quickly. We are big fans of their team.
Scott Garlington VP Engineering, xG Technology
We’ve used Cardinal Peak on several projects. They have a very capable engineering team. They’re a great resource.
Greg Read Senior Program Manager, Symmetricom
Cardinal Peak has proven to be a trusted and flexible partner who has helped Harmonic to deliver reliably on our commitments to our own customers. The team at Cardinal Peak was responsive to our needs and delivered high quality results.
Alex Derecho VP Professional Services, Harmonic
Yonder Music was an excellent collaboration with Cardinal Peak. Combining our experience with the music industry and target music market, with Cardinal Peak’s technical expertise, the product has made the mobile experience of Yonder as powerful as…
Adam Kidron founder and CEO, Yonder Music
The Cardinal Peak team played an invaluable role in helping us get our first Internet of Things product to market quickly. They were up to speed in no time and provided all of the technical expertise we lacked. They interfaced seamlessly with our i…
Kevin Leadford Vice President of Innovation, Acuity Brands Lighting
We asked Cardinal Peak to help us address a number of open items related to programming our systems in production. Their engineers have a wealth of experience in IoT and embedded fields, and they helped us quickly and diligently. I’d definitely…
Ryan Margoles Founder and CTO, notion