web page header logo


January 23, 2014

Steve DiBartolomeo
Steve DiBartolomeo
Applications Manager

Antonio Morawski
Antonio Morawski
Director of Software Development



Introduction

We examine whether using a GPU to perform the bitmap computations required to extract rotated DMD frames from a rasterized bitmap signficantly improves the throughput compared to using only CPU based operations. The computation involves loading a chunk of the large bitmap and extracting the pixels for each frame; the bits based on the frame's size and rotation. The frames are incremented in the scan direction by a user specified number of pixels.

  the dmd is scanned across the mask

The Two Alternatives

We are going to compare two approaches.

  1. multi-core CPUs
  2. GPU

cpu vs gpu computation


Hardware

We put together two separate workstations of different capabilities as shown below:

  Workstation A Workstation B
CPU i7-2600 @3.4G
4 core (32 nm)
i7-4770 @3.5G
4 core (22 nm)
RAM 16 GB 32 GB
GPU GTX 550 Ti
NVidia Compute Index 2.1
GTX 660
NVidia Compute Index 3.0
  CUDA CORES 192 960
  Graphic Clock 900 Mhz 980 MHz
  Memory Clock 4.1 Gbps 6.0 Gbps
  Memory 1024 MB/GDDR5 2048 MB/GDDR5
  Memory BW 98.4 GB/s 144.2 GB/s


Frame Parameters

We used the following frame size and parameters for our extraction.

frame parameters for the experiment

Test Results


Frames
Workstation A
Workstation B
  CPU GPU Ratio CPU GPU Ratio
1000 3.70 sec 1.00 sec 3.7 3.11 sec 1.09 sec 2.84
1500 5.41 sec 1.39 sec 3.89 4.11 sec 1.39 sec 2.95
2000 7.57 sec 1.95 sec 3.87 6.11 sec 1.80 sec 3.40
2500 9.97 sec 2.3 sec 4.34 7.92 sec 2.30 sec 3.45
3000 11.32 sec Fail   10.31 sec 2.89 sec 3.57
4000       14.41 sec 4.92 sec 2.93
5000       17.88 Fail  


Notes and Comments

The GPU is not giving us an order of magnitude improvement in throughput when compared to a multi-threaded CPU approach. How much of this is due to fundamental limitations and how much is due to our relative inexperience with GPU coding is still unknown. More investigation will be done.

We are not sure why the GPU fails for larger number of frames. It might be due to our programming code as it would appear there is sufficient memory in the GPU to hold both the input bitmap and the resulting frame data.

These times do not include additional time it would take to move the DMD frame results from memory out to the actual mask writer.

The faster and more capable GPU does not crank out results much quicker. This implies to me that a limitation may be in transferring data into the GPU.

It would be good to get seme feedback from mask writer equipment companies as to the required frame rate.






Links


Industry Players References and Papers Hardware and Software Solutions