Using a GPU to Create DMD bitmaps

January 23, 2014

Steve DiBartolomeo
Steve DiBartolomeo
Applications Manager

Antonio Morawski
Antonio Morawski
Director of Software Development

Introduction

We examine whether using a GPU to perform the bitmap computations required to extract rotated DMD frames from a rasterized bitmap signficantly improves the throughput compared to using only CPU based operations. The computation involves loading a chunk of the large bitmap and extracting the pixels for each frame; the bits based on the frame's size and rotation. The frames are incremented in the scan direction by a user specified number of pixels.

The Two Alternatives

We are going to compare two approaches.

multi-core CPUs
GPU

Hardware

We put together two separate workstations of different capabilities as shown below:

	Workstation A	Workstation B
CPU	i7-2600 @3.4G 4 core (32 nm)	i7-4770 @3.5G 4 core (22 nm)
RAM	16 GB	32 GB
GPU	GTX 550 Ti NVidia Compute Index 2.1	GTX 660 NVidia Compute Index 3.0
CUDA CORES	192	960
Graphic Clock	900 Mhz	980 MHz
Memory Clock	4.1 Gbps	6.0 Gbps
Memory	1024 MB/GDDR5	2048 MB/GDDR5
Memory BW	98.4 GB/s	144.2 GB/s

Frame Parameters

We used the following frame size and parameters for our extraction.

Test Results

Frames	Workstation A			Workstation B
	CPU	GPU	Ratio	CPU	GPU	Ratio
1000	3.70 sec	1.00 sec	3.7	3.11 sec	1.09 sec	2.84
1500	5.41 sec	1.39 sec	3.89	4.11 sec	1.39 sec	2.95
2000	7.57 sec	1.95 sec	3.87	6.11 sec	1.80 sec	3.40
2500	9.97 sec	2.3 sec	4.34	7.92 sec	2.30 sec	3.45
3000	11.32 sec	Fail		10.31 sec	2.89 sec	3.57
4000				14.41 sec	4.92 sec	2.93
5000				17.88	Fail

Notes and Comments

The GPU is not giving us an order of magnitude improvement in throughput when compared to a multi-threaded CPU approach. How much of this is due to fundamental limitations and how much is due to our relative inexperience with GPU coding is still unknown. More investigation will be done.

We are not sure why the GPU fails for larger number of frames. It might be due to our programming code as it would appear there is sufficient memory in the GPU to hold both the input bitmap and the resulting frame data.

These times do not include additional time it would take to move the DMD frame results from memory out to the actual mask writer.

The faster and more capable GPU does not crank out results much quicker. This implies to me that a limitation may be in transferring data into the GPU.

It would be good to get seme feedback from mask writer equipment companies as to the required frame rate.