New Architecture for QISLIB-MT and ACSRaster
August 1, 2016
When optimizing the throughput of a complex process, a good rule of thumb is to "balance" the various bottlenecks. There's no point in speeding up a stage if the following stage (or earlier stage) remains as a choke point.
Your architecture should also minimize unnecessary "waiting." You don't want a process thread waiting for the results of a upstream process to complete.
For wafer/reticle inspection applications we know:
a) the client program has a large list of many small regions (or windows) that it wants to extract from the GDSII layout. We're talking about hundreds to tens-of-thousands small windows.
b) because the clip is small, a single thread of our rasterizer will process it quickly and efficiently; there's not need to apply a large number of concurrent threads to do the raster work.
For large input files, polygon extraction is relatively slower, we want to use as many threads as possible for this task.
Our programmers came up with the following architecture.
1. Windows to rasterize (clips) are generated by the client program and are evenly distributed among the N queues; where N is either the number of cores in the computer (if left to default) or is user specified.
2. Eeach thread holds its own queue of clips which it processes serially.
3. The next window ready for processing. Only one window is rasterized per queue at a time.
4. Each queue commands a single instance of the qislib-mt polygon exploder. Each exploder can have independent parameters for cell, nesting level depth and layers (in addition to window coordinates) and traverses the quad-tree independently of all other exploder instances.
5. Each queue commands a single instance of the rasterizer. In this implementation, each polygon received from the exploder is rasterized on-the-fly without any buffering or storage.
6. Each queue/thread allocates one raster buffer at a time (the buffer is expected to be quite small - a few MBytes) for the windows currently being processed.
7. Upon completion, a raster buffer holding the bitmap for a given window is passed back to the client program for post-processing, formatting or whatever other purpose the client intends.
This architecture uses very little additional memory for each additional thread. Instead, memory is available for holding the quad-tree data and the GDSII data because any disk I/O would immediately become the limiting bottleneck.
Notice that there is no Boolean operation or library in use. The rasterizer does not need the polygon data clipped to the window extents. Of course, for other applications one might need Boolean operations and we will investigate such flows when such applications are presented to us.