University of Washington Mediaprocessor Computing Library (UW-MPCL)

Approaches to Implement University of Washington
Mediaprocessor Computing Library (UW-MPCL)
Image Computing Systems Laboratory
University of Washington
Seattle, WA 98195-2500

1. Introduction

This document describes the approaches and issues in implementing the University of Washington Mediaprocessor Computing Library (UW-MPCL) for each mediaprocessor supported by the University of Washington Mediaprocessor User Consortium. We have selected the following 12 functions to be included in the UW-MPCL:

1. invert8: 8-bit image invert
2. add8: add two 8-bit images
3. bin_dilate: dilate an image with a morphology structuring element
4. flip8_x: flip an 8-bit input image vertically (around the x axis)
5. flip8_y: flip an 8-bit input image horizontally (around the y axis)
6. transpose8: transpose an 8-bit input image
7. affine8: affine warping
8. conv8: 2D convolution with various kernel sizes
9. cfft: 2D complex fast Fourier transform (FFT)
10. dwt4: D4 wavelet transform
11. histo8: histogram generation for 8-bit input data
12. idct8x8: 8 x8 inverse discrete cosine transform (IDCT)
From our experience in developing two previous image computing libraries, we have found that the library users want the functions to be generic and flexible so that these functions can be used in various applications. To fulfill this goal, the image computing library functions need to support different image sizes. Therefore, we will try to implement most functions in the UW-MPCL to work on any image size. However, it is generally true that supporting arbitrary image sizes and removing the granularity limitation also introduce, in addition to the additional programming effort, some overhead in the data flow and tight loop code, which results in lower performance. In case an application requires only a single image size, the library users can modify the code and remove the support for arbitrary image sizes to obtain better performance.
The execution time of the UW-MPCL functions will include the following: (1) time to read the source image data from the external memory, (2) time to process the data according to the image computing function by the mediaprocessor, and then (3) time to store the processed results back to the external memory, i.e., the execution time includes the necessary input and output activities as well as pure processing. Our approach provides the performance numbers that can be realistically expected when the mediaprocessor is utilized in the actual systems with external memory. This is in contrast to some existing benchmark performance numbers handling this issue lightly, e.g., assuming that all data are present on-chip, thus the cache miss penalty and memory access latency tend to be ignored.
The functions will go through upgrades and updates when newer tools (compiler and simulator/software development board) and/or new versions of the mediaprocessor are made available. For effective and high-quality function development, the mediaprocessor chip vendors are expected to provide the development tools early to the UW Image Computing Systems Laboratory (ICSL) and initial training of several ICSL engineers/programmers.
All functions in the UW-MPCL are subject to detailed design reviews, white-box testing, and black-box testing. Prior to the implementation of any new UW-MPCL function, the high-level design of each new function is reviewed and critiqued during regular design review meetings. Once the design is approved and the function is coded, the programmers test their functions (white-box testing) under various conditions on the simulator or hardware and validate the outputs of the function by using the verification utilities of each function. Finally, at least one UW ICSL member is designated as a quality control person who tests each function under various conditions and reports any problems and suggestions to the function's author for corrections and improvements. Details of the algorithm verification procedure are documented in the Quality Assurance Guide.
An html file for each function will be provided to describe the algorithm specifics, synopsis, restrictions, function parameters, how to compile and run, performance and its condition, and restrictions.

2. 2D Block Transfer DMA Mode and Data Cache

Mediaprocessors are designed to provide high performance by utilizing both instruction-level parallelism and data-level parallelism. This increasing computing power makes how to transfer the necessary data between off-chip and on-chip memories without slowing down the core processor's performance a real challenge. Two methods, i.e., data cache and direct memory access (DMA), have been used to address this issue. Data cache utilizes the temporal and spatial locality of the accessed data while a DMA controller can transfer the data independently of the core processor.
The most frequently used DMA mode in image computing is the 2D block transfer even though there exist other DMA transfer modes, e.g., 3D block transfer [1] and guided transfer [2]. When an entire input image does not fit on-chip, the input image is transferred and processed in small blocks, where the 2D block transfer DMA is used to move each block between the off-chip and on-chip memories. Moreover, in order to keep the core processor from waiting for the data to arrive in the on-chip memory, the DMA controller can be used to manage the data movements concurrently with the processor's computation. This technique is illustrated in Figure 1 and is commonly known as double buffering. Four buffers, two for input blocks (ping_in_buffer and pong_in_buffer) and two for output blocks (ping_out_buffer and pong_out_buffer), are allocated in the on-chip memory. While the core processor computes on a current image block (e.g., block #2) from pong_in_buffer and stores the result in pong_out_buffer, the DMA controller moves the previously-calculated output block (e.g., block #1) in ping_out_buffer to the external memory and brings the next input block (e.g., block #3) from the external memory into ping_in_buffer. When the computation and data movements are both completed, the core processor and the DMA controller switch buffers with the core processor starting to use the ping buffers and the DMA controller working on the pong buffers.
Figure 1. Double buffering the data flow using a DMA controller.


If the mediaprocessor does not have a DMA controller but has data cache, there are several programming techniques to use data cache efficiently, e.g., loop fusion, loop interchange, and blocking [3]. In particular, blocking lets the program to process the image frame block-by-block, increasing the memory access locality. Blocking is useful when large 2D data are accessed in column-major order. Figure 2 illustrates the blocking technique used for image transposition as an example. In Figure 2(a), the transposition of a 2D image is performed as the input image is accessed row by row and each row is stored in column-major order. Note that each store operation requires the entire cache line to be loaded in the data cache before being overwritten (we assume write-allocate cache). If the data cache is smaller than N ´cache line size, excessive cache misses could occur as more columns are accessed. Figure 2(b) shows the blocking mechanism. With blocking, the input image is accessed in 2D blocks and transposed, and the results are stored in the output image. By setting the size of two 2D blocks (one for the input image and another for the output image) such that they fit in the data cache, we can achieve better cache utilization, thus obtain higher performance.

Figure 2. Image transposition using data cache.

3. Image Computing Functions

In this section, we describe several UW-MPCL image computing functions in detail.

3.1 2D convolution

In convolution, each output pixel is computed to be a weighted average of several neighboring input pixels. In the simplest form, generalized 2D convolution of an N ´N input image with an M ´ M convolution kernel is defined as
(1)


where f is the input image, h is the kernel, s is the scaling factor, and b is the convolved output image. How to optimally map this algorithm on mediaprocessors is discussed in [4]. The data access pattern for convolution is quite regular. For example, for convolution with a 3 ´ 3 kernel on the current pixel (x5), eight neighboring pixels (x1 through x9) as shown in Figure 3 are required. Also, to compute for the next adjacent pixel (x8), it requires a new set of eight neighboring pixels (x4 through x12). It is obvious that both the 2D block transfer DMA and cache-based approaches would work well in moving the data between the off-chip and on-chip memories.
Most DSPs provide multiply-accumulate (MAC) instructions to speed up the convolution operations. However, some recent mediaprocessors feature partitioned MAC and/or inner-product instructions. Figure 4 shows an example inner-product instruction. When computing Eq. (1) using partitioned MAC or inner-product instructions, the convolution kernel coefficients are usually stored in the registers. However, if the kernel width is greater than the number of partitions in the instruction, then the kernel can be subdivided into several sections and the tight loop is iterated multiple times, while accumulating the multiplication results.
Figure 3. Convolution data access pattern.
 
 
Figure 4. Inner-product instruction with 64-bit registers split into eight 8-bit partitions.
 

3.2 Affine warp

Affine warp is a very useful subset of image warping algorithms. It redefines the spatial relationship between points in the input image to generate the resultant output image while preserving parallel lines and equispacing between image points. The example affine warp transformations include rotation, scaling, shearing, flipping and translation. These transformations can be applied individually or in any combination as shown in Figure 5.


Original image
(a)The image has been shrunk in half (b)The image has been rotated by 87.
the x direction and also horizontally
sheared by 30°.
Figure 5. Examples of affine warp.


A mathematical equation for affine warp relating the output image to the input image (called inverse mapping) is shown in Eq. (2), where and are the discrete output image locations, and are the inverse-mapped input locations, and - are the six affine warp coefficients. .



(2)


For each discrete pixel in the output image, an inverse transformation with Eq. (2) results in a non-discrete sub-pixel location within the input image, from which the output pixel value is computed. To determine the gray-level output value at this non-discrete location, some form of interpolation, e.g., bilinear, has to be performed using the pixels around the mapped location. The interpolation can utilize MAC or inner-product instructions. In order to use the inner-product instructions, interpolation weighting factors need to be computed and pixels have to be rearranged so that the multiplication of the pixel values with their proper interpolation coefficients can be performed. This rearrangement of pixels for interpolation can be facilitated if the mediaprocessor supports partitioned and flexible data movements within a register. Computation of the weighting factors could use partitioned-multiply instructions. The detailed description of affine warp and its implementation on mediaprocessors can be found elsewhere [5].
While utilizing DMA, the output image can be segmented into multiple blocks, an example of which is illustrated in Figure 6. A given output block maps to a quadrilateral in the input space. The rectangular bounding block encompassing each quadrilateral is shown by the solid line in the input image. If the bounding block contains no source pixels (case 3 in Figure 6), the DMA controller could be programmed to write zeros in the output image block without bringing any input pixels on-chip. If the bounding block is partially filled (case 2 in Figure 6), then the DMA controller could be programmed to bring only valid input pixels on-chip and fill the rest of the output image block with zeros. If the bounding block is completely filled with valid input pixels (case 1 in Figure 6), the whole block is brought on-chip by the DMA controller. In addition, these data movements are double-buffered so that the data transfer time can be overlapped with the processing time.
With a cache-based affine warp implementation to compute output pixels in row-major order, the data cache behavior is largely dependent on the warping parameters. For example, the input image will be accessed in raster-scan order for the warping shown in Figure 5(a), while it is accessed in (almost) column-major order for the warping in Figure 5(b). If a mapped pixel location in the input image lies outside the boundary, the program should fill the corresponding output pixel with a zero value. The warping in Figure 5(b) might cause a large number of cache misses since each access to the input pixel requires the entire cache line to be filled and it needs to access the pixels in (almost) column-major order.
To overcome this problem, the program can be reorganized to produce output pixels block-by-block, using the blocking technique described in Section 2. If a 2D block of output pixels is generated at a time, the memory access for the input image will show better spatial locality. The memory access pattern in this case is similar to that of the DMA-based approach, except that the cases 2 and 3 in Figure 6 need to be handled by the core processor only and there is no double buffering.
Figure 6. Inverse mapping in affine warp. Dotted lines represent the inverse-mapped blocks.

3.3 2D FFT

An N ´ N 2D FFT using the Cooley-Tukey decomposition requires real multiplications and real additions [6]. While computationally efficient, such a direct 2D FFT decomposition leads to data references that are highly scattered throughout the image. For example, the first 2 x 2 butterfly on a 512 ´ 512 image would require the following pixels: x(0,0), x(0,256), x(256,0), and x(256,256). The large distances between the data references make it difficult to keep the necessary data for the same butterfly in the data cache or on-chip memory. Alternatively, a 2D FFT can be decomposed into row-column 1D FFTs, which can be computed by performing the 1D FFT on all the rows (row-wise FFTs) followed by all the columns (column-wise FFTs) of the row FFT results. This method requires 4N2log2N real multiplications and 6N2log2N real additions [6], which is 33% more multiplications and 9.1% more additions than the direct 2D FFT approach. In spite of this additional computation, the separable 2D FFT algorithm has been popular since all the data for the row or column-wise 1D FFT currently being computed can be stored in any reasonably-sized data cache or on-chip memory. The intermediate image is transposed after the row-wise 1D FFTs, so that another set of row-wise 1D FFTs can be performed. This is to reduce the number of SDRAM page misses (via block-by-block transpositions) that are costly and would occur many times otherwise (i.e., if 1D FFTs are performed along the columns of the intermediate (row-wise FFT results) image). One more transposition is needed before storing the final results.
However, as the performance of mediaprocessors increases and the computation time steadily decreases, the data I/O time during the transposition would become more significant. In this case, we can use our recently-developed method that performs the column-wise 1D FFTs in two passes to reduce the data I/O time in 2D FFT [7]. In each pass, one half of the stages of the column-wise FFTs are computed. In this two-pass approach, the computation along a column in each pass is decomposed into a number of independent sub-computations, each of which handles only a fraction of the entire column, thus leaving space in the on-chip memory for multiple columns that are to be processed at a time.
With this improved I/O handling method, the number of data transfers between the on-chip and off-chip memories is reduced to three compared to four in the row-column decomposition method using transposition, thus reducing the total amount of data I/O by approximately 25%.
The FFT's innermost tight loop (butterfly) basically consists of MAC operations. It will be facilitated if a mediaprocessor can execute multiple MAC operations and/or handle complex numbers and their MAC operations. Also, depending on the available functional units to compute the butterflies, different radix bases (e.g., radix 4 in addition to radix 2) can be explored for better performance.

3.4 Other functions

Other functions like invert8 (to subtract each 8-bit pixel of the image from 255 to produce an inverse image) and add8 (to add two 8-bit images) are point operations, i.e., each output pixel only depends on the input pixel(s) at the corresponding position. Both cache-based and DMA mechanisms can access the input image(s) in row-major order and produce the output image accordingly. However, due to the simple processing requirement in each tight loop, these functions tend to be I/O-bound, thus can be used to test how fast a specific mediaprocessor can transfer the image data. The add8 function also tests how efficient the instruction set is in dealing with partitioned operations since the addition, subtraction and multiplication of two integers could result in overflows in the output, requiring extra instructions to handle the situation.
The binary morphology function (bin_dilate) is selected to test how effectively the individual bits can be handled and processed, e.g., extracting the arbitrary number of bits from an arbitrary bit location in the register.
The flip8_x, flip8_y, and transpose8 (as well as affine8) functions test the processor's ability to manipulate the data. flip8_x (or vertical flip) can be easily implemented by reordering the rows, but the flip8_y (horizontal flip) and transpose functions can benefit from intelligent data movement instructions.
D4 wavelet (dwt4) is separable and have been traditionally decomposed into separate row-wise and column-wise wavelets. However, the transpositions of the intermediate data could incur a large amount of overhead (2D FFT has a similar problem).In order to reduce the memory access latency and/or cache miss penalty, judicious use of data cache prefetch instructions (if available), cache blocking technique (Figure 2), and/or direct memory access (DMA) must be explored and employed.
In histogram generation (histo8), accessing bins that accumulate the number of occurrences of the corresponding input pixel values should be performed in a serial fashion. If the same bin is accessed while the previous access to that bin is not finished (this could happen either in pipelined memory accesses to the same bin or parallel memory accesses to the same bin by multiple load/store units), an incorrect histogram might be created. To resolve this issue, we can create multiple bins in the first phase. The interdependency in the memory accesses can be eliminated using multiple bins, thus the memory accesses can be pipelined. These multiple bins are added together to result in the final histogram in the second phase.
8 ´ 8 IDCT (idct8x8) is an essential operation in various image and video compression standards, thus can play an important role in measuring the mediaprocessor performance. In the IDCT computation, MAC, inner-product, and data movement instructions can be used. However, traditionally well-accepted fast (e.g., Chen's and Lee's) IDCT algorithms to reduce the number of multiplications may not be optimal for some mediaprocessors, e.g., add/subtract operations could become the bottleneck. We have also found that 8 ´ 8 matrix transpositions of intermediate results incur quite a large overhead even with the availability of data movement instructions on many mediaprocessors. To avoid this transposition overhead, a hybrid approach can be used, e.g., direct matrix-multiply 1D IDCT with inner-product instructions in processing rows and Chen's IDCT with partitioned MAC instructions in processing columns. Since the partitioned MAC instructions can process multiple vertical columns at a time, there is no need for transpositions.If we know the statistics of IDCT coefficients, other IDCT algorithms, e.g., symmetric mapped IDCT (SMIDCT) [8], could be considered as well.Another concern is whether the IDCT result meets the IEEE-1180 mismatch test. In order to make the IDCT results compliant to the IEEE-1180 Standard, judicious use of rounding during multiplications and additions or longer bit fields to represent the data would be needed.

4. Conclusion

Modern mediaprocessors have a lot of on-chip computing and I/O resources for high performance via several levels of parallelism in the instructions (issuing multiple instructions per cycle), data (using partitioned operations), processors (using multiple processors in the chip), and overlapping processing and data I/O. Even with such a flexible and powerful architecture, to achieve good performance necessitates a careful analysis and design of algorithms to make the best use of these resources.
In developing the UW-MPCL functions, we will strive to optimize the functions at all levels of parallelism and data flow. Sometimes, optimizing them at one level (e.g., instructions or data level) requires programming changes in different levels (e.g., processors or between processing core and DMA). In other words, the optimization should be done globally and systematically, considering all different levels, factors and their interdependence.With these highly-optimized functions, the UW-MPCL will become a very effective and useful tool for helping the set makers to accurately yet easily evaluate the mediaprocessors for their applications and to develop their mediaprocessor-based products quickly.
It is also critical for mediaprocessor manufacturers to provide the development tools early to the UW ICSL and initial training of several ICSL engineers/programmers. Set makers can also help UW to develop and improve library functions of high quality by giving UW their feedback in using the UW-MPCL functions as they have done in the past.In addition to our team with experience, insights, creativity, hardwork, and striving for excellence, we needs to closely collaborate with chip vendors and set makers for the UW Mediaprocessor User Consortium to succeed, whose success would benefit both chip vendors and set makers.

References
1. K. Guttag, R. J. Gove, and J. R. Van Aken, "A single chip multiprocessor for multimedia: The MVP,'' IEEE Computer Graphics Application, vol. 12, no. 6, pp. 53-64, 1992.
2. C. Basoglu, Y. Kim, and V. Chalana, "A real-time scan conversion algorithm on commercially-available microprocessors," Ultrasonic Imaging, vol. 18, pp. 241-260, 1996.
3. D. A. Patterson and J. L. Hennessy, Computer Architecture A Quantitative Approach. Morgan Kaufmann Publishers, 2nd ed., San Francisco, CA, 1996.
4. R. Managuli, D. Kim, G. York and Y. Kim, "Mapping of 2D convolution on VLIW mediaprocessors for real-time performance," Journal of Electronic Imaging, vol. 9, pp. 327-335, 2000.
5. O. Evans, and Y. Kim, "Efficient implementation of image warping on a multimedia processor," Real-Time Imaging, vol. 4, pp. 417-428, 1998.
6. D. E. Dudgeon and R. M. Mersereau, Multidimensional Digital Signal Processing. Prentice Hall, Englewood Cliffs, NJ, 1984.
7. C. Mermer, D. Kim, and Y. Kim, "2D FFT on a modern programmable mediaprocessor using an efficient data flow," to be submitted to Parallel Computing, 2000.
8. A. C. Hung and T. H. Meng, "A comparison of fast inverse discrete cosine transform algorithms," Multimedia Systems, vol. 2, pp. 204-217, 1994.