Approaches to Implement University of Washington
Mediaprocessor Computing Library (UW-MPCL)
Image Computing Systems Laboratory
University of Washington
Seattle, WA 98195-2500
1. Introduction
This
document describes the approaches and issues in implementing the University
of Washington Mediaprocessor Computing Library (UW-MPCL) for each mediaprocessor
supported by the University of Washington Mediaprocessor User Consortium.
We have selected the following 12 functions to be included in the UW-MPCL:
1. invert8: 8-bit image invert
2. add8: add two 8-bit images
3. bin_dilate: dilate an image with a morphology structuring element
4. flip8_x: flip an 8-bit input image vertically (around the x axis)
5. flip8_y: flip an 8-bit input image horizontally (around the y axis)
6. transpose8: transpose an 8-bit input image
7. affine8: affine warping
8. conv8: 2D convolution with various kernel sizes
9. cfft: 2D complex fast Fourier transform (FFT)
10. dwt4: D4 wavelet transform
11. histo8: histogram generation for 8-bit input data
12. idct8x8: 8 x8 inverse discrete cosine transform (IDCT)
From
our experience in developing two previous image computing libraries, we
have found that the library users want the functions to be generic and
flexible so that these functions can be used in various applications. To
fulfill this goal, the image computing library functions need to support
different image sizes. Therefore, we will try to implement most functions
in the UW-MPCL to work on any image size. However, it is generally true
that supporting arbitrary image sizes and removing the granularity limitation
also introduce, in addition to the additional programming effort, some
overhead in the data flow and tight loop code, which results in lower performance.
In case an application requires only a single image size, the library users
can modify the code and remove the support for arbitrary image sizes to
obtain better performance.
The
execution time of the UW-MPCL functions will include the following: (1)
time to read the source image data from the external memory, (2) time to
process the data according to the image computing function by the mediaprocessor,
and then (3) time to store the processed results back to the external memory,
i.e., the execution time includes the necessary input and output activities
as well as pure processing. Our approach provides the performance numbers
that can be realistically expected when the mediaprocessor is utilized
in the actual systems with external memory. This is in contrast to some
existing benchmark performance numbers handling this issue lightly, e.g.,
assuming that all data are present on-chip, thus the cache miss penalty
and memory access latency tend to be ignored.
The
functions will go through upgrades and updates when newer tools (compiler
and simulator/software development board) and/or new versions of the mediaprocessor
are made available. For effective and high-quality function development,
the mediaprocessor chip vendors are expected to provide the development
tools early to the UW Image Computing Systems Laboratory (ICSL) and initial
training of several ICSL engineers/programmers.
All
functions in the UW-MPCL are subject to detailed design reviews, white-box
testing, and black-box testing. Prior to the implementation of any new
UW-MPCL function, the high-level design of each new function is reviewed
and critiqued during regular design review meetings. Once the design is
approved and the function is coded, the programmers test their functions
(white-box testing) under various conditions on the simulator or hardware
and validate the outputs of the function by using the verification utilities
of each function. Finally, at least one UW ICSL member is designated as
a quality control person who tests each function under various conditions
and reports any problems and suggestions to the function's author for corrections
and improvements. Details of the algorithm verification procedure are documented
in the Quality Assurance Guide.
An
html file for each function will be provided to describe the algorithm
specifics, synopsis, restrictions, function parameters, how to compile
and run, performance and its condition, and restrictions.
2. 2D Block Transfer DMA Mode and Data Cache
Mediaprocessors are designed
to provide high performance by utilizing both instruction-level parallelism
and data-level parallelism. This increasing computing power makes how to
transfer the necessary data between off-chip and on-chip memories without
slowing down the core processor's performance a real challenge. Two methods,
i.e., data cache and direct memory access (DMA), have been used to address
this issue. Data cache utilizes the temporal and spatial locality of the
accessed data while a DMA controller can transfer the data independently
of the core processor.
The most
frequently used DMA mode in image computing is the 2D block transfer even
though there exist other DMA transfer modes, e.g., 3D block transfer
[1]
and guided transfer [2].
When an entire input image does not fit on-chip, the input image is transferred
and processed in small blocks, where the 2D block transfer DMA is used
to move each block between the off-chip and on-chip memories. Moreover,
in order to keep the core processor from waiting for the data to arrive
in the on-chip memory, the DMA controller can be used to manage the data
movements concurrently with the processor's computation. This technique
is illustrated in Figure
1
and is commonly known as
double buffering. Four buffers, two for
input blocks (
ping_in_buffer and
pong_in_buffer) and two
for output blocks (
ping_out_buffer and
pong_out_buffer),
are allocated in the on-chip memory. While the core processor computes
on a current image block (e.g., block #2) from
pong_in_buffer and
stores the result in
pong_out_buffer, the DMA controller moves the
previously-calculated output block (e.g., block #1) in
ping_out_buffer
to the external memory and brings the next input block (e.g., block #3)
from the external memory into
ping_in_buffer. When the computation
and data movements are both completed, the core processor and the DMA controller
switch buffers with the core processor starting to use the
ping
buffers and the DMA controller working on the
pong buffers.
|
|
Figure 1. Double buffering the data flow using a DMA controller.
|
If the mediaprocessor
does not have a DMA controller but has data cache, there are several programming
techniques to use data cache efficiently, e.g., loop fusion, loop interchange,
and blocking [3].
In particular, blocking lets the program to process the image frame block-by-block,
increasing the memory access locality. Blocking is useful when large 2D
data are accessed in column-major order. Figure
2
illustrates the blocking technique used for image transposition as an example.
In Figure
2(a),
the transposition of a 2D image is performed as the input image is accessed
row by row and each row is stored in column-major order. Note that each
store operation requires the entire cache line to be loaded in the data
cache before being overwritten (we assume write-allocate cache). If the
data cache is smaller than N ´
cache
line size, excessive cache misses could occur as more columns are accessed. Figure
2(b)
shows the blocking mechanism. With blocking, the input image is accessed
in 2D blocks and transposed, and the results are stored in the output image.
By setting the size of two 2D blocks (one for the input image and another
for the output image) such that they fit in the data cache, we can achieve
better cache utilization, thus obtain higher performance.
Figure 2. Image transposition using data cache.
3. Image Computing Functions
In this
section, we describe several UW-MPCL image computing functions in detail.
3.1 2D convolution
In convolution,
each output pixel is computed to be a weighted average of several neighboring
input pixels. In the simplest form, generalized 2D convolution of an N ´N
input image with an M ´
M convolution kernel is defined as
|
(1)
|
where
f
is the input image,
h is the kernel,
s is the scaling factor,
and
b is the convolved output image. How to optimally map this algorithm
on mediaprocessors is discussed in [4].
The data access pattern for convolution is quite regular. For example,
for convolution with a 3 ´
3 kernel on the current pixel (x
5), eight neighboring pixels
(x
1 through x
9) as shown in Figure
3
are required. Also, to compute for the next adjacent pixel (x
8),
it requires a new set of eight neighboring pixels (x
4 through
x
12). It is obvious that both the 2D block transfer DMA and
cache-based approaches would work well in moving the data between the off-chip
and on-chip memories.
Most DSPs
provide multiply-accumulate (MAC) instructions to speed up the convolution
operations. However, some recent mediaprocessors feature partitioned MAC
and/or inner-product instructions. Figure
4
shows an example inner-product instruction. When computing Eq. (1) using
partitioned MAC or inner-product instructions, the convolution kernel coefficients
are usually stored in the registers. However, if the kernel width is greater
than the number of partitions in the instruction, then the kernel can be
subdivided into several sections and the tight loop is iterated multiple
times, while accumulating the multiplication results.
|
Figure 3. Convolution data access
pattern.
|
| |
|
|
Figure 4. Inner-product instruction
with 64-bit registers split into eight 8-bit partitions.
|
|
3.2 Affine
warp
Affine
warp is a very useful subset of image warping algorithms. It redefines
the spatial relationship between points in the input image to generate
the resultant output image while preserving parallel lines and equispacing
between image points. The example affine warp transformations include rotation,
scaling, shearing, flipping and translation. These transformations can
be applied individually or in any combination as shown in Figure
5.
Original image
|
|
|
|
(a)The
image has been shrunk in half
(b)The image has been rotated by 87.
|
|
the
x direction and also horizontally
|
|
sheared
by 30°.
|
|
Figure 5. Examples of affine warp.
|
A mathematical
equation for affine warp relating the output image to the input image (called
inverse mapping) is shown in Eq. (2),
where

and

are
the discrete output image locations,

and

are
the inverse-mapped input locations, and

-

are
the six affine warp coefficients. .
For each
discrete pixel in the output image, an inverse transformation with Eq. (2)
results in a non-discrete sub-pixel location within the input image, from
which the output pixel value is computed. To determine the gray-level output
value at this non-discrete location, some form of interpolation, e.g.,
bilinear, has to be performed using the pixels around the mapped location.
The interpolation can utilize MAC or inner-product instructions. In order
to use the inner-product instructions, interpolation weighting factors
need to be computed and pixels have to be rearranged so that the multiplication
of the pixel values with their proper interpolation coefficients can be
performed. This rearrangement of pixels for interpolation can be facilitated
if the mediaprocessor supports partitioned and flexible data movements
within a register. Computation of the weighting factors could use partitioned-multiply
instructions. The detailed description of affine warp and its implementation
on mediaprocessors can be found elsewhere [5].
While utilizing
DMA, the output image can be segmented into multiple blocks, an example
of which is illustrated in Figure
6.
A given output block maps to a quadrilateral in the input space. The rectangular
bounding block encompassing each quadrilateral is shown by the solid line
in the input image. If the bounding block contains no source pixels (case
3 in Figure
6),
the DMA controller could be programmed to write zeros in the output image
block without bringing any input pixels on-chip. If the bounding block
is partially filled (case 2 in Figure
6),
then the DMA controller could be programmed to bring only valid input pixels
on-chip and fill the rest of the output image block with zeros. If the
bounding block is completely filled with valid input pixels (case 1 in Figure
6),
the whole block is brought on-chip by the DMA controller. In addition,
these data movements are
double-buffered so that the data transfer
time can be overlapped with the processing time.
With a cache-based
affine warp implementation to compute output pixels in row-major order,
the data cache behavior is largely dependent on the warping parameters.
For example, the input image will be accessed in raster-scan order for
the warping shown in Figure
5(a),
while it is accessed in (almost) column-major order for the warping in Figure
5(b).
If a mapped pixel location in the input image lies outside the boundary,
the program should fill the corresponding output pixel with a zero value.
The warping in Figure
5(b)
might cause a large number of cache misses since each access to the input
pixel requires the entire cache line to be filled and it needs to access
the pixels in (almost) column-major order.
To overcome
this problem, the program can be reorganized to produce output pixels block-by-block,
using the blocking technique described in Section 2. If a 2D block of output
pixels is generated at a time, the memory access for the input image will
show better spatial locality. The memory access pattern in this case is
similar to that of the DMA-based approach, except that the cases 2 and
3 in Figure
6
need to be handled by the core processor only and there is no
double
buffering.
|
|
Figure 6. Inverse mapping in affine
warp. Dotted lines represent the inverse-mapped blocks.
|
3.3 2D FFT
An N ´
N 2D FFT using the Cooley-Tukey decomposition requires

real
multiplications and

real
additions [6].
While computationally efficient, such a direct 2D FFT decomposition leads
to data references that are highly scattered throughout the image. For
example, the first 2 x
2 butterfly on a 512 ´
512 image would require the following pixels:
x(0,0), x(0,256), x(256,0),
and
x(256,256). The large distances between the data references
make it difficult to keep the necessary data for the same butterfly in
the data cache or on-chip memory. Alternatively, a 2D FFT can be decomposed
into row-column 1D FFTs, which can be computed by performing the 1D FFT
on all the rows (row-wise FFTs) followed by all the columns (column-wise
FFTs) of the row FFT results. This method requires
4N2log2N
real
multiplications and
6N2log2N real additions
[6],
which is 33% more multiplications and 9.1% more additions than the direct
2D FFT approach. In spite of this additional computation, the separable
2D FFT algorithm has been popular since all the data for the row or column-wise
1D FFT currently being computed can be stored in any reasonably-sized data
cache or on-chip memory. The intermediate image is transposed after the
row-wise 1D FFTs, so that another set of row-wise 1D FFTs can be performed.
This is to reduce the number of SDRAM page misses (via block-by-block transpositions)
that are costly and would occur many times otherwise (i.e., if 1D FFTs
are performed along the columns of the intermediate (row-wise FFT results)
image). One more transposition is needed before storing the final results.
However,
as the performance of mediaprocessors increases and the computation time
steadily decreases, the data I/O time during the transposition would become
more significant. In this case, we can use our recently-developed method
that performs the column-wise 1D FFTs in two passes to reduce the data
I/O time in 2D FFT [7].
In each pass, one half of the stages of the column-wise FFTs are computed.
In this two-pass approach, the computation along a column in each pass
is decomposed into a number of independent sub-computations, each of which
handles only a fraction of the entire column, thus leaving space in the
on-chip memory for multiple columns that are to be processed at a time.
With this
improved I/O handling method, the number of data transfers between the
on-chip and off-chip memories is reduced to three compared to four in the
row-column decomposition method using transposition, thus reducing the
total amount of data I/O by approximately 25%.
The FFT's
innermost tight loop (butterfly) basically consists of MAC operations.
It will be facilitated if a mediaprocessor can execute multiple MAC operations
and/or handle complex numbers and their MAC operations. Also, depending
on the available functional units to compute the butterflies, different
radix bases (e.g., radix 4 in addition to radix 2) can be explored for
better performance.
3.4 Other functions
Other functions
like invert8 (to subtract each 8-bit pixel of the image from 255 to produce
an inverse image) and add8 (to add two 8-bit images) are point operations,
i.e., each output pixel only depends on the input pixel(s) at the corresponding
position. Both cache-based and DMA mechanisms can access the input image(s)
in row-major order and produce the output image accordingly. However, due
to the simple processing requirement in each tight loop, these functions
tend to be I/O-bound, thus can be used to test how fast a specific mediaprocessor
can transfer the image data. The add8 function also tests how efficient
the instruction set is in dealing with partitioned operations since the
addition, subtraction and multiplication of two integers could result in
overflows in the output, requiring extra instructions to handle the situation.
The binary
morphology function (bin_dilate) is selected to test how effectively the
individual bits can be handled and processed, e.g., extracting the arbitrary
number of bits from an arbitrary bit location in the register.
The flip8_x,
flip8_y, and transpose8 (as well as affine8) functions test the processor's
ability to manipulate the data. flip8_x (or vertical flip) can be easily
implemented by reordering the rows, but the flip8_y (horizontal flip) and
transpose functions can benefit from intelligent data movement instructions.
D4 wavelet
(dwt4) is separable and have been traditionally decomposed into separate
row-wise and column-wise wavelets. However, the transpositions of the intermediate
data could incur a large amount of overhead (2D FFT has a similar problem).In
order to reduce the memory access latency and/or cache miss penalty, judicious
use of data cache prefetch instructions (if available), cache blocking
technique (Figure
2),
and/or direct memory access (DMA) must be explored and employed.
In histogram
generation (histo8), accessing bins that accumulate the number of occurrences
of the corresponding input pixel values should be performed in a serial
fashion. If the same bin is accessed while the previous access to that
bin is not finished (this could happen either in pipelined memory accesses
to the same bin or parallel memory accesses to the same bin by multiple
load/store units), an incorrect histogram might be created. To resolve
this issue, we can create multiple bins in the first phase. The interdependency
in the memory accesses can be eliminated using multiple bins, thus the
memory accesses can be pipelined. These multiple bins are added together
to result in the final histogram in the second phase.
8 ´
8 IDCT (idct8x8) is an essential operation in various image and video compression
standards, thus can play an important role in measuring the mediaprocessor
performance. In the IDCT computation, MAC, inner-product, and data movement
instructions can be used. However, traditionally well-accepted fast (e.g.,
Chen's and Lee's) IDCT algorithms to reduce the number of multiplications
may not be optimal for some mediaprocessors, e.g., add/subtract operations
could become the bottleneck. We have also found that 8 ´
8 matrix transpositions of intermediate results incur quite a large overhead
even with the availability of data movement instructions on many mediaprocessors.
To avoid this transposition overhead, a hybrid approach can be used, e.g.,
direct matrix-multiply 1D IDCT with inner-product instructions in processing
rows and Chen's IDCT with partitioned MAC instructions in processing columns.
Since the partitioned MAC instructions can process multiple vertical columns
at a time, there is no need for transpositions.If
we know the statistics of IDCT coefficients, other IDCT algorithms, e.g.,
symmetric mapped IDCT (SMIDCT) [8], could be considered as well.Another
concern is whether the IDCT result meets the IEEE-1180 mismatch test. In
order to make the IDCT results compliant to the IEEE-1180 Standard, judicious
use of rounding during multiplications and additions or longer bit fields
to represent the data would be needed.
4. Conclusion
Modern
mediaprocessors have a lot of on-chip computing and I/O resources for high
performance via several levels of parallelism in the instructions (issuing
multiple instructions per cycle), data (using partitioned operations),
processors (using multiple processors in the chip), and overlapping processing
and data I/O. Even with such a flexible and powerful architecture, to achieve
good performance necessitates a careful analysis and design of algorithms
to make the best use of these resources.
In developing
the UW-MPCL functions, we will strive to optimize the functions at all
levels of parallelism and data flow. Sometimes, optimizing them at one
level (e.g., instructions or data level) requires programming changes in
different levels (e.g., processors or between processing core and DMA).
In other words, the optimization should be done globally and systematically,
considering all different levels, factors and their interdependence.With
these highly-optimized functions, the UW-MPCL will become
a very effective and useful tool for helping the set makers to accurately
yet easily evaluate the mediaprocessors for their applications and to develop
their mediaprocessor-based products quickly.
It is also critical
for mediaprocessor manufacturers to provide the development tools early
to the UW ICSL and initial training of several ICSL engineers/programmers.
Set makers can also help UW to develop and improve library functions of
high quality by giving UW their feedback in using the UW-MPCL functions
as they have done in the past.In
addition to our team with experience, insights, creativity, hardwork, and
striving for excellence, we needs to closely collaborate with chip vendors
and set makers for the UW Mediaprocessor User Consortium to succeed, whose
success would benefit both chip vendors and set makers.
References
1. K. Guttag, R. J. Gove, and J. R. Van Aken,
"A single chip multiprocessor for multimedia: The MVP,'' IEEE Computer
Graphics Application, vol. 12, no. 6, pp. 53-64, 1992.
2. C. Basoglu, Y. Kim, and V. Chalana, "A real-time
scan conversion algorithm on commercially-available microprocessors," Ultrasonic
Imaging, vol. 18, pp. 241-260, 1996.
3. D. A. Patterson and J. L. Hennessy, Computer
Architecture A Quantitative Approach. Morgan Kaufmann Publishers, 2nd
ed., San Francisco, CA, 1996.
4. R. Managuli, D. Kim, G. York and Y. Kim,
"Mapping of 2D convolution on VLIW mediaprocessors for real-time performance,"
Journal
of Electronic Imaging, vol. 9, pp. 327-335, 2000.
5. O. Evans, and Y. Kim, "Efficient implementation
of image warping on a multimedia processor," Real-Time Imaging, vol.
4, pp. 417-428, 1998.
6. D. E. Dudgeon and R. M. Mersereau, Multidimensional
Digital Signal Processing. Prentice Hall, Englewood Cliffs, NJ, 1984.
7. C. Mermer, D. Kim, and Y. Kim, "2D FFT on
a modern programmable mediaprocessor using an efficient data flow," to
be submitted to Parallel Computing, 2000.
8. A. C. Hung and T. H. Meng, "A comparison
of fast inverse discrete cosine transform algorithms," Multimedia Systems,
vol.
2, pp. 204-217, 1994.