Rapid Prototyping of OpenCV Image Processing ... - IEEE Xplore

0 downloads 0 Views 821KB Size Report
Rapid Prototyping of OpenCV Image Processing. Applications using ASP. Felix Mühlbauer∗, Michael Großhans∗, Christophe Bobda†. ∗Chair of Computer ...
Rapid Prototyping of OpenCV Image Processing Applications using ASP Felix M¨uhlbauer∗ , Michael Großhans∗ , Christophe Bobda† ∗ Chair

of Computer Engineering University of Potsdam, Germany

{muehlbauer,grosshan}@cs.uni-potsdam.de † CSCE

University of Arkansas, USA [email protected]

Abstract—Image processing is becoming more and more present in our everyday life. With the requirements of miniaturization, low-power, performance in order to provide some intelligent processing directly into the camera, embedded camera will dominate the image processing landscape in the future. While the common approach of developing such embedded systems is to use sequentially operating processors, image processing algorithms are inherently parallel, thus hardware devices like FPGAs provide a perfect match to develop highly efficient systems. Unfortunately hardware development is more difficult and there are less experts available compared to software. Automatizing the design process will leverage the existing infrastructure, thus providing faster time to market and quick investigation of new algorithms. We exploit ASP (answer set programming) for system synthesis with the goal of genarating an optimal hardware software partitioning, a viable communication structure and the corresponding scheduling, from an image processing application.

I. I NTRODUCTION Image processing is becoming more and more present in our everyday life. Mobiles devices are able to automatically take a photo when detecting a smiling face, intelligent cameras are used to monitor suspect peoples and operations at airports. In production chain, smart cameras are being used for quality control. Besides those fields of application, many other are being considered and will be widened in the future. The challenge for developing such embedded image processing systems is that image processing often results in very high resource utilization while embedded systems are usually only equipped with limited resources. The common approach is based on general purpose processor systems, which processes data mainly sequentially. In contrast, image processing algorithms are inherently parallel and thus hardware devices like FPGAs and ASICs provide the perfect match to develop highly efficient solutions. Unfortunately, compared to software development, only few hardware experts are available. Additionally, hardware development is error prone, difficult to debug and time consuming, leading to huge time-to-market. From an economic point of view two criteria for the development are important: time to market and the performance of the product. Automatic synthesis with the aim of generating an optimal

architecture according the application, will help provide the required performance in reasonable time. Our motivation is to design a development environment in which high-level software developers could leverage the speed of hardware accelerators without knowledge in the low-level hardware design and integration. Our approach consists of running a ready-to-use and well known software library for computer vision on a processor featuring an operating system. We rely on very popular open source software like OpenCV [1] and the operating system Linux. This approach allows the software application developers to focus on the developpement of high-quality algorithms, which will be implemented with the best performance. Given an application, the decision to map a task to a processing element and defining the underlying communication infrastructure and protocol is a challenging task. In this paper we focus on the system synthesis problem of OpenCV applications using heterogeneous FPGA based onchip architectures as target architecture. We use ASP (answer set programming) to prune the solution space. The goal is to find optimal solutions for the task mapping and communication simultaneously using constraints like timings and chip resources. This paper is structured as follows: After addressing related works, we explain our model for image processing architectures and the resulting design space. A brief introduction into ASP is followed by a description of the strategy how to express and solve the problem in an ASP-like manner. The paper concludes with results and future work. II. R ELATED WORK Search algorithms like e. g. evolutionary algorithms are capable of solving very complex optimization problems. A generic approach is implemented by the PISA tool [2]. Here a search algorithm is a method which tries to find solutions for a given problem by iterating three steps: First, the evaluation of candidate solutions, second, the selection of promising candidates based on this evaluation and third, the generation of new candidates by variation of these selected

c 2011 IEEE 978-1-4577-0660-8/11/$26.00

16

candidates. As examples most evolutionary algorithms and simulated annealing fall into this category. PISA is mainly dedicated to multi-objective search, where the optimization problem is characterized by a set of conflicting goals. The module SystemCoDesigner (SCD) was developed to explore the design space of embedded systems. The found solution is not necessarily the best possible, in contrast we use ASP to find an optimal solution. In Ishebabi et al. [3] several approaches for architecture synthesis for adaptive multi-processor systems on chip were investigated. Beside heuristics methods also ILP (linear integer programming), ASP and evolutionary algorithms were used. In our work the focus is less on processor allocation and more on communication, especially handling streaming data. The resulting complexity is different. III. A RCHITECTURE The flexibility of FPGAs allows for the generation of application specific architecture, without modification on the hardware infrastructure. We are especially interested in centralized systems in which a coordinator processing element is supported by other slave processing elements. Generally speaking, these processing elements could be GPPs (general purpose processors) like those available in SMP systems, coprocessors like FPUs (floating point unit), or dedicated hardware accelerators (AES encryption, Discrete Fourier Transformation, . . . ). Each processing element has different capabilities and interfaces which define the way sata exchange withe the environment takes place. Also communication is a key component of a hardware/software system. While an efficient communication infrastructure will boost the performance a poorly designed communication system will badly affect the system. In the following our architecture model and the different paths of communication are described in detail. In general, the most important components for image processing are processing elements, memories and communication channels. The specification will therefore focus on more details on those components. In our architecture model we distinguish two kinds of processing elements: software executing processors and dedicated hardware accelerators that we call PUs (processing unit). For each image processing function one or more implementations may exist in software or hardware each of which is having a different processing speed and resource utilization (BRAM, slices, memory, . . . ). Considering the communication, image data usually does not fit1 into the sticky memory inside an FPGA and must be stored in an external memory. Because of the sequential nature (picture after picture) of the image capture, the computation on video data is better organized and processed in a stream. The hardware representation of this idea is known as pipelining: Several PUs, are concatenated and build up a processing 1 A video with VGA resolution, 8 bit per color and 25 frames per second has an amount of data of (640·480)·(3·8)·25 ≈ 175 megabit per second.

17

system bus

processor

to external memory

IM

PU from camera

PU

memory controller

PU

PU

IM PU

FPGA

Fig. 1.

IM PU = processing unit IM = interconnection module = SDI connection

Example architecture according to our model

chain. The data is processed while flowing through the PU chain. In order to allow a seamless integration of streaming oriented computation in a software environment, we implemented an interface called SDI (streaming data interface) here. The interface is simple and able to control the data flow to prevent data loss because of different speeds of the modules. It consists of the signals and protocolls to allow an interblocking data transport accross a chain of connected components. The SDI-interface allows for reusing PUs in different calculation contexts. A variety of processors available for FPGAs, like PowerPC, MicroBlaze, NIOS, LEON or OpenRISC, provide a dedicated interface to connect co-processors. These interfaces could be used for instruction set extensions but also as a dedicated communication channel to other modules. Hence, to build an architecture, a PU or chain of PUs could be connected to a memory or a processor. Furthermore, memories are accessible directly (e. g. internal BRAM) or via a shared bus (e. g. external memory). For these interconnections, so called IMs (interconnection module) are introduced in our model to link an SDI interface to another interface. Figure 1 shows an example architecture. IV. D ESIGN SPACE AND SCHEDULING Compared to a software only solution, the best architecture, based on a hardware/software partitioning, should be found. The search is based on a given set of image processing algorithms and a given set of objective functions and optimization constraints. These general requirements is usually the system performance, the chip’s resource utilization or the power consumption. We use the task graph modell to capture a computation application. It defines the dependencies between the different processing steps, from the capture of the raw image data to the production of result. Beside a pool of software and hardware implementations a database was filled with meta information about these implementations like costs and interoperability. Assuming that for each task a software implementation exists, the costs of selecting a component are its processing time and its memory utilization. For computational intensive tasks a hardware implementation is available. The important costs here are processing time, initial delay, throughput and chip utilization

(slices, BRAM, DSP, . . . ). These information are gathered p. ex. by profiling of function calls or data flow analysis. The problem to be solved is to distribute the tasks to a selected set of processing elements while being aware of timings and scheduling. As mentioned earlier, communication is a key part and often a trade-off between communication and processing has to be found. For example for adjacent tasks it could be faster in total to process these all locally instead of transferring the data from one task in the middle to and from another high-speed module. While mapping the tasks to processors, two kinds of parallelism have to be considered: First, the parallel operation of independent processors like in SMP systems and second, PUs in series processing data while streaming like pipelining. Both parallel operations have different impact on the scheduling. Figure 2 shows a petri net modeling the behavior of the implementation of a image processing function. One byte of data is represented by one token, which traverse the chain from pin to pout according to the implementation type T. I is the number of pixels in one image or frame. The two paths on the left describe filter like operations which consume one image with α bytes per pixel and output one image with maybe another data rate α0 . The two paths on the right describe operations with a fixed result size r like the image brightness. While stream-based implementations (T=0;2) work on a pixel-by-pixel base and cause an initialization delay (modeled as transaction t1 / t4 ) and a processing delay (t2 / t5 ), other implementations (T=1;3) need full access to the whole image and take a delay of t3 / t6 . PUs introduce two additional parameters, which are important to calculate the scheduling. The initial delay and the throughput of data, which must be considered for chained PUs: It takes some time after a PU has read the first data before the first result is available. This delay determines the starting-time of the next PU as specified in the the task graph. Additionally, it is not always possible for a PU to operate at its maximum speed because the speed is also determined by the speed of incoming data from the previous module. Thus to calculate the actual operation speed of a PU, these two facts have to taken in account too. The different operation speeds for different contexts make a decision for the best architecture more difficult. Furthermore, the interplay of components on the chip may incur communication bottlenecks. For example the parallel access to main memory by different modules will incur delays in the computation.

pin α α αI

t1

t2

t4

T=0 1

αI T=2 1

t3

t6

T=1

1

t5

T=3

I

α0 I r

r

α0 pout Fig. 2. Petri net modeling the data flow within processing elements with different implementations T

A. Background The basic concept of modeling ASP programs are rules in the form: p0 ← p1 , . . . , pm , not pm+1 , . . . , not pn

(1)

For a rule r the sets body + (r) = {p1 , . . . , pm }, body − (r) = {pm+1 , . . . , pn } and head(r) = p0 are defined. To understand the concept of answer sets the rule could be interpreted as following: If an answer set A contains p1 , . . . , pm , but not pm+1 , . . . , pn then p0 has to be inserted into this answer set. Additionally, to avoid unfounded solutions for an answer set A, which contains p0 , there must exist one rule r, that head(r) = p0 , body + (r) ⊆ A and body − (r) ∩ A = ∅. Of course, this is only an intuitively way to describe the wide field of answer sets. More specific definitions could be found in several works about answer set programming [4]. With regard to the answer set semantic, a solving strategy and some solving tools are needed to handle the proposed way of understanding logic programs. The first step in computing answer sets is to build a grounded version of the logical program: All variables are eliminated by duplicating rules for each possible constant and substitute the variable. For example, the program:

V. A NSWER S ET P ROGRAMMING

q(X) ← p(X).

p(1).

p(2).

p(2).

q(1) ← p(1).

(2)

is ground to: Answer Set Programming (ASP) is a declarative programming paradigm, which uses facts, rules and other language elements to specify a problem. Based on this formal description an ASP solver can determine possible sets of facts, which fit to the given model.

18

p(1).

q(2) ← p(2).

(3)

Of course the grounded version of the program could be much bigger than the original one. Another problem is that the grounder needs a complete domain for each variable. For

this reason sometimes it is necessary to model such a domain manually, e. g. the grounder could need a time domain, where all possible time values are explicitly given. After generating the grounded version a SAT-like solver is used to compute all answer sets. The most common way to model logic programs for an ASP solver is to use the generate-and-test paradigm. Thus, there are some rules, which are responsible to generate a set of facts. Additionally, there exists some constraints which have to be met, such that the generated set is an answer set, and consequently a solution for the given problem. For that reason ASP solvers support some special language extensions. Generating rules could be modeled using aggregates [5]: l [ v0 = a0 , v1 = a1 , . . . , vn = an ] u.

(4)

The brackets define a weighted sum of atoms v0 , . . . , vn and their weights a0 , . . . , an . The rule describes the fact, that a subset A ⊆ {v0 , . . . , vn } of true atoms exist, such that the sum of weights is within the boundaries [l; u]. Omitted weights default to 1. These rules could be used to generate different sets of atoms. Integrity constraints test if a generated set of atoms is an answer set. These constraints describe, which conditions must not be true in any answer set and are written: ← p, q

(5)

In this example there exists no answer set containing both p and q. So far we have described all language features which are necessary to model our optimization problem. B. Model Our ASP model is structured in three parts: First, the problem description including a task graph and constraints for the demanded architecture, second a summary of meta information for all implementations like costs, mapping of task to HW or SW-component and the mapping interconnect. Third, the solver itself with all rules needed to find a solution. This separation (in files) also offers a highly flexible and reusable model. The basic idea of the model is to select an implementation for each task, connect the associated modules and finally consider timings and dependencies to build the scheduling. Details are described in the following sections. C. Allocation of processors First, each task needs to be mapped to exactly one component. For the introduced scenario this could be the main processor or a PU. The amount of permutations in the model to map components is M !, where M is the maximum number of possible components allowed to be instantiated. To reduce symmetries, a component is defined to have two indices: cij , j ∈ [1; Ji ]. With Ji defining the maximum possible number of instantiations of a component i, the amount of permutations is reduced to J1 ! · . . . · Jn !. The values Ji are derived from the task graph. 19

Normally it is not necessary to instantiate a certain component more than once, thus Ji often is equal to 1. For each instantiated component cij and each task tn an atom Mtn cij is defined, where i specifies the implementation type of a component and j an instance counter. Thus, for each task ti the sum of mapped components must equal 1: 1 [ Mti c11 , . . . , Mti ckl , ] 1.

(6)

D. Data flow After all processing units are instantiated, they need to be connected. Connections are derived from edges in the task graph and could be simple point-to-point connections, but also could involve more components. For example in the case data should be transfered from a processing unit to a memory, the connection requires an IM in between to link the different interfaces. For each transfer n an atom Cncij ckl is defined, which indicates that data for that transfer is sent from component cij to ckl . This atom only exists if the transfer actually happens: 0 [ Cncij ckl ] 1. (7) Again, after generating atoms they have to be limited to useful ones. Modeling these constraints is similar to pathfinding algorithms. Assuming a transfer n describing the dependency between two tasks tx (source) and ty (sink) the following constraints need to be met: ← Mtx cij , [ Cncij ckl ] 0.

(8)

← Mty cij , [ Cnckl cij ] 0.

(9)

If a source task tx is mapped to a component cij there must exist a component ckl to which cij sends data in transfer n (see rule 8). Similary, if a sink task ty is mapped to a component cij there must exist a component ckl which sends data to cij in transfer n (see rule 9). Else this solution is invalid. Additionally, the model must ensure that there exists a path for each transfer between the source and sink components as well as it has to avoid senseless connections, p. ex. in the case of incompatible interfaces between to components. E. Time To evaluate the performance of a hardware architecture it is necessary to schedule all tasks in a temporal order tha will insure the minimal run-time of the algorithm. Modeling a temporal behavior could be done with the help of a time domain, which defines a discrete and finite set of possible time slots. Each task is assigned to a time slot to indicate its starting time. Additionally, the task graph in extended by two special tasks to mark the start and end of computation. While the start task is assigned to time slot 0, the time slot of the end task indicates the total runtime and is used as the value be optimized by the solver. Choosing a practical duration for a time slot is difficult. Selecting shorter clock for time intervals results in a very accurate scheduling. However a small time slot leads to an explosion of the number of possibilities that the solver has to

deal with. As trade-off we choose a normalized time interval for a time slot related to the fastest component: One time slot is the amount of time the fastest component takes to process a certain amount of data. The occupation of two time slots indicates that a component operates at half speed compared to the fastest one. In our ASP model each task ti is assigned exactly to one time slot k indicated by an atom Tti k 1 [ Tti 1 , . . . , Tti m ] 1.

(10)

where m is the total number of time slots available in the time domain and given as a constant in our model. To meet the dependencies given by the task graph, a task tx may not start before its predecessor ty : ← Ttx kx , Tty ky , ky < kx .

(11)

F. Synchronization In section IV we explained why the maximum operation speed of a component may not be exhausted and the actual speed depends on the processing context. Therefore an atom Sti k is defined as 1 [ Sti 1 , . . . , Sti p ] 1.

(12)

where k is the speed of the task ti . Similar to the time model also for modeling the speed a relative criteria was chosen for the same reasons. A value of 1 implies that the fastest component needs one time slot to process a certain amount of data. The constant p is the maximum speed value, and thus the speed of the slowest component. The possible speed values for a component are dependent of the speed of the predecessor. If two tasks tx and ty are dependent and mapped on adjacent components, the assigned speed values have to be equal: ← Stx kx , Sty ky , kx 6= ky .

(13)

To find a scheduling some more helper values are needed. If the starting time and the speed of a task is known, then the end time could be determined. Similar to the definition of the starting time T in section V-E, the end time of a task ti is described by an atom Eti k : Eti ke ← Sti d , Tti ks , ke = ks + d.

(14)

To introduce a local scheduling on each component there may not exist any tasks which have intersecting computation times. Thus, if two tasks tx and ty are mapped to the same component and ty starts after tx , then tx must have finished before ty starts: ← Ttx ks , Etx ke , Tty ky , ks ≤ ky , ky < ke .

(15)

conditions. In detail, this concerns memory bandwidth, chip area utilization and total runtime. One major issue is the utilization of the system bus resp. memory bus, because most data is stored in the main memory and easily the memory interface becomes a bottleneck. For each point in time it must be assured that the bus is not overloaded and the speed of the attached components is throttled if necessary. In our model the speed of the system bus is given as a constant sb . For each time slot the traffic of all active bus transfers is summed up and compared to the system bus capacity. Furthermore the traffic caused by a component is inversely proportional to its speed, p. ex. if a component operates four times slower than the bus, the bus utilization is one quarter. This is expressed by the following inequation: 1 1 1 + ... + ≥ stc1 stcn sb

(16)

For the time slot t the components c1 , . . . , cn are loading the bus according to their individual speeds stck . For the ease of reading cij is shortened to ck . In ASP fraction numbers should be avoided and integer numbers used instead. Therefore the bus capacity is modeled as discrete work slots which could be allocated by active components. The constant p (introduced in rule 12) is derived from the slowest component and hence the minimal bus load is 1/p, if the bus speed sb equals 1. This also results in p as the number of needed slots respectively p/sb for sb 6= 1. To understand that, consider that it is possible to normalize all speed values, including the maximum value p by sb and get a new maximum value p0 = p/sb and a new bus speed s0b = 1. For a component ck sending data and operating at speed stck , the normalization coefficient is stck /sb . Thus ck uses sb /stck of the bus capacity and consequently allocates sb sb p p · p0 = · = (17) stck stck sb stck work slots. With this normalization the equation 16 becomes an integrity constraint using only integer numbers: p p ←d e + ... + d e ≥ bp0 c. (18) stc1 stcn Another issue concerning the general constraints of a solution is the chip area. As described before, the resource utilization r of each component is given as part of the metainformation. For each instantiated component ck the value rk is represented by the atom Rck rk . With Ru defining the overall resource constraint the integrity constraint X ← Rc1 r1 , . . . , Rcn rn , Ru , r = rn , u ≤ r. (19) n

G. Resource utilization With the introduced rules so far it is possible to build up valid architectures. In the following further rules are presented, which have global influence on the quality of the generated architectures and second, comply with the general

20

rejects architectures, which consume too much resources. This rule is replicated to handle different resources like slices or BRAMs. Finally, to get the optimized model the total run-time should be minimized. As indicator for the runtime the end task te was

task gauss sobel gradient trace system bus IM (bus)

throughput PPC PU 16 1 16 1 8 2 16 -

resources slices bram 2 2 2 2 1 0 10 3

system bus

processor

to external memory

from camera

memory controller

gauss

sobel

TABLE I B RIEF META - INFORMATION FOR DIFFERENT IMPLEMENTATIONS

trace

gradient

IM

chip area: 18 time slots: 21

FPGA

defined earlier. In the ASP model an aggregate is used to find the time slot of te : minimize [ Tte 1 = 1, . . . , Tte m = m ].

(20)

Each atom Tte k is weighted by its time slot number k. Because only one atom is true, the sum results in the time slot number of the end task and hence the total runtime. VI. R ESULTS At the University of Potsdam, Germany a collection of tools called POTASSCO [6] was developed to support the computation of answer sets. Some of those tools are trend-setting and award-winning in the wide field of logic programming [7]. We use the tools gringo and clasp to solve our problem. These applications are capable of handling optimizing statements, which themselves are very similar to aggregates. A sum of specific weighted literals is build and tried to optimize this sum during the solving process. As example application for this paper we used the canny edge detector, a common preprocessing stage for object recognition. The processing steps are: camera → Gauss filter (noise reduction) → sobel filter (find edges) → calculate gradient of edges → trace edges to find contours. While for the first steps a hardware implementation is very fast, the tracing of edges has no consecutive memory access and thus only a software implementation is assumed. Table I summarizes the resource utilization and the assumed throughput for each implementation. On our test system2 the ASP solver needs about 3 seconds to find a solution. Figure 3 shows two different generated architectures while decreasing the constraint for the available chip area. The mapping of software tasks is illustrated with parallelograms and dashed arrows. In the bottom right corner of each drawing the consumption of chip area and the estimated run-time is given. Finally, the software only solution (not shown) takes 65 time slots compared to 21 for the most hardware intensive architecture. Our current ASP model is a first approach and thus not optimized. Nevertheless we want to give an idea of the performance of the solving process. For the measurements for each number of tasks from 5 up to 10, 1000 problems were generated randomly especially the task graph and the metainformation for different implementations. Figure 4 shows a 2 Desktop

with Intel Core2 Duo Processor 3.16 GHz, 3.2 GB RAM

21

system bus

processor

to external memory

memory controller

from camera

gauss FPGA

sobel

IM

gradient, trace chip area: 17 time slots: 37

Fig. 3. Resulting architectures for different chip area constraints. The pure software solution takes 65 time slots.

Fig. 4. ASP solver runtime measurements according to the number of tasks

exponential growth of the solving time relative to the number of tasks in the problem which is not worse than expected. VII. C ONCLUSION AND FUTURE WORK We have shown that answer set programming is a viable approach to solve complex problems like the architecture generation for data stream based hardware/software co-design systems. The advantage over evolutionary algorithms or heuristics methods is the guaranty findind of an optimal solution. Our development platform is an intelligent camera system, based on a Virtex4 FX FPGA with an embedded PowerPC hardcore processor. Because image data is normally huge and stored in the external DDR memory the first IM which was developed connects the PLB (system bus) to a SDI component and vice versa. This module operates similar to a DMA controller but instead of just copying, the data is streamed out and in of the module between the load and store operation to go through PUs. Our next step is to improve the ASP model to be solved faster and to be more accurate especially concerning the resolution of timings and to examine more complex task graphs.

An extension of the POTASSCO tools which is currently in heavy development and capable of handling real numbers, will be used for this purpose. Our ASP model is already capable to generate architectures which include a partial reconfiguration of modules. The scheduling is also valid except for the case that, a module is used, reconfigured and directly used again. Here the delay for the reconfiguration must be considered in the scheduling because it stalls the processing. It’s possible to include this case in our model with little modifications. In the future we are going to combine the work of this paper with our work in the domain of partial reconfiguration. R EFERENCES [1] Intel Inc., “Open Computer Vision Library,” http://www.intel.com/ research/mrl/research/opencv/, 2007.

22

[2] ETH Zurich, “PISA - A Platform and Programming Language Independent Interface for Search Algorithms,” http://www.tik.ee.ethz.ch/pisa/, 2010. [3] H. Ishebabi and C. Bobda, “Automated architecture synthesis for parallel programs on fpga multiprocessor systems,” Microprocess. Microsyst., vol. 33, no. 1, pp. 63–71, 2009. [4] C. Anger, K. Konczak, T. Linke, and T. Schaub, “A glimpse of answer set programming,” K¨unstliche Intelligenz, no. 1/05, pp. 12–17, 2005. [5] M. Gebser, R. Kaminski, B. Kaufmann, M. Ostrowski, S. Thiele, and T. Schaub, A Users Guide to gringo, clasp, clingo, and iclingo, Nov. 2008. [6] University of Potsdam, “Potassco - Tools for Answer Set Programming,” http://potassco.sourceforge.net/, 2010. [7] M. Denecker, J. Vennekens, S. Bond, M. Gebser, and M. Truszczy´nski, “The second answer set programming competition,” in Proceedings of the Tenth International Conference on Logic Programming and Nonmonotonic Reasoning (LPNMR’09), ser. Lecture Notes in Artificial Intelligence, E. Erdem, F. Lin, and T. Schaub, Eds., vol. 5753. Springer-Verlag, 2009, pp. 637–654.