a scalable parallel hardware architecture for ...

0 downloads 0 Views 199KB Size Report
ABSTRACT. The parallel connected component labeling used in binary image analysis is ... union operations are implemented to complete the entire connected ...
Proceedings of 2010 IEEE 17th International Conference on Image Processing

September 26-29, 2010, Hong Kong

A SCALABLE PARALLEL HARDWARE ARCHITECTURE FOR CONNECTED COMPONENT LABELING Chung-Yuan Lin, Sz-Yan Li and Tsung-Han Tsai Department of electrical engineering National central university, Taiwan, R.O.C. {yashiro,siy_an,han}@dsp.ee.ncu.edu.tw ABSTRACT The parallel connected component labeling used in binary image analysis is reconsidered in this paper for the high throughput and intermediate memory requirements problem on high dimensional image sequence. It is based on a proposed dual-parallel connected component labeling method. The main idea is to break the sequentiality of the labeling procedure by separating image into slices and to correctly delimit the extent of all connected components locally, on each slice, simultaneously. According to the proposed method, a scalable architecture which can be adaptive to different throughput requirement is derived. The proposed architecture consists of local label assignment, local label fusion, and global process unit. The forest structure is introduced to cope with both global and local label equivalent. Based on the forest structure, find and union operations are implemented to complete the entire connected components labeling during two raster scans. Performance of the proposed architecture estimated in terms of the number of clocks and memory requirement are brought forward to justify the superiority of the novel design compared against previous implementation. Index Terms— Connected component, algorithm, scalable architecture, real-time

labeling

1. INTRODUCTION Labeling connected components in a binary image is one of the most fundamental operations in pattern analysis and computer vision [1]. Since connected components in an image may have complicated geometric shapes and complex connectivity, labeling cannot be completed by mere parallel local operation, but needs sequential operations [2]. Many sequential labeling algorithms have been proposed by raster-scan manner in the past. Rasterscan based methods are comprised of label assigning and label merging steps [3]. An image is scanned in the raster direction, and a provisional label is assigned to a new pixel that is not connected to other previously scanned pixels.

978-1-4244-7993-1/10/$26.00 ©2010 IEEE

3753

Provisional labels assigned to the same connected component are called equivalent labels, and then the label merging step merges equivalent labels that belong to the same connected component into an unique label. The raster-scan based methods can be characterized by the number of image scans. One-scan algorithms read the input image once with irregular data access [4]. This character is particularly not suited for hardware realization. Two-scan and multi-scan algorithms require several iterations, but hardware solutions can derive benefit from simplicity and parallelism [5]-[12]. This paper proposes a dual-parallel labeling algorithm based on two-scan scenario. The proposed algorithm tackles the connected components labeling in a divide-and-conquer fashion as the first layer parallelism. The image is divided into slices, and then intermediate labeling is performed within each slice. Each intermediate labeling, which processes four pixels concurrently is the second layer parallelism. Based on the proposed algorithm, a scalable architecture, which consists of local label assignment, local label fusion, and global process units, is presented. A forest structure is introduced to deal equivalent label pairs in parallel. Find and union hardware are implemented to grow local forest during the fusion procedure. Final labels correspond to the connected components are obtained after the first scan. Global process unit merges local forests into a global forest by reusing find and union hardware. The forest structure based architecture can reduce the amount of registers to buffer provisional labels while merging equivalent labels. Moreover, the scalability nature of the proposed architecture can easily adapt to various high throughputs required applications. The proposed design outperforms other architectures due to its low register cost and high parallelism mechanism. The rest of this paper is organized as follow. Section 2 describes the proposed labeling algorithm. Hardware architecture is described in Section 3. In Section 4, the proposed architecture is compared with conventional labeling architecture and a conclusion is given in Section 5.

ICIP 2010

C

B

D

P

A C4

B4

D4

P4

C3

B3

D3

P3

C2

B2

D2

P2

C1

B1

D1

P1

If (Find(LocalParenti, u)  Find(LocalParenti, v)) Union(LocalParenti, u, v) Find(LocalParenti, u) If(LocalParenti [u]  u) LocalParenti [u]=Find(LocalParenti,LocalParenti [u]); Return LocalParenti [u]; Union(LocalParenti, u, v) Link(LocalParenti,Find(LocalParenti,u), Find(LocalParenti, v)) Link(LocalParenti, u, v) If (u < v) LocalParenti [v] = u; Else if (u > v) LocalParenti [u] = v;

A1

(b) (a) Figure 1. The labeling window of the label assignment step. (a) window for particular pixel. (b) window for local assignment step.

2. PARALLEL LABELING ALGORITHM A suited algorithm for a hardware implementation has to exploit parallelism and keep the data access regular. In addition, the throughput is supposed to be high enough to allow real-time applications. This paper proposes the dualparallel labeling algorithm to increase the throughput. Let us consider a K division into equally sized slices of the same input image in the first scan of the proposed algorithm. Local label assignment step is applied to each slice to generate provisional labels and local equivalent label set. The local equivalent label set is then merged based on the corresponding to connected component by local label fusion step. A local forest is grown to represent label equivalent. All local label assignment and fusion steps are realized in parallel. Each local label assignment step processes four pixels concurrently in four consecutive rows. The labeling window that processes one particular pixel is shown in Fig. 1(a). When an object pixel P is encountered, the 4 neighbors that have already been processed are examined in the following priority order: A, C, D, B. If none are labeled, a new label is assigned to the current pixel. If one neighbor is labeled, that label is propagated to the current pixel. If two neighbors either [A, C] or [A, D] are labeled with different label u and v, an equivalent label pair (u,v) is generated. Since the local label assignment step has the ability to label four pixels in parallel, therefore, the labeling window is extended to a 9 by 5 window in the local assignment step as shown in Fig. 1(b). Let Gi be the local equivalent label set that contains all pairs of equivalent pairs generated from i-th slice. The set Gi respected to i slice is resolved locally in local label merging step. Considering the array of pairs (u, v) of Gi, u>v sorted in decreasing order of u. A local forest of tree is grown from the initial set of trees in which every label is a one node tree. Array LocalParenti is used to store each label of its parent in slice i. Initially, LocalParenti[u]=u,  u. Two operations are introduced in the local label merging step. Find(LocalParent, u) is the operator which returns the parent of u with the path compression, and Union(LocalParent, u, v) merges the trees containing u and v. All the local label fusion steps work concurrently in each slice as follow. Local label assigning (Gi, LocalParenti),  i=1, 2, …, K For each (u, v)  Gi

Figure 2. An example of global label fusion step for K=4.

After local label fusion step is completed, in order to have non-overlapping ranges of labels in each slice, a local label offset is globally computed based on the number Li of labels assigned in every slice. That is, for slice 0, the local label offset LO0 is zero, and for slice s, the local label offset LOs is equal to the accumulated sum as follow: LOs =  i = 0 Li , s = 1, 2, ..., K s 1

(1)

The global label fusion step is applied after local label offset. All local forests are examined, and a global forest is grown by merging of labels which belong to the same connected component in two consecutive slices in different local forest. Array GlobalParent is used to store parents of each label in whole image. Then, all top and bottom rows of slice are checked against bottom or top row of the adjacent slices. The equivalent label set between two adjacent slices is therefore established. Fig. 2 depicts this procedure. Initially, GlobalParent[u]=LocalParenti[u],  u, i. Then, the procedure works similar to the local label fusion step, where the only difference is that LocalParenti is replaced by GlobalParent. After the global label merging, the label equivalent information of entire image is available from the global forest. 3. THE SCALABLE ARCHITECTURE The architecture for the dual-parallel labeling algorithm is shown in Fig. 3. The scalable architecture contains K amount of local label assignment and fusion units, and a global process unit. The global process unit, which

3754

Figure 3. The proposed dual-parallel labeling architecture.

comprises a global control and K local label offset units, controls the communication from local to global procedure. Based on our architecture, K slices of an image can be done in parallel within two scans. Two memories are required for label equivalent storage and labeled image storage. These two memories are divided into K bank also. The i-th bank of memory corresponds to the data storage of the i-th slice. Each local label assignment reads previously labeled pixels from local labeled image memory bank, and outputs provisional labels and the equivalent label set during the first scan. Then, the local label fusion unit merges the label equivalent by the Find and Union operations. The merged label equivalent is indicated by LocalParant array, which is stored in the local equivalent memory bank by respective slice. Each local forest is grown through the first scan by consecutive merging equivalent label set. The detailed architecture of local label assignment and fusion units is depicted in Fig. 4. The architecture consists of four pipeline stages: assigning, find and loading, union, and storing. The pipeline period is eight cycles since the storing stage requires eight cycles to save four LocalParent labels to eight different addresses in label equivalent memory bank. Each local label assignment contains four PEs and several registers to implement the labeling window as shown in Fig. 1. Each PE generates a provisional label and an equivalent label pair. Each local assignment unit outputs four provisional labels and four equivalent label pairs. Four LocalParent labels are loaded from the equivalent label memory to prior-parent label registers by the smallest label of each equivalent label pair as index via the comparator. Concurrently, the correspondence of each equivalent label pair is examined, and the linkage among four equivalent label pairs is set. Considering the example where the equivalent label set Gi={(4,6),(7,8),(6,10),(5,10)}, then the linkage among pairs (4,6), (6,10) and (5,10) is set. The Union and Link processing unit gathers the linked equivalent label pairs and compares the respective LocalParent labels. The smallest LocalParent label is reassigned as a new LocalParent label to each other linked

Figure 4. The detailed architecture of local assignment and fusion.

labels. Considering the same example, the respectively LocalParent labels of Gi are given as {1,2,3,4}. Then, based on the linkage information, the smallest label of LocalParent labels {1,3,4} are reassigned. Therefore, the Union and Link processing unit output result is {1,2,1,1}. Finally, four reassigned LocalParent labels in posteriorparent label registers are stored to the equivalent label memory bank respectively by labels of each pair as indexes. When the first scan is completed, the number of label of each slice is transmitted to global control unit. K local label offset units are then enabled. The content in each label equivalent memory is accumulated by (1). Now, each equivalent label memory bank has non-overlapping ranges of labels. The global label fusion step is then performed. All top and bottom rows of slice are checked against bottom or top row of the adjacent slices. Four boundary equivalent label pairs are loaded from two equivalent label memory banks that belong to two adjacent slices to prior-parent label registers. The Global forest is then grown by consecutively merging boundary equivalent pair from two different local forests. The generated GlobalParent labels are stored to respective equivalent label memory bank where the accessed address is changed from local index to global index by the global control unit. The same hardware as shown in Fig. 4 is reused to generate GlobalParent. In the second scan, the task of the equivalent label memory is used as a look-up table for reassigning all labels. The labeled pixels stored in the labeled image memory bank are read into reassigning units to obtain its exact label value. Four pixels are concurrently processed in this unit also. 4. EVALUATION AND DISCUSSION We test various binary image sequences generated by background subtraction proposed in [13]. The tested image sequence is HD720 format at 30Hz. Labeling results show

3755

that sufficient label number is 1000-3000. Therefore, 4096 labels are required, and each label requires 12 bits for HD720 specification. Given an image of N columns and M rows contains L labels. In the first raster scan, K parts of local hardware processes K slices concurrently. Each of local hardware processes four pixels in each pipeline period where each pipeline period consists of eight cycles. Therefore, the proposed architecture takes (8*N*M)/(4*K) cycles for local label assignment and fusion. After the first raster scan is completed, local label offset requires L/4 cycles to accumulate label stored in label equivalent memory. Then, (log2K)*N cycles are required for the global label fusion. In the second scan, reassigning units take 4*N*M/4*K cycles to complete look-up table and obtain final label image. A comparison is listed in terms of execution cycle, register cost and memory cost in Table I. The scale K of local hardware is supposed as 4 and log2L is denoted as L’. It is obvious that those sequential architectures [6]-[9] require more processing cycles. Among the parallel architecture [10]-[12], our architecture consumes the fewest cycles. Considering the register cost, our design has the lowest register cost among parallel architectures. The memory cost is slightly higher than others since the label equivalent is stored in the memory. In general, the hardware cost of register is much higher than the SDRam per bit, such that the proposed architecture can significantly reduce the area cost. Suppose the image to be labeled is HD720 and contains 4096 regions, the proposed architecture requires 1.4 Kbit register and 9486 Kbit memory while [12] required 61 Kbit register and 9437 Kbit memory. It is equal to 90% of area saving. The whole design is described with VerilogHDL. The synthesized gate count is 7.64K by 0.35um technology and the maximum operation frequency is 100MNz. Considering the HD720 format, the proposed architecture requires 21 MHz to achieve real-time processing by K=4. 5. CONCLUSION A dual-parallel connected components labeling algorithm and its hardware architecture is proposed. Based on the divide-and-conquer fashion, the architecture contains scalable number of local label assignment and fusion units, and a global process unit. All of local units, which can concurrently process labeling task, are the first layer parallelism. To resolve the local and global label equivalent, the forest structure is introduced. Based on the forest structure, four pipeline-stage local label assignment and fusion architecture is presented. Each local assignment and fusion architecture has the ability to process four pixels in parallel, and is regarded as the second layer parallelism. The architecture is evaluated on various image sequences

Table I. Architecture performance comparison Architecture

Timing (cycle)

Rosenfeld [6] Haralick [7] Lumia [8] Nicol [9] Ranganathan [10] Wang [11] Yang [12]

3MN 2mMN 4MN 2MN+ 2M 2MN+ 5M MN+6M-4 MN

Proposed

3MN/4 + L/4 + 2N

Register (bits) N/A N/A N/A NL’ 10NL’ 4NL’ LL’+ (N+3)L’ 116L’

Memory (bits) 2NML’ NML’+ ML’ NML’+ ML’ NML’ NML’ NML’ NML’ NML’ + LL’

and compared to other architectures. Our architecture has a better performance than others whether on clock cycle or hardware cost. The proposed architecture requires only 15 MHz to reach real-time processing on 720HD format. 6. REFERENCE [1] R. C. Gonzalez and R. R. Woods, Digital Image Processing. Reading, MA: Addison Wesley, 1992. [2] A. Rosenfeld and A. C. Kak, Digital Picture Processing, 2nd ed. San Diego, CA: Academic, 1982, vol. 2. [3] H. M. Alnuweiri and V. K. Prasanna, “Parallel architectures and algorithms for image component labeling,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 14, pp. 1024–1034, Oct. 1992. [4] F. Chang, C. J. Chen, and C. J. Lu, “A linear-time componentlabeling algorithm using contour tracing technique,” Comput. Vis. Image Understand., vol. 93, pp. 206–220, 2004. [5] L. He, Y. Chao, and K. Suzuki, "A Linear-Time Two-Scan Labelling Algorithm", in IEEE International Conference on Image Processing, San Antonio, Texas, pp. 241-244, 2007. [6] A. Rosenfeld and J. L. Pfaltz, “Sequential operation in digital picture processing,” Journal of Association for Computing Machinery, Vol. 13, pp. 471-494, 1966. [7] R. M. Haralick, “Some neighborhood operations,” in Real Time/Parallel Computing Image Analysis. New York: Plenum, pp. 11–35, 1981. [8] R. Lumia et al., “A new connected components algorithm for virtual memory computers,” Computer Vision, Graphics and Image Processing, vol. 22, pp. 287-300, 1983. [9] C. J. Nicol, “A systolic approach for real time connected component labeling,” Computer Vision and Image Understanding, vol. 61, pp. 17-31, 1995. [10] N. Ranganathan et al., “A high speed systolic architecture for labeling connected components in an image,” IEEE Transactions on System, Man and Cybernetics, vol. 25, pp. 415-423, 1995. [11] Kuang-Bor Wang et al., “Parallel Execution of a Connected Component Labeling Operation on a Linear Array Architechture,” Journal of Information Science and Engineering, vol. 19, pp. 353370, 2003. [12] S.-H. Yang et al., “VLSI architecture design for a fast parallel label assignment in binary image,” in Proc. IEEE Int. Symp. Circuit Syst., vol. 3, May 2005. [13] T.-H. Tsai, W.-T. Sheu, and C.-Y. Lin, “Foreground object detection based on multi-model background maintenance,” IEEE Int. Symposium on Multimedia, 2007.

3756