A Script Matching Algorithm for Oriental Characters ... - Semantic Scholar

0 downloads 0 Views 84KB Size Report
Recently growth of pen-computers and personal digi- tal assistants has made electronic ink a first-class object. One of the most important problems under this ...
A Script Matching Algorithm for Oriental Characters on PDAs Mi-Gyung Cho Research Institute of Computer Information & Communication, Pusan National University PUSAN 609-735, KOREA [email protected]

Abstract Recently growth of pen-computers and personal digital assistants has made electronic ink a first-class object. One of the most important problems under this model is to search ink data at previously stored pen-strokes. In this paper we proposed and implemented a matching algorithm of write-dependant ink search for oriental characters, especially mixed Korean with Chinese characters. Our algorithm is very simple guaranteeing the performance of the matching time for the mobile computer which has limitations of hardware. Various experiments showed matching rate over 98% for only the Korean scripts and 94% for the data mixed Korean with Chinese scripts.

Hwan-Gue Cho School of Computer Science and Engineering Pusan National University PUSAN 609-735, KOREA [email protected]

case. Approximate ink matching is used to search ink data. During recent decade, many researchers have studied on ink search and proposed some ink matching algorithms[2, 3, 4]. In this paper we address the problem of searching for a given cursive scripts especially mixed Korean with Chinese characters on the PDAs. Because PDAs have limitations of hardware environment compared with desktop computers, a searching algorithm must be simple. We proposed and implemented a simple-fast matching algorithm of writedependant ink for oriental characters. Geometric characteristics of the oriental characters are used for our algorithm. Proposed algorithm could find a given script within one second and also gave high matching rate.

2. A Simple-fast matching algorithm 1. Introduction

2.1. Geometric characteristics of oriental character

In the last few years a number of mobile computers such as PDAs have been released. These devices intend to keep personal data such as schedules, notes, address books and so on. Most of the PDAs have been slower in speed and had have small screen size and difficulty in handwriting or stoke recognition. Given the size limitations imposed by the need for portability, pens are the most natural way to input data. General way to treat handwritten text is to convert penstrokes into strings of ASCII characters. Once converted into ASCII, the data can be manipulated and searched in conventional ways. The problem is that handwriting recognition is prone to errors. Many users point out that handwriting recognition does not meet the need of them. More natural way to handle the pen-stroked data is to treat it as a first-class datatype[1, 4]. Generally the pen-stroked data is called as the electronic ink data. Although ink is a natural medium for human being, it must be made editable and searchable to become a first-class datatype. Searching handwritten cursive text is a challenging problem. A person can not rewrite perfectly his own previously written characters. Hence, exact match of query is not appropriate in this

Most of the Korean characters mainly consist of a set of line and circle shape strokes. Lines are horizontal or vertical or slash line. Figure 1 shows the primitive strokes of Korean hand-printed character[5]. Although the stroke type 5 and 6 are one stroke, those can be split into stroke type 1 and 2 at the place where directions of the stroke change radically. Therefore if we only account for the geometric characteristics of Korean hand-printed character, they can be identified as five primitive strokes, which are the stroke type 1, 2, 3, 4 and 7 in figure 1. Chinese characters have similar characteristics but more primitive strokes.

Figure 1. Primitive strokes for Korean character

The cursive stroke tends to have many different representations, since there can be many variations of stroke seg-

1051-4651/02 $17.00 (c) 2002 IEEE

ments for a given stroke. And the Korean and Chinese handwritten characters are allowed to connect consecutive primitive strokes. Statistical reports show that more than 80% of Korean test characters have one or more strokes connected[5]. That is the reason why it is very difficult to recognize the cursive characters. Figure 2 shows an example of a hand-printed character (a) for a letter ‘nahn’and its various cursive forms. The cursive data (b) has no connection but (c) and (d) are connected with a pair of consecutive strokes. And the connection occurred twice in (e) at the first stroke and the second stroke, the second stroke and the last stroke, respectively.

are the combination of the stroke type 1 and 2 in handprinted characters. But according to user’s writing style, these strokes may be split into two primitive strokes such as the stroke type 1 and 2 or the stroke type 1 and 3 or may be the stroke type 5 or 6 in figure 3. The curvature of a curve is a scalar quantity that is the rate at which the unit vector T of the curve changes direction with respect to arc length. Let C be a stroke curve in 2-space traced out by a vector function r(t). If C is parameterized by arc length s, then a unit tangent to the curve is given by dr/ds. The quantity r (t) is related to arc length s by ds/dt. The unit vector T define as the following. T =

dr/dt dr r (t) = = r (t) ds/dt ds

(1)

The curvature of C at a point is Figure 2. Various cursive forms for a letter ‘nahn ’

The characteristics when two more strokes are connected are that directions of consecutive points come to change radically. If we can separate the cursive strokes at that place, the cursive strokes will be a sequence of the primitive strokes. We used the curvature of the stroke curve to separate a cursive stroke. We will give details of the curvature of a curve in next section. To identify variations of handwritten Korean and Chinese characters we defined the nine primitive strokes as showed the figure 3. The stroke type 8 and 9 are for the Chinese handwritten characters. And the stroke type 5 and 6 are for variations of the stroke type 5 and 6 in figure 1. These strokes can be found at both characters. Of course we don’t consider all possible variations that could occurred at the handwritten text of Korean and Chinese characters. We want to design a simple matching algorithm for PDAs.

     dT   dT /dt  T  (t)  = k =   =  ds ds/dt  r (t)

(2)

By the definition, the curvature of the portion where bends more sharply is large. The curve closed to a line has near zero curvature. Figure 4 shows the change of curvature for two Korean cursive characters. We separated the stroke at the place where its curvature has more threshold value. We determined the threshold value using the average and standard variation of the curvature of a given stroke curve. In figure 4, marked points mean the place where strokes were separated. Above script described solid line and below script described dotted line was split into five and six primitive strokes, respectively

Figure 3. Primitive strokes for handwritten character Figure 4. Curvature of two Korean letters ‘jo ’and ‘guk ’

2.2. Stroke separation by curvature There are two cases that pen-strokes are separated by curvature. First, the handwritten oriental characters may have the stroke connection between consecutive primitive strokes. In this case, the pen-stroke is separated at the place where the connection is occurred. Second, separating occurred at the stroke type 5 and 6 in figure 1. These strokes

After the separation of cursive strokes has been performed, we determined the type of separated strokes. The pen-strokes will be represented as a sequence of the primitive strokes. We call this sequence a stroke feature vector. A stroke feature vector is stored as following form. .... The stroke feature vector of two scripts in figure 4 is .

1051-4651/02 $17.00 (c) 2002 IEEE

2.3. Proposed matching algorithm we proposed an matching algorithm for writer-dependant ink search at stroke level. And we implemented an application that runs on PDAs in order to evaluate our matching algorithm. Our application manages one’s personal information, for example, name, phone number, email address and short memo etc. We stored those data in the form of the electronic ink and found the query ink using our matching algorithm. There are five phases to the proposed algorithm. [Algorithm] A Simple-fast Matching Algorithm Step 1. Preprocessing. Step 2. Stroke separation. Step 3. Determination of the stroke type. Step 4. Computing the distance. Step 5. Report of matching results. Preprocessing : Instead of using the raw data from the pen, we performed the preprocessing steps such as resampling, smoothing and removal of hook. A hook is a short stroke segment with radical change, which may occur at beginning or end of a stroke. We removed these hooks using the curvature. High curvature at the beginning two points or at the end two points of a stroke may be a hook. Stroke separation : We explained about the stroke separation phase in the previous section. The stroke type 7 was excluded from the separating step because it must handle as one stroke. Using angles and directions of consecutive points we tested a stroke whether to be the stroke type 7 or not. After performing this step all separated strokes will be one of the primitive stroke types in figure 3. Determination of the stroke type : We must determine the stroke type for separated strokes. First, we made a local information sequence of the stroke depending on its angle and direction. The kind of directions are RIGHT, LEFT, UP and DOWN. And the sort of angles are VERT(vertical line), HOL(horizontal line), RS(right slash) and LS(left slash). The local information sequence represents as . Second, the tuning process of the local information sequence needed, although we performed the preprocessing step. Finally, we determined the stroke type using the local information sequence. And then we produced a stroke feature vector and saved it with original ink data. Computing the distance : To find best match data we computed the similarity between stroke feature vector associated with input ink and the pre-computed stroke feature vector in the database. We used dynamic programming to determine the edit distance between the sequences. Five edit operations; an insertion, a deletion, a substitute, a split and a merge are applied. We add a merge and a split operation to three operations

of classic string matching in order to identify the characteristics of oriental characters. Only two operations are applied when either stroke type 5 or 6 in figure 3 are found. The stroke type 5 may be the combination of the stroke type 1 and 2 or the combination of the stroke type 1 and 3 or other combinations according to user’s writing style. The stroke type 6 is applied to the same rule. In those cases we preferred a merge or a split operation to other operations The cost of all operations was pre-defined depending on the geometric similarity between the primitive strokes. We applied dynamic programming to find best match in database ink. In Eq. (3), dist(i, j) is the cost of the match of the ith stroke of query P and jth stroke of a script T in database. P and T represent stroke sequences as (p1 , p2 , ....pn ) and (t1 , t2 , ....tm ), respectively.  dist(i − 1, j) + del(pi )     dist(i, j − 1) + ins(tj ) dist(i, j) =

dist(i − 1, j − 1) + sub(pi , tj )

    dist(i − 1, j − 2) + split(pi , tj−1 tj )

(3)

dist(i − 2, j − 1) + merge(pi−1 pi , tj )

The initial conditions are dist(0, 0) = 0 dist(i, 0) = dist(i − 1, 0) + del(pi ) dist(0, j) = dist(0, j − 1) + ins(tj )

(4)

Report of matching results : Four match results that have minimum distance is displayed on a page. Figure 5 shows an example of snapshot of the program result. In figure 5 (a), a user wrote a query script and clicked “find”button. And the program displayed match results ranked from one to four in (b). The first result was found with similarity 91%. Other results was very similar to the query. Our algorithm distinguished sharply between similar characters. When a user want to see more detailed information, he can show other information such as phone number and memo etc, by clicking “info”button. His personal information was showed in (c).

(a)

(b)

(c)

Figure 5. Three snapshots of program result (Finding ‘Lee Sung Keun’)

1051-4651/02 $17.00 (c) 2002 IEEE

3. Experimental results We implemented our system by Embedded Visual C++ and experimented under Pocket PC(Cassiopeia E-125 model) environment. We prepared five data sets, 100, 150, 200, 250 and 300 person’s name and his personal information. They were randomly selected from the telephone book. In order to measure the matching rate of our algorithm we conducted the experiment with the following model. R = M (N, k, r)

Figure 7. The average matching rate vs. database size (Mixed Korean with Chinese handwritten characters)

(5)

where, N means the total name data number in database and k is the trial number to find the query and r means the ranked order. If mismatch occurred, we tried to find the same query again. If the desired data is found in the first result page, the match succeeded. The matching rate was measured as the percentage of queries for which the query name is matched with correct name in the database. Figure 6 and 7 show the average matching rate for various database sizes. When mismatch occurred, we tried three times to find the query. The matching rate was over 98% for only the Korean scripts and 94% for the data mixed Korean with Chinese scripts as showed in figure 6 and 7. When we tried ones and three times to find a query, the matching rate was different. The more the trial number is, the higher the matching rate is. This is because no one writes the same word exactly the same way he or she did, .

algorithm is very efficient guaranteeing fast matchings for mobile computers.

4. Conclusion In this paper we proposed and implemented a script matching algorithm for oriental characters mixed Korean with Chinese characters. By experiment we showed that our algorithm was fast on PDAs having limitations of the hardware and gave high matching rate over 98% for only the Korean scripts and 94% for the data mixed Korean with Chinese scripts.

Acknowledgements This research was supported by University IT Research Center Project.

References [1] Walid G. Aref, Ibrahim Kamel, and Daniel P. Lopresti, ”On Handling Electronic Ink,” ACM Computing Surveys, Vol. 27, No 4. pp. 564-567, 1995. [2] Walid Aref and Daniel Barbara, ”Supporting Electronic Ink Database,” Information Systems, Vol. 24, No. 4, pp. 303-326, 1999. Figure 6. The average matching rate vs. database size (Only Korean handwritten characters)

Our application showed that the proposed algorithm had three advantages. First, it is very practical on PDAs because it gives high matching rate. Most people keep information about 100-300 man/woman in their address book. Our algorithm gave the matching rate over 98% for 300 database size. Second, it gives a user security for private information. When other peoples who’s writing style is different tried to find a query, the desired data is not found because our algorithm is writer-dependant. Third, it is very fast. All queries were found within one second. It is showed that our

[3] Ibrahim Kamel and Daniel Barbara, ”Retrieving Electronic Ink by Content,” IEEE Proceedings of International Workshop on Multimedia Database Management Systems, pp. 54-61, 1996. [4] Lopredti, D. and Tomkins, A., ”On the searchability of electronic ink,” In Proceedings of the International Workshop Front. in Handwriting Recognition, pp. 156165, 1994. [5] Hang Joon Kim and Pyeoung Kee Kim, ”On-line recognition of cursive Korean characters using set of extended primitive strokes and fuzzy functions,” Pattern Recognition Letters, Vol. 17, pp19-28, 1995.

1051-4651/02 $17.00 (c) 2002 IEEE