Temporal Expression Recognition and Temporal ... - Semantic Scholar

4 downloads 0 Views 271KB Size Report
Jul 3, 2011 - College of Biomedical Engineering and Instrument Science, Key Laboratory of Biomedical Engineering, Ministry ... understanding, the results of Temporal Expression Recognition ... medical informatics, such as clinical decision support, digital ..... Taking into consideration the state of art of Chinese medical.
Temporal Expression Recognition and Temporal Relationship Extraction from Chinese Narrative Medical Records ZhouXiaoJia, LiHaoMin*, LuXuDong, DuanHuiLong College of Biomedical Engineering and Instrument Science, Key Laboratory of Biomedical Engineering, Ministry of Education Zhejiang University Hangzhou, China [email protected] Abstract — As the first step of temporal information understanding, the results of Temporal Expression Recognition will directly affect the further usage of temporal information, such as temporal relationship extraction. For Chinese language, there are many distinct characters both in word morphology and syntax in temporal expression comparing with the western languages where many researches had been done in the last decade. This paper provided a Temporal Expression Recognition and Temporal Relationship Extraction framework focusing on Chinese narrative medical records. Firstly, this study developed a temporal expression syntax based on regular expression through statistical analysis and classification of a medical records corpus which covered 147 medical records from more than 30 clinical departments. The recognizing results showed that the method we proposed could cover most temporal expressions in the narrative medical records. Adhere to the result of temporal expression recognition, a temporal relationship extraction method based on CRF (Conditional Random Fields) was proposed to automatically extract temporal attributes of medical problems from Chinese narrative medical records. The test results showed the accuracy of temporal relationship extraction using the choice template file could reach 86.94%. Keyword — Temporal Expression Recognition; Temporal Relationship Extraction; Temporal Representation and Reasoning (TRR); Medical Natural Language Processing (MLP)

I.

INTRODUCTION

As the needs of clinical information extraction increasing, such as the development of clinical decision support systems, it is becoming a hot research field to extract clinical information from clinical texts automatically [ 1 ]. Temporal information is of great significance in the electronic medical records and other medical informatics systems [2].The recognition of temporal information and the temporal relationship extraction can serve various applications in medical informatics, such as clinical decision support, digital

clinical pathway, et al. [3]. A series of related evaluation [4,5] have reflected the importance of temporal information. U.S. Defense Advanced Research Projects Agency (DARPA) developed TIMEX2 on the basis of TIMEX program, which is oriented to English news corpus to annotate the temporal phrase. In 2003, it has been introduced to use in Chinese. In recent years, a lot of attention has been paid to the research about temporal information oriented to Chinese news corpus[6], Tsinghua University, Yuan Chunfa and Chinese University of Hong Kong, Li Wenjie, who are committed to the extraction of temporal information and temporal relationship, have got some research achievements[7]. The relevant studies in medical domain started from the late 80s of 20th century, and had published a series of research achievements [8] in computer and medical journals and conferences. The most representative research on utilization of temporal information oriented to English narrative medical records is the TimeText system [9], which is developed by Li Zhou et al. This research team made a statistic and classification to temporal expression based on 100 discharge summaries, summarized the characteristics of temporal expression and achieved the temporal relationship extraction based on medical terminology lexicon and syntactic rules. But there have been few related researches specifically for the needs of temporal expression analysis in Chinese medical domain. Furthermore, large differences exist between the news corpus and medical records, and the frequency for medical events in the news corpus is very low. As a consequence, the research results from news corpus cannot be directly applied to the medical domain. So, it’s urgent to fill in the blanks of this domain. This study summarized the common types of temporal expressions oriented to Chinese narrative medical records, and proposed a Chinese temporal expression syntax based on regular expressions. Then, a temporal relationship extraction approach using Conditional Random Fields (CRF) was used to extract temporal relationship based on the recognition results of temporal expression. Experiments results showed that regular expressions had good applicability

Supported by National Natural Science Foundation of China(30900329) and China Postdoctoral Science Foundation (20090451467)

978-1-4244-5089-3/11/$26.00 ©2011 IEEE

and laid the foundation for follow-up utilization of temporal information from Chinese narrative medical records and the accuracy of temporal relationship extraction can reach more than 86%. II.

TEMPORAL EXPRESSION CLASSIFICATION

The clinical records were recorded by doctors who received professional training, so the temporal expressions in clinical records always have certain rules. This study made an exhaustive analysis of 147 clinical records covering more than 30 clinical departments, including the discharge summary records, out-patient records, progress notes, admission records. 1379 expressions of simple time were extracted,

forming a total of 1207 temporal expressions, each of which is a complete concept of time information, including 1004 simple temporal expressions, 203 composite temporal expressions. This study also established temporal expression classification from Chinese narrative medical records based on the above analysis. A. Simple temporal expression Referring TIDES temporal expression classification criteria and considering the characteristics of temporal expressions in medical domain, this study divided temporal expressions in clinical records into three categories (see Tab. 1).

Tab 1. Classification and statistic of temporal expressions Class

Sub-class

Frequency

Semantic

Example

Specific time expression

Date

416(30.17%)

TP

1992-7-6, May 3,1988, August

Time of day(TOD)

98(7.11%)

TP

21:00, 17:00 3 months、4 days

Fuzzy time expression

Duration(Dur)

233(16.9%)

TD

Age

6(0.44%)

TP

at 3 years old, at the age of 47

Duration Range(DurR)

19(1.38%)

TD

2-4 days, 2-3 weeks

Date Range(DateR)

2(0.15%)

TD

the year 1987 to 1998

Duration as Time Point (DurTP)

99(7.18%)

TP

2 weeks ago, after 4 days, 2 years ago

Relative Time(RTime)

183(13.10%)

TP

yesterday, today, last year, this month at present, recently, in the past, in recent years, in recent months

Past, Now, Future(PNF)

30(2.15%)



Part of day(POD)

112(8.12%)



morning, afternoon, at night

Unspecified Duration(UDur)

2(0.15%)

TD

several months, several decades

Season

19(1.38%)

TP

spring, winter, autumn and winter

Modified Date(MDate)

23(1.67%)

TP

early in 1980, middle of March, late in October, during the July

Modified Duration(MDur)

82(5.95%)

TD

almost 5 months, more than 2 months

55(3.99%)



3 days before admission, 6 months after surgery

Event-based Time (EBT)

Note: The semantic field of time information is classified by Time Point (TP) and Time Duration (TD); those not classified cannot determine whether this temporal expression is a time point or a time period.

SPECIFIC TEMPORAL EXPRESSION: this kind of temporal expressions expresses a clear time point or time period. Date refers to the time unit expressed by a day or longer than a day, and the general expression includes year, month and date; Time of day (TOD) is expressed in time units shorter than a day and mainly refers to the combination of hours, minutes, seconds, and numbers; Duration(Dur) represents a specific time span, there is no modification of any fuzzy modifiers, the smallest granularity is second, the maximum granularity is year; Duration Range(DurR) and Date Range(DateR) belong to the time range expression, DurR describing the time span range and DateR describing the start and end dates. FUZZY TEMPORAL EXPRESSION: this kind of temporal expressions is of a certain ambiguous. Duration as Time Point (DurTP) is combined by the time duration and the time position words(front or after), representing time point semantically; Relative Time(RTime) has no clear figures, needs to refer to the context time or recorded time to position the time from time axis; Past, Now, Future(PNF)’s reference time is generally the recorded time, describing vaguely the

time relationship between the referred time and the reference time, without being able to determine a specific time; Part of day (POD) refers to an vague time period in one day; Unspecified Duration(UDur) refers to fuzzy time modified with unspecified digital information; Modified Date(MDate) is combined with the date and time positioning words, the modifier can be divided into four categories: beginning, middle, end of period, during some period; Modified Duration(MDur) is combined by the time duration and fuzzy modifiers to indicate an inaccurate time period; Season can be used to represent a single season, or a combination of the two seasons, such as "autumn and winter". EVENT-BASED TIME (EBT): a typical expression of such clinical events as hospitalization, discharge, admission, surgery as a reference time, this kind of information in clinical reasoning has important reference value, so this is shown separately. B. Composite temporal expression Since the flexibility of natural language, several types of simple temporal expressions (Date, TOD, POD, RTime,

Season, MDate) can be combined to form composite time. In the statistical corpus, the percentage of composite temporal expressions is about 1/6. Tab. 2 lists the composite temporal expressions with different simple temporal expressions. Tab 2. Composite time statistic Composite Date+POD

Num 11

Percentage 5.42%

Example morning, March 12,2001

Date+POD+TOD

5

2.46%

11:00am, April 2 autumn,1998

Date+Season

9

4.43%

Date+TOD

52

25.62%

14:20, July 3

POD+TOD

18

8.87%

8:00pm

RTime+Date

37

18.23%

last July14

RTime+Date+POD

4

1.97%

last month, 11 am

RTime+POD

45

22.17%

last afternoon

RTime+Season

6

2.96%

last summer

RTime+POD+TOD

3

1.48%

2:00pm,the day before yesterday

RTime+MDate

13

6.40%

late October of this year

III.

IDENTIFY THE TEMPORAL EXPRESSIONS

According to the above analysis on time information in medical records, there are some rules in temporal expressions of medical records, so this study adopted the approach of regular expressions to identify the simple temporal expressions to identify the composite temporal expressions. Tab.3 lists several types of high frequency regular expressions for simple time expressions. 1207 temporal expressions were randomly divided into two parts, 1000 of which were reference data for extracting regular expressions, and the remaining 207 were the test data to validate regular expressions. The experiment results showed that the regular expressions could almost cover the entire corpus, and the recognition accuracy is more than 95%. The factors causing the errors are as follows: unpurposed spell errors or ill characters such as white space or special symbols may exist in the temporal expressions, which made the regular rules cannot correctly identify the temporal expression; some medical terms included the components of temporal expression as the adjective, such as the clinical test

"24-hour blood glucose levels". In addition, there are some ambiguity expressions in Chinese which sometime mean the temporal information and sometime mean a total different thing, such as the “10 号” maybe mean the 10th day of the month, or N.O. 10 as the street address. IV.

TIME RELATIONSHIP EXTRACTION

In the phase of temporal relationship extraction, the results of temporal expression recognition can’t be used directly but would be modified manually to avoid the secondary errors. Using the modified results, combined with a small library of medical terminology and conditional random field algorithm, it can accomplish automatic extraction of time properties of related medical problems [10]. Fig. 1 showed the process to extract temporal relationship. The medical records were firstly annotated semantically with medical problem and temporal information to fulfill the CRF training task.The temporal relationship was tagged based on medical problem oriented mode, that is to say only interested medical problem’s temporal attributes were tagged. A multiple cross-validation method was used to evaluate different CRF learning templates in the corpus which comprised 63 practical narrative medical records, the general principle of template design was proposed. The results showed the accuracy of temporal relationship extraction using the choice template file could reach 86.94%. A. Semantic annotation Semantic annotation for medical problems using the reverse maximum matching algorithm [11] method depends on an in-house semantic annotated medical terminology lexicon. Each sample sentence extracted a specific medical problem and was tagged as P (concerned problem), other medical problems were labeled as OP (other problems). B. Corpus preparation The experiment used crf++ as a conditional random fields execution tool. In this study, 319 medical records were selected from 63 practical narrative records as the experimental data, in which the number of medical problems was 1075 and time information was 531.

Tab 3. Regular expressions for time expressions Time classification

Regular expressions

Date

(((?\d+)年(?!代))?((?月)?(?\d{1,2})[日号]|(?\d+)-(?\d+)-(?\d+) 日?|((?\d+)年(?!代))?((?\d+)月)|((?\d{4})年 )"|(?\d{2})/(?)\d{2}/(?\d{2,4})) (?([0-2]?[0-9])|([一二三四五六七八九十]?[一二]))[::点时] (?[0-5][0-9])?[::分]?((?[0-5][0-9])秒?)? ((?\d+|[一二三四五六七八九十]|(\d+[~~-]\d+))(?秒|分(?!之)钟?|个?小时|(天|日)|个? 星期|周(?!岁)|个?月)))|((?(?年)) Dur+前|后

TOD Dur DurTP RTime POD MDur

([前今昨明次同本去](天|日|月|年)?) (?(?早[上晨])|(?上午)|(晨起时?)|(?[正中晌]午)|(午[后间])|(?下午)|((?傍晚)|(?夜间)|(?凌晨))" |([前今明昨次后][晨晚天日]) 近|约+Dur、Dur+余|以上|左右

The temporal relationship extraction study was carried out in a small corpus, so some of the usual drawbacks of machine learning methods also existed. Some errors were caused by data sparseness problem and some were caused by the long distance between the relevant information in natural language description. Uncertain temporal relationship also existed in actual texts, which brought difficulties for the automatic extraction work. VI.

Fig 1. Process to extract temporal relationship

C. Template preparation In the CRF learning process, an appropriate template can guide the algorithm to use the appropriate context information to obtain good learning results. If the span is too large, it will reduce the efficiency of machine learning and over-fitting phenomenon may exist. This article prepared 6 templates files, and the details can refer to [19]. D. Experiments and Results 319 annotated sentences were randomly divided into 5 groups, using 5-fold cross validation to evaluate the accuracy of different templates. This process was replicated 10 times, and the average result of 10 replications was treated as the ultimate result. The final evaluation results of the 6 template files were shown in Fig. 2. The accuracy of temporal relationship extraction could above 86% when the template of CRF was optimized.

Fig 2. Results of temporal relation extraction using different templates

V.

DISCUSSION

Through the analysis of actual corpus, it showed temporal expressions in medical records had many different characteristics comparing with other types of corpus, such as the news corpus. Time granularity was an important issue in temporal expressions. It showed that most of the granularity of time duration in order of frequency was day, year, month, week and hour, in which the first three were the main ones.

CONCLUSION

With the development of medical informatics and the comprehensive application of clinical support systems in China, automatic temporal expression recognition and temporal relationship extraction from Chinese narrative medical records will rapidly develop and be widely used. Taking into consideration the state of art of Chinese medical language processing, there are still many handicaps to implement such solution in many conditions at present. But with the rapid developing and more and more interests in such domain, we believe the framework we proposed here for structuring temporal information in Chinese narrative clinical records could serve for various medical information applications, such as information retrieval, time extraction and data summaries. REFERENCES [1]

Xu H, Stenner SP, Doan S, Johnson KB, Waitman LR, Denny JC. MedEx: a medication information extraction system for clinical narratives. Journal of the American Medical Informatics Association, 2010, 17(1): 19-24. [2] Shahar Y, Combi C. Editors’ foreword: Intelligent temporal information systems in medicine. Special Issue from J Intell Info Syst, 1999,13(12):5-8. [3] Augusto JC. Temporal reasoning for decision support in medicine. Artificial Intelligence in Medicine, 2005, 33(1): 1-24. [4] ACE2007 evaluation plan. http://www.nist.gov/speech/ tests/ace/ace07/ doc/ace07-evalplan.v1.3a.pdf, 2007-05-28. [5] SemEval-2007.http://nlp.cs.swarthmore.edu/semeval shtml. 2007-1. [6] YongDong Xu, ZhiMing Xu, XiaoLong Wang, YuanChao Liu.Extraction and semantic computing of Chinese textual time information, Jounal of Harbin Institute of Technology. 2007, 39(3):43843. [7] Wenjie, L, W. Kam-Fai, et al. A design of temporal event extraction from Chinese financial news. International Journal of Computer Processing of Oriental Languages, 2003, 16(1): 21-39. [8] Keravnou ET. Temporal reasoning in medicine. Artif Intell Med, 1996, 8(3):187-91. [9] Zhou L, Friedman C, Simon P. System architecture for temporal information extraction, representation and reasoning in clinical narrative reports. Friedman CP. Proceedings of the 2005 AMIA Annual Symposium. Austin: AMIA Symposium, 2005. 869-73. [10] XiaoJia Zhou , HaoMin Li, HuiLong Duan, XuDong Lu. “The automatic extraction of temporal relation from chinese narrative medical records using conditional random fields”. Journal of Biomedical Engineering in China, 2010.29(5) [11] http://en.wikipedia.org/wiki/Matching