The New Clothes for an Old Cookbook

0 downloads 0 Views 334KB Size Report
measurements and amounts specific to culinary domain. Our evaluation showed .... The general NER module for measurement expres- sions is based on a part ...
The New Clothes for an Old Cookbook ˇ Cvetana Krstev, Duˇ sko Vitas, Miloˇ s Utvi´ c, Branislava Sandrih University of Belgrade, Studentski trg 3, 11000 Belgrade, Serbia {cvetana,vitas,misko}@matf.bg.ac.rs, [email protected] Abstract In this paper we present a rule-based language-dependent system for compilation of list of ingredients from recipes as presented in cookbooks for domestic use. The problems involves both recognition of food ingredients and expressions for measurements and amounts specific to culinary domain. Our evaluation showed that for most type of recipes precision is higher than recall, with F1 = 0.883.

1.

Introduction

In Natural Language Processing (NLP), numerical expressions are treated as one or more types of Named Entities (NE) (Sekine et al., 2002). They are often treated as a large class of NEs opposed to name-type expressions based usually on proper names. These super-class would thus include time, count, measurement and other various numerical expressions such as money and percentage. For recognition of these expressions a number of language-dependent rulebased systems were developed, e.g. for French (Constant, 2009), Arabic (Habash and Roth, 2008), Croatian (Bekavac et al., 2009), Serbian (Krstev et al., 2014), as well as some based on machine-learning methods, e.g. for Arabic (Saleh et al., 2011), Hindu and Bengali (Ekbal and Bandyopadhyay, 2009). Authors in (Bautista et al., 2013) present a system that deals with numerical expressions in a culinary domain for the purpose of their simplification. In this paper we are interested only in measurement and count expressions as they are frequently used in the domain of culinary. In next sections we present the text we are working on (Sec. 2.), lexical resources we rely upon (Sec. 3.), the rule-based system for compilation of lists of ingredients from culinary recipes (Sec. 4.), and evaluation results (Sec. 5.). Finally, we outline our future work (Sec. 6.).

2.

About the cookbook

The cookbook, informally known as Pata’s Cookbook, that we were set on producing the new cloths for is by far the most popular cookbook in Serbia and “the must” for every household. Its first edition was published in 1939 under the title Moj kuvar ‘My Cookbook’. It contained recipes that were collected and prepared for publishing by Spasenija Pata Markovi´c (1891–1957), the long-standing leader of the school for young girls and the editor of the culinary section of the daily newspaper Politika. After the Second World War, this cookbook was published under the title Veliki narodni kuvar ‘The Great People’s Cookbook’ in 1956, and after that it had at least 28 editions in Cyrillic and 9 editions in Latin script.1 The records of the 1

Three Cyrillic editions are missing from the on-line catalog of the National Library of Serbia, that records edi-

National Library of Serbia give data about the circulation for a few editions only, but they are stunning – for example, the circulation of the 26th edition published in 1987 was 40,000. We based the electronic version of this cookbook on its first post-war edition.2 It has 871 pages out of which 548 contain recipes for the nutrition of the healthy that we have considered. The other pages contain mostly general explanations and recommendations, e.g. for diets, for the nutrition of the sick, instructions for food serving etc. The recipe part of the cookbook containing 2,964 recipes is divided in 16 sections (Starters, Soups, Stews, etc.) some of which are divided in subsections – e.g. the section Starters has 18 subsections while the subsection Salads has none. Each recipe has a heading – a dish name – and one or several paragraphs giving more or less precise instructions for the preparation of a dish presented in a standard recipe style. They lack a list of ingredients with needed quantities that are de facto standard of modern cookbooks, especially those consulted on the Web, as they enable quick filtering based on various criteria. Our task in this project is automatic extraction of these lists from the text of recipes. The electronic version of the text was obtained by OCR after which a thorough semi-automatic correction was performed using various lexical resources (see Section 3.). This task was time-consuming due to many errors produced by OCR of the old Cyrillic print. In the next step the e-text was transformed into the valid XML document conforming to the simple DTD and to stricter RELAX NG schema: we marked cookbook sections, recipes, headings and paragraphs.

3.

Lexical resources

For the processing of recipes we have used both general resources and resources specially produced for the processing of culinary domain texts. Among the general resources, the most important are comprehensive tions from the first in 1956 up to the 30th from 1997. There were two more editions in 2003 and 2009 with the inconsistent edition numbering. 2 Велики народни кувар ; [предговор Спасениjа-Пата Марковић]. - Београд : Народна књига, 1956 (Београд : Култура). - 871 стр., листа с таблама : илустр.

morphological electronic dictionaries of Serbian comprising of general lexica and proper names (personal names, geographic names, etc.), both as simple- and multi-word units. The information in these dictionaries as well as its format can be best explained by the following examples: belog hleba,beli hleb.N+Conc+Food+Prod +DOM=Culinary:ms2q kolaˇ ca,kolaˇ c.N+Conc+Food+Course +DOM=Culinary:ms2q In these examples lexical forms kolaˇca; belog hleba are assigned lemmas kolaˇc; beli hleb ‘cake; white bread’ and codes for grammatical information: nonanimate (q) singular (s) masculine (m) forms in the genitive case (2). Before our research e-dictionaries were already comprehensive, having many entries from the culinary domain: 1,933 simple-word and 1,641 multi-word entries. The analysis of Pata’s e-text resulted in collecting some new entries: 246 simple-word and 151 multi-word entries. Moreover, we prepared a small e-dictionary specific to this text that contained entries written using old orthography (e.g. belgiski instead of contemporary belgijski ‘Belgian’), obsolete entries (e.g. rtenica instead of kiˇcma ‘spine’), and entries whose meaning is not quite clear (e.g. tauk, a kind of a chicken). There were 62 such entries. Besides morpho-syntactic information, lexical forms are assigned some additional explanatory markers: semantic markers (+Conc – concrete object; +Food – food; +Prod – food product) and a domain marker (+DOM=Culinary). The set of markers used for the culinary domain relevant to this research is given in the upper part of Table 1. The enrichment of e-dictionaries with the lexica from the culinary domain and development of specific markers is described in more details in (Vujiˇci´c Stankovi´c et al., 2014). Markers +Food +Alim +Prod +Course +Drink +Ing +MesApp +Cont +Por +Part +Wh +Set

Examples food alimentation (jabuka ‘apple’) product (ˇse´cer ‘sugar’) course (ˇcorba ‘chowder’) drink (ˇcaj ‘tea’) ingredient (zaslađivač ‘sweetner’) approximate measure container (ˇsoljica ‘small cap’) portion (ˇstangla ‘bar (of chocolate)’) part (kora ‘peel/crust’) whole (vekna ‘loaf’) set (sveˇzanj ‘bunch’)

Table 1: Some semantic markers for culinary concepts (upper part) and measures (lower part) The Named-Entities Recognition System for Serbian is a rule-based and lexical based system that recognizes and tags entities belonging to the following high-level categories: person names, geopolitical names, organization names, temporal expressions, and

numerical expressions including measures, amounts and percentages. This NER system and its performances is described in (Krstev et al., 2014). For this specific research we have used NER modules for measurement and amount expressions. The module for amount expressions had to be modified because we were looking only for those expressions involving food ingredients, e.g. dva jajeta ‘two eggs’. The general NER module for measurement expressions is based on a part of e-dictionaries containing standard measurement units and their abbreviations some of which, mostly for weight and volume, are used in the culinary domain as well, e.g. gram, litar, kg, etc. However, the culinary domain is specific for its usage of many approximate measures, especially in non-professional recipes aimed for domestic use. The identification and systematization of culinary specific measures is described in (Krstev et al., ), and they are briefly presented in the lower part of Table 1. At the beginning of this research e-dictionaries contained 112 simple-word and 19 multi-word unit approximate measures. The general NER module for measurement expression was modified to take into consideration these approximate measures as well. Finally, our system for ingredient extraction uses also some general shallow parsing modules for Serbian, like a module for the identification of adjective-noun constructions.

4.

The ingredient extraction system

We can formulate our problem in the following way: we want to extract from a recipe all ingredients and their amounts as precisely as they are given in a recipe itself. That means that we want to: 1. extract only the ingredients, “input” for the final dish, and not the intermediate products; 2. avoid repetitions of the same ingredients, e.g first when stating the necessary amount and then when explaining how to use it; 3. recognize, if stated, the state in which an ingredient is expected to be (e.g. fresh, pre-cooked, etc.). Sometimes it means that a rather complex phrases have to be recognized, for instance 1 vezu opranog i sitno seˇcenog perˇsuna ‘1 bunch of washed and finely chopped parsley’; 4. recognize, if stated, which part of an ingredient should be used (e.g. a peel or a juice of an orange); 5. extract a necessary ingredient even if the amount is not given. The recipe in Figure 1 illustrates this requirements.3 The system for ingredient extraction is implemented in the Unitex corpora processing system4 that supports use of e-dictionaries (in a format presented in Sec. 3.) and local grammars that represent the formalism for describing various syntactic and semantic linguistic phenomena (Gross, 1999). Unitex 3 Our e-text is in Cyrillic, but we have presented it here in Latin script for easier readability. 4 http://unitexgramlab.org/

Uzeti 1 kilogram crna luka, ali sitna, veliˇ cine oraha3 . Oˇcistiti ga, oprati i cele glaviˇcice spustiti na 1/4 litra ulja da se prˇze. Dodati i aleve paprike5 . Prˇziti luk 2 dok sasvim ne odmekne, a tada dodati 1/4 kilograma suvih ˇ sljiva, zajedno sa koˇ sticama3 , 6–7 kocaka ˇ se´ cera, 1 ˇ caˇ su bela vina i toliko isto vode. Ostaviti na ˇstednjaku luk 2 dok se i ˇsljive 2 dobro ne skuvaju. Ovaj se paprikaˇs 1 sluˇzi topao ili hladan. Take 1 kilogram of onion, but small, the size of the walnut. Clean it, wash it and put the entire bulbs into 1/4 liter of oil to fry. Add also cayenne. Fry the onion until it completely softens, then add 1/4 kilograms of dried plums, together with seeds, 6-7 cubes of sugar, 1 glass of white wine and the same amount of water. Leave the onion on the stove until the plums are well cooked. This paprikaˇs is served warm or cold.

Figure 1: One Pata’s recipe: in bold are ingredients that should be recognized, in italic those that should not be recognized. Subscripts indicate the requirements given above that they illustrate. e-dictionaries and local grammars are implemented as Finite-State Transducers (FSTs) that can process both unannotated and annotated text. Text is annotated in several levels using different FSTs at different levels: e-dictionaries for the first level of annotation and the cascades of local grammars for higher levels (Friburger and Maurel, 2004). In order to achieve our goal we used two cascades of graphs: the first one recognizes and annotates amounts and ingredients, the second one translates Unitex annotations into XML tags. The first cascade consists of 4 high-level transducers – the first one uses lexical annotations (provided by e-dictionaries) while the others use both lexical annotations and annotations done by previously used transducers. The roles of these transducers are: 1. This transducer performs the basic task: it recognizes and annotates ingredients preceded or followed by the quantity needed (expressed by a standard measure, e.g. 1/2 litre ‘1/2 liter’, culinary specific measure, e.g. 2 supene kaˇsike ‘2 soup spoons’, or by an adverb for quantity, e.g. malo ‘little’ and ingredients preceded by an amount expressed by digits or words or their combination, e.g. 2 i po ‘2 and a half’). 2. The second FST recognizes the parts that should be used of an already recognized ingredient, e.g. meso od 3-4 slane sardine ‘meat of 3-4 salted sardines’. 3. The third transducer tries to recognize ingredients for which no amount or quantity is given, by recognizing appropriate context, e.g. ukrasiti ulupanom slatkom pavlakom ‘decorate with whipped cream’. 4. The fourth FST recognizes an ingredient if it is preceded or followed by some already annotated ingredient. For instance, in ... malo kvasca, a zatim

so, ˇ se´ cer i braˇ sno... ‘...a little leaven, and then salt, sugar and flour...’ malo kvasca will be recognized by the first FST, while so, ˇse´cer i braˇsno would be recognized by the iterative application of the fourth FST. The ingredients were tagged with , while the necessary amounts were tagged with five different tags: (a) for the exact measures expressed by standard or culinary units; (b) for a range expressed by standard or culinary units; (c) for the exact amounts expressed by digits or words or by adverbs; (d) for a range of amounts; (e) where no amount or measure is given. The whole phrase was tagged by .
PAPRIKAˇ S NA ULJU 1 kilogram crna luka 1/4 litra ulja 1/4 kilograma suvih s ˇljiva 6-7 kocaka ˇ se´ cera 1 ˇ caˇ su bela vina


Figure 2: Automatic tagging of the recipe in Fig. 1

5.

Evaluation

The results of automatic tagging were evaluated for the whole text.5 The task of evaluators was formulated as follows: they were to add the attribute/value pair check="OK" to every tag in the case that everything was recognized and tagged correctly. The attribute value nok was assigned to a tag if something not representing a recipe ingredient was tagged, while the value uok was assigned if the scope of the tag was not correct: something was missing or something was erroneously added. If some ingredient was completely missing the evaluators had to add the missing tag with the attribute check="MISS". E.g. (Figure 2), all tags are correct, only one is missing. So the evaluator had to add a tag (Figure 3). All values, except ok could be further qualified: a qualifier e meant that a tag was not correct or missing due to some error in the input text (OCR error, incorrect spelling, etc.). In addition, for the value uok the 5

The evaluation was done by students of LIS at the Faculty of Philology, University of Belgrade in the scope of the course Information Retrieval during the school year 2015/16. Some students dropped the subject, that is why a small portion of the tagged text was not evaluated.

added qualifier a meant that something was not correct with the quantity, while s meant that something was not correct with the ingredient. xxx aleve paprike

Figure 3: A tag missing from the list in Fig. 2 first evaluation measure ok nok uok measure.exact 80.5% 6.4% 7.1% measure.range 88.6% 0.3% 3.3% amount.exact 85.2% 5.2% 4.7% amount.range 83.2% 9.9% 3.6% amount.empty 42.9% 6.4% 7.9% second evaluation measure ok nok uok measure.exact 92.7% 1.7% 3.4% measure.range 90.9% 0.0% 0.0% amount.exact 87.7% 7.0% 3.7% amount.range 78.6% 21.4% 0.0% amount.empty 64.6% 7.2% 4.3%

miss 6.0% 7.8% 5.0% 3.3% 42.9% miss 2.2% 9.1% 1.6% 0.0% 24.0%

Table 2: Evaluation results in recognizing various types of measure and amount expressions For 321 recipes out of 2,561 evaluated all ingredients were correctly extracted (12.5%), and for 521 all ingredients were extracted, maybe not all of them completely correct (20%). For 71 recipes no ingredient was extracted correctly (2.8%). Above 80% of measure and amount expressions with ingredients were successfully recognized (Table 2), except where ingredients were mentioned without any amount. In that case, less then half of ingredients were recognized. The results of the first evaluation according to the their type are presented in Table 3. When calculating recall and precision we have taken the strict approach: true positives were ingredients annotated with ok, false positives those annotated with nok, and false negatives those annotated either with nok or uok. All these results should be taken cautiously as analysis of evaluation showed that evaluators (students) were not meticulous in performing their task. Also, in some cases they did not understand instructions properly, e.g., the use of qualifiers. The instructions were not precise enough in some case, so students were not sure how to deal with them. The most frequent uncertainties arose from: (a) the treatment of ‘water’ as an ingredient; (b) the treatment of repetitions of some ingredients, e.g. first with an amount and then without it; (c) the treatment of indirect mentions of ingredients, e.g. the verb posoliti ‘to salt’ implies ‘salt’ as an ingredient; (d) treatment of optional or alternative ingredients. Some students tagged such cases as missing if the ingredient was not in a list, while others tagged them as incorrect if it was present. Although results of the first evaluation were not

Category starters soups noodles stews roasted meat sauces salads dumplings pasta pies tortes cakes desserts sweets drinks pickles total

Recall 0.710 0.720 0.824 0.769 0.606 0.768 0.813 0.895 0.890 0.641 0.767 0.796 0.841 0.886 0.785 0.600 0.735

Precision 0.932 0.929 0.994 0.917 0.943 0.931 0.929 0.808 0.837 0.892 0.901 0.976 0.925 0.903 0.939 0.784 0.918

F1 0.806 0.811 0.901 0.837 0.737 0.842 0.867 0.849 0.863 0.746 0.829 0.877 0.881 0.895 0.855 0.680 0.816

Table 3: The results of the first evaluation presented by types of recipes; for all categories P > R

completely reliable, they were nevertheless useful as they helped us to discover what were the most frequent problems and to try to correct them. The problems that we addressed in this phase were: 1. The e-text was corrected (errors present in the source text as well as those introduced by OCR). 2. Lexical resources were improved, which consisted of enhancing dictionaries with new entries (10 simple- and 25 multi-word entries) and improving of existing ones, e.g. by adding or replacing semantic markers, correcting inflectional paradigms of simpleand multi-word lemmas, etc.; 3. The set of approximate measures was improved – a number of new entries were added (3 simple- and 9 multi-word); also, during the first round of evaluation it was established that some approximate measures were not useful for this task, mostly because they were words with many meanings, so they were excluded: vrh ‘top’, deo ‘part’, kraj ‘end’, red and sloj ‘layer’. 4. Local grammars that recognize amounts, measures, ingredients, as well as overall structure of ingredient quantities were enhanced and improved. For instance, a structure ingredient_quantity, like in mekan komad tele´ceg mesa u teˇzini 3/4 kilograma ‘A soft piece of veal meat weighing 3/4 kilograms’, that was not covered by initial grammars was added. Also, we added recognition of ingredients through indirect mentioning. In order to do that we established a list of verbs used for this purpose like, e.g., ˇse´ceriti ‘to sugar’ and kiseliti ‘to acidify’. For this last task we added the fifth local grammar to our cascade. The second round of evaluation was performed on a part of the whole text that consisted of 10 randomly chosen recipes of each type (a total of 160 recipes). For this evaluation we adopted in advance some rules that eliminated previously observed uncertainties: (a) water was not treated as an ingredient, as presumably

present in every household; (b) repetitions of same ingredients were marked with a new qualifier d of the attribute ok (if otherwise correct); (c) indirect mentions of ingredients were tagged and marked with the attribute miss if missing; (d) optional and alternative ingredients were tagged and marked with the attribute miss if missing – their special status was not recorded. Category starters soups noodles stews roasted meat sauces salads dumplings pasta pies tortes cakes desserts sweets drinks pickles total

Recall 0.750 0.900 0.879 0.894 0.742 0.857 0.896 0.859 0.913 0.875 0.889 0.930 0.842 0.933 0.750 0.780 0.863

Precision 0.875 0.933 0.944 0.949 0.862 0.947 0.945 0.924 0.875 0.824 0.923 0.922 0.914 0.954 0.846 0.696 0.903

F1 0.808 0.915 0.911 0.921 0.798 0.900 0.920 0.890 0.894 0.848 0.906 0.926 0.877 0.943 0.795 0.736 0.883

Table 4: The results of the second evaluation; precision and recall are more balanced The results of the second evaluation are presented in Table 4. Recall and Precision were calculated as before with a few small modifications: (a) tags marked as uok or miss with attribute modifier e were treated as correct if the sequence was not correctly recognized due to the error in the input text; (b) tags marked as ok or uok with modifier d were skipped as they, being repetitions, can be filtered in post-processing. The recognition of ingredient quantities (except for the range of amounts) was better, the highest improvement achieved for the amount.empty tag (Table 2). The analysis of measurement expressions in our evaluation sample showed that out of 609 such tags, 324 used standard measure units while other 285 used 49 different approximate measures The most frequently used is kaˇsika ‘spoon’ with its synonyms: velika kaˇsika ‘big spoon’, kaˇsika za jelo ‘spoon for dishes’, kaˇsika za supu ‘spoon for soups’ – it appears 120 times. Among 361 amount expressions 232 were numerical while 129 were adverbs (12 different), the most frequent of them malo ‘little’ (91) and dosta ‘lot’ (12). Among these some typical for culinary domain were used: na vrh noˇza ‘at the tip of the knife’, po volji ‘as desired’, po ukusu ‘according to taste’.

6.

Conclusions

In this paper we presented results we obtained in preparing a modernized version of an old, much appreciated cookbook. There are still things to be done: link some today less used ingredient names with their

synonyms (through the Serbian WordNet) and replace others (for instance, some colorants forbidden for use today). However, all this is a part of a much broader project of collecting a large and versatile corpus of culinary books (and other domain texts) and producing lexical resources and tools for their processing.

Acknowledgments This research was supported by the Serbian Ministry of Science under grants #178006 and #47003.

7.

References

Bautista, Susana, Raquel Hervas, and Pablo Gervas, 2013. Accessible Numerical Information: Cookery Recipes as a Special Case. In 4th Int. Conf. Information and Communication Technology and Accessibility (ICTA). IEEE. ˇ ˇ Bekavac, Boˇzo, Zeljko Agi´c, Kreˇsimir Sojat, and Marko Tadi´c, 2009. Detecting Measurement Expressions Using NooJ. In NooJ 2009 International Conference and Workshop. Constant, Matthieu, 2009. Microsyntax of Measurement Phrases in French: Construction and Evaluation of a Local Grammar. In 8th International Workshop FSMNLP . Ekbal, Asif and Sivaji Bandyopadhyay, 2009. A Conditional Random Field Approach for Named Entity Recognition in Bengali and Hindi. Linguistic Issues in Language Technology, 2(1):1–44. Friburger, Nathalie and Denis Maurel, 2004. FiniteState Transducer Cascades to Extract Named Entities in Texts. Theoretical Computer Science, 313(1):93–104. Gross, Maurice, 1999. A Bootstrap Method for Constructing Local Grammars. In N. Bokan (ed.), Proceedings of the Symposium on Contemporary Mathematics. University of Belgrade. Habash, Nizar and Ryan Roth, 2008. Identification of Naturally Occurring Numerical Expressions in Arabic. In LREC . Krstev, Cvetana, Ivan Obradovi´c, Miloˇs Utvi´c, and Duˇsko Vitas, 2014. A System for Named Entity Recognition Based on Local Grammars. Journal of Logic and Computation, 24(2):473–489. Krstev, Cvetana, Staˇsa Vujiˇci´c Stankovi´c, and Duˇsko Vitas. Approximate Measures in the Culinary Domain: Ontology and Lexical Resources. In Proceedings of the 9th Language Technologies Conference IS-LT 2014, year=2014, pages=38-43 . Saleh, Iman, Lamia Tounsi, and Josef van Genabith, 2011. ZamAn and Raqm: Extracting Temporal and Numerical Expressions in Arabic. In Asia Information Retrieval Symposium. Springer. Sekine, Satoshi, Kiyoshi Sudo, and Chikashi Nobata, 2002. Extended Named Entity Hierarchy. In LREC . Vujiˇci´c Stankovi´c, Staˇsa, Cvetana Krstev, and Vitas Duˇsko, 2014. Enriching Serbian WordNet and Electronic Dictionaries with Terms from the Culinary Domain. In Proceedings of the 7th GWC 2014 .