Introduction to Computational Linguistics

6 downloads 120568 Views 911KB Size Report
Automated natural language generation ... Automatic error correction systems help because: 1. ... competitors, marketing campaigns can be finely targeted.
Introduction to Computational Linguistics Fiorella C. Dotti UAM University of Salamanca

What is Computational Linguistics? ACL Definition: “The study of language from a computational perspective” An area of knowledge that combines theoretical and applied linguistics, statistics, computer science and mathematics, among other fields, in order to further our understanding of natural language and help us develop new language technologies. Universidad de Salamanca

Fiorella C. Dotti

What is Computational Linguistics? The field is closely related to Natural Language Processing (NLP). The relationship between NLP and Computational Linguistics has been described as the similarity between Engineering and Science: Computational Linguistics is more concerned with causes and origins, NLP is more concerned with direct application.

Universidad de Salamanca

Fiorella C. Dotti

What is Computational Linguistics? In our area of study, both fields are constantly overlapping. Most likely, we are interested in both things: improving recognition and finding out the underlying cause.

Universidad de Salamanca

Fiorella C. Dotti

Approaches to CL and NLP The first approaches were mostly rule-based. Rule-based approaches typically make intensive use of hand-crafted resources. Creating these resources is expensive.

Universidad de Salamanca

Fiorella C. Dotti

Approaches to CL and NLP Then, statistical approaches started to be used. Statistical approaches do not often rely on as much information as rule-based approaches. This makes them cheaper, and as a result of this, more popular.

Universidad de Salamanca

Fiorella C. Dotti

Approaches to CL and NLP Nevertheless, statistical approaches only work for cases that are very frequent, so a combined approach is rising in popularity (rule-based + statistics).

Universidad de Salamanca

Fiorella C. Dotti

Some results of CL + NLP ❖ Speech recognition systems (NOT the same as Voice recognition). ❖ Search engines ❖ Automatic ontology creation ❖ Automatic correction systems. ❖ Sentiment analysis ❖ Automatic summarization ❖ Machine translation ❖ Automated natural language generation ❖ Natural language understanding Universidad de Salamanca

Fiorella C. Dotti

Speech recognition systems The most common example would be using a search engine or a digital assistant by means of speaking to your phone. Speech recognition systems use statistical techniques such as Hidden Markov Models to calculate the probability that a phoneme will be followed by another and identify the most likely intended word/sentence. Universidad de Salamanca

Fiorella C. Dotti

The photo depicts the launch of STS-26 (September 1988), the first return to flight mission after the Challenger accident. This was the first shuttle mission to use a non-critical speech recognition system. Weightlessness affected the astronaut’s articulation, so that templates created on the ground were ineffective, while templates that were created in microgravity were highly effective (as long as personal templates were created as well).

Another possible aerospace application This video captures a real conversation between a hypoxic pilot and air traffic controllers. The pilot is physically and cognitively unable to effectively control the plane and can only respond to direct instructions. Do you think a speech recognition system could have helped him? How?

https://www.youtube.com/watch?v=_IqWal_EmBg

Search Engines Similarly to Speech recognition systems, they calculate the probability of a word being followed by another and that it would refer to one topic or another (an area of study known as word sense disambiguation). They also identify keywords (something that Search Engine Optimization makes use of) and try to repair user error (e.g., typing “machne learning” would return suggested results for “machine learning”) Universidad de Salamanca

Fiorella C. Dotti

Automatic Ontology creation An ontology is a formal framework that we can use to represent knowledge. Natural language understanding and keyword extraction techniques (as well as others) can be used to extract information and its relationship to other information bits (e.g: Ontology from Wikipedia → DBpedia)

Universidad de Salamanca

Fiorella C. Dotti

Automatic error correction systems Nowadays, it is not infrequent to teach a class with students with 5 different mother tongues in it (or more). New European standards demand learner autonomy. This is a very hard situation for teachers. Universidad de Salamanca

Fiorella C. Dotti

Automatic error correction systems Automatic error correction systems help because: 1. They are always available, so students can practise at any time. 2. They do not have a native language limitation, they can detect and trace errors from students with different L1s Universidad de Salamanca

Fiorella C. Dotti

Sentiment analysis Big companies invest large amounts of money in obtaining information about their customers. One of the main ways to do so is by monitoring and participating in social networks. “Community Managers” are not able to stay up to date on absolutely everything related to the brand, in real-time Universidad de Salamanca

Fiorella C. Dotti

Sentiment analysis If a program can detect customers’ opinions and understand how they see the brand’s competitors, marketing campaigns can be finely targeted. There are many possible methods to use in this area (bag of words, Support Vector Machine, etc) Many companies offer their services in this are Universidad de Salamanca

Fiorella C. Dotti

Machine translation Automatically translate a text from its source language into a target language. This is how CL started: In the 1950s, American defense agencies wanted to be able to translate scientific articles from Russian into English. Russian agencies were trying to do the same. Universidad de Salamanca

Fiorella C. Dotti

Machine Translation Current examples include, most famously, Google Translate. Google can detect the source language automatically. Most systems use parallel corpora (a corpus of texts in one language and their corresponding translations into other languages) and dictionaries.. Rule based techniques provide a syntactic basis, while statistical techniques help with false-friend detection. Universidad de Salamanca

Fiorella C. Dotti

Natural Language Generation The creation of natural language by a machine. We can determine how close it is to what a human will say by using a series of tests. One of the best known tests for this purpose is the Turing test.

Universidad de Salamanca

Fiorella C. Dotti

Natural Language Generation Often used to create a more “user-friendly” experience (databases, Q&A systems). Also useful to improve accessibility for users with disabilities that prevent them from speaking, reading, etc.

Universidad de Salamanca

Fiorella C. Dotti

Natural Language Understanding A ‘smarter’ computer: It entails not only being able to ‘read’ the text, but to make logical inferences from it. Present in the technologies that we have reviewed before, though it is also an area of research on its own right. Universidad de Salamanca

Fiorella C. Dotti

How do these sytems work? Several components: A. B. C. D.

Statistical systems Linguistic systems Programming Extra resources (of any type)

Universidad de Salamanca

Fiorella C. Dotti

Statistical systems: a quick intro There are many statistical methods, but in essence they are mostly counting the amount of instances of a particular phenomena and the circumstances surrounding it, and deciding how likely it is that that would have occurred by chance.

Universidad de Salamanca

Fiorella C. Dotti

Statistical systems: a quick intro Central to this idea is the concept of statistical significance: something is statistically significant if it is not likely to have happened by chance alone. There are tables with values that allow researchers to identify when something is or is not statistically significant.

Universidad de Salamanca

Fiorella C. Dotti

Linguistic systems Rules that are derived from linguistic knowledge, e.g.: Example sentence: “He are busy” Linguistic rule: the third person singular for the verb ‘to be’ is “is”. Therefore, the sentence is incorrect. Universidad de Salamanca

Fiorella C. Dotti

Programming The backbone and glue of it all. Not necessarily innovative, sometimes it just acts as a facilitating medium (you wouldn’t be able to process a 3,000,000 word corpus without some programming involved).

Universidad de Salamanca

Fiorella C. Dotti

A programming primer Computers are best suited to process numerical information and can also deal with logical representation, but they appear to handle more advanced concepts because we use programs on top of basic programs. The trick lies in reducing everything to its simplest expression so that the computer can understand it. Universidad de Salamanca

Fiorella C. Dotti

Example: Turn Off The Light We could program a robot to walk down the aisle and turn off the light. The problem is that the robot’s computer will not be able to work with that right away. We must break the problem into smaller pieces of information.

Universidad de Salamanca

Fiorella C. Dotti

Example: Turn Off The Light 1. 2. 3. 4.

Ascertain that the light is on Walk down the aisle Find the light switch Press the light switch.

Universidad de Salamanca

Fiorella C. Dotti

Example: Turn Off The Light 2. Walk down the aisle ★ Error: What is ‘walk’? ★ Error: What is ‘the aisle’? ★ Error: How do I know which way is down?

Universidad de Salamanca

Fiorella C. Dotti

Example: Turn Off The Light Walk= put one foot in front of the other in a straight line. The aisle= where you will be walking. A straight line, there will be walls on the sides (use computer vision, or sensors?) Down = you will know the way down because it is further from where you are now (or there is light, or if it is physically down, you can use your oscillometer). Universidad de Salamanca

Fiorella C. Dotti

Example: Turn Off The Light We need to do this for each one of our subdivisions. Each one of this smaller steps would be defined in code. Walking would be a function, the aisle would be a variable, etc. All the steps would ultimately become one function: Turn off the light. The user would probably only see a button that gives them the option to give the robot that order. Universidad de Salamanca

Fiorella C. Dotti

Python Python is a good programming language for someone in computational linguistics: 1. Easy to learn. Reads almost like English. 2. Good for prototyping. 3. Good community. Many projects in CL and NLP use Python, and it is a de facto industry standard. 4. ‘Batteries included’ → lots of functionality right off the box. Universidad de Salamanca

Fiorella C. Dotti

Data types Data types exist in all programming languages. They are ways to store data. Some of the most widely used ones are: ➢ Strings ➢ Lists ➢ Dictionaries (‘mappings’) ➢ Tuples ➢ Numbers ➢ Sets Universidad de Salamanca

Fiorella C. Dotti

Data types: Strings Strings are immutable (they never change). They can contain numbers or text, but when you use numbers inside them, the computer will not recognize them as such. a_string= “The University of Salamanca” b_string= “12345” c_string= “10000” Universidad de Salamanca

Fiorella C. Dotti

Data types: Strings If we sum strings, we only get concatenated strings (one goes after the other, the numbers are not treated as such): b_string+c_string= “1234510000” a_string+b_string= “The University of Salamanca12345” Universidad de Salamanca

Fiorella C. Dotti

Data types: Strings Strings are best used for information that won’t change and that should be kept in a certain order, e.g, a sentence: some_string = “The man that you saw yesterday is asking me for help.”

Universidad de Salamanca

Fiorella C. Dotti

Data type: List A list is an ordered sequence of elements, but in contrast to the string, it is mutable. Their order and length can change. shopping_list= [“milk”, “cereal”, “bacon”] We can iterate over lists, that is, we can treat each one of its elements on its own. We can also sort them alphabetically and do many other things. Universidad de Salamanca

Fiorella C. Dotti

Data type: Tuple The tuple is an ordered, immutable small set of elements. tuple_bigram= (‘a’, ‘walk’). They are useful when maintaining order is important, but we need more atomicity and structure. Universidad de Salamanca

Fiorella C. Dotti

Data type: Dictionary Dictionaries are similar to real life dictionaries. They have a key and a value. The problem is that dictionaries in Python should not have repeated keys and they only take one value for key, so if a word has multiple values, they get problematic. dictionary={‘map’: ‘mapa’, ‘cat’: ‘gato’...} Universidad de Salamanca

Fiorella C. Dotti

Data type: Numbers Numbers can be integers, floats, complex numbers, etc. Integers and floats are the most commonly used. You can use numbers for any mathematical operation you want. my_integer = 1 my_float = 2.0 my_integer+my_float=3.0 Universidad de Salamanca

Fiorella C. Dotti

Data type: Sets Sets do not take duplicates. They are often used to remove duplicate instances: my_list = [‘jam’, ‘toast’, ‘cereal’, ‘jam’] my_set = set(my_list) (‘jam’, ‘toast’, ‘cereal)

Universidad de Salamanca

Fiorella C. Dotti

Data type: Booleans Something is True or False: my_favorite_animal=’lion’ >>>my_favorite_animal==’giraffe’ False

Universidad de Salamanca

Fiorella C. Dotti

Iteration To iterate means to go over (elements) in sequence. my_list=[‘giraffe’, ‘lion’, ‘turtle’] for element in my_list: print(element) giraffe lion turtle Universidad de Salamanca

Fiorella C. Dotti

Conditional Statements If something is true, then do something. Else, do something else: If age5: print(‘still a while to go!”) i= i -1 Universidad de Salamanca

Fiorella C. Dotti

Functions A collection of instructions that may or may not return a result: def count_to_ten(initial_number): while initial_number