Monte Carlo Simulation Generation Through

0 downloads 0 Views 15MB Size Report
project and latest source is at https://github.com/VoxML/VoxSim. 20 ...... Annotator 3: European-American female, fluent or native English speaker, 20-29 years old, ...... constructed using the Python Natural Language Toolkit (NLTK) (Loper and ...
Monte Carlo Simulation Generation Through Operationalization of Spatial Primitives A Dissertation

Presented to The Faculty of the Graduate School of Arts and Sciences Brandeis University Department of Computer Science Dr. James Pustejovsky, Advisor

In Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy

by Nikhil Krishnaswamy August, 2017

The signed version of this form is on file in the Graduate School of Arts and Sciences. This dissertation, directed and approved by Nikhil Krishnaswamy’s committee, has been accepted and approved by the Graduate Faculty of Brandeis University in partial fulfillment of the requirements for the degree of:

DOCTOR OF PHILOSOPHY

Eric Chasalow, Dean of Arts and Sciences

Dissertation Committee: Dr. James Pustejovsky, Chair Dr. Kenneth D. Forbus, Dept. of Electrical Eng. and Comp. Sci., Northwestern University Dr. Timothy J. Hickey, Dept. of Computer Science, Brandeis University Dr. Marc Verhagen, Dept. of Computer Science, Brandeis University

c

Copyright by Nikhil Krishnaswamy 2017

For my father

Acknowledgments Like a wizard, a thesis never arrives late, or early, but precisely when it means to; but it would never arrive at all without the people who helped it along the way. First and foremost, I would like to thank Prof. James Pustejovsky for taking a chance on a crazy idea and tirelessly pursuing opportunities for the topic and for me in particular, for always taking time to discuss ideas and applications any time, any place (during the week, on the weekend, with beer, without beer, on three continents). And for, when I requested to stay on at Brandeis after completing my Master’s degree, writing a letter of recommendation on my behalf, to himself. That story usually kills. I would like to thank my committee members: Dr. Marc Verhagen, for hours of stimulating discussions; Prof. Tim Hickey, for letting me poach his animation students to help me develop the simulation software; and Prof. Ken Forbus, with whom it is an honor to have the opportunity to share my research. Additionally, thank you all for your perceptive, thorough, and insightful feedback in molding the draft copy of this thesis into its final form. To my friends and family, thank you; particularly to my wife, Heather, for letting this shady roommate of an idea move in with us; to my mother and stepfather, Uma Krishnaswami and Satish Shrikhande, for their unwavering faith and support—didn’t I say not to worry? I promised I could handle it. To the student workers who contributed many enthusiastic hours developing VoxSim, thank v

you; particularly to Jessica Huynh, Paul Kang, Subahu Rayamajhi, Amy Wu, Beverly Lum, and Victoria Tran. I’d probably have quit long ago without you. To the community of Unity developers, whose collective knowledge I spent hours reading online. Special thanks to Neil Mehta of LaunchPoint Games for prompt response and service helping me debug his plugin. To the faculty and staff of the Brandeis Master’s program in Computational Linguistics, particularly to Prof. Lotus Goldberg for her consistent good wishes, cheer, advice, and ongoing interest in this thesis, and to all the students I’ve had the pleasure of getting to know along the way. Clearly there’s something special going on here because I’ve done everything I can to avoid leaving. Without Dr. Paul Cohen and the Communicating with Computers project at DARPA, this research would likely never have gotten off the ground. Working on the program I have enjoyed, and continue to enjoy, many fruitful collaborations that drove, and sometimes forced, development on VoxSim and have really put the software through a stress test. In this regard, I’d like to give particular thanks to Prof. Bruce Draper and his group at Colorado State University. Additional thanks also goes to Eric Burns at Lockheed Martin, for his advice and encouragement during the early part of my Ph.D., where I learned to spot opportunities as they arose, and seize them before time ran out. This work was supported by Contract W911NF-15-C-0238 with the U.S. Defense Advanced Research Projects Agency (DARPA) and the Army Research Office (ARO). Approved for Public Release, Distribution Unlimited. The views expressed herein are mine and do not reflect the official policy or position of the Department of Defense or the U.S. Government. All errors and mistakes are, of course, my own. I would like to dedicate this research and its culmination, this dissertation, to the memories of those family who passed on during my time in graduate school: my grandfather, Mr. V. Krishna Swami Iyengar; my grandmother, Mrs. Hema Krishnaswamy; my stepbrother, Kedar Shrikhande; vi

and my father, Dr. Sumant Krishnaswamy.

vii

Abstract Monte Carlo Simulation Generation Through Operationalization of Spatial Primitives A dissertation presented to the Faculty of the Graduate School of Arts and Sciences of Brandeis University, Waltham, Massachusetts by Nikhil Krishnaswamy Much existing work in text to scene generation focuses on generating static scenes, which leaves aside entire word classes such as motion verbs. This thesis introduces a system for generating animated visualizations of motion events by integrating dynamic semantics into a formal model of events, resulting in a simulation of an event described in natural language. Visualization, herein defined as a dynamic three-dimensional simulation and rendering that satisfies the constraints of an associated minimal model, provides a framework for evaluating the properties of spatial predicates in real-time, but requires the specification of values and parameters that can be left underspecified in the model. Thus, there remains the matter of determining what, if any, the “best” values of those parameters are. This research explores a method of using a three-dimensional simulation and visualization interface to determine prototypical values for underspecified parameters of motion predicates, built on a game engine-based platform that allows the development of semantically-grounded reasoning components in areas in the intersection of theoretical reasoning and AI.

viii

Contents Abstract 1

2

3

4

viii

Introduction 1.1 Background . . . . . . . . . . . . . . . . . . . 1.2 Information-Theoretic Foundations . . . . . . . 1.3 Linguistic Underspecification in Motion Events 1.4 Related Prior Work . . . . . . . . . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

1 3 4 5 6

Framework 2.1 VoxML: Visual Object Concept Modeling Language 2.2 VoxSim . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Spatial Reasoning . . . . . . . . . . . . . . . . . . . 2.4 Object Model . . . . . . . . . . . . . . . . . . . . . 2.5 Action Model . . . . . . . . . . . . . . . . . . . . . 2.6 Event Model . . . . . . . . . . . . . . . . . . . . . . 2.7 Event Composition . . . . . . . . . . . . . . . . . . 2.8 Specification Methods . . . . . . . . . . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

8 12 20 23 28 29 29 31 36

. . . .

38 40 48 52 57

. . . . .

65 66 80 92 106 108

Methodology and Experimentation 3.1 Preprocessing . . . . . . . . . 3.2 Operationalization . . . . . . . 3.3 Monte Carlo Simulation . . . 3.4 Evaluation . . . . . . . . . . .

. . . .

. . . .

Results and Discussion 4.1 Human Evaluation Task 1 Results 4.2 Human Evaluation Task 2 Results 4.3 Automatic Evaluation Task Results 4.4 Mechanical Turk Worker Response 4.5 Summary . . . . . . . . . . . . .

. . . .

. . . . .

. . . .

. . . . .

. . . .

. . . . .

ix

. . . .

. . . . .

. . . .

. . . . .

. . . .

. . . . .

. . . .

. . . . .

. . . .

. . . .

. . . . .

. . . .

. . . .

. . . . .

. . . .

. . . . .

. . . .

. . . . .

. . . .

. . . . .

. . . .

. . . . .

. . . .

. . . . .

. . . .

. . . . .

. . . .

. . . . .

. . . .

. . . . .

. . . .

. . . . .

. . . .

. . . . .

. . . .

. . . . .

. . . .

. . . . .

. . . .

. . . . .

. . . .

. . . . .

. . . .

. . . . .

. . . .

. . . . .

CONTENTS 5

Future Directions 112 5.1 Extensions to Methdology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 5.2 VoxML and Robotics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 5.3 Information-Theoretic Implications . . . . . . . . . . . . . . . . . . . . . . . . . . 115

A VoxML Structures A.1 Objects . . . A.2 Programs . . A.3 Relations . . A.4 Functions . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

117 117 131 136 138

B Underspecifications

139

C [[TURN]]: Complete Operationalization

141

D Sentence Test Set

147

E Data Tables E.1 DNN with Unweighted Features . . . . . . . . . . . . . E.2 DNN with Weighted Features . . . . . . . . . . . . . . . E.3 DNN with Weighted Discrete Features . . . . . . . . . . E.4 DNN with Feature Weights Only . . . . . . . . . . . . . E.5 Combined Linear-DNN with Unweighted Features . . . E.6 Combined Linear-DNN with Weighted Features . . . . . E.7 Combined Linear-DNN with Weighted Discrete Features E.8 Combined Linear-DNN with Feature Weights Only . . . F Publication History

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

166 167 170 173 176 179 182 185 188 191

x

List of Tables 2.1 2.2 2.3 2.4 2.5 2.6

Example voxeme properties . . . . . . . . VoxML O BJECT attributes . . . . . . . . VoxML O BJECT H EAD types . . . . . . . VoxML P ROGRAM attributes . . . . . . . VoxML P ROGRAM H EAD types . . . . . Example VoxML ATTRIBUTE scalar types

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

3.1 3.2 3.3

Test set of verbal programs and objects . . . . . . . . . . . . . . . . . . . . . . . . 39 Program test set with underspecified parameters . . . . . . . . . . . . . . . . . . . 48 Number of videos captured per motion predicate . . . . . . . . . . . . . . . . . . . 56

Acceptability judgments and statistical metrics for “move x” visualizations, conditioned on respecification predicate . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Acceptability judgments and statistical metrics for “turn x” visualizations, conditioned on respecification predicate . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Acceptability judgments and statistical metrics for unrespecified “turn x” visualizations, conditioned on rotation angle . . . . . . . . . . . . . . . . . . . . . . . 4.4 Acceptability judgments and statistical metrics for “roll x” visualizations, conditioned on path length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Acceptability judgments and statistical metrics for “slide x” visualizations, conditioned on translocation speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Acceptability judgments and statistical metrics for “spin x” visualizations respecified as “roll x,” conditioned on path length . . . . . . . . . . . . . . . . . . . . . 4.7 Acceptability judgments for unrespecified “spin x” visualizations, conditioned on rotation axis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8 Acceptability judgments and statistical metrics for “lift x” visualizations, conditioned on translocation speed and distance traversed . . . . . . . . . . . . . . . . 4.9 Acceptability judgments and statistical metrics for “put x touching y” visualizations, conditioned on relations between x and y at event start and completion . . . 4.10 Acceptability judgments and statistical metrics for “put x touching y” visualizations, conditioned on x movement relative to y . . . . . . . . . . . . . . . . . . .

. . . . . .

10 13 13 16 16 18

4.1

xi

. 66 . 67 . 68 . 69 . 69 . 70 . 70 . 71 . 72 . 73

LIST OF TABLES 4.11 Acceptability judgments and statistical metrics for “put x near y” visualizations, conditioned on distance between x and y at event start and completion . . . . . . . 4.12 Acceptability judgments and statistical metrics for “put x near y” visualizations, conditioned on start and end distance intervals between x and y . . . . . . . . . . . 4.13 Acceptability judgments and statistical metrics for “put x near y” visualizations, conditioned on distance between x and y and POV-relative orientation at event completion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.14 Acceptability judgments and statistical metrics for “lean x” visualizations, conditioned on rotation angle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.15 Acceptability judgments and statistical metrics for “flip x” visualizations, conditioned on rotation axis and symmetry axis . . . . . . . . . . . . . . . . . . . . . . 4.16 Acceptability judgments for “close x” visualizations, conditioned on motion manner 4.17 Acceptability judgments for “open x” visualizations, conditioned on motion manner 4.18 Probabilities and statistical metrics for selection of “move” predicate for “move x” event, conditioned on respecification predicate . . . . . . . . . . . . . . . . . . . . 4.19 Probabilities and statistical metrics for selection of “turn” predicate for “turn x” visualizations, conditioned on respecification predicate . . . . . . . . . . . . . . . 4.20 Probabilities and statistical metrics for selection of “turn” predicate for unrespecified “turn x” visualizations, conditioned on rotation angle . . . . . . . . . . . . . . 4.21 Probabilities and statistical metrics for selection of “roll” predicate for “roll x” visualizations, conditioned on path length . . . . . . . . . . . . . . . . . . . . . . 4.22 Top 3 most likely predicate choices for “roll x” visualizations, conditioned on path length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.23 Probabilities and statistical metrics for selection of “slide” predicate for “slide x” visualizations, conditioned on path length and translocation speed . . . . . . . . . 4.24 Probabilities and statistical metrics for selection of “spin” predicate for “spin x” visualizations respecified as “roll x,” conditioned on path length . . . . . . . . . . 4.25 Probabilities for selection of “spin” predicate for unrespecified “spin x” visualizations, conditioned on rotation axis . . . . . . . . . . . . . . . . . . . . . . . . . . 4.26 Probabilities and statistical metrics for selection of “lift” predicate for “lift x” visualizations, conditioned on translocation speed and distance traversed . . . . . . . 4.27 Probabilities and statistical metrics for selection of “put on/in” predicate for “put x on/in y” visualizations, conditioned on translocation speed . . . . . . . . . . . . . 4.28 Probabilities and statistical metrics for selection of “put touching” predicate for “put x touching y” visualizations, conditioned on relative orientation between x and y at event completion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.29 Probabilities and statistical metrics for selection of “put near” predicate for “put x near y” visualizations, conditioned on distance traveled . . . . . . . . . . . . . . . 4.30 Probabilities and statistical metrics for selection of “lean on” predicate for “lean x on y” visualizations, conditioned on rotation angle . . . . . . . . . . . . . . . . . .

xii

73 74

75 78 79 79 80 81 82 82 83 83 84 85 85 86 87

87 87 88

LIST OF TABLES 4.31 Probabilities and statistical metrics for selection of “lean against” predicate for “lean x against y” visualizations, conditioned on rotation angle . . . . . . . . . . 4.32 Probabilities and statistical metrics for selection of “flip on edge” predicate for “flip x on edge” visualizations, conditioned on rotation axis and symmetry axis . 4.33 Probabilities and statistical metrics for selection of “flip at center” predicate for “flip x at center” visualizations, conditioned on rotation axis and symmetry axis . 4.34 Top 3 most likely predicate choices for “flip x {on edge, at center}” visualizations, conditioned on rotation axis and symmetry axis . . . . . . . . . . . . . . . . . . 4.35 Probabilities for selection of “close” predicate for “close x” visualizations, conditioned on motion manner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.36 Top 3 most likely predicate choices for “close x” visualizations, conditioned on motion manner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.37 Probabilities for selection of “open” predicate for “open x” visualizations, conditioned on motion manner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.38 Top 3 most likely predicate choices for “open x” visualizations, conditioned on motion manner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.39 Accuracy tables for baseline automatic evaluation . . . . . . . . . . . . . . . . .

. 89 . 89 . 89 . 90 . 91 . 91 . 91 . 92 . 93

B.1 Underspecified parameters and satisfaction conditions . . . . . . . . . . . . . . . . 140 E.1 E.2 E.3 E.4 E.5 E.6 E.7 E.8

Accuracy tables for “vanilla” DNN automatic evaluation . . . . . . . . . . . . . . 169 Accuracy tables for DNN automatic evaluation with weighted features . . . . . . . 172 Accuracy tables for DNN automatic evaluation with weighted discrete features . . 175 Accuracy tables for DNN automatic evaluation with feature weights alone . . . . . 178 Accuracy tables for linear-DNN automatic evaluation . . . . . . . . . . . . . . . . 181 Accuracy tables for linear-DNN automatic evaluation with weighted features . . . 184 Accuracy tables for linear-DNN automatic evaluation with weighted discrete features187 Accuracy tables for linear-DNN automatic evaluation with feature weights alone . 190

xiii

List of Figures 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9

. . . . . . . .

10 11 15 17 18 19 20 21

2.10 2.11 2.12 2.13 2.14 2.15 2.16 2.17 2.18 2.19 2.20 2.21 2.22 2.23 2.24 2.25

3D model of a bowl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bowl with box collider shown in green . . . . . . . . . . . . . . . . . . . . . . . [[PLATE]], an O BJECT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . [[PUT]], a P ROGRAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . [[SMALL]], an ATTRIBUTE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . [[TOUCHING]], a R ELATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . [[TOP]], a F UNCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VoxSim architecture schematic . . . . . . . . . . . . . . . . . . . . . . . . . . . Dependency parse for Put the apple on the plate and transformation to predicatelogic form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VoxML structure for [[CUP]] with associated geometry . . . . . . . . . . . . . . VoxML structures for [[ON]] and [[IN]] . . . . . . . . . . . . . . . . . . . . . . Execution of “put the spoon in the mug” . . . . . . . . . . . . . . . . . . . . . . Orientation-dependent visualizations of “put the lid on the cup” . . . . . . . . . . Context-dependent visualizations of “put the paper on the TV” . . . . . . . . . . Object model of lifting and dropping an object . . . . . . . . . . . . . . . . . . . Action model of lifting and dropping an object . . . . . . . . . . . . . . . . . . . Event model of lifting and dropping an object . . . . . . . . . . . . . . . . . . . Execution of “put the yellow block on the red block” using embodied agent . . . End state of “lean the cup on the block” . . . . . . . . . . . . . . . . . . . . . . Unsatisfied vs. satisfied “lean” . . . . . . . . . . . . . . . . . . . . . . . . . . . VoxML structure for [[LEAN]] (compositional) . . . . . . . . . . . . . . . . . . Visualization of “switch the blocks” . . . . . . . . . . . . . . . . . . . . . . . . VoxML structure for [[SWITCH]] . . . . . . . . . . . . . . . . . . . . . . . . . . VoxML structure for [[SWITCH]] (unconstrained) . . . . . . . . . . . . . . . . . Abbreviated VoxML type structure for [[ROLL]] . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

22 24 25 26 27 27 28 29 30 31 32 33 34 35 35 36 37

3.1 3.2 3.3 3.4

VoxML and DITL for put(y,z) VoxML and DITL for slide(y) VoxML and DITL for roll(y) . VoxML and DITL for turn(y) .

. . . .

42 43 44 45

. . . .

. . . .

. . . .

. . . .

. . . .

xiv

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

LIST OF FIGURES 3.5 3.6 3.7 3.8 3.9 3.10 3.11 3.12 3.13 3.14

VoxML and DITL for move(y) . . . . . . . . . . . C# operationalization of [[TURN]] (abridged) . . . Test environment with all objects shown . . . . . . Snapshot from video capture in progress . . . . . . Automatic capture process diagram . . . . . . . . . “Densified” feature vector for “open the book” . . Sparse feature vector for “move the grape” . . . . . Sparse feature vector for “put the block in the plate” HET1 task interface . . . . . . . . . . . . . . . . . HET2 task interface . . . . . . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

45 51 54 54 55 56 56 57 60 62

4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12 4.13 4.14 4.15 4.16 4.17 4.18 4.19 4.20

Baseline accuracy on restricted choice set . . . . . . . . . . . . . . . . . . . . Baseline accuracy on unrestricted choice set . . . . . . . . . . . . . . . . . . . “Vanilla” DNN accuracy on restricted choice set . . . . . . . . . . . . . . . . . “Vanilla” DNN accuracy on unrestricted choice set . . . . . . . . . . . . . . . DNN with weighted features accuracy on restricted choice set . . . . . . . . . DNN with weighted features accuracy on unrestricted choice set . . . . . . . . DNN with weighted discrete features accuracy on restricted choice set . . . . . DNN with weighted discrete features accuracy on unrestricted choice set . . . . DNN with feature weights only accuracy on restricted choice set . . . . . . . . DNN with feature weights only accuracy on unrestricted choice set . . . . . . . Linear-DNN accuracy on restricted choice set . . . . . . . . . . . . . . . . . . Linear-DNN accuracy on unrestricted choice set . . . . . . . . . . . . . . . . . Linear-DNN with weighted features accuracy on restricted choice set . . . . . . Linear-DNN with weighted features accuracy on unrestricted choice set . . . . Linear-DNN with weighted discrete features accuracy on restricted choice set . Linear-DNN with weighted discrete features accuracy on unrestricted choice set Linear-DNN with feature weights only accuracy on restricted choice set . . . . Linear-DNN with feature weights only accuracy on unrestricted choice set . . . Word clouds depicting worker response to HET1 . . . . . . . . . . . . . . . . Word clouds depicting worker response to HET2 . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

94 94 96 96 97 97 98 98 99 99 100 100 101 101 102 102 103 103 106 107

C.1 C# operationalization of [[TURN]] (unabridged) . . . . . . . . . . . . . . . . . . . 146

xv

Chapter 1 Introduction The expressiveness of natural language is difficult to translate into visuals, and much existing work in text to scene generation has focused on creating static images, such as WordsEye (Coyne and Sproat, 2001), L EONARD (Siskind, 2001), and work by Chang et al. (2015). This research describes an approach centered on motion verbs, that uses a rich formal model of events and maps from a natural language expression, through Dynamic Interval Temporal Logic (Pustejovsky and Moszkowicz, 2011), into a 3D animated visualization. Building on a method for modeling natural language predicates in a 3D environment (Pustejovsky and Krishnaswamy, 2014), a modeling language to encode semantic knowledge about entities described in natural language in a composable way (Pustejovsky and Krishnaswamy, 2016a), and a spatial reasoner to generate visual simulations involving novel objects and events, this thesis presents a system, VoxSim, that uses the real-world semantics of objects and events to generate animated scenes in real time, without the need for a prohibitively complex animation interface (Krishnaswamy and Pustejovsky, 2016a,b). Semantic interpretation requires access to both knowledge about words and how they compose. As the linguistic phenomena associated with lexical semantics have become better understood, several assumptions have emerged across most models of word meaning. These include the following: • Lexical meaning can be analyzed componentially, either through predicative primitives or a system of types; • The selectional properties of predicators can be explained in terms of these components; • An understanding of event semantics and the different role of event participants seems crucial for modeling linguistic utterances. 1

CHAPTER 1. INTRODUCTION Lexical semantic analysis, in both theoretical and computational linguistics, typically involves identifying features in a corpus that differentiate the data points in meaningful ways.1 Combining these strategies, we might, for instance, posit a theoretical constraint that we hope to justify through behavioral distinctions in the data. An example of this is the theoretical claim that motion verbs can be meaningfully divided into two classes: manner- and path-oriented predicates (Jackendoff, 1983; Talmy, 1985, 2000). These constructions can be viewed as encoding two aspects of meaning: how the movement is happening and where it is happening. The former strategy is illustrated in (a) and the latter in (b) (where m indicates a manner verb, and p indicates a path verb). (a) The ball rolledm . (b) The ball crossedp the room. With both verb types, adjunction can make reference to the missing aspect of motion, by introducing a path (as in (c)) or the manner of movement (as in (d)). (c) The ball rolledm across the room. (d) The ball crossedp the room rolling. Differences in syntactic distribution and grammatical behavior in large datasets, in fact, correlate fairly closely with the theoretical claims made by linguists using small introspective datasets (Harris, 1954; Durand, 2009). The path-manner distinction is a case where data-derived classifications should correlate nicely with theoretically-inspired predictions. However, it is often the case that lexical semantic distinctions are formal stipulations in a linguistic model based on unclear correlations between predicted classes and underlying corpus data, leaving the possibility that these class groupings are either arbitrary or derived from inappropriate data. As an example, the manner of movement class from Levin (1993) gives drive, walk, run, crawl, fly, swim, drag, slide, hop, roll as manner of motion verbs, although it is unclear what underlying data is used to make this grouping. Assuming the two-way distinction between path and manner predication of motion mentioned above, these verbs do, in fact, tend to pattern according to the latter class in the corpus. Given that they are all manner of motion verbs, however, any data-derived distinctions that emerge within this 1

Meaningful in terms of prior theoretical assumptions or observably differentiated behaviors.

2

CHAPTER 1. INTRODUCTION class will have to be made in terms of additional syntactic or semantic dimensions. While it is most likely possible to differentiate, for example, the verbs slide from roll, or walk from hop in a corpus, given enough data, it is important to realize that conceptual and theoretical modeling is often necessary to reveal the factors that semantically distinguish such linguistic expressions in the first place. This problem can be approached with the use of minimal model generation. As Blackburn and Bos (2008) point out, theorem proving (essentially type satisfaction of a verb in one class as opposed to another) provides a “negative handle” on the problem of determining consistency and informativeness for an utterance, while model building provides a “positive handle” on both. In this research, simulation construction provides a positive handle on whether two manner of motion processes are distinguished in the underlying model. Further, the simulation must specify how they are distinguished, the analogue to informativeness. Factors in these event distinctions can be either temporal, such as rhythmic distinctions in run vs. walk, or spatial, such as limb motion in hop, for instance.

1.1

Background

Notions of simulation figure in some work in the literature of cognitive linguists of the past decade (Bergen, 2012; Lakoff, 2009), but have been largely unaddressed by the computational linguistic community, due in part to arguments against the efficacy of simulation in explaining natural language understanding (Davis and Marcus, 2016), particularly regarding linguistic phenomena involving continuous ranges or underspecified values. Nonetheless, existing computational approaches to semantic processing, when taken together, provide a framework on which to implement a simulator as an extension of a model builder. This thesis endeavors to demonstrate that that simulation, when modeled within a dynamic qualitative spatial and temporal semantics, can provide a robust environment for examining the interpretation of linguistic behaviors, including those described qualitatively. The result is a research instrument within which to test these interpretations, that can be used to inform qualitative and quantitative models of motion events. As mentioned, dynamic interpretations of event structures divide movement verbs into “path” and “manner of motion” verbs. In both cases, the location of the moving argument is reassigned at each step or frame, but where path verbs conduct the reassignment relative to a specified location, manner verbs do not, and the location is specified through an optional prepositional adjunct. The

3

CHAPTER 1. INTRODUCTION spoon falls and The spoon falls into the cup are both equally grammatical, but result in different “mental instantiations” of the event described. In order to render a visualization of these instantiations, or “simulations,” a computational system must be able to infer some path or manner information that is missing from the predicate from the objects involved, or from their composition with the event.2 This requires, at some level, real-world knowledge about the distinctions between minimal pairs of motion predicates, as well as of how the visual instantiations of lexical objects interact with their environment to enable reasoning. How these are modeled will be addressed in Section 2.3. Visual instantiations of lexemes require an encoding of their situational context, or a habitat (Pustejovsky, 2013; McDonald and Pustejovsky, 2014), as well as afforded behaviors that the object can participate in, that are either Gibsonian or telic in nature (Gibson, 1977, 1979; Pustejovsky, 1995). For instance, a cup may afford containing another object, or being drunk from. Many event descriptions presuppose such conditions that rarely appear in linguistic data, but a visualization lacking them will make little sense to the observer. This linguistic “dark matter,” conspicuous by its absence, is thus easily exposable through simulation.

1.2

Information-Theoretic Foundations

Questions on the meaning of sentences or concepts date to the beginning of philosophy, with philosophers from Plato to Kant offering their own takes on what constitutes a distinct concept. ¨ Gottlob Frege in Uber Sinn und Bedeutung (1892) offers the distinction between “sense” (Sinn) and “reference” (Bedeutung), where the sense is the thought expressed by a sentence and the reference is the truth value in some world (w). Where true synonyms would be indistinguishable in w, most subtle differences that arise out of differing lexical choices (e.g., event class, Aktionsart) do in fact create minimal contrasts that differentiate meaning. Rudolf Carnap (1947) provides an interpretation of Frege, based on Tarski’s technique for model-theoretic semantics (1936), that proposes a distinction of intensional meaning: the individual concept; and extensional meaning: what makes it true in some model M. Thus it can be argued that Frege’s reference (Bedeutung) and Carnap’s extensional meaning can be unified in the set of parameter values that make an action, property, or proposition true under M for the thought 2

This assumes that there is no prior database of missing information accessible, such as one compiled from previously-enacted events or “remembered experiences” as is discussed by, e.g., Bod (1998). The semantic encoding of such information in a simulation context discussed in Section 2.1.

4

CHAPTER 1. INTRODUCTION being expressed. That is, for a sentence describing an event (e.g., “the ball rolls”), there exists a set of ranges for parameters (speed, rotation, etc.) that make that sentence true under M and a contrasting set that make it false. Until these parameters have distinct values in M, the truth of the sentence under M is not possible to ascertain. An information theoretician like Shannon (1948) might say that the entropy of a predicate in a minimal model may, depending on the predicate, exist in a non-minimal state until values are assigned to the required parameters.

1.3

Linguistic Underspecification in Motion Events

As minimal models allow components to be underspecified, it is permissible to simply model that “the ball rolled” without providing information such as direction, speed, size of the ball, friction between the ball and the supporting surface, etc.; this information can be specified, but the model is still considered complete without it. When a ball rolls in the real world, these model-unspecified components all have values assigned to them, even if those values are not specifically measured. Following Carnap in his interpretation of Frege, the meaning of a sentence can be said to be determined by testing what would make it true (Soames, 2015). Therefore, given a visualization of an event generated from a known input sentence, and a pair (or set) of sentences potentially describing what appears in the visualization (and blind to the original input), the appropriateness of the visualization to a description can be assessed by a pairwise similarity judgment (Rumshisky et al., 2012) (i.e., asking an annotator to judge a video showing a ball rolling as matching the sentence “The ball rolls” or the sentence “The ball slides”). If a value assigned in the simulation to an underspecified parameter results in a visualization that is judged to not match the input sentence s, that visualization cannot be said to represent a model in which the proposition p denoted by s is true. Thus, there may be a set of values [a] for an parameter underspecified in s for which the resulting visualization represents a proposition that, in a Kripke semantics (Kripke, 1965) with a model M, is M  ps [a] and another set of values [b] which result in a proposition that is actually M 2 ps [b]. The task is then to separate the two sets, and trying to solve this problem computationally entails two further tasks: • Building a computationally coherent model of a world that can be evaluated from an embodied human perspective;

5

CHAPTER 1. INTRODUCTION • Determining values for all salient components that create a simulation that satisfies a human judge’s notion of given event. Additionally, event simulations where values for the aforementioned components overlap significantly suggest the existence of value ranges that define a prototypical notion of the event, a la Rosch (1973, 1983).

1.4

Related Prior Work

This work is related to frameworks and implementations for object placement and orientation in static scenes (Coyne and Sproat, 2001; Siskind, 2001; Chang et al., 2015). The introduced focus on motion verbs in early prototypes of this simulation work and related studies (Pustejovsky and Krishnaswamy, 2014; Pustejovsky, 2013) led to two additional lines of research: an explicit encoding for how an object is itself situated relative to its environment; and an operational characterization of how an object changes its location or how an agent acts on an object over time. Pustejovsky and others have developed into the former into a semantic notion of situational context, called a habitat (Pustejovsky, 2013; McDonald and Pustejovsky, 2014), while the latter is addressed by dynamic interpretations of event structure, including Dynamic Interval Temporal Language, or DITL (Pustejovsky and Moszkowicz, 2011; Pustejovsky, 2013). I would of course also be remiss not to mention the canonical example of a language parser hooked up to a “blocks world” environment, SHRDLU (Winograd, 1971), which served me, as it served so many others, as a model of both project aim and design mechanics. The interdisciplinary nature of this work is influenced by cognitive linguistic work regarding the role of “embodiment” in interpreting mental simulations (Bergen, 2012; Narayanan, 1997; Feldman and Narayanan, 2004; Feldman, 2006), interpreting the spatial aspects of cognitive reasoning in computational and algebraic frameworks (Randell et al., 1992; Bhatt and Loke, 2008; Mark and Egenhofer, 1995; Kurata and Egenhofer, 2007; Albath et al., 2010), along with temporal analogues (Allen, 1983). The notion of a verb enacted as a program over its arguments (Naumann, 1999) is foundational to the implementation, resulting in testable satisfaction conditions that are calculated compositionally with the affected objects, in terms of the aforementioned qualitative spatial reasoning approaches. In implementing a platform to allow experimenting with the underspecification question mentioned in Section 1.3, this work has leveraged the Unity game engine by Unity Technologies (Gold6

CHAPTER 1. INTRODUCTION stone, 2009). Game engines have the advantage of providing relatively user-friendly tools for developers to implement a variety of subsystems “out of the box,” from graphics and rendering to UI to physics. This has allowed work to proceed beyond the scope of standard game engine components, into areas in the intersection between theoretical reasoning and AI and real-time game or “game-like” environments, in the vein of work presented by Forbus et al. (2002), Dill (2011), and Ma and McKevitt (2006), and the mapping of spatial constraints to animation by Bindiganavale and Badler (1998).

7

Chapter 2 Framework The framework for this research rests on the definition of three terms: 1. Minimal model — the universe containing a set of arguments and a set of predicates, interpretations of those arguments, and subsets each defining the interpretation of a predicate (Gelfond and Lifschitz, 1988). For this research, each predicate is assumed to be a logic program and each argument is assumed to be a constant. 2. Simulation — the minimal model with values assigned to all unspecified variables. A minimal model can therefore be considered an underspecified simulation according to this definition, to which variable values can be assigned arbitrarily or by some rule or heuristic. This thesis is primarily concerned with using a visual simulation system to determine a set of “best practices” for assigning these values. 3. Visualization — the process by which each linguistic/semantic object in the simulation is linked to a “visual object concept” which is enacted within the virtual world with the variable values assigned by the simulation evaluated and reassigned at every frame according to the program encoded by the predicate in question. The final step is rendering, in which the computer draws the “finished product” at the frame rate specified by the visualization system. In order to take the simulation from a fleshed-out model to a rendered visualization, requirements include, but are not limited to, the following components: 1. A minimal embedding space (MES) for the simulation must be determined. This is the 3D region within which the state is configured or the event unfolds; 8

CHAPTER 2. FRAMEWORK 2. Object-based attributes for participants in a situation or event need to be specified; e.g., orientation, relative size, default position or pose, etc.; 3. An epistemic condition on the object and event rendering, imposing an implicit point of view (POV); 4. Agent-dependent embodiment; this determines the relative scalar factors of an agent and its event participants to their surroundings, as the entity engages in the environment. In order to construct a robust simulation from linguistic input, an event and its participants must be embedded within an appropriate minimal embedding space. This must sufficiently enclose the event localization, while optionally including room enough for a frame of reference visualization of the event (the viewer’s perspective). The above list enumerates the need for semantic-adjacent or “epi-semantic” information in simulation generation. This is the type of information that influences behavior, interpretation, and entailed consequences of events, but is not directly involved in representing the predicative force of a particular lexeme, a la qualia structure (Pustejovsky, 1995). The modeling language VoxML (Visual Object Concept Markup Language) (Pustejovsky and Krishnaswamy, 2016a) forms the scaffold used to link lexemes to their visual instantiations, termed the “visual object concept” or voxeme. In parallel to a lexicon, a collection of voxemes is termed a voxicon. There is no requirement on a voxicon to have a one-to-one correspondence between its voxemes and the lexemes in the associated lexicon, which often results in a many-to-many correspondence. That is, the lexeme plate may be visualized as a [[ SQUARE PLATE ]], a [[ ROUND PLATE ]] 1 , or other voxemes, and those voxemes in turn may be linked to other lexemes such as dish or saucer. Each voxeme is linked to an object geometry (if a noun—O BJECT in VoxML), a DITL program (if a verb or VoxML P ROGRAM), an attribute set (VoxML ATTRIBUTEs), or a transformation algorithm (VoxML R ELATIONs or F UNCTIONs). VoxML is used to specify the “epi-semantic” information beyond that which can be directly inferred from the geometry, DITL, or attribute properties. VoxSim does not rely on manually-specified categories of objects with identifying language, and instead procedurally composes the properties of voxemes in parallel with the lexemes they are 1

Note on notation: discussion of voxemes in prose will be denoted in the style [[ VOXEME ]] and should be taken to refer to a visualization of the bracketed concept. Where relevant, images of actual visualizations will be provided as well.

9

CHAPTER 2. FRAMEWORK linked with. These properties may be specified in the voxeme’s VoxML markup or calculated from properties natively accessible by the Unity framework. A non-exhaustive list of voxeme properties and their accessibility is shown below in Table 2.1. VoxML-specified Concavity Symmetry Semantic head Event typing NL predicate Unity-calculated Physical object size Location/orientation Dimensionality Non-nominal scalar value Event satisfaction condition

Table 2.1: Example voxeme properties VoxML augments engine-accessible data structures such as geometries. As an illustrative example, let us consider a bowl. Common linguistic knowledge links the lexeme “bowl,” when referring to a physical object (PHYSOBJ according to Pustejovsky’s Generative Lexicon (GL) (1995)), with an object that, among other properties, typically has some concavity in its physical structure. It is a simple matter for a 3D artist to create a model of such an object, as shown below:

Figure 2.1: 3D model of a bowl

10

CHAPTER 2. FRAMEWORK Unity (or any game engine)2 has native access to object parameters such as size, position, and orientation that allows it to calculate certain additional information about the object, of the type enumerated in the lower box of Table 2.1. Other properties of the object represented by the 3D model, such as those on the left side of Table 2.1, are difficult to calculate from the object geometry alone without a complex geometrical analysis algorithm, or impossible entirely without sophisticated AI: • Whether or not the object is concave, flat, or convex; • Symmetry of the object about a certain axis or around a certain plane; • What, if any, component of the entity (object, event, etc.) is most semantically salient; • Multiple predicates that may denote the entity represented by the geometry. Many of these questions border on computer vision or discourse modeling problems, well outside the scope of this work. Thus, the software cannot procedurally add these parameters to its knowledge base at runtime. It is not computationally feasible, for example, to calculate collision volumes for every object in the voxicon that always closely map to the geometries. Instead, due to the time and resource constraints of creating a platform that is quickly deployable on the average user’s (and developer’s) hardware, automatically computed collision boxes must be used, such as that shown below.

Figure 2.2: Bowl with box collider shown in green 2

Subsequent references to Unity functionality should be taken to refer to capabilities provided by most game engines in some shape or form. Unity is simply the platform of choice that has been used to implement the VoxSim software.

11

CHAPTER 2. FRAMEWORK With this information alone, the simulator has no way of knowing that there are points on the exterior of the bowl’s geometry (including the bottom of what we would call the bowl’s “interior,” the approximate location of which is indicated in Figure 2.2 with a white line) that are lower on the Y-axis than the collider volume at that same X- and Z-value. Human beings as language interpreters know this and can describe the difference as above, but the computer does not without a world-knowledge data bank, and furthermore cannot do anything with this information without the resources to access all relevant parameters. VoxML provides a compact way of representing a minimally-required set of parameters to enable compositional reasoning.

2.1

VoxML: Visual Object Concept Modeling Language

The VoxML specification is laid out in greater detail in Pustejovsky and Krishnaswamy (2016a), but an abbreviated overview follows here. Following GL, VoxML entities are given a feature structure enumerating: (a) Atomic Structure (GL FORMAL): objects expressed as basic nominal types. (b) Subatomic Structure (GL CONST): mereotopological structure of objects. (c) Event Structure (GL object.

TELIC

and GL

AGENTIVE ):

origin and functions associated with an

(d) Macro Object Structure: how objects fit together in space and through coordinated activities. VoxML covers five entity types: O BJECT, P ROGRAM, ATTRIBUTE, R ELATION, and F UNC TION , which are closely correlated to nouns, verbs, adjectives and adverbs, adpositions, and functional constructions (e.g. “top of x”), respectively. These entity types represent semantic knowledge of the associated real-world concepts as represented as three-dimensional models, and of events and attributes related to and enacted over these objects. It is intended to overcome the limitations of existing 3D visual markup languages by allowing for the encoding of a wealth of semantic knowledge that can be exploited by a variety of systems and platforms, leading to multimodal simulations of real-world scenarios using conceptual objects that represent real-world semantic qualities. It shares many of the goals pursued in Dobnik et al. (2013) and Dobnik and Cooper (2013), for specifying a rigidly-defined type system for spatial representations associated with linguistic expressions, and is extensible for new needs and additions to the semantic model. 12

CHAPTER 2. FRAMEWORK

2.1.1

Objects

The VoxML O BJECT is used for modeling nouns. The set of O BJECT attributes is shown below: L EX T YPE H ABITAT A FFORD S TR E MBODIMENT

O BJECT’s lexical information O BJECT’s geometrical typing O BJECT’s habitat for actions O BJECT’s affordance structure O BJECT’s agent-relative embodiment

Table 2.2: VoxML O BJECT attributes The L EX attribute contains the subcomponents P RED, the lexical predicate denoting the object, and T YPE, the object’s type according to Generative Lexicon. The T YPE attribute (distinct from L EX’s T YPE subcomponent) contains information to define the object geometry in terms of primitives. H EAD is a primitive 3D shape that roughly describes the object’s form (e.g. calling an apple an “ellipsoid”), or the form of the object’s most semantically salient subpart. For completeness, possible H EAD values are grounded in mathematical formalisms defining families of polyhedra (Gr¨unbaum, 2003), and, for annotator’s ease, common primitives found across the “corpus” of 3D artwork and 3D modeling software3 (Giambruno, 2002). Using common 3D modeling primitives as convenience definitions provides some built-in redundancy to VoxML, as is found in NL description of structural forms. For example, a rectangular prism is the same as a parallelepiped that has at least two defined planes of reflectional symmetry, meaning that an object whose H EAD is rectangular prism could be defined two ways, an association which a reasoner can unify axiomatically. Possible values for H EAD are given below: H EAD

prismatoid, pyramid, wedge, parallelepiped, cupola, frustum, cylindroid, ellipsoid, hemiellipsoid, bipyramid, rectangular prism, toroid, sheet

Table 2.3: VoxML O BJECT H EAD types These values are not intended to reflect the exact structure of a particular geometry, but rather a 3

Mathematically curved surfaces such as spheres and cylinders are in fact represented, computed, and rendered as polyhedra by most modern 3D software.

13

CHAPTER 2. FRAMEWORK cognitive approximation of its shape, as is used in some image-recognition work (e.g. Goebel and Vincze (2007)). Object subparts are enumerated in C OMPONENTS. C ONCAVITY can be concave, flat, or convex, and refers to any concavity that deforms the H EAD shape. ROTAT S YM (rotational symmetry) defines any of the world’s three orthogonal axes around which the object’s geometry may be rotated for an interval of less than 360 degrees and retain identical form as the unrotated geometry. R EFLECT S YM (reflectional symmetry), is defined similarly—if an object may be bisected by a plane defined by two of the world’s three orthogonal axes and then reflected across that plane to obtain the same geometric form as the original object, it is considered to have reflectional symmetry across that plane. The values of ROTAT S YM and R EFLECT S YM are intended to be world-relative, because objects are always situated in a minimal embedding space defined by Cartesian coordinates, and the axes/planes of symmetry are those denoted in the world, not of the object. Thus, a tetrahedron— which in isolation has seven axes of rotational symmetry, no two of which are orthogonal—when placed in the MES such that it cognitively satisfies all “real-world” constraints, must be situated with one base downward (as a tetrahedron placed any other way will fall over). This reduces the salient in-world axes of rotational symmetry to one: the world’s Y-axis. When the orientation of the object is ambiguous relative to the world, the world should be assumed to provide the grounding value. The H ABITAT element defines habitats, which per Pustejovsky (2013) and McDonald and Pustejovsky (2014) are conditioning environments in which an object exists that enable or disable certain actions being taken with the object. These habitats may be I NTRINSIC to the object, regardless of what action it participates in, such as intrinsic orientations or surfaces. An example would be a computer monitor with an intrinsic front, and a geometry in which that intrinsic front faces along the positive Z-axis. We adopted the terminology of “alignment” of an object dimension, d ∈ {X, Y, Z}, with the dimension, d0 , of its embedding space, Ed0 , as follows: align(d, Ed0 ). E XTRINSIC habitats must be satisfied for particular actions to take place, such as a bottle that must be placed on its side in order to be rolled across a surface. In VoxML encoding, each habitat is given a label and a numerical index for future reference. A FFORD S TR describes the set of specific actions that may be taken with objects, subject to their current conditioning habitats. The habitats supply the requisite conditions, and the affordance structure encodes the actions that may be taken under those conditons, and the states that results from those actions being taken. There are low-level G IBSONIAN affordances, which involve ma14

CHAPTER 2. FRAMEWORK nipulation or maneuver-based actions (grasping, holding, lifting, touching, etc.); there are also TELIC affordances (Pustejovsky, 1995), which link directly to what goal-directed activity can be accomplished, by means of the G IBSONIAN affordances. E MBODIMENT qualitatively describes the S CALE of the object compared to an in-world agent (typically assumed to be a human) as well as whether the object is typically M OVABLE by that agent.                                                                         



plate 

LEX

  =   



PRED TYPE 

TYPE

       =        

= =

plate   physobj 

HEAD = sheet[1] COMPONENTS = surface[1], CONCAVITY = concave ROTAT S YM

=

REFLECT S YM

{Y } = {XY, Y Z}

   

base           

 



HABITAT

      =      

I NTR

  = [2]  

E XTR

= ...

align(Y, EY )      TOP = top(+Y )  UP



=

    



AFFORD STR

   =   

EMBODIMENT

A1 A2

= =

H[2 ] → [put(x, y)]contain(y, x)    H[2] → [grasp(x, [1])] 





  =   

SCALE = < agent    MOVABLE = true  

                                                                       

Figure 2.3: [[PLATE]], an O BJECT A complete O BJECT voxeme is linked to a geometry, but unlinked markup can be used to specify typical or prototypical visualization parameters. Numerals in brackets denote references to and reentrancies from other parameters of the voxeme. In the bottle example of an E XTRINSIC habitat discussed above, the habitat of a bottle on its side might be denoted as [3]UP = align(Y, E⊥Y ) (that is, the bottle “upward” vector is created by aligning its object-space Y axis with a vector perpendicular to the world-space Y axis). An associated affordance might be H[3] → [roll(x, [1])], where that habitat (habitat-3), affords the 15

CHAPTER 2. FRAMEWORK rolling of the object ([1]) by some agent x. In Figure 2.3, [1] denotes both the semantic H EAD and the “surface” subcomponent, indicating that they refer to the same part of the geometry (as H EAD is always linked to a geometric form). The H ABITAT structure [2] illustrates the role the habitat plays in activating the entity’s affordance structure. Namely, if the appropriate conditions are satisfied (defined by habitat-2), then the telic affordance associated with a plate is activated; every putting of x on y results in y containing x. Thus an affordance is notated as HABITAT → [ EVENT ] RESULT.

2.1.2

Programs

P ROGRAM is used for modeling verbs. The current set of P ROGRAM attributes is shown below:

L EX T YPE E MBEDDING S PACE

P ROGRAM’s lexical information P ROGRAM’s event typing If different from the MES, P ROGRAM’s embedding space as a function of the participants and their changes over time

Table 2.4: VoxML P ROGRAM attributes Like O BJECTs, a P ROGRAM’s L EX attribute contains the subcomponents P RED, the predicate lexeme denoting the program, and T YPE, the program’s type as given in a lexical semantic resource, e.g., its GL type. Top-level component T YPE contains the H EAD, its base form; A RGS, references to the participants; and B ODY, subevents that are executed in the course of the program’s operation. Top-level values for a P ROGRAM’s H EAD are given below: H EAD

state, process, transition assignment, test

Table 2.5: VoxML P ROGRAM H EAD types The [[HEAD]] of a program as shown above is given in terms of how the visualization of the action is realized. Basic program distinctions, such as test versus assignment are included within this typology and further distinguished through subtyping.

16

CHAPTER 2. FRAMEWORK                                                 



put 

LEX

   =  



PRED TYPE

= =

 put   transition event 



TYPE

               =               

HEAD

=





ARGS



transition

    =    

A1 A2 A3

=

E1

=

E2

=

E3

=

= =

x:agent   y:physobj  z:location 



BODY

    =     

  grasp(x, y)   [while(hold(x, y), move(y)]   [at(y, z) → ungrasp(x, y)] 

                             

                                               

Figure 2.4: [[PUT]], a P ROGRAM No specified E MBEDDING S PACE indicates that the embedding space for the event is the same as the MES. This P ROGRAM is agent driven. Should no agent exist, the agent may be excluded from the argument structure (see Section 2.4: Object Model). When beginning the execution a P ROGRAM, any subevents that are already satisfied may be skipped; thus, if during the execution of [[PUT]], if the intended path is blocked and the simulation system must replan, grasp(x, y) may be omitted when [[PUT]] resumes—there is no need to ungrasp and re-grasp the object. P ROGRAM voxemes may be linked with an abstract visual representation, such as an image schema (Johnson, 1987; Lakoff, 1987).

2.1.3

Attributes

ATTRIBUTEs fall into families structured according to some S CALE (cf. Kennedy and McNally (1999)). The least constrained scale is a conventional sortal classification, and its associated attribute family is the set of pairwise disjoint and non-overlapping sortal descriptions (non-super types). VoxML terms this a nominal scale, following Stevens (1946) and Luce et al. (1990). A two-state subset of this domain is a binary classification. By introducing a partial ordering over values, we can have transitive closure, assuming all orderings are defined; this defines an ordinal scale. When fixed units of distance are imposed between the elements on the ordering, we arrive at an interval scale. When a zero value is introduced that can be defined in terms of 17

CHAPTER 2. FRAMEWORK a quantitative value, we have a rational scale.4 In reality there are many more attribute categories than those listed, but the goal in VoxML is to use these types as the basis for an underlying cognitive classification for creating measurements from different attribute types. In other words, these scale types denote how the attribute is to be reasoned with, not what its precise qualitative features are. As VoxML is intended to model visualizations of physical objects and programs, it is intended to model “the image is red” but not “the image is depressing”. Examples of different S CALE types follow:

Scale ordinal

Example domain D IMENSION

binary nominal rational interval

H ARDNESS C OLOR M ASS T EMPERATURE

Example values big, little, large, small, long, short hard, soft red, green, blue 1kg, 2kg, etc. 0◦ C, 100◦ C, etc.

Table 2.6: Example VoxML ATTRIBUTE scalar types VoxML also denotes an attribute’s A RITY. transitive attributes are considered to describe object qualities that require comparison to object prototypes (e.g. the small cup vs. the big cup), whereas intransitive attributes do not require that comparison (a red cup is not red compared to other cups; it is red in and of itself). Finally, every attribute must be applied to an object, so attributes’ A RG represents said object and its typing, denoted identically to the individual A RGS of VoxML P ROGRAMs.                        



small 

LEX

 = 



PRED

=

small 



TYPE

    =    



SCALE ARITY ARG =

ordinal   = transitive    x:physobj  =

                      

Figure 2.5: [[SMALL]], an ATTRIBUTE 4 The difference between an interval scale and rational scale is subtle, but they can be distinguished by a question of whether the zero value on the scale indicates “nothing” or “something.” 0 kg indicates “no mass,” while 0◦ C does not indicate “no temperature.”

18

CHAPTER 2. FRAMEWORK [[SMALL]] is transitive meaning that a “small” x is small relative to a prototypical instantiation of x.

2.1.4

Relations

A R ELATION’s type structure specifies a binary C LASS of the relation: configuration or force dynamic, describing the nature of the relation that exists between the objects under its scope. These classes themselves have subvalues—for configurational relations these are values enumerated in a qualitative spatial relation calculus such as the Region Connection Calculus (Randell et al., 1992). For force dynamic relations, subvalues are relations defined by forces between objects, such as “support” or “suspend,” many of which are defined as resultant states in an affordance structure. Also specified are the arguments participating in the relations. These, as above, are represented as typed variables. C ONSTR denotes an optional constraint on the relation, such as y→HABITAT→INTR[align], which denotes that the I NTRINSIC habitat of ydenoted by an align function must be satisfied by the current placement of y in order for the relation in question to be in effect.                                  



touching 



LEX

 = 

PRED

is touching 

=



TYPE

         =          



CLASS VALUE

config = EC

=



ARGS

  =   

CONSTR



A1 A2

=

= =

nil

x:3D   y:3D 

                  

                                

Figure 2.6: [[TOUCHING]], a R ELATION

2.1.5

Functions

F UNCTIONs’ typing structures take as A RG the O BJECT voxeme being computed over. R EFERENT takes any subparameters of the A RG that are semantically salient to the function, such as the voxeme’s H EAD. If unspecified, the entire voxeme should be assumed as the referent. M APPING denotes the type of transformation the function performs over the object, such as dimensionality reduction (notated as dimension(n):n-1 for a function that takes in an object of n dimensions and 19

CHAPTER 2. FRAMEWORK returns a region of n-1 dimensions). Finally, O RIENTATION provides three values: S PACE, which notes if the function is performed in world space, object space, or pov (camera-relative) space; A XIS, which notes the primary axis and direction the function exploits relative to that space; and A RITY, which returns transitive or intransitive based on the boolean value of a specified input variable (x[y]:intransitive denotes a function that returns intransitive if the value of y in x is true). Definitions of transitive and intransitive follow those for ATTRIBUTEs, so in the example below, A RITY of [[TOP]] would be intransitive if the I NTRINSIC habitat top(+Y ) of x is satisfied by the current placement of the object in question in the virtual world.                                             



top 

LEX

= 



PRED 

TYPE

               =               

=

top  

ARG = x:physobj REFERENT = x→H EAD MAPPING = dimension(n):n-1 

ORIENTATION

       =        



 SPACE = world    AXIS = +Y    ARITY = x→H ABITAT →  

I NTR [top(axis)]: intransitive

     

                             

                                           

Figure 2.7: [[TOP]], a F UNCTION Like P ROGRAMs, ATTRIBUTE, R ELATION, and F UNCTION voxemes may be linked with an abstract visual representation where relevant.

2.2

VoxSim

VoxSim is the real-time language-driven event simulator implemented on top of the VoxML platform. A build of VoxSim can be downloaded at http://www.voxicon.net/download. The Unity project and latest source is at https://github.com/VoxML/VoxSim.

20

CHAPTER 2. FRAMEWORK

2.2.1

Software Architecture

VoxSim uses the Unity game engine (Goldstone, 2009) for graphics and I/O processing. Input is a simple natural language sentence, which is part-of-speech tagged, dependency-parsed, and transformed in to a simple predicate-logic format. These NLP tasks may be handled with a variety of third-party tools, such as the ClearNLP parser (Choi and McCallum, 2013), SyntaxNet (Andor et al., 2016), or TRIPS (Ferguson et al., 1998), which interface with the simulation software using a C++ communications bridge and wrapper. 3D assets and VoxML-modeled entities (created with other Unity-based tools) are loaded externally, either locally or from a web server. Commands to the simulator may be input directly to the software UI, or may be sent over a generic network connection or using VoxSim Commander, a companion app for iOS.

VoxSim Commander iOS

Parser

VoxML Resources

Communications Bridge

Simulator

Voxeme Geometries Unity

Figure 2.8: VoxSim architecture schematic The simulator forms the core of VoxSim, and performs operations over geometries supplied to its resource library and informed by VoxML semantic markup. The communications bridge facilitates generic socket-level communication to third-party packages and exposes the simulator to commands given over a network connection so it can be easily hooked up to remote software. Arrows indicate the directionality of communication to or from each component. Given a tagged and dependency parsed sentence, we can the transform it into predicate-logic format using the root of the parse as the VoxML P ROGRAM, which accepts as many arguments as are specified in its type structure, and subsequently enqueuing arguments that are either constants (i.e. VoxML O BJECTs) or evaluate to constants at runtime (all other VoxML entity types). Other non-constant VoxML entity types are treated similarly, though usually accept only one argument. Thus, the dependency arc CASE(plate, on) becomes on(plate). The resulting predicate-logic is evaluated from the innermost first-order predicates outward until a single first-order representation is reached.

21

CHAPTER 2. FRAMEWORK ROOT

NMOD DOBJ

CASE

DET

DET

put/VB the/DT apple/NN on/IN the/DT plate/NN 1. p := put(a[]) 2. dobj := the(b) 3. b := (apple) 4. a.push(dobj) put(the(apple),on(the(plate)))

5. 6. 7. 8.

nmod := on(iobj) iobj := the(c) c := plate a.push(nmod)

Figure 2.9: Dependency parse for Put the apple on the plate and transformation to predicate-logic form

2.2.2

LTSs to Manage Events

An event sequence can be viewed as a type of a labeled transition system (LTS) (van Benthem, 1991; van Benthem et al., 1994), in which the usage of distinct event classes comports with an LTS’s notion of “process equivalence” where the equivalence relations respect but are not defined by the observational properties in two LTSs under comparison, as enumerated by van Benthem and Bergstra (1994). An event is distinguished by a label and its argument set as enumerated by its VoxML structure. It can be indicated and selected by name, with its argument structure being filled in from the linguistic parse or by various specification methods discussed subsequently, most abstractly in Section 2.8. VoxSim manages its event queue by way of an “event manager” that maintains a list of the events currently requested of the system, under the assumption that the event at the front of a nonempty queue is the event currently being executed. Thus, the frontmost event always exists in a first-order state (Lωω ), while other events may be of higher order forms (Lω1ω ) until they come to the front of the queue, at which point they are “first-orderized” (Keisler, 1971; Chang and Keisler, 1973; van Benthem et al., 1993) by the process depicted in Figure 2.9, where the result of the parse is what is inserted into the queue. Should an event require precondition satisfaction, such as a [[PUT]] event requiring that the object be grasped before being moved, an event that will satisfy this precondition can be inserted into the event manager system at appropriate point in the transition graph such that when the precondition is satisfied, the remainder of the initially-prompted event’s subevent structure can 22

CHAPTER 2. FRAMEWORK resume execution from that point. Insertion into the event manager/transition graph can be nested, so if [[PUT]] requires that the object be grasped and [[GRASP]] requires that the object be touched, the call to [[GRASP]] from [[PUT]] can handle the precondition insertion of [[REACH FOR]] (the predicate label for touching + hand in grasp pose) before returning control back to [[GRASP]], and subsequently [[PUT]]. As events proceed, VoxSim maintains the current set of relations that exist between every pair of objects in the scene. Thus, should a precondition inserted into the event manager already be satisfied, the moment it comes to the front of the queue and is evaluated to first-order form, it will be removed from the transition graph as a satisfied event, keeping the system moving. Event sequences, as broadly implemented in the VoxSim event manager, are not necessarily deterministic (¬2∀xy((Ra xy ∧ Ra xz → y = z))), as information required to either initiate or resolve the event (sometimes both) must be computed from the composition of object and event properties as encoded in VoxML, or given value at runtime (discussed in Section 2.8).

2.3

Spatial Reasoning

VoxML is not intended to encode every piece of information needed to reason about objects and events. Rather, the VoxML-encoded information about the entity is interpreted at runtime when composed with current situational context.

23

CHAPTER 2. FRAMEWORK                                                                            



cup 

LEX

  =  



PRED TYPE 

TYPE

       =        

= =

cup   physobj  

 HEAD = cylindroid[1]   COMPONENTS = surface,interior     CONCAVITY = concave  

{Y } REFLECT S YM = {XY, Y Z} ROTAT S YM

=

      

 

HABITAT

      =      

I NTR

  = [2]  

E XTR

= ...



AFFORD STR

    =     

EMBODIMENT



align(Y, EY )      TOP = top(+Y )  UP



=

    

A1 A2 A3

= = =

H[2] → [put(x, on([1]))]support([1], x)   H[2] → [put(x, in([1]))]contain([1], x)    H[2] → [grasp(x, [1])] 





  =   

SCALE = d(z,wt ))+

Figure 3.1: VoxML and DITL for put(y,z) (a) put requires a prepositional adjunct which typically includes the destination location (z in Figure 3.1). The prepositional adjunct specifies the event’s ending location, and at each time step, the location of the moving object along the path is expected to approach the destination by the measure of the distance function (various distance heuristics can be used to allow for convoluted or non-direct paths, such as those requiring object avoidance; this research uses A* path planning with the Manhattan distance heuristic). No value is given for the speed of motion. The DITL program encodes a test for whether or not the object has reached its destination (wt 6= z?), and continuous, Kleene-iterated location change toward the destination (subject to the distance heuristic in use) if the test is not satisfied (wt 6= wt−1 ; d(z,wt−1 ) > d(z,wt )). This maps to a satisfaction test d(z,loc(y)n ) = 0,(loc(y)t 6= loc(y)t−1 )[0,n] ?. 2

In these DITL programs, subscripted letters refer to object parameter values at time step t, and d refers to a distance function between two vectors.

42

CHAPTER 3. METHODOLOGY AND EXPERIMENTATION                                                    



slide 

LEX

  =  



PRED TYPE

= =

slide  process 



TYPE

                 =                 

HEAD

=





ARGS

    =    

A1 A2 A3

=

E1

=

E2

=

= =



BODY



process

      =       

x:agent   y:physobj  z:surface    grasp(x, y)   [while(hold(x, y),   while(EC(y, z),    move(y)))] 

                                 

                                                  

slide(y) w0 := loc(y); (wt 6= wt−1 ; EC(y,z))+

Figure 3.2: VoxML and DITL for slide(y) (b) slide enforces a constraint on the motion of the object, requiring it to remain EC with the surface being slid across (z in Figure 3.2; only implicit in the NL utterance unless explicated in an adjunct). In the bare predicate, no reference is made to what speed or direction the object should be moving as long as the EC constraint is maintained. The DITL program requires a step-wise location change ((wt 6= wt−1 )). The object may rotate, but the rotation should not be proportional to the path length, path shape, and speed of the object (that could make the motion a rolling). It must maintain the EC constraint with the surface (EC(y,z))). These conditions map to a satisfaction test: EC(y,z)[0,n] ?,loc(y)n 6= loc(y)0 ,rot(y) n − n − P −−−−−−−−−−−−→ P −−−−−−−−−−−−→ 6∝ loc(y)t − loc(y)t−1 . The final parameter of the test rot(y) 6∝ loc(y)t − loc(y)t−1 is t=0

t=0

not expressed explicitly in the DITL program, but must be expressed in the satisfaction test because while rotation is optional and may be arbitrary, there are conditions under which the nature of the rotation change would render the motion no longer a “slide.”

43

CHAPTER 3. METHODOLOGY AND EXPERIMENTATION                                                    



roll 

LEX

  =  



PRED TYPE

= =

 roll  process 



TYPE

                 =                 

HEAD

=





ARGS

    =    

A1 A2 A3

=

E1

=

E2

=

= =



BODY



process

      =       

x:agent   y:physobj  z:surface    grasp(x, y)     [while(hold(x, y),     while(EC(y, z),   move(y), rotate(y)))] 

                                 

                                                  

roll(y) w0 := loc(y), v0 := rot(y); (wt 6= wt−1 ; abs(vt -v0 ) > abs(vt−1 -v0 ); EC(y,z))+

Figure 3.3: VoxML and DITL for roll(y) (c) roll enforces the same EC constraint as slide, and also makes no mention of the speed or direction of object translocation. The moving object’s speed and direction of rotation are also not specified, but these are further constrained by the speed and direction of translocative motion when composed with the size of the object, so if a value is established for those parameters, the necessary direction and speed of rotation can be computed. The DITL program requires a location change ((wt 6= wt−1 )), a rotation change in a consistent direction (abs(vt -v0 ) > abs(vt−1 -v0 )), and a maintenance of the EC constraint (EC(y,s))), and n − P −−−−−−−−−−−−→ so maps to the satisfaction test EC(y,z)[0,n] ?,loc(y)n 6= loc(y)0 ,rot(y) ∝ loc(y)t − loc(y)t−1 . t=0

Unlike slide, the total rotation change must be proportional to the path shape, path length, and movement speed to properly be considered a “roll.”

44

CHAPTER 3. METHODOLOGY AND EXPERIMENTATION                                            



turn 



LEX

  =  

PRED TYPE 

TYPE

             =             

= =

HEAD

turn  process  =



ARGS



process

  =   



A1 A2

=

E1

=

E2

=

=

x:agent   y:physobj 



BODY

    =     

  grasp(x, y)   [while(hold(x, y),     rotate(y))]

                         

                                          

turn(y) w0 := rot(y); (wt 6= wt−1 )+

Figure 3.4: VoxML and DITL for turn(y) (d) turn lexically singles out only the rotation of its argument. No reference is made to speed or direction of rotation. No reference is made to speed or direction of translocation either, but as the verb turn focuses only on the rotation, parameters of translocation theoretically have no bearing on the correctness of the event operationalization. However, the DITL program (wt 6= wt−1 )+ has only one parameter, the object’s rotation, and so maps to the satisfaction test rot(y)n 6= rot(y)0 .                                            



move 

LEX

  =  



PRED TYPE 

TYPE

             =             

= =

HEAD

move  process  =

ARGS

  =   



A1 A2

=

E1

=

E2

=

=



BODY



process 

    =     

x:agent   y:physobj    grasp(x, y)   [while(hold(x, y),     move(x))]

                         

                                          

move(y) w0 := loc(y), v0 := rot(y); ((wt 6= wt−1 ) ∨ (vt 6= vt−1 ))+

Figure 3.5: VoxML and DITL for move(y) (e) move is one of the most (perhaps the most) underspecified motion verbs, as all path and manner of motion verbs are special cases of move. Tautologically, any kind of motion can be a move; 45

CHAPTER 3. METHODOLOGY AND EXPERIMENTATION this is reflected in the VoxML structure as well as the fact that move shows up in a number of Levin verb classes (viz. “slide”, “roll”, “object alternation”) (Levin, 1993), and maps to a wide variety of motion subtypes in VerbNet (Kipper et al., 2006). In bare move, speed and direction values are left unspecified, and a further open question is whether some other motion event may be enacted instead of the operationalization of move specifically and still satisfy the observer’s definition of move. The DITL program ((wt 6= wt−1 ) ∨ (vt 6= vt−1 ))+ maps to the satisfaction test loc(y)n 6= loc(y)0 ∨ rot(y)n 6= rot(y)0 . The location or rotation of the object (or both) may have changed, and it will have moved. As this test otherwise leaves the search space too wide open for a Monte Carlo method to provide much insight without generating a very large set of test simulations, we instead respecify “move” as a different event randomly selected from the test set, leaving “motion manner” as the underspecified parameter, as listed in Table 3.2 below. The full list of programs and their satisfaction tests are given in Appendix B. Having determined the underspecified parameters for each predicate, the search space can further be constrained by filtering out parameters that, while not explicitly singled out in the linguistic utterance, can have their values inferred or calculated from known parameters. Examples of this type would be direction of motion in a [[PUT]] event or rotation speed in a [[ROLL]] event, which, as discussed in the sections above, are constrained by other parameters and must fall at a certain value in order for the event to execute. The full table of underspecified parameters tested on for each predicate is given below, along with their types. These parameters have been manually selected for saliency based on the criteria and rationale previously outlined in this section. Program move(x)

Underspecified parameters motion manner

Type predicate

46

Possible values {turn(x), roll(x), slide(x), spin(x), lift(x), stack(x[]), put(x,on(y)), put(x,in(y)), put(x,near(y)), lean(x,on(y)), lean(x,against(y)), flip(x,edge(x)), flip(x,center(x))}

CHAPTER 3. METHODOLOGY AND EXPERIMENTATION

turn(x)

rot speed rot axis rot angle rot dir motion manner

float axis float sign predicate

roll(x)

transloc dir

3-vector

slide(x)

transloc speed transloc dir

float 3-vector

spin(x)

rot angle rot speed rot axis rot dir motion manner transloc speed transloc dir placement order transloc speed rel orientation

float float axis sign predicate float 3-vector list float predicate

put(x,on(y)) put(x,in(y)) put(x,near(y))

transloc speed transloc speed transloc speed transloc dir

float float float 3-vector

lean(x,on(y)) lean(x,against(y)) flip(x,edge(x))

rot angle rot angle rot axis symmetry axis rot axis symmetry axis

float float axis axis axis axis

motion speed

float

lift(x) stack(x[]) put(x,touching(y))

flip(x,center(x))

close(x)

3

{(0..12.5]} {X, Y, Z} {(0◦ ..180◦ ]} {+, -} {roll(x), spin(x), lean(x,on(y)), lean(x,against(y)) flip(x,edge(x)), flip(x,center(x))} V ∈ {h[0..1], 0, [0..1]i | C(x)/2 < mag(V) ≤ 1} {(0..5]} V ∈ {h[0..1], 0, [0..1]i | mag(V) ≤ 1} {(180◦ ..540◦ ]} {(0..12.5]} {X, Y, Z} {+, -} {roll(x)} {(0..5]} {h0, y-y(x), 0i} {[1,2], [2,1]} {(0..5]} {left(y), right(y), behind(y), in front(y), on(y)} {(0..5]} {(0..5]} {(0..5]} V ∈ {hy-x(x), y-y(x), y-z(x)i | d(x,y) < d(edge(s(y),y)), IN(s(y)), ¬IN(y)}3 {[25◦ ..65◦ ]} {[25◦ ..65◦ ]} {X, Y, Z} {X, Y, Z} {X, Y, Z} {X, ( Y, Z} manner = put : {(0..5]}

s(y) represents the surface of the object currently supporting y.

47

manner = turn :

{(0..12.5]}

CHAPTER 3. METHODOLOGY AND EXPERIMENTATION ( open(x)

motion speed

float

manner = put :

{(0..5]}

manner = turn : {(0..12.5]}  manner = put : {hy − x(x ),   y − y(x ), transloc dir 3-vector   y − z(x )i}  manner = turn : {(0◦ ..180◦ )} rot angle float Table 3.2: Program test set with underspecified parameters

In the case where the parameter is of type predicate, this requires the event to be respecified before executing (e.g., instantiating a [[MOVE]] event as a [[SPIN]]), and the respecified predicate may itself contain underspecified values, which must then be specified before execution. However, in the case where an underspecified event is respecified to another predicate that can itself be respecified to one of a set of events (e.g., respecifying [[MOVE]] to [[TURN]] which could itself be optionally realized as [[ROLL]], [[SPIN]], [[LEAN]], or [[FLIP]]), the respecified predicate is required to give value assignment to its underspecified parameters rather than attempting to respecify to a different predicate again. That is, a predicate cannot be respecified more than once, so if a [[MOVE]] is respecified to a [[TURN]], that [[TURN]] event must be given value assignment for its speed, axis, angle, and direction of rotation at that point. This is also why put(x,touching(y)) is not permitted as a respecification of move(x), as touching(y) itself requires respecification to another relation.

3.2

Operationalization

Each VoxML/DITL primitive program maps to a method executed by VoxSim. Each of these methods, written in C# (although Unity also supports JavaScript and Boo scripting), operationalizes the parameters and tests of the program in real time over the specified arguments. Underspecified variables (discussed in Section 3.1 and shown in Tables 3.2 and B.1), must be assigned values for the software to run. Operationalization follows a two-pass process, which is initiated once an unexecuted event reaches the front of the event queue. First, the satisfaction conditions under which the event will be considered complete must be calculated, with regard to the particulars of the objects involved, and informed by forward composition from the object and event VoxML, and the parameters must be 48

CHAPTER 3. METHODOLOGY AND EXPERIMENTATION given value through Monte Carlo simulation value assignment. This results in the event being fully evaluated down to its first-order form at the front of the event queue (as discussed in Section 2.2.2). In the second pass, the now-fully evaluated first-order event assigns to its VoxML [[O BJECT]] arguments the transformations required to satisfy the event as calculated during the first pass. In both passes, all predicates encountered are invoked in order to evaluate them. This allows for compactness and completeness of code and allows any changes to the verbal program to apply in both “evaluation” mode and “execution” mode without discrepancy. The only difference is that predicates invoked in “evaluation” mode are not allowed to make changes to the event manager/transition graph, with the exception of inserting events required to satisfy preconditions. This reserves dequeuing events for the “execution” mode. An abridged, schematic C# operationalization of [[TURN]] follows on the next pages. The entire method is printed in Appendix C. Many calls and references are made to the VoxSim, Unity, and .NET APIs in the course of operationalizing a predicate, for calculating geometric values and parameterizing inputs to various subsystems, but the code demonstrates the two types of “turn” event discussed in Section 2.7 and the process of assigning random values to underspecified variables. public void TURN(object[] args) { ... // look for agent ... // add preconditions ... // add postconditions ... // override physics rigging ... if (args [0] is GameObject) { GameObject obj = (args [0] as GameObject); Voxeme voxComponent = obj.GetComponent (); if (voxComponent != null) { if (!voxComponent.enabled) {

49

CHAPTER 3. METHODOLOGY AND EXPERIMENTATION voxComponent.gameObject.transform.parent = null; voxComponent.enabled = true; } if (args [1] is Vector3 && args [2] is Vector3) { // args[1] is local space axis // args[2] is world space axis if (args [3] is Vector3) { // args[3] is world space axis sign = Mathf.Sign (Vector3.Dot(Vector3.Cross ( obj.transform.rotation * (Vector3)args [1], (Vector3)args [2]), (Vector3)args[3])); angle = Vector3.Angle ( obj.transform.rotation * (Vector3)args [1], (Vector3)args [2]); // rotation from object axis [1] // to world axis [2] // around world axis [3] if (voxComponent.turnSpeed == 0.0f) { voxComponent.turnSpeed = RandomHelper.RandomFloat (0.0f, 12.5f, (int)RandomHelper.RangeFlags.MaxInclusive); } targetRotation = (Quaternion.AngleAxis (sign * angle, (Vector3)args [3]) * obj.transform.rotation).eulerAngles; rotAxis = Constants.Axes.FirstOrDefault ( a => a.Value == (Vector3)args [3]).Key; } else { // rotation from object axis[1] to world axis [2] if (voxComponent.turnSpeed == 0.0f) { voxComponent.turnSpeed = RandomHelper.RandomFloat ( 0.0f, 12.5f, (int)RandomHelper.RangeFlags.MaxInclusive); }

50

CHAPTER 3. METHODOLOGY AND EXPERIMENTATION

targetRotation = Quaternion.FromToRotation( (Vector3)args [1], (Vector3)args [2]).eulerAngles; angle = Vector3.Angle ((Vector3)args [1], (Vector3)args [2]); } } else { if (voxComponent.turnSpeed == 0.0f) { voxComponent.turnSpeed = RandomHelper.RandomFloat (0.0f, 12.5f, (int)RandomHelper.RangeFlags.MaxInclusive); } targetRotation = (obj.transform.rotation * UnityEngine.Random.rotation).eulerAngles; angle = Quaternion.Angle(transform.rotation, Quaternion.Euler(targetRotation)); } voxComponent.targetRotation = targetRotation; } } // add to events manager ... // record parameter values ... return; }

Figure 3.6: C# operationalization of [[TURN]] (abridged) Individual segments of the VoxML program map to individual segments of the linked DITL program (see Section 3.1 for examples), and segments of the DITL program map directly to the C# code. 51

CHAPTER 3. METHODOLOGY AND EXPERIMENTATION • GameObject obj = (args [0] as GameObject) accesses the DITL variable y, the object; • targetRotation = Quaternion.[...] specifies the nature of the DITL rotation update wt 6= wt−1 ; • If three vector arguments are specified, then it is assumed that the third is the axis about which to constrain rotation. If only two are specified, rotation is calculated relative to the current orientation and takes the shortest path. Due to the object-oriented nature of the VoxSim architecture, some generic operations are abstracted out to other classes rather than operationalized directly in the predicate. For example, the variable targetRotation in the above example sets the goal rotation of the object for the execution of the predicate (here [[TURN]]). The Voxeme “component” (Unity terminology for a certain type of member class instance that is updated every frame, providing a clean way of handling DITL and VoxML Kleene iterations) on the the Unity GameObject that contains the VoxML [[O BJECT]] geometry is what actually handles the frame-to-frame update of the object geometry’s location and orientation. Since all geometries in the simulation are part of voxemes, which contain the Voxeme component, there is no need to handle the state-to-state update in the predicate itself, and the operationalization only needs to specify the nature of the update between the start and end states. Predicates for all VoxML entity types can be operationalized, subject to their typing constraints. For example, the declaration for ON : public Vector3 ON(object[] args), shows that the operationalization takes an object or list of objects and returns a location, which is in line with the configurational C LASS of the VoxML relation [[ON]].

3.3

Monte Carlo Simulation

Using the list of test sentences, the list of underspecified parameters and satisfaction conditions for each predicate, and their operationalizations in code, sets of visualized simulations were created for each sentence with random values assigned for each underspecified parameter in the event predicate. These simulations were evaluated for the best match between visualization and test sentence, should one or a set thereof exist. 52

CHAPTER 3. METHODOLOGY AND EXPERIMENTATION Parameters were randomized in two ways: (a) If a numerical/vector value is needed for a parameter such as speed or direction, it is assigned by a random number generator. The range of possible random values were constrained to values that allow the motion from event start to event end to be completed in under 15 seconds (this value was been chosen arbitrarily to allow evaluators to complete their tasks in a reasonable amount of time); (b) If the predicate denotes an underspecified manner of motion, a different predicate that satisfies the remainder of the original predicate’s satisfaction test was chosen at random from the available specification set and then executed instead of the predicate from the input. The value ranges available are shown in Table 3.2. Randomization was performed using the Unity engine’s built-in randomizer, which uses a uniform distribution, in line with standard Monte Carlo methods (Sawilowsky, 2003). Resampling was allowed in cases where the randomly generated value violated some constraint on the predicate (e.g., generating a location inside another object or off the table). All simulations were run in the same environment, with only the motion predicate and object participants changing. Objects were initialized in a X- and Z axis-aligned grid pattern, as shown in Figure 3.7. During capture of an event all objects not involved in the event were removed from the scene, as shown in Figure 3.8. Most events finished executing in 1-3 seconds. Video capture was automatically stopped after 15 seconds if the event had not yet completed.

53

CHAPTER 3. METHODOLOGY AND EXPERIMENTATION

Figure 3.7: Test environment with all objects shown

Figure 3.8: Snapshot from video capture in progress

3.3.1

Automated Capture

Video capture was performed using the Flashback Video Recorder, an FFmpeg-based Unity package from LaunchPoint Games (http://launchpointgames.com/unity/flashback.html). Three videos were captured for each input sentence in the test set, each with values assigned anew to its underspecified parameters. Unity generates the parameters as discussed above and logs 54

CHAPTER 3. METHODOLOGY AND EXPERIMENTATION them to a SQL database, along with the path of the video file, the input string, the parsed predicatelogic form of the string, the parsed form with objects resolved to their internal unique names, the event predicate alone, the objects involved in the current simulation, and a feature vector of all underspecified parameters and their value assignments. As each event is completed (or in rare cases, times out before completing), VoxSim writes all changes to the database, requests a new event from the event spawner script over the communications bridge (which the event spawner retrieves from the input sentence lists), waits two seconds to allow Flashback to prepare to capture the event, and repeats the process until the list of input sentences is exhausted. Figure 3.9 shows a digram of this process. The label at the bottom of each green box shows the language or framework that the component contained in that box runs on.

Input Lists

Event Spawner

Communications Bridge VoxSim

Bash

Python

C++

Unity/C#

SQL

Figure 3.9: Automatic capture process diagram The final dataset consisted of 3,357 videos of simulated motion events. The distribution per predicate is given in Table 3.3. Program move(x) turn(x) roll(x) slide(x) spin(x) lift(x) put(x,touching(y)) put(x,on(y)) put(x,in(y)) put(x,near(y)) lean(x,on(y)) lean(x,against(y)) flip(x,edge(x)) flip(x,center(x)) close(x) open(x)

# videos captured 45 45 45 45 45 45 630 582 174 580 552 503 9 36 9 12

55

CHAPTER 3. METHODOLOGY AND EXPERIMENTATION

Table 3.3: Number of videos captured per motion predicate

3.3.2

Feature Vectors

For each video, the SQL database stores a sparse feature vector saved as a JSON string. Parameters that are fully specified by the predicate itself are left blank in the vector and only those features requiring value assignment are stored for that event instance. For uniformity, vectors are saved as “densified” vectors where the non-valued features are left as empty strings. { "MotionSpeed":"12.21398", "MotionManner":"turn(front cover)", "TranslocSpeed":"", "TranslocDir":"", "RotSpeed":"", "RotAngle":"104.7686", "RotAxis":"", "RotDir":"", "SymmetryAxis":"", "PlacementOrder":"", "RelOrientation":"", "RelOffset":"" }

Figure 3.10: “Densified” feature vector for “open the book” The densified vectors are made properly sparse for evaluation, meaning that some show many specified features where others may display only one. { "MotionManner":"put(grape,near(apple))", "TranslocSpeed":"3.62548", "TranslocDir":"", "RelOffset":"" }

Figure 3.11: Sparse feature vector for “move the grape” 56

CHAPTER 3. METHODOLOGY AND EXPERIMENTATION { "TranslocSpeed":"1.944506" }

Figure 3.12: Sparse feature vector for “put the block in the plate”

3.3.3

Alternate Descriptions

Having captured the event test set visualizations for evaluation with respect to their original inputs, two alternate “captions” for each event were procedurally generated for a total of three description options per video. Heuristics for generating the candidate descriptions were as follows: 1. One candidate is always the true original input sentence. 2. If the simulation required a respecification of the event to another predicate, one candidate sentence is constructed out of that respecification. For example, the “move the grape” event reflected in Figure 3.11 would include “put the grape near the apple” as a candidate sentence. 3. If the original input contains a prepositional adjunct, one candidate sentence is constructed by alternating that preposition with another that co-occurs with the same event predicate in the test set (i.e., “put on” vs. “put in/touching/near” or “lean on” vs. “lean against”). 4. If the number of candidate sentences is less that 3, choose at random another predicate from the test set and apply it to the theme object of the original input. 5. Repeat steps 3 and 4 until the number of candidate sentences reaches 3. The three sentences were then put in a randomized order before being logged to a new table in the same SQL database as the feature vectors and video file path information.

3.4

Evaluation

All code written for preprocessing and the human- and machine-driven evaluation tasks is available at https://github.com/nkrishnaswamy/thesis-docs-utils. The actual videos captured are stored on Amazon S3 (Simple Storage Service) from Amazon Web Services, at https://s3.amazonaws.com/ 57

CHAPTER 3. METHODOLOGY AND EXPERIMENTATION voxsim-videos/. The link will lead to the document tree in which individual videos are sorted by the Key field.

3.4.1

Human-Driven Evaluation

Human-driven evaluation was conducted using the Amazon Mechanical Turk (MTurk) platform, which is widely considered to be a good source of high-quality, inexpensive data (Buhrmester et al., 2011). A series of Amazon MTurk HITs (Human Intelligence Tasks) were used to assess the correctness of each visualization relative to a description, or of each of the set of descriptions relative to a visualization. Human judgments of a visualization are given as “acceptable” or “unacceptable” relative to the event’s linguistic description whereas human judgments of a sentence are given as “acceptable” or “unacceptable” relative to a provided visualization of an event.

Human Evaluation Task #1 (HET1) In the first evaluation task, “Turkers” (judges, or workers) were asked to choose from A, B, and C, the three animated visualizations generated from a single input sentence, which one best depicted the sentence. The instructions given were: Below is a sentence and three videos generated by an automatic natural language processing and simulation system using that sentence as input. Read the sentence, then watch each of the videos. Choose which of the videos, A, B or C, you think best shows what the sentence is saying. • There are options if you think multiple videos satisfy the description equally well. • “None” is also a valid choice if you think none of the videos accurately depict what is described. • Use your intuition. It’s better to go with your first instinct than to overthink the answer. The multiple choice options offered were, where $DESCRIP T ION refers to the input sentence for the individual HIT: 58

CHAPTER 3. METHODOLOGY AND EXPERIMENTATION • Video A best represents $DESCRIP T ION • Video B best represents $DESCRIP T ION • Video C best represents $DESCRIP T ION • Videos A and B represent $DESCRIP T ION equally well • Videos A and C represent $DESCRIP T ION equally well • Videos B and C represent $DESCRIP T ION equally well • All videos represent $DESCRIP T ION equally well • None of these videos represent $DESCRIP T ION well Workers could also optionally briefly explain their answers.

59

CHAPTER 3. METHODOLOGY AND EXPERIMENTATION

Figure 3.13: HET1 task interface

60

CHAPTER 3. METHODOLOGY AND EXPERIMENTATION Results from this experiment are presented in Section 4.1.

Human Evaluation Task #2 (HET2) The second evaluation task flips HET1 around, giving workers a single video and the three associated candidate captions generated as described in Section 3.3.3. Workers were asked to choose the best description(s) for the video, if any. The instructions given were: Below is a video generated by an automatic natural language processing and simulation system. Watch the video, then choose the sentence, 1, 2 or 3, that best describes what’s being done in the video. • There are options if you think multiple sentences describe the video equally well. • “None” is also a valid choice if you think none of the sentences accurately describe what is depicted. • Use your intuition. It’s better to go with your first instinct than to overthink the answer. The multiple choice options offered were, where $CAN DIDAT E{1, 2, 3} refer to the candidate sentences: • $CAN DIDAT E1 best describes the events in the video • $CAN DIDAT E2 best describes the events in the video • $CAN DIDAT E3 best describes the events in the video • $CAN DIDAT E1 and $CAN DIDAT E2 describe the events in the video equally well • $CAN DIDAT E1 and $CAN DIDAT E3 describe the events in the video equally well • $CAN DIDAT E2 and $CAN DIDAT E3 describe the events in the video equally well • All sentences describe the events in the video equally well • None of these sentences describe the events in the video well 61

CHAPTER 3. METHODOLOGY AND EXPERIMENTATION As in HET1, workers could also optionally briefly explain their answers. Results from this experiment are presented in Section 4.2.

Figure 3.14: HET2 task interface For these experiments, the number of possible visualizations per description or possible descriptions per visualization in each HIT was set to 3 in order to keep the number of choices per HIT as low as possible while still allowing for more than a pairwise comparison. Each HIT was completed by 8 individual workers, for a total of 35,808 individual evaluations: 8,952 for HET1 and 26,856 for HET2. Workers were paid $0.01 per individual task. Results underwent a QA process based on methods similar to those developed by Ipeirotis et al. (2010).

62

CHAPTER 3. METHODOLOGY AND EXPERIMENTATION

3.4.2

Automatic Evaluation

Automatic Evaluation Task (AET) HET2 effectively requires annotators to predict which sentence was used to generate the visualization in question. Annotator agreement on a “most correct” description that correctly selects the original sentence indicates an event visualized (with variable values for underspecified parameters) that conforms to human notions, while disagreement, or agreement on an incorrect selection for the input sentence, should indicate a high likelihood that one or more variable value falls outside of prototypical ranges, resulting in confusion on the judges’ parts as to what the original predicate was. As this closely resembles a classification task with a discrete label set, it is possible to create an analogous task using machine learning. Automatic evaluation allows us to quickly assess more visualizations per input sentence than human evaluation methods, but as mentioned runs a risk of evaluating according to an overfitted model. However, HET2 provides a human-evaluated dataset, and with it a unique opportunity to compare the machine learning results with a well-suited gold standard. For this machine evaluation, a baseline maximum entropy logistic regression classifier was constructed using the Python Natural Language Toolkit (NLTK) (Loper and Bird, 2002) that took in the feature vectors collected during the automatic event capture process and, for each vector, predicted the most likely original input sentence from the three candidates provided to evaluators, as well as choosing the most likely input from all 1,119 input sentences in the test set. 10-fold cross-validation with a convergence threshold of .0001 and a cutoff of 1,000 training steps was conducted on three levels of granularity. 1. Predicting the verb only 2. Predicting the verb plus prepositional adjunct, if one exists 3. Predicting the entire input sentence Once a baseline accuracy was captured, a multilayer neural network was constructed, consisting of four layers of 10, 20, 20, and 10 nodes, respectively, using the TensorFlow framework (Abadi et al., 2016). Using the aforementioned feature vectors as input, a variety of variations of this network were run: 63

CHAPTER 3. METHODOLOGY AND EXPERIMENTATION 1. A “vanilla” four-layer DNN 2. DNN with features weighted by IDF metric (see below) 3. DNN with IDF weights on the discrete features only4 4. DNN excluding feature values and including IDF-weighted binary presence or absence only 5. A combined linear-DNN classifier, using linear estimation for continuous features and DNN classification for discrete features 6. Combined linear-DNN classifier with features weighted by IDF metric 7. Combined linear-DNN classifier with IDF weights on the discrete features only 8. Combined linear-DNN classifier excluding feature values and including IDF-weighted binary presence or absence only IDF as a feature weight Following the intuition that the presence or absence of an underspecified parameter feature can be a strong predictor of the type of motion class, and the fact that certain classes of underspecified parameters occur across multiple motion classes while other are more specific, it becomes clear that this is effectively a term frequency-inverse document frequency metric of informativity across motion classes, where the feature vector is the “document” and the feature is the “term.” Moreover, since each feature occurs a maximum of one time in each feature |D| )), where D vector, tf for any feature and any vector is 1, leaving tf × idf = idf = log( |{d∈D:t∈d}| is the “corpus” of feature vectors, d is an individual feature vector, and t is an individual feature, as a coarse-grained informativity metric of a given feature across this dataset. 10-fold cross-validation was run on all these neural net classifier variations at 1,000, 2,000, 3,000, 4,000, and 5,000 training steps, for a total of 1,200 iterations. Results from this experiment are presented in Section 4.3.

4

“Discrete” features are considered to be those features that are maximally specified by making a value assignment choice out of a set of categories. These features are motion manner, rotation axis, symmetry axis, placement order, and relative orientation. All others are considered to be “continuous” features.

64

Chapter 4 Results and Discussion Raw and collated data from all tasks is available at https://github.com/nkrishnaswamy/thesis-docsutils. This chapter presents a selection of the most interesting and informative results for each of the motion predicates in the study. Complete data is available as evaluation logs and SQL databases, containing data conditioned on every combination of recorded features, in the /docs/UnderspecificationAnalysis/ directory of the GitHub repository above. As discussed in the footnote on the previous page, the features logged may be divided into “discrete” features, those features that are maximally specified by making a value assignment choice out of a set of categories; and “continuous” features, those features that take values in a continuous range. For analysis, continuous feature values were all plotted as a probability density over the relevant continuous random variable, and partitioned into subsets. This evaluation uses quintiles (q = 5), although other quantiles and partitions can be easily generated by passing alternate parameters to the evaluation scripts het1-generic-eval.py and het2-generic-eval.py (source code available on GitHub in the /utils/analysis/human-eval/ directory of the repository listed above). As the Automatic Evaluation Task is a neural net classifier well-equipped to handle continuous features, this partitioning was not required for that task.

65

CHAPTER 4. RESULTS AND DISCUSSION

4.1

Human Evaluation Task 1 Results

The tables below show the probability that an arbitrary judge would, given a description and a Monte Carlo visualization generated from that description, judge that visualization as best depicting the description, conditioned on various relevant parameters in the visualization. As multiple choices were allowed and eight evaluators judged each task, probabilities in each table will likely not sum to 1.

4.1.1

“Move” Pred stack f lip spin roll lean slide turn lif t put

P(acc|Pred) 0.4375 0.4539 0.4875 0.5066 0.5769 0.6563 0.6818 0.7500 0.7857

µ ≈ 0.5929 σ ≈ 0.1304

Table 4.1: Acceptability judgments and statistical metrics for “move x” visualizations, conditioned on respecification predicate As “move” is almost fully unspecified, the simulator was always required to enact the verb as an instance of a different predicate. Results show that an arbitrarily chosen evaluator was significantly more likely to judge a visualization as acceptable for “move” if it was respecified as a “put,” “lift,” “turn,” or “slide” than as a “stack,” “flip,” “spin,” or “roll,” with “lean” falling closest to the mean. • P(acc|put) ≈ 0.7857 ≈ µ + 1.48σ • P(acc|stack) ≈ 0.4375 ≈ µ - 1.19σ No clear or obvious trend based on the features of individual respecifications emerges, although there does appear to be a correlation with events involving a translocation or translocation sequence (“put,” “lift,” “slide”) being preferred over events involving a rotation or rotation sequence (“flip,”

66

CHAPTER 4. RESULTS AND DISCUSSION “spin,” “roll”), with “lean,” which as demonstrated in Section 2.7 involves both a rotation sequence and a translocation, falling somewhere in the middle. This may suggest that humans have a preference for translocations over rotations as prototypical instantiations or mental simulations of “move.” The exceptions, “stack” and “turn,” may then potentially be explained as follows: • “Stack” randomizes the placement order of the objects. If, for example, “move the block” is respecified as “stack the block and the x” where x is some randomly chosen other object, and the randomly-assigned placement order feature places the block on the bottom of the stack, x, not the block becomes the moving object, and judges are unlikely to evaluate such a visualization as acceptable for “move the block.” Conditioning on placement order would reveal such circumstances. • When “move” is respecified as “turn,” “turn” is required to not be respecified into another of its own subclasses (see Section 3.1.2), meaning that “move” in this case is respecified to a random rotation, or minimally specified turning. This this type of “turn” is considered more acceptable than, say, “spin,” “roll,” or “flip” may suggest an aversion to overspecifying an underspecified predicate such as “move.” Interpreting a “move” as a “spin,” “roll,” or “flip” is somewhat less perspicacious about the intent of labeling something a “move” than a more generic interpretation of “turn” is.

4.1.2

“Turn” Pred spin roll lean f lip

P(acc|Pred) 0.4875 0.5066 0.5769 0.6195

µ ≈ 0.5476 σ ≈ 0.0614

Table 4.2: Acceptability judgments and statistical metrics for “turn x” visualizations, conditioned on respecification predicate We see very similar results for “turn” as for “move”—identical in fact for all predicate respecification options except for “flip”—due most likely to the fact that the same number of total visualizations (45) were generated for “turn x” as for “move x” over all x. “Lean” once again is the 67

CHAPTER 4. RESULTS AND DISCUSSION respecification that falls closest to the mean probability of acceptability for all respecified turns, although “flip” is preferred, falling approximately 1.17 standard deviations above the mean. • P(acc|flip) ≈ 0.4375 ≈ µ + 1.17σ Assessing this qualitatively, it can be conjectured that “flip” is a very obvious or dramatic kind of a turn. “Spin” could be argued to be this, too, but is in fact the least preferred falling 0.98σ below the mean probability of acceptability for all respecified turns. • P(acc|spin) ≈ 0.4375 ≈ µ + 1.17σ This may be because “spin” is overspecified/less perspicacious relative to a generic “turn,” especially if it rotates past 360◦ away from its starting orientation, passing out of the range of “turn” into a definitive “spin.” Rot (0,QU 1) (QU 1,QU 2) (QU 2,QU 3) (QU 3,QU 4) (QU 4,∞)

P(acc|Rot) 0.5179 0.5500 0.6637 0.6198 0.4338

µ ≈ 0.5570 σ ≈ 0.0896

Table 4.3: Acceptability judgments and statistical metrics for unrespecified “turn x” visualizations, conditioned on rotation angle For the instances of “turn” that were not respecified to a different predicate, no clear trend appears based on the length of the rotation enacted, although evaluators did seem more likely to judge as acceptable turns that ended between QU 2 and QU 4, or moderate-to-long turns, while leaning against turns in the highest interval, perhaps because, like “spin” above, the rotation went on long enough to change the judge’s qualitative labeling of the event. Evaluators seem to prefer events like this that are moderate in duration. Continuing continuously iterated events like “turn” for too long may in fact change the event class in most people’s minds. • P(acc|(QU 2,QU 3)) ≈ 0.6637 ≈ µ + 1.19σ • P(acc|(QU 3,QU 4)) ≈ 0.6198 ≈ µ + 0.70σ • P(acc|(QU 4,∞)) ≈ 0.4338 ≈ µ - 1.38σ 68

CHAPTER 4. RESULTS AND DISCUSSION

4.1.3

“Roll” Dist (0,QU 1) (QU 1,QU 2) (QU 2,QU 3) (QU 3,QU 4) (QU 4,∞)

P(acc|QU) 0.4539 0.5319 0.4830 0.4800 0.6208

µ ≈ 0.5139 σ ≈ 0.0661

Table 4.4: Acceptability judgments and statistical metrics for “roll x” visualizations, conditioned on path length Since all parameters of “roll” can be calculated from object parameters given the length of the path traveled, we can condition acceptability judgments on this parameter, however there does not appear to be a clear trend. Rolls in the longest interval appear to be quite strongly preferred by judges, as P(acc|(QU 4,∞)) ≈ 0.6208 ≈ µ + 1.62σ, perhaps due to the very evident and obvious nature of a rolling motion along a long path. This observation coupled with the preference for “flip” as a “turn” may suggest a consideration of obviousness as a criterion for prototypicality of a motion event.

4.1.4

“Slide” Speed (0,QU 1) (QU 1,QU 2) (QU 2,QU 3) (QU 3,QU 4) (QU 4,∞)

P(acc|QU) 0.5166 0.6056 0.6110 0.6107 0.5715

µ ≈ 0.5831 σ ≈ 0.0406

Table 4.5: Acceptability judgments and statistical metrics for “slide x” visualizations, conditioned on translocation speed For “slide,” no clear pattern appeared when conditioning on path length. When conditioning on translocation speed, we see preference for the middle three intervals again. The lowest (here slowest) interval is the least preferred (P(acc|(0,QU 1)) ≈ 0.5166 ≈ µ - 1.64σ). If the speed value generated is close enough to 0 it may be hard to see the object moving at all. The highest (fastest) interval falls closest to the mean probability for acceptability. This may exhibit the balancing act 69

CHAPTER 4. RESULTS AND DISCUSSION between preference for a moderate speed of motion and the “obvious” sliding that a fast motion speed would also demonstrate.

4.1.5

“Spin” Pred roll roll roll roll roll

Dist (0,QU 1) (QU 1,QU 2) (QU 2,QU 3) (QU 3,QU 4) (QU 4,∞)

P(acc|“roll”,QU) 0.2500 0.5625 0.5313 0.4063 0.6250

µ ≈ 0.4750 σ ≈ 0.1489

Table 4.6: Acceptability judgments and statistical metrics for “spin x” visualizations respecified as “roll x,” conditioned on path length

Axis X Y Z

P(acc|Axis) 0.6466 0.5263 0.5947

Table 4.7: Acceptability judgments for unrespecified “spin x” visualizations, conditioned on rotation axis “Spin” may be optionally respecified as a “roll,” as shown in Table 4.6 and in all other cases is spun around a random axis (Table 4.7). Spins respecified as “roll” show strong dispreference for those rolls that travel the shortest distances and, with the exception of a drop in the interval (QU 3,QU 4), an overall increase in probability of acceptability. • P(acc|(0,QU 1)) ≈ 0.2500 ≈ µ - 1.51σ • P(acc|(QU 4,∞)) ≈ 0.6250 ≈ µ + 1.01σ In cases (which were the majority of cases) where “spin” was enacted as a rotation around a randomly-chosen axis, evaluators preferred rotation around the X axis by over 5% when compared to the Z axis, and by over 12% when compared to rotation around the Y axis. Some of this may be due to the positioning of the camera, looking down the Z axis at the intersection of the X and Y axes, where asymmetrical objects that rotate toward the camera do so very evidently, invoking 70

CHAPTER 4. RESULTS AND DISCUSSION the obviousness criterion discussed above with a factor of perspective independence. For objects that look the same from all sides, such as a plate, it may have been difficult for judges to see the rotation around the Y axis.

4.1.6

“Lift”

Speed (0,QU 1) (0,QU 1) (0,QU 1) (0,QU 1) (QU 1,QU 2) (QU 1,QU 2) (QU 1,QU 2) (QU 1,QU 2) (QU 1,QU 2) (QU 2,QU 3) (QU 2,QU 3) (QU 2,QU 3) (QU 2,QU 3) (QU 3,QU 4) (QU 3,QU 4) (QU 3,QU 4) (QU 3,QU 4) (QU 3,QU 4) (QU 4,∞) (QU 4,∞) (QU 4,∞)

Dist (0,QU 1) (QU 2,QU 3) (QU 3,QU 4) (QU 4,∞) (0,QU 1) (QU 1,QU 2) (QU 2,QU 3) (QU 3,QU 4) (QU 4,∞) (0,QU 1) (QU 1,QU 2) (QU 2,QU 3) (QU 3,QU 4) (0,QU 1) (QU 1,QU 2) (QU 2,QU 3) (QU 3,QU 4) (QU 4,∞) (QU 2,QU 3) (QU 3,QU 4) (QU 4,∞)

P(acc|QUs ,QUd ) 0.4511 0.3750 0.4537 0.5625 0.5161 0.3750 0.4063 0.4712 0.6898 0.4271 0.5187 0.5648 0.4658 0.4063 0.6193 0.4756 0.5242 0.5789 0.5000 0.3728 0.6000

µs,d ≈ 0.5017 µs=(0,QU 1) ≈ 0.4606 µs=(QU 1,QU 2) ≈ 0.5279 µs=(QU 2,QU 3) ≈ 0.4941 µs=(QU 3,QU 4) ≈ 0.5209 µs=(QU 4,∞) ≈ 0.4909 µd=(0,QU 1) ≈ 0.4502 µd=(QU 1,QU 2) ≈ 0.5043 µd=(QU 2,QU 3) ≈ 0.4643 µd=(QU 3,QU 4) ≈ 0.4575 µd=(QU 4,∞) ≈ 0.6078

σs,d ≈ 0.0839 σs=(0,QU 1) ≈ 0.0771 σs=(QU 1,QU 2) ≈ 0.1063 σs=(QU 2,QU 3) ≈ 0.0603 σs=(QU 3,QU 4) ≈ 0.0840 σs=(QU 4,∞) ≈ 0.1139 σd=(0,QU 1) ≈ 0.0476 σd=(QU 1,QU 2) ≈ 0.1228 σd=(QU 2,QU 3) ≈ 0.0756 σd=(QU 3,QU 4) ≈ 0.0545 σd=(QU 4,∞) ≈ 0.0568

Table 4.8: Acceptability judgments and statistical metrics for “lift x” visualizations, conditioned on translocation speed and distance traversed Aside from a (fairly strong) preference for “longer” lifts (µd=(QU 4,∞) ≈ 0.6078 ≈ µs,d + 1.26σs,d ) no clear patterns emerge when conditioning “lift” instances on translocation speed or distance traveled alone. When conditioning on the joint probability of both parameters, we find that evaluators were less likely to rate a “lift” event as acceptable when speed fell in (0,QU 1) and distance fell in (QU 1,QU 2), when speed fell in (QU 1,QU 2) and distance fell in (QU 1,QU 2), or when speed fell in (QU 4,∞) and distance fell in (QU 3,QU 4). The overall distribution is fairly random, with 71

CHAPTER 4. RESULTS AND DISCUSSION relatively high standard deviations across all speed and distance intervals, but these figures seem to suggest a slight preference for “moderate” or “average” speed and distance values for “lift.”

4.1.7

“Put”

QSR (start) behind(y) in front(y) left(y) right(y) on(y)

P(acc|QSR) 0.5497 0.5692 0.5753 0.5725 N/A

QSR (end) behind(y) in front(y) left(y) right(y) on(y)

P(acc|QSR) 0.5474 0.5816 0.4995 0.5560 0.6683

µstart ≈ 0.5667 σstart ≈ 0.0116

µend ≈ 0.5725 σend ≈ 0.0628

Table 4.9: Acceptability judgments and statistical metrics for “put x touching y” visualizations, conditioned on relations between x and y at event start and completion

72

CHAPTER 4. RESULTS AND DISCUSSION Movement (M) behind→behind(y) behind→in front(y) behind→left(y) behind→right(y) behind→on(y) in front→behind(y) in front→in front(y) in front→left(y) in front→right(y) in front→on(y) left→behind(y) left→in front(y) left→left(y) left→right(y) left→on(y) right→behind(y) right→in front(y) right→left(y) right→right(y) right→on(y)

P(acc|M) 0.5347 0.4758 0.5014 0.4888 0.7453 0.4523 0.6447 0.4601 0.5756 0.6234 0.5732 0.5853 0.5266 0.5211 0.6492 0.5406 0.5786 0.4777 0.5847 0.7081

µM ≈ 0.5624 µ→beh ≈ 0.5252 µ→f r ≈ 0.5711 µ→l ≈ 0.4911 µ→r ≈ 0.5426 µ→on ≈ 0.6815

σM ≈ 0.0811 σ→beh ≈ 0.0515 σ→f r ≈ 0.0701 σ→l ≈ 0.0289 σ→r ≈ 0.0455 σ→on ≈ 0.0554

Table 4.10: Acceptability judgments and statistical metrics for “put x touching y” visualizations, conditioned on x movement relative to y

Dist (start) (0,QU 1) (QU 1,QU 2) (QU 2,QU 3) (QU 3,QU 4) (QU 4,∞)

P(acc|QU) N/A 0.3542 0.3829 0.4444 0.4470

Dist (end) (0,QU 1) (QU 1,QU 2) (QU 2,QU 3) (QU 3,QU 4) (QU 4,∞)

P(acc|QU) 0.7523 0.6207 0.3890 0.3655 0.1295

µstart ≈ 0.4071 σstart ≈ 0.0461

µend ≈ 0.4514 σend ≈ 0.2419

Table 4.11: Acceptability judgments and statistical metrics for “put x near y” visualizations, conditioned on distance between x and y at event start and completion

73

CHAPTER 4. RESULTS AND DISCUSSION Movement (M) (QU 1,QU 2)→(0,QU 1) (QU 1,QU 2)→(QU 1,QU 2) (QU 1,QU 2)→(QU 2,QU 3) (QU 1,QU 2)→(QU 3,QU 4) (QU 1,QU 2)→(QU 4,∞) (QU 2,QU 3)→(0,QU 1) (QU 2,QU 3)→(QU 1,QU 2) (QU 2,QU 3)→(QU 2,QU 3) (QU 2,QU 3)→(QU 3,QU 4) (QU 2,QU 3)→(QU 4,∞) (QU 3,QU 4)→(0,QU 1) (QU 3,QU 4)→(QU 1,QU 2) (QU 3,QU 4)→(QU 2,QU 3) (QU 3,QU 4)→(QU 3,QU 4) (QU 3,QU 4)→(QU 4,∞) (QU 4,∞)→(0,QU 1) (QU 4,∞)→(QU 1,QU 2) (QU 4,∞)→(QU 2,QU 3) (QU 4,∞)→(QU 3,QU 4) (QU 4,∞)→(QU 4,∞)

P(acc|M) 0.7625 0.4044 0.2232 0.1667 0.0682 0.6848 0.5703 0.3750 0.2788 0.1488 1.000 0.3750 0.3750 0.5417 0.2083 0.7698 0.6863 0.4217 0.4162 0.1300

µM ≈ 0.4303 µ→(0,QU 1) ≈ 0.8043 µ→(QU 1,QU 2) ≈ 0.5090 µ→(QU 2,QU 3) ≈ 0.3487 µ→(QU 3,QU 4) ≈ 0.3509 µ→(QU 4,QU 5) ≈ 0.1388

σM ≈ 0.2521 σ→(0,QU 1) ≈ 0.1360 σ→(QU 1,QU 2) ≈ 0.1462 σ→(QU 2,QU 3) ≈ 0.0865 σ→(QU 3,QU 4) ≈ 0.1631 σ→(QU 4,QU 5) ≈ 0.0577

Table 4.12: Acceptability judgments and statistical metrics for “put x near y” visualizations, conditioned on start and end distance intervals between x and y

74

CHAPTER 4. RESULTS AND DISCUSSION Dist (end) QSR P(acc|QU,QSR) (0,QU 1) behind(y) 0.7730 (0,QU 1) in front(y) 0.7349 (0,QU 1) left(y) 0.7338 (0,QU 1) right(y) 0.7712 (QU 1,QU 2) behind(y) 0.6701 (QU 1,QU 2) in front(y) 0.5797 (QU 1,QU 2) left(y) 0.6675 (QU 1,QU 2) right(y) 0.5819 (QU 2,QU 3) behind(y) 0.4151 (QU 2,QU 3) in front(y) 0.3644 (QU 2,QU 3) left(y) 0.3945 (QU 2,QU 3) right(y) 0.3825 (QU 3,QU 4) behind(y) 0.1713 (QU 3,QU 4) in front(y) 0.4308 (QU 3,QU 4) left(y) 0.2093 (QU 3,QU 4) right(y) 0.4699 (QU 4,∞) behind(y) 0.0972 (QU 4,∞) in front(y) 0.1401 (QU 4,∞) left(y) 0.1250 (QU 4,∞) right(y) 0.1348 µend,qsr ≈ 0.4424 σend,qsr ≈ 0.2380 µend=(0,QU 1) ≈ 0.7532 σend=(0,QU 1) ≈ 0.0218 µend=(QU 1,QU 2) ≈ 0.6248 σend=(QU 1,QU 2) ≈ 0.0508 µend=(QU 2,QU 3) ≈ 0.3891 σend=(QU 2,QU 3) ≈ 0.0213 µend=(QU 3,QU 4) ≈ 0.3203 σend=(QU 3,QU 4) ≈ 0.1518 µend=(QU 4,QU 5) ≈ 0.1243 σend=(QU 4,QU 5) ≈ 0.0191 µqsr=beh) ≈ 0.4253 σqsr=beh ≈ 0.2971 µqsr=f r ≈ 0.4500 σqsr=f r ≈ 0.2246 µqsr=l ≈ 0.4260 σqsr=l ≈ 0.2700 µqsr=r ≈ 0.4681 σqsr=r ≈ 0.2362

Table 4.13: Acceptability judgments and statistical metrics for “put x near y” visualizations, conditioned on distance between x and y and POV-relative orientation at event completion While “put on” and “put in” judgments do not show significant variation in acceptability based on their one underspecified parameter, translocation speed, some very interesting results appear in the examination of “put touching” and “put near” visualizations. We observe a lower likelihood for visualizations to be judged acceptable when the moving object moves from behind the stationary object to in front of it, and vice versa. P(accept|behind→in front(y)) 75

CHAPTER 4. RESULTS AND DISCUSSION is approximately 0.4758, which is approximately 1.07 standard deviations below the mean of the population for all starting/ending QSR relation pairs. This may be explained as an effect of the point of view imposed by the camera position, which may make it difficult to see if an object behind another object is actually making contact and satisfying the EC relation required by “touching,” especially if a larger object is occluding a smaller object. Visualizations where the moving object ends to the left of the stationary object were also less likely to be judged acceptable. P(accept|left(y)) is approximately 1.16 standard deviations below the mean likelihood of acceptance over the population for all event-end QSR relations. This is apparently independent of the moving object’s starting location relative to the stationary object, but the dispreference is more significant for objects that start in front of, or to the right of, their destination. • P(acc|in front→left(y)) ≈ 0.4601 ≈ µM - 1.26σM • P(acc|right→left(y)) ≈ 0.4777 ≈ µM - 1.04σM This could also be explained as an effect of the POV, in particular the distortion it causes in cases where larger objects closer to the camera (including laterally) may occlude objects further away, making it difficult to assess the satisfaction of the EC relation. Therefore, some objects that move from the right of another object to the left of it also move away from the camera, meaning that this effect is analogous to that seen in the behind(y) relations, and explains the similar result seen for in front→left(y) motions. However, this hypothesis would not explain the absence of a symmetric inclination against right(y) relations so more experimentation or analysis is needed. Some of this may be related to features of the objects themselves, which are not strongly controlled for (discussed further in Section 4.5). There is a strong preference for the on(y) specification of touching(y) over all others, which matches linguistic intuition. “On” necessarily implies an EC relation, which is expressed in the VoxML (Figure 2.11). P(accept|on(y)) falls approximately 1.52 standard deviations above the mean probability of acceptability of the population for all event-end relations. The strongest preference is for motion from behind(y) to on(y), where P(accept|behind→on(y)) is approximately 2.25 standard deviations above the mean likelihood for acceptability over the whole population conditioned on start-to-end motion. In terms of point of view effects, this may be due to an occluded object being brought into view and very obviously made to touch its destination in a visualization with no obstructed view. Where “touching” is an underspecified predicate, the relations entailed 76

CHAPTER 4. RESULTS AND DISCUSSION by “on,” while arguably somewhat overspecified as an interpretation of “touching” alone, seem to most clearly satisfy the qualitative specification of “touching” out of the options available. Notably, it is the only one not dependent on the relative point of view, suggesting that the relative point of view introduces some noise or confusion into the human judgments, potentially for the reasons discussed above, among others. For “put near,” evaluators unsurprisingly preferred visualizations where the two objects ended up close to each other to those where the objects ended further apart. • P(acc|(0, QU 1)) ≈ µend + 1.24σend • P(acc|(QU 1, QU 2)) ≈ µend + 0.70σend While this seems like an obvious result, the fact that quantitative data comports with intuition lends credence to the soundness of this simulation method of determining the presuppositions underlying motion and relation predicates. In the first three distance intervals, we observe a slight preference for events where the moving object finishes the event behind the stationary object. • P(acc|(0, QU 1),behind(y)) ≈ µend=(0,QU 1),qsr + 0.90σend=(0,QU 1),qsr • P(acc|(QU 1, QU 2),behind(y)) ≈ µend=(QU 1,QU 2),qsr + 0.89σend=(QU 1,QU 2),qsr • P(acc|(QU 2, QU 3),behind(y)) ≈ µend=(QU 2,QU 3),qsr + 1.22σend=(QU 2,QU 3),qsr This may be an effect of foreshortening caused by the point of view, as with some of the “touching” specifications, which causes an object x which is behind(y) to appear closer to y than it actually is. When conditioning on the joint distribution of the distance interval and the QSR relation, as shown in Table 4.13, there is some apparent confusion in judgments of events in the fourth distance interval, where σ for the population of P(accept|QSR) is greater than .15, where in all other intervals σ for P(accept|QSR) falls between .019 and .051. This is possibly a factor of workers being unable to judge purely from the visuals whether an object that began its movement from a position in the fourth distance interval relative to the stationary object actually ended the event nearer than it began, whereas in preceding intervals, the resulting location was more likely to be unambiguously “near” regardless of starting location. 77

CHAPTER 4. RESULTS AND DISCUSSION Table 4.12 shows the judges’ preferences for objects that moved between the different distance intervals, independent of direction or orientation. The quintiles were calculated based on the distributions of distances between objects at the end of the “put near” event, which is why Tables 4.11 and 4.12 show no objects beginning the event in the lowest distance interval. There is a clear preference for objects that move from a far interval to a near one, and the inverse is also true, with very low proportions of “acceptable” judgments for visualizations where the object moved from a near distance interval to a farther one. This reinforces the intuition that a qualitative term like “near” is understood to be inherently relative (Peters, 2007).

4.1.8

“Lean” Angle (0,QU 1) (QU 1,QU 2) (QU 2,QU 3) (QU 3,QU 4) (QU 4,∞)

P(acc|QU) 0.6117 0.6403 0.6694 0.6443 0.6502

µ ≈ 0.6432 σ ≈ 0.0208

Table 4.14: Acceptability judgments and statistical metrics for “lean x” visualizations, conditioned on rotation angle For “lean,” the only parameter value left underspecified is the angle of the lean.1 The data gathered from the evaluators suggest near equal preference for all angle intervals, with a low standard deviation (0.0208). There is a dispreference for the lowest interval, those angles closest to 25◦ (P(acc|(0, QU 1)) ≈ 0.6117 ≈ µ - 1.51σ). This may be because the Unity engine physics takes over once the object has completed its motion, and on occasion with certain objects (e.g., a book), leaning it at a low angle causes the force of gravity to overwhelm and break the support relation between the leaning object and the supporting object, causing it to fall. This suggests that prototypical events may be required to remain satisfied after the application of any postconditions or effects imposed by world physics, independently of the event satisfaction itself. 1 The speed of rotation and translocation are also underspecified, but these are properties of the [[TURN]] and [[PUT]] subevents of “lean,” not the lean itself.

78

CHAPTER 4. RESULTS AND DISCUSSION

4.1.9

“Flip” Rot Axis X X Y Y Z Z

Symmetry Axis Y Z X Z X Y

P(acc|Axisrot ,Axissym ) 0.6193 0.6667 0.5417 0.7500 0.3137 0.6645

µ ≈ 0.5927 σ ≈ 0.1527

Table 4.15: Acceptability judgments and statistical metrics for “flip x” visualizations, conditioned on rotation axis and symmetry axis On “flip” instances, the outliers are objects with symmetry around the X axis rotating around Z axis and objects with symmetry around the Z axis rotating around the Y axis. 1. P(acc|Z,X) ≈ 0.3137 ≈ µ - 1.83σ 2. P(acc|Y,Z) ≈ 0.7500 ≈ µ + 1.03σ Axisrot = Z and Axissym = X is very strongly dispreferred, while Axisrot = Y and Axissym = Z is somewhat strongly preferred. As with “spin” (Table 4.7), this may be a factor of the object symmetry relative to the camera placement, where a rotation around the Y axis shows all sides of the object, making the “flip” motion obvious (as long as it is not symmetric around the Y axis). A kind of conditioning to control for relative point of view, similar to those calculated for the “put touching” events, could clarify whether or not this is in fact the case.

4.1.10

“Close” Pred turn put

P(acc|Pred) 0.6818 0.7857

Table 4.16: Acceptability judgments for “close x” visualizations, conditioned on motion manner For “close,” we see two types of motions: those where a subcomponent of the object is turned to close it (e.g., “close the book”), and those where another object is put on top of the object to 79

CHAPTER 4. RESULTS AND DISCUSSION be closed (e.g., “close the cup” → “put the lid (or disc) on the cup”). Evaluators seem to prefer the “put” type to the “turn” type. Although no instances of “close the book” were realized by placing another object on top of an open book, it seems unlikely that that would be considered an acceptable visualization or interpretation of “close the book.” It is likely, then, that acceptability of a “close” event, since it requires predicate respecification, is strongly conditioned on the features of the object rather than the motion event itself.

4.1.11

“Open” Pred turn move

P(acc|Pred) 0.6818 0.8750

Table 4.17: Acceptability judgments for “open x” visualizations, conditioned on motion manner The same is probably true for “open.” Here we have two cases: the reverse of the “turn” event from “close,” where some subcomponent is turned to an arbitrary angle to open the object (such as a book), and instances of “open” where some object closing another object is moved to a different location or positioning (e.g., “move the lid” where the lid is currently sealing a cup). Evaluators strongly preferred the “move” event to the “turn” event.

4.2

Human Evaluation Task 2 Results

The tables below show the probability that an arbitrary judge would, given a Monte Carlo-generated visualization of the given predicate, identify that predicate as best describing the visualization, conditioned on various relevant parameters in the visualization. As multiple choices were allowed and eight judges evaluated each task, probabilities in each table will likely not sum to 1. The probabilities shown here are generally lower than the probabilities that fall out of HET1, due primarily to the fact that for each single visualization, workers were given three separate labels to choose from instead of a single label for three visualizations, spreading the distribution out over three potential captions and rendering the overall task three times as large.

80

CHAPTER 4. RESULTS AND DISCUSSION

4.2.1

“Move” Pred spin turn lif t f lip put stack roll lean slide

P(select=“move”|Pred) 0.0617 0.0909 0.0938 0.1491 0.2500 0.2500 0.2961 0.3714 0.5938

µ ≈ 0.2396 σ ≈ 0.1694

Table 4.18: Probabilities and statistical metrics for selection of “move” predicate for “move x” event, conditioned on respecification predicate The probability of an arbitrary evaluator choosing the label “move” for a given visualization of a “move” is highly variable, with a very high standard deviation relative to the probability values for each individual candidate respecification predicate. • P(select=“move”|spin) ≈ 0.0617 ≈ µ - 1.05σ • P(select=“move”|slide) ≈ 0.5938 ≈ µ + 2.09σ We also see a very different order of increasing preference for the label choices compared to the generated respecifications as seen in Table 4.1. Most respecifications that were preferred by evaluators in HET1 are often infrequent choices to be labeled a “move” here. The one point of convergence in the two tasks is “slide,” which is a frequently accepted respecification in HET 1 (P(acc|slide) ≈ 0.6563) and the most likely event type to be labeled a “move” in this task. We might explain this by introducing a reflexivity qualification on prototypical motion events, such that events instantiated with a certain label by an event generator should be labeled the same way by an evaluator, in which case a sliding motion might be a good candidate for a prototypical “move” event.

81

CHAPTER 4. RESULTS AND DISCUSSION

4.2.2

“Turn” Pred lean roll spin f lip

P(select=“turn”|Pred) 0.1048 0.2171 0.2346 0.4561

µ ≈ 0.2532 σ ≈ 0.1470

Table 4.19: Probabilities and statistical metrics for selection of “turn” predicate for “turn x” visualizations, conditioned on respecification predicate These results also show a very different order than the acceptability of the same predicates as visualizations in HET1. As with “move,” there is a point of consistency between the two sets of results. Here, that is “flip,” which is the most likely respecification to be labeled as a “turn” and the most likely type of “turn” visualization to be judged acceptable. This satisfies both reflexivity and obviousness with respect to prototypicality, and so “flip” might be considered for a prototypical realization of a “turn” event. Rot (0,QU 1) (QU 1,QU 2) (QU 2,QU 3) (QU 3,QU 4) (QU 4,∞)

P(select=“turn”|Rot) 0.2618 0.2647 0.2355 0.2560 0.2426

µ ≈ 0.2521 σ ≈ 0.0126

Table 4.20: Probabilities and statistical metrics for selection of “turn” predicate for unrespecified “turn x” visualizations, conditioned on rotation angle For visualizations where “turn” was not respecified as a different predicate, the results can be conditioned on the angle of the rotation. No clear patterns emerge here, with roughly equal distribution over all intervals. It appears that for an arbitrary rotation not part of any other distinct motion class, “turn” is the best overall label from the available choices.

82

CHAPTER 4. RESULTS AND DISCUSSION

4.2.3

“Roll” Dist (0,QU 1) (QU 1,QU 2) (QU 2,QU 3) (QU 3,QU 4) (QU 4,∞)

P(select=“roll”|QU) 0.0300 0.0568 0.0531 0.0483 0.0407

µ ≈ 0.0458 σ ≈ 0.0107

Table 4.21: Probabilities and statistical metrics for selection of “roll” predicate for “roll x” visualizations, conditioned on path length

Dist (0,QU 1) (0,QU 1) (0,QU 1) (QU 1,QU 2) (QU 1,QU 2) (QU 1,QU 2) (QU 2,QU 3) (QU 2,QU 3) (QU 2,QU 3) (QU 3,QU 4) (QU 3,QU 4) (QU 3,QU 4) (QU 4,∞) (QU 4,∞) (QU 4,∞)

Pred move put near lif t move put near roll move put near lif t move put near lif t move put near lif t

P(select=Pred|QU) 0.6816 0.2912 0.0412 0.6639 0.2912 0.0568 0.6434 0.3080 0.0624 0.6007 0.3719 0.0611 0.5717 0.4646 0.0664

Table 4.22: Top 3 most likely predicate choices for “roll x” visualizations, conditioned on path length There is a very low probability overall that evaluators would choose “roll” as the best label for a rolling event, regardless of path length. The most likely label choice for instances of “roll” overall was actually “move,” followed by “put near,” and then “lift” in most cases, although the overall probabilities for choosing “lift” are also very low. The occurrence of “lift” is hard to explain. Some of the “roll” visualizations bounce a little bit due to physics effects, and the low probabilities may also be attributable to evaluator error.

83

CHAPTER 4. RESULTS AND DISCUSSION

4.2.4

“Slide” Speed Dist (0,QU 1) (0,QU 1) (0,QU 1) (QU 2,QU 3) (0,QU 1) (QU 3,QU 4) (0,QU 1) (QU 4,∞) (QU 1,QU 2) (0,QU 1) (QU 1,QU 2) (QU 1,QU 2) (QU 1,QU 2) (QU 2,QU 3) (QU 1,QU 2) (QU 3,QU 4) (QU 1,QU 2) (QU 4,∞) (QU 2,QU 3) (0,QU 1) (QU 2,QU 3) (QU 1,QU 2) (QU 2,QU 3) (QU 2,QU 3) (QU 2,QU 3) (QU 3,QU 4) (QU 3,QU 4) (0,QU 1) (QU 3,QU 4) (QU 1,QU 2) (QU 3,QU 4) (QU 2,QU 3) (QU 3,QU 4) (QU 4,∞) (QU 4,∞) (0,QU 1) (QU 4,∞) (QU 2,QU 3) (QU 4,∞) (QU 3,QU 4) (QU 4,∞) (QU 4,∞) µs,d ≈ 0.0392 µs=(0,QU 1) ≈ 0.0303 µs=(QU 1,QU 2) ≈ 0.0328 µs=(QU 2,QU 3) ≈ 0.0392 µs=(QU 3,QU 4) ≈ 0.0409 µs=(QU 4,∞) ≈ 0.0547 µd=(0,QU 1) ≈ 0.0306 µd=(QU 1,QU 2) ≈ 0.0533 µd=(QU 2,QU 3) ≈ 0.0346 µd=(QU 3,QU 4) ≈ 0.0387 µd=(QU 4,∞) ≈ 0.0460

P(select=“slide”|QUs ,QUd ) 0.0300 0.0214 0.0178 0.0521 0.0379 0.0311 0.0300 0.0150 0.0498 0.0311 0.0310 0.0385 0.0563 0.0217 0.0978 0.0179 0.0263 0.0323 0.0651 0.0655 0.0558 σs,d ≈ 0.0204 σs=(0,QU 1) ≈ 0.0154 σs=(QU 1,QU 2) ≈ 0.0127 σs=(QU 2,QU 3) ≈ 0.0119 σs=(QU 3,QU 4) ≈ 0.0381 σs=(QU 4,∞) ≈ 0.0156 σd=(0,QU 1) ≈ 0.0058 σd=(QU 1,QU 2) ≈ 0.0385 σd=(QU 2,QU 3) ≈ 0.0188 σd=(QU 3,QU 4) ≈ 0.0260 σd=(QU 4,∞) ≈ 0.0134

Table 4.23: Probabilities and statistical metrics for selection of “slide” predicate for “slide x” visualizations, conditioned on path length and translocation speed Like “roll,” the probability of choosing “slide” as the best label for visualizations generated from the “slide” predicate are low, although we can see an increasing trend in favor of the “slide” label as motion speed rises. 84

CHAPTER 4. RESULTS AND DISCUSSION • µs=(0,QU 1) ≈ 0.0303 ≈ µs,d - 0.44σs,d • µs=(QU 4,∞) ≈ 0.0547 ≈ µs,d + 0.76σs,d Overall, results from neither “roll” nor “slide” seem to be very informative about prototypicality in this case, possibly because both are fairly fully-specified events, not easily confused for anything else beyond their own hypernyms (e.g., “move”).

4.2.5

“Spin” Pred roll roll roll roll roll

Dist (0,QU 1) (QU 1,QU 2) (QU 2,QU 3) (QU 3,QU 4) (QU 4,∞)

P(select=“spin”|“roll”,QU) 0.2500 0.1563 0.1250 0.2500 0.1750

µ ≈ 0.1913 σ ≈ 0.0565

Table 4.24: Probabilities and statistical metrics for selection of “spin” predicate for “spin x” visualizations respecified as “roll x,” conditioned on path length

Axis X Y Z

P(select=“spin”|Axis) 0.0643 0.4137 0.0625

Table 4.25: Probabilities for selection of “spin” predicate for unrespecified “spin x” visualizations, conditioned on rotation axis For instances of “spin” respecified as “roll,” no clear trend emerges that would identify a particular path length as making such an event more identifiable as a “spin.” Paths in the shortest interval, and in (QU 3,QU 4) show some preference, but not a clearly explainable one relative to the total test set for respecified “spin,” and it is unclear how much of this is statistical noise due to the small size of this specific segment of the dataset. For unrespecified “spin” There is a very strong preference for spin motions around the Y axis, making it clear that the prototypical notion of a “spin” is probably around that axis.

85

CHAPTER 4. RESULTS AND DISCUSSION

4.2.6

“Lift” Speed Dist (0,QU 1) (QU 1,QU 2) (0,QU 1) (QU 2,QU 3) (0,QU 1) (QU 3,QU 4) (0,QU 1) (QU 4,∞) (QU 1,QU 2) (0,QU 1) (QU 1,QU 2) (QU 1,QU 2) (QU 1,QU 2) (QU 3,QU 4) (QU 1,QU 2) (QU 4,∞) (QU 2,QU 3) (0,QU 1) (QU 2,QU 3) (QU 1,QU 2) (QU 2,QU 3) (QU 2,QU 3) (QU 2,QU 3) (QU 3,QU 4) (QU 2,QU 3) (QU 4,∞) (QU 3,QU 4) (0,QU 1) (QU 3,QU 4) (QU 1,QU 2) (QU 3,QU 4) (QU 2,QU 3) (QU 3,QU 4) (QU 3,QU 4) (QU 4,∞) (0,QU 1) (QU 4,∞) (QU 1,QU 2) (QU 4,∞) (QU 2,QU 3) (QU 4,∞) (QU 3,QU 4) (QU 4,∞) (QU 4,∞) µs,d ≈ 0.0680 µs=(0,QU 1) ≈ 0.0681 µs=(QU 1,QU 2) ≈ 0.0722 µs=(QU 2,QU 3) ≈ 0.0634 µs=(QU 3,QU 4) ≈ 0.0714 µs=(QU 4,∞) ≈ 0.0664 µd=(0,QU 1) ≈ 0.0496 µd=(QU 1,QU 2) ≈ 0.0565 µd=(QU 2,QU 3) ≈ 0.0830 µd=(QU 3,QU 4) ≈ 0.0664 µd=(QU 4,∞) ≈ 0.0876

P(select=“lift”|QUs ,QUd ) 0.0523 0.0256 0.0800 0.1146 0.0909 0.0932 0.0300 0.0746 0.0104 0.0354 0.1106 0.0938 0.0667 0.0217 0.0761 0.1071 0.0806 0.0753 0.0257 0.0888 0.0476 0.0944 σs,d ≈ 0.0317 σs=(0,QU 1) ≈ 0.0381 σs=(QU 1,QU 2) ≈ 0.0293 σs=(QU 2,QU 3) ≈ 0.0411 σs=(QU 3,QU 4) ≈ 0.0358 σs=(QU 4,∞) ≈ 0.0291 σd=(0,QU 1) ≈ 0.0395 σd=(QU 1,QU 2) ≈ 0.0280 σd=(QU 2,QU 3) ≈ 0.0395 σd=(QU 3,QU 4) ≈ 0.0265 σd=(QU 4,∞) ≈ 0.0215

Table 4.26: Probabilities and statistical metrics for selection of “lift” predicate for “lift x” visualizations, conditioned on translocation speed and distance traversed Like “roll” and “slide,” “lift” receives overall low probabilities of being judged the best label for visualizations generated from “lift” events, but there seems to be a rising trend of probability for the 86

CHAPTER 4. RESULTS AND DISCUSSION “lift” label as the distance traveled rises (µd=(QU 4,∞) ≈ 0.0876 ≈ µs,d + 0.62σs,d ). Since in HET1 (Table 4.8), evaluators also preferred longer instances of “lift,” this comports with the reflexivity qualification on prototypicality introduced earlier.

4.2.7

“Put” Speed (0,QU 1) (QU 1,QU 2) (QU 2,QU 3) (QU 3,QU 4) (QU 4,∞)

P(select=“put on/in”|Speed) 0.2016 0.2182 0.2372 0.2334 0.2349

µ ≈ 0.2251 σ ≈ 0.0151

Table 4.27: Probabilities and statistical metrics for selection of “put on/in” predicate for “put x on/in y” visualizations, conditioned on translocation speed QSR (end) behind(y) in front(y) left(y) right(y) on(y)

P(select=“put touching”|QSR) 0.1982 0.2706 0.3250 0.3333 0.3654

µ ≈ 0.2985 σ ≈ 0.0656

Table 4.28: Probabilities and statistical metrics for selection of “put touching” predicate for “put x touching y” visualizations, conditioned on relative orientation between x and y at event completion Dist (0,QU 1) (QU 1,QU 2) (QU 2,QU 3) (QU 3,QU 4) (QU 4,∞)

P(select=“put near”|Dist) 0.2912 0.2912 0.3080 0.3719 0.4646

µ ≈ 0.3454 σ ≈ 0.0745

Table 4.29: Probabilities and statistical metrics for selection of “put near” predicate for “put x near y” visualizations, conditioned on distance traveled “Put on” and “put in” results show little variation when conditioned on translocation speed, with perhaps a slight preference for faster motions. With the exception of this parameter these are 87

CHAPTER 4. RESULTS AND DISCUSSION already well-specified motion predicates so combined with the results from HET1, there does not appear to be any particularly distinct set of “best values” for translocation speed in a prototypical “put on” or “put in” event. The results for “put touching” also reflect the results in HET1, with a preference for the on(y) specification of touching(y), and the dispreference for behind(y), possibly due to the effects of the point of view on perceptibility of the EC constraint of a touching(y) relation. • P(select=“put touching”|behind(y)) ≈ 0.1982 ≈ µ - 1.53σ • P(select=“put touching”|on(y)) ≈ 0.3654 ≈ µ + 1.02σ on(y), meanwhile, remains a very obvious interpretation of touching(y) and clearly preferred as a specification in both visualization and labeling. The results for “put near” display something interesting: rather than having a clear pattern of preference for objects in close proximity, conditioning on relative offset, as shown in the results for HET1 (Table 4.11), results in this task show the trend emerging when conditioning on distance the moving object traveled overall, with preference to longer paths (P(select=“put near”|(QU 4,∞)) ≈ 0.4646 ≈ µ + 1.6σ). Longer paths are obvious motions, but they are also, in the case of moving an object “near” another object, demonstrative, in that they demonstrate the contrast between the beginning and ending state, and the clear typing of the event, as might be encoded in VoxML (for a [[PUT]], the typing is a “transition event,” a minimal distinction which a long path taken over the course of the event clearly demonstrates).

4.2.8

“Lean” Angle (0,QU 1) (QU 1,QU 2) (QU 2,QU 3) (QU 3,QU 4) (QU 4,∞)

P(select=“lean”|QU) 0.4263 0.4287 0.4291 0.4245 0.4110

µ ≈ 0.4239 σ ≈ 0.0075

Table 4.30: Probabilities and statistical metrics for selection of “lean on” predicate for “lean x on y” visualizations, conditioned on rotation angle

88

CHAPTER 4. RESULTS AND DISCUSSION Angle (0,QU 1) (QU 1,QU 2) (QU 2,QU 3) (QU 3,QU 4) (QU 4,∞)

P(select=“lean”|QU) 0.3319 0.3495 0.3555 0.3217 0.3512

µ ≈ 0.3420 σ ≈ 0.0145

Table 4.31: Probabilities and statistical metrics for selection of “lean against” predicate for “lean x against y” visualizations, conditioned on rotation angle For both “lean on” and “lean against,” results show roughly equal probabilities for the respective two labels across angle intervals. The same type of results appear for unrespecified “turn,” although here there is an overall and universal preference for labeling a “lean” event with “on” as opposed to “against,” even though the two events programs are enacted identically.

4.2.9

“Flip” Rot Axis X X Y Z Z

Symmetry Axis Y Z Z X Y

P(select=“flip on edge”|Axisrot ,Axissym ) 0.1080 0.1579 0.3125 0.0833 0.0789

µ ≈ 0.1481 σ ≈ 0.0971

Table 4.32: Probabilities and statistical metrics for selection of “flip on edge” predicate for “flip x on edge” visualizations, conditioned on rotation axis and symmetry axis

Rot Axis X X Y Z Z

Symmetry Axis Y Z X X Y

P(select=“flip at center”|Axisrot ,Axissym ) 0.3352 0.3333 0.2400 0.1875 0.3750

µ ≈ 0.2942 σ ≈ 0.0776

Table 4.33: Probabilities and statistical metrics for selection of “flip at center” predicate for “flip x at center” visualizations, conditioned on rotation axis and symmetry axis

89

CHAPTER 4. RESULTS AND DISCUSSION Rot Axis X X X X X X Y Y Y Y Y Y Z Z Z Z Z Z

Symmetry Axis Y Y Y Z Z Z X X X Z Z Z X X X Y Y Y

Pred turn f lip move turn f lip move turn f lip move turn f lip move move turn f lip turn f lip move

P(select=“flip”|Axisrot ,Axissym ) 0.4659 0.3352 0.1818 0.4211 0.3333 0.2280 0.4000 0.2400 0.1200 0.4375 0.3125 0.1250 0.2500 0.2291 0.1875 0.5197 0.3750 0.2039

Table 4.34: Top 3 most likely predicate choices for “flip x {on edge, at center}” visualizations, conditioned on rotation axis and symmetry axis Labeling probabilities for “flip on edge” and “flip at center” pattern differently so are separated here. Preference for a “flip on edge” label was for visualizations containing objects rotating around their Y axis when they had symmetry around their Z axis, whereas for a “flip at center” label, visualizations containing objects symmetric around their Y axis rotating around their Z axis were preferred. Overall, evaluators were more likely to label these visualizations with “turn,” than “flip.” In some cases, this may have been because the object flipped on its edge would fall over after physics effects are evaluated following the event completion, violating the physics-independence quality of a prototypical motion event, but may also indicate a level of “hedging bets” on the part of the evaluators, or building in a measure of certainty into their evaluations. In other words, while the event visualized may have been somewhat less-than-definitely a “flip,” it was very securely a “turn,” making that the best label.

90

CHAPTER 4. RESULTS AND DISCUSSION

4.2.10

“Close” Pred turn put

P(select=“close”|Pred) 0.2273 0.2143

Table 4.35: Probabilities for selection of “close” predicate for “close x” visualizations, conditioned on motion manner Manner put put put turn turn turn

Pred put on move close turn close open

P(select=“close”|Manner) 0.4821 0.2500 0.2143 0.3636 0.2272 0.1932

Table 4.36: Top 3 most likely predicate choices for “close x” visualizations, conditioned on motion manner For both types of “close” realization, “put” events and “turn” events, evaluators were roughly equally likely to choose “close” as the correct label. However, evaluators were much more likely to choose the more specific predicate, “put on” or “turn” (depending on event typing) as the correct label overall, as opposed to “close.” There is also a small incidence of “open” in the “turn” realization, which seems to be evaluator error or at least counterintuitive to the visualization given.

4.2.11

“Open” Pred turn move

P(select=“open”|Pred) 0.1932 0.6122

Table 4.37: Probabilities for selection of “open” predicate for “open x” visualizations, conditioned on motion manner

91

CHAPTER 4. RESULTS AND DISCUSSION Manner turn turn turn move move move

Pred turn close open open move lean against

P(select=“close”|Manner) 0.3636 0.2272 0.1932 0.6122 0.5306 0.0408

Table 4.38: Top 3 most likely predicate choices for “open x” visualizations, conditioned on motion manner Evaluators were much more likely to label “move” enactments of “open” as “open” events, whereas, like for “close,” “turn” realizations were more likely to be given the label “turn.” This may suggest that, because although “turn” is an underspecified motion predicate, “move” is more so, and so the more underspecified the motion predicate, the less likely evaluators are to choose it as an event label for a more fully-specified event visualization. It should be noted that in this task, the heuristics used to generated the three choices for evaluators may potentially influence the results presented. By choosing to give alternative choices to the true input sentence that rely on 1) adjunct alternation such as “put in” vs. “put on” and 2) motion superclass/subclass distinctions such as providing a “move” option for “slide” or a “turn” option for “lean,” these results do reinforce rigid taxonomies of motion classes when human interpretation may be more fluid.

4.3

Automatic Evaluation Task Results

The classifier source code for both MaxEnt baseline and the DNN evaluator is available in the /utils/ analysis/auto-eval/ directory of https://github.com/nkrishnaswamy/thesis-docsutils.

92

CHAPTER 4. RESULTS AND DISCUSSION

4.3.1

Baseline: Maximum Entropy Logistic Regression Granularity 1: Predicting Predicate Only 3-way Choice Total Correct Incorrect 3357 1628 1729 µ Accuracy σ σ2 48.50% 0.29066 0.08448 Unrestricted Choice Total Correct Incorrect 3357 558 2799 µ Accuracy σ σ2 16.62% 0.09007 0.00811 Granularity 2: Predicting Predicate w/ Adjunct 3-way Choice Total Correct Incorrect 3357 1182 2175 µ Accuracy σ σ2 35.19% 0.24375 0.05941 Unrestricted Choice Total Correct Incorrect 3357 522 2835 µ Accuracy σ σ2 15.54% 0.10732 0.01152 Granularity 3: Predicting Full Sentence 3-way Choice Total Correct Incorrect 3357 1532 1825 µ Accuracy σ σ2 45.64% 0.02424 0.00059 Unrestricted Choice Total Correct Incorrect 3357 34 3323 µ Accuracy σ σ2 1.01% 0.00320 0.00001

Table 4.39: Accuracy tables for baseline automatic evaluation 93

CHAPTER 4. RESULTS AND DISCUSSION

100

Accuracy

80 60

48.5

45.64 35.19

40 20 0 Predicate Only

Pred+Prep

Full Sentence

Figure 4.1: Baseline accuracy on restricted choice set

100

Accuracy

80 60 40 20

16.62 15.54

0 Predicate Only

1.01 Pred+Prep

Full Sentence

Figure 4.2: Baseline accuracy on unrestricted choice set

Over a 10-fold cross-validation of the test set, using the saved feature vectors as training data, the MaxEnt classifier achieves 48.50% accuracy on selecting the correct event predicate alone, and 45.64% when selecting the correct sentence in its entirety. The baseline results when selecting the predicate alone display a much higher variance than the results when selecting the entire sentence, pointing to the existence of some “confusing” features 94

CHAPTER 4. RESULTS AND DISCUSSION when judging the predicate by itself, or indicating some extra information provided by object features resulting in more consistent results across folds. Nevertheless, this baseline exhibits only 12-15% improvement over random chance in a three-way classification task. When the algorithm is not restricted to the same three choices given to evaluators for each labeling task (effectively increasing the choice from a 3-way classification to an 11-way classification when choosing just the predicate, or a 1,119-way classification when choosing the entire sentence), the accuracy drops quite drastically, to 16.62% on predicting the predicate alone, and to 1.01% on predicting the entire sentence, effectively reducing the accuracy to statistical error.

4.3.2

Deep Learning Results

The results from the deep learning classifiers are presented graphically, showing the differences in performance across neural network types and granularity levels, but also showing the effects of the number of training steps on each learning method. Full data tables of the format of Table 4.39 are available in Appendix E. Eight neural net configurations were run, on 5 different training lengths, across 10 folds at 3 levels of granularity apiece, for a total of 1,200 individual automatic evaluations. Aggregate results are presented below. The following charts are sorted by neural net type and learning method, with two graphs for each: with the network’s choice restricted to the three possible captions available to human evaluators for the same visualization in HET2, and with the choice set open to all options (predicate, predicate plus preposition, or full sentence, depending on granularity level). Each chart shows the three levels of granularity assessed for the baseline. Discussion follows after all the charts.

95

CHAPTER 4. RESULTS AND DISCUSSION DNN with Unweighted Features

100

97.88

97.64

97.37

81.88

81.61

81.16

80.87

97.88

97.82

81.19

Accuracy

80 66.83

64.86

63.25

63.91

60 48.8

40 20 0 1000

2000

3000 Training Steps

Predicate Only

Pred+Prep

4000

5000

Full Sentence

Figure 4.3: “Vanilla” DNN accuracy on restricted choice set

100

95.41

94.87

93.5

95.29

95.14

Accuracy

80 60

56.46

55.63

57.18

55.63

56.58

40 20 0

0.24

1000

0.42

2000

0.36

3000 Training Steps

Predicate Only

Pred+Prep

0.66

4000

0.51

5000

Full Sentence

Figure 4.4: “Vanilla” DNN accuracy on unrestricted choice set

96

CHAPTER 4. RESULTS AND DISCUSSION DNN with Weighted Features

100

Accuracy

80.98

80.27

80

97.85

97.58

96.36

97.88

97.82

63.17

62.78

62.3

60.13

60

81.34

80.78

80.66

45.32

40 20 0 1000

2000

3000 Training Steps

Predicate Only

Pred+Prep

4000

5000

Full Sentence

Figure 4.5: DNN with weighted features accuracy on restricted choice set

100

95.5

95.44

95.05

93.44

95.47

Accuracy

80 60

56.46

55.69

55.15

56.7

55.84

40 20 0

0.25

1000

0.24

2000

0.24

3000 Training Steps

Predicate Only

Pred+Prep

0.6

0.21

4000

5000

Full Sentence

Figure 4.6: DNN with weighted features accuracy on unrestricted choice set

97

CHAPTER 4. RESULTS AND DISCUSSION DNN with Weighted Discrete Features

100

97.88

65.31

60

98

97.82

81.19

81.01

80.33

80 Accuracy

97.91

96.51

80.95

80.33

65.25

65.01

63.67

54.11

40 20 0 1000

2000

3000 Training Steps

Predicate Only

Pred+Prep

4000

5000

Full Sentence

Figure 4.7: DNN with weighted discrete features accuracy on restricted choice set

100

95.5

95.2

95.17

95.02

92.75

Accuracy

80 60

56.61

56.04

55.18

56.46

55.19

40 20 0

0.18

1000

0.54

0.45

2000

3000 Training Steps

Predicate Only

Pred+Prep

0.6

0.45

4000

5000

Full Sentence

Figure 4.8: DNN with weighted discrete features accuracy on unrestricted choice set

98

CHAPTER 4. RESULTS AND DISCUSSION DNN with Feature Weights Only

100

99.04

98.95

98.27

83.43

83.34

82.95

99.01

98.95

83.25

83.13

Accuracy

80 69.78

67.19

69.72

69.12

60 40 20

11.35

0 1000

2000

3000 Training Steps

Predicate Only

Pred+Prep

4000

5000

Full Sentence

Figure 4.9: DNN with feature weights only accuracy on restricted choice set

100

Accuracy

80

76.57

75.86

75.62

97.1

97.05

96.93

96.12

97.07

76.51

76.48

60 40 20 0

3 · 10−2

1000

0.45

0.18

2000

3000 Training Steps

Predicate Only

Pred+Prep

0.63

4000

0.63

5000

Full Sentence

Figure 4.10: DNN with feature weights only accuracy on unrestricted choice set

99

CHAPTER 4. RESULTS AND DISCUSSION Combined Linear-DNN with Unweighted Features

100

Accuracy

60

96.15

58.1

57.65

57.27

56.19

81.19

80.92

80.57

80.36

80.18

80

96.15

95.97

95.88

95.56

53

40 20 0 1000

2000

3000 Training Steps

Predicate Only

Pred+Prep

4000

5000

Full Sentence

Figure 4.11: Linear-DNN accuracy on restricted choice set

100

91.5

91.44

91.35

91.2

90.85

Accuracy

80 60

55.06

54.85

54.43

53.75

53.51

40 20 0

0.36

1000

0.42

2000

0.42

3000 Training Steps

Predicate Only

Pred+Prep

0.47

4000

0.42

5000

Full Sentence

Figure 4.12: Linear-DNN accuracy on unrestricted choice set

100

CHAPTER 4. RESULTS AND DISCUSSION Combined Linear-DNN with Weighted Features

100

Accuracy

60

96.06

96

57.21

56.28

55.21

53.06

81.46

81.43

81.16

80.92

80.66

80

96.12

96.03

95.91

50.05

40 20 0 1000

2000

3000 Training Steps

Predicate Only

Pred+Prep

4000

5000

Full Sentence

Figure 4.13: Linear-DNN with weighted features accuracy on restricted choice set

100

91.59

91.5

91.26

90.88

91.44

Accuracy

80 60

54.79

54.73

54.58

53.99

54.52

40 20 0.3

0 1000

0.36

2000

0.33

3000 Training Steps

Predicate Only

Pred+Prep

0.45

0.39

4000

5000

Full Sentence

Figure 4.14: Linear-DNN with weighted features accuracy on unrestricted choice set

101

CHAPTER 4. RESULTS AND DISCUSSION Combined Linear-DNN with Weighted Discrete Features

100

Accuracy

60

96.15

58.1

57.65

57.27

56.19

81.19

80.92

80.57

80.36

80.18

80

96.15

95.97

95.88

95.59

53

40 20 0 1000

2000

3000 Training Steps

Predicate Only

Pred+Prep

4000

5000

Full Sentence

Figure 4.15: Linear-DNN with weighted discrete features accuracy on restricted choice set

100

96.75

96.66

95.56

94.48

92.46

Accuracy

80 60

55.06

54.85

54.43

53.75

53.51

40 20 0

0.18

1000

0.18

2000

0.18

3000 Training Steps

Predicate Only

Pred+Prep

0.3

0.27

4000

5000

Full Sentence

Figure 4.16: Linear-DNN with weighted discrete features accuracy on unrestricted choice set

102

CHAPTER 4. RESULTS AND DISCUSSION Combined Linear-DNN with Feature Weights Only

100

83.31

82.86

81.04

98.71

97.49

97.1

96.45

98.71

83.37

83.19

Accuracy

80 57.89

60 45.29

40 30.03 22.49

20.23

20 0 1000

2000

3000 Training Steps

Predicate Only

Pred+Prep

4000

5000

Full Sentence

Figure 4.17: Linear-DNN with feature weights only accuracy on restricted choice set

100

Accuracy

80

76.25

76.04

75.98

75.77

91.5

91.44

91.35

91.2

90.85

71.78

60 40 20 0

0.36

1000

0.42

2000

0.42

3000 Training Steps

Predicate Only

Pred+Prep

0.45

4000

0.42

5000

Full Sentence

Figure 4.18: Linear-DNN with feature weights only accuracy on unrestricted choice set

103

CHAPTER 4. RESULTS AND DISCUSSION Discussion All variations of the neural network method were able to identify the motion predicate alone with greater than 90% accuracy, even with only 1,000 training steps and even when given a choice of all available motion predicates, not just the three offered to human evaluators on the same individual task. The addition of a prepositional adjunct distinction caused accuracy to sink to somewhere in the 80-90% range for the restricted choice set, and the 50-60% range for the unrestricted choice set. This is still well above the baseline but represents a significant discounting from results where the label set was limited to verbs and verbs alone. Two exceptions to this are the networks with feature weights only (Figure 4.10 and Figure 4.18), where accuracy remained in the vicinity of 75%. This phenomenon is discussed briefly below and in Section 4.5.1. One general observation that emerged during early tests of the neural network learning method was that introducing IDF weighting to all the features seemed to add little in terms of end-run accuracy, and in fact often introduced noise that lowered accuracy at short training intervals (see Figure 4.3 where the “vanilla” DNN achieved 97.73%/81.88%/48.80% accuracy as opposed to Figure 4.5 where the DNN with feature weights achieved 96.36%/80.27%/45.32% accuracy). This shortfall was usually made up with additional training but even then rarely exceeded the performance of unweighted features. Meanwhile, assigning IDF weights to the discrete features only provided some increase in performance, notable mostly in the highest granularity level. This led to the intuition that the presence or absence of a feature may be a strong predictor of motion class. Since the distribution of underspecified parameter features varies through events in the test set (that is, some features occur in one event class only, others in multiples—which ties back into the notion that the amount and nature of spatial information provided by a predicate is variable), this was transformed into a TF-IDF metric (see discussion under Section 3.4.2). Both DNN and combined Linear-DNN methods that used feature IDF weights only in place of actual feature values actually outperformed all other methods. In the lowest granularity, the advantage was slight (typically up from ∼96% to ∼98% across 5,000 training steps). In the middle granularity, the advantage was in the vicinity of 20%, typically jumping from ∼55% to ∼75%. In the highest granularity, predicting the entire input sentence, the weights-only method drastically underperforms all the other methods at first, but this is quickly made up for by longer training. In the combined network, a gradual increase in performance over training time results in approximate parity with the other methods (Figure 4.17), whereas in the purely deep learning network, the weights-only method ends up besting the others 104

CHAPTER 4. RESULTS AND DISCUSSION by about 10% after 5,000 training steps (Figure 4.9).2 TensorFlow documentation typically recommends using a “wide” or linear classifier for continuous features and deep learning for discrete or categorical features, so it was thought that a combined Linear-DNN feeding continuous features to the linear nodes would be more appropriate to mixed input feature types than a wholly deep learning classifier, but overall the entirely deep learning method outperformed the combined method across all variations. The “deep” learning method is actually not very deep, consisting of only 4 layers of artificial neurons, but that is all that was needed to achieve results significantly above the baseline, with one exception (discussed shortly). Some tests on dev-test sets did not show much improvement with the addition of more layers or more artificial neurons per layer, although only a small set of variations were tested, and this testing was conducted informally. Results for these development tests are not recorded, but can be replicated through the DNN code at https://github.com/nkrishnaswamy/ thesis-docs-utils/tree/master/utils/analysis/auto-eval. In the highest level of granularity, predicting the entire input sentence, over the unrestricted choice set of all possible input labels, the deep learning methods all actually underperformed the baseline, falling only as high as 0.63% accuracy to the baseline’s 1.01%. As all these figures (baseline included) fall in the range of statistical noise—barely better than random guessing—it seems clear that the label set in this trial, of 1,119 possible labels, is simply too much to classify using sparse feature data, without a very sophisticated method. There do exist recurrent, convolutional, and sequence-to-sequence methods that could be suitable in theory and would be worth testing against this data. In the first two levels of granularity, results are roughly the same across all training lengths. Often, there is a slight increase in performance with the addition of training steps, but in most cases this plateaus after about 3,000 or 4,000 steps, and only in a few variations of the neural net configuration does the classifier perform best when trained for 5,000 steps. In some cases with the finest-grained level, we see indications that further training may increase performance, particularly when using feature weights as input alone without values (see Figure 4.17), but more tests would need to be run to verify this. On the finest-grained trials, we also see significant improvement in performance between 1,000 and 2,000 training steps (e.g., Figure 4.9), or across the 1,000-5,000 training step range (Figure 4.17). Slight improvement from 1,000 to 2,000 training steps that plateaus afterwards is also common (e.g., Figure 4.13). 2

This is in the restricted choice set only.

105

CHAPTER 4. RESULTS AND DISCUSSION

4.4

Mechanical Turk Worker Response

The human evaluation tasks also allowed workers to (optionally) provide short feedback explaining their decisions on each HIT. Since this field of the task was not required to be filled out, data is not uniform and is also free form, making it difficult to quantitatively assess. However, qualitative assessment can be provided to get a sense of what those workers who filled out that portion of the task were reasoning about it. A brief survey of this data is presented as word clouds, to provide a quick, intuitive, and visual assessment of the terms (here, uni- and bigrams) that occurred most frequently in worker comments.

Figure 4.19: Word clouds depicting worker response to HET1 For HET1, workers most often explained a choice of multiple videos for a given input by citing the fact that the videos displayed the same event (“perfectly matching,” “matching video,” “equally 106

CHAPTER 4. RESULTS AND DISCUSSION well,” “right process,” etc.). Less frequent but also prevalent was discussion of the objects involved.

Figure 4.20: Word clouds depicting worker response to HET2 For HET2, we see similar results (“right process,” “describe event,” etc.). Workers also commented on the task itself, sometimes ungrammatically (“much interesting”). One particular worker, who completed a lot of tasks, also described the videos as “nice work,” which, while validating, is not a very informative response. Overall, workers did not display much tendency to explain their decisions when they chose one distinct answer. Most explanation came when they chose multiples, or in some cases “none.” Word clouds were generated with the wordcloud Python package by Andreas Mueller (http://amueller.github.io/word cloud/).

107

CHAPTER 4. RESULTS AND DISCUSSION

4.5

Summary

The discussion heretofore has been primarily of qualitative inferences taken from the quantitative data, to evaluate the relevance of particular parameters and trends to the prototypicality of the visualization of a given motion event. In Chapter 2, I stated that the goal of this thesis was to determine a set of “best practices” for assigning values to an underspecified motion event in a visual simulation system. From the data revealed by conditioning the responses to HET1 and HET2 on various parameters, I would like to propose the following qualitative criteria for determining the prototypicality of an instance of a motion: 1. Perspicacity — When specifying the parameters of a motion predicate, a certain minimum level of additional information is required for the program to run, but adding too much information seems to have a greater chance of violating an evaluator’s notion of what the original (underspecified) predicate should be. For example, while a rolling or a spinning motion is technically a turning, the prototypical turning motion will have the minimum level of additional specification needed to execute the simulation, and no more. Too much added information tends to change the motion class in an evaluator’s eyes. 2. Obviousness — Motions that are too slight may be mistaken for software jitters or other artifacts and tend not to be classed as distinct motions. The motion should be made in such a way that it is undeniably denoting something. 3. Moderation — For parameters where evaluators have preferred value ranges, those ranges tend to fall in the center of the values available. Motions that are too slow, too fast, too far, not far enough, etc., tend to be less likely to be preferred. This must often be balanced against obviousness, in cases where a motion of longer length or duration is the most obvious kind of enactment. 4. Perspective-independence — The nature (path, manner, relative orientation, etc.) of the motion event should be identifiable from all points of view. An event that looks different from different perspectives is less likely to be preferred as a “best visualization.” 5. Physics-independence — Any physics effects applied after the completion of an event should not change the event class. For example, when a “lean” event is completed, if the leaned object falls off its support, that is unlikely to be considered a “lean.” 108

CHAPTER 4. RESULTS AND DISCUSSION 6. Reflexivity — An event visualization generated from a given label should be identifiable as that label. The prototypical “move” may be considered a “slide,” since “move” instances respecified as “slide” are more likely to be described as “move” events than other events respecified from “move.” 7. Demonstrativity — An event’s VoxML markup includes a semantic head that broadly indicates what class of program is enacted over the arguments, and the event visualization should demonstrate that class of program, such as a continually iterated process, a value assignment, etc. For instance, a prototypical transition event such as “put” will clearly demonstrate the contrast and distinction between the ¬φ and φ states encoded in its typing. In conjunction with the obviousness characteristic, a demonstrative motion will display a key characteristic of the motion that distinguishes it from a different motion. 8. Certainty — The label for the motion should be definite, certain, and unambiguous. If evaluators waver between two choices for a best visualization or motion label, the event depicted is unlikely to be prototypical for either of them. This characteristic serves to define the labels of motion supertypes (e.g., “move” or “turn”) from more fully-specified motion subtypes (e.g., “slide” or “flip”). “Best values” assigned to an underspecified motion predicate are therefore those values that satisfy the maximum number of these criteria that are relevant to that predicate. These criteria or maxims are not equally relevant to every predicate under examination, and a few warrant further examination: (a) The quality of perspicacity as defined above suggests that, at least for motion predicates such as “move” or “turn,” that have a high amount of unspecified information, there exists a base level of information that defines that basic class of motion. This might be termed a “natural prototype” a la Rosch (1973, 1983). As these motions are also supertypes of more specific motions (such as “turn,” a supertype of “spin”), adding parameters that make the underspecified motion more closely approach a natural prototype of a more specified motion predicate may remove the motion from the prototype or “natural category” of the underspecified motion. This comports with the observation that spins and rolls are less likely to be considered acceptable instances of “turn” than arbitrary rotations that do not have another label associated with them in the simulation besides “turn.” 109

CHAPTER 4. RESULTS AND DISCUSSION (b) The above observation also applies to the quality of reflexivity. This research relied on two human-driven evaluation tasks: one in which evaluators had to pick the best visualization(s) for a given label and another where they had to pick the best label(s) for a given visualization. The prototypical motion event for a given class is one where these two classifications converge, such that given a motion class C and a visualization V, for V to be considered a prototypical realization of C by the reflexivity maxim, an arbitrary evaluator should be likely to both choose V as an acceptable or best visualization for C, and choose C as an acceptable or best description of V. (c) The quality of physics-independence suggests that humans evaluate motions not just on the interval from beginning to the instant of satisfaction, but also consider the effects of the motion for at least a short period after the completion of the event, such that an instance of “lean” that does not continue the created support relation is not considered a “lean.” This suggests that the events are taken to be perfective and that prototypical events are judged relative to that Aktionsart.3 However, we should also consider that it may be that all the motions considered herein have natural perfective constructions, thus biasing judgment in that direction. Lexical distinctions such as these could be examined using simulations of semelfactive or atelic verbs (e.g., “blink,” “read,” etc.). The motion predicates tested broadly fell into three types: those where all underspecified variables have a distinct range of “best values” according to the human judges (“move,” “put touching,” “put near”), those where the precise values of the underspecified variables are routinely judged immaterial (“lean” in HET1), and those where certain variables have a preferred range but others do not (the remainder). As expected, most fell into the last category. One interesting conclusion is that human judges appear to exhibit a preference for the minimum level of additional specification required. That is, there appears to be a preference against overspecification, a kind of Gricean maxim for motion events. There is also the possibility that the “best values” chosen by the human judges may be correlated to object or scene properties such as size of the object or environment/MES. Object features are reflected in some of the parameter values, such as relative offset ranges for “put near” events, but these are very weak signals. To further examine these kinds of effects, a similar set of experiments can be run in different environments to see if correlations emerge between Monte 3

A property of predicates concerned with the internal temporal consistency of the denoted situation (Vendler, 1957; Comrie, 1976; Bache, 1985). The German term is roughly equivalent to “lexical aspect” in English.

110

CHAPTER 4. RESULTS AND DISCUSSION Carlo-generated value ranges and object-independent scene properties, such as size of the total manipulable area beyond the MES. Conditioning on these additional variables could reveal the effects of object features even beyond obvious candidates like object size (a primary factor in occlusion, which may have affected evaluators’ judgments, as discussed in Section 4.1.7). Since human evaluators were making their judgments on the basis of the visualized events without access to the feature vectors that created them, it was difficult to quantitatively (or even qualitatively) assess the informativity of particular features in motion event labeling. As humans tend to be better judges of qualitative parameters that of precise quantitative values such as those provided by the feature vectors (particularly for the continuous features), the feature vectors are unlikely to be have been much help to the human judges, although this question may be worth individual examination, particularly on the basis on individual features. At any rate, the machine learning-based AET provides a heuristic to measure individual feature informativity against.

4.5.1

Feature Informativity

Independent of its actual value, the presence or absence of a given underspecified feature turns out to be quite a strong predictor of motion class, and even without object features, can be used to automatically discriminate minimal pairs (or triplets) of complete sentences based on motion class alone, with relatively high accuracy, given sufficient training time. This is an interesting result that says something about the data and the task, and not just the machine learning method. The meaning of a motion event, at least according to this data, can be said to be found as much as is what is left out as in what is said. This axiom, in many senses and interpretations, seems apt to generalize to semantic informativity in language at large.

111

Chapter 5 Future Directions While it has been discussed how composing primitive behaviors into complex events requires indepth knowledge of the entailments underlying event descriptions, corpora targeted toward needs such as this are only recently being developed (Bowman et al., 2015). Annotated video and image datasets that do not restrict annotator input to a core vocabulary (e.g. Ronchi and Perona (2015)) contain mostly top-level super-events as opposed to the subevent data needed to perform automatic event composition, while those that require a restricted vocabulary rely on differing sets of primitives (Chao et al., 2015; Gupta and Malik, 2015). The data gathered through the course of this research can serve as one such dataset built on what I believe to be a set of primitive actions grounded solidly in both language understanding and semantics, and three-dimensional mathematics. Also provided are a set of feature vectors describing the test set of motion events, experimental data inferring some “best values” for those features, (should they exist), and a machine-learning method resulting in the assessment of the informativity of those features. The Brandeis University lab in which this research has been conducted has begun bootstrapping a dataset of videos annotated with event-subevent relations using ECAT, an internally-developed video annotation tool (Do et al., 2016). ECAT allows the annotation of video with labeled events, object participants, and subevents, which can be used to induce what the common subevent structures are for the labeled superevent. Videos have been recoded of a test set of verbs using simple objects, such as blocks and other objects like those found used in the test set here. Similar annotation methods have also been linked to data from movies (Do, 2016; Kehat and Pustejovsky, 2016). 112

CHAPTER 5. FUTURE DIRECTIONS The features of motion events gathered in this dissertation, along with the analyzed informativity metrics, can be used to better understand motion semantics in narratives. Through this line of research, it can be demonstrated that there exist, for some underspecified variables in motion events, prototypical values that create event enactments (here visual) that comport with human notions of prototypicality for those events. These conclusions flow from analysis of a dataset linking visualized events to linguistic instantiations, which can serve as the beginnings of a corpus of motion events annotated with, effectively, the information “missing” from the linguistic instantiation, such that an utterance describing a motion event, composed with its voxeme and corpus data from a dataset such as this one, results in a complete event visualization with the missing “bits” assigned value, allowing the event to be computationally evaluated. Underlying that is VoxML, a robust, extensible framework for mapping natural language to a minimal model and then to a simulation through the dynamic semantics, DITL. The implementation of this framework is VoxSim, a software platform for simulation in a visual modality, which demonstrates the framework’s utility for answering theoretical linguistic questions about motion events. It should be stressed here that visualization is just one available modality. As technology improves, events may be simulated through other modalities, including aural, haptic, or proprioceptic. The deep learning methods used here for event classification provide good results over some permutations of the dataset, but could be better in others, and utterly fail in others still. The notion should be entertained that, as a motion event from start to finish is a sequence, attempting to decode it from a single feature vector is perhaps not the best technique, and sequence-to-sequence methods could be explored for this type of task. In addition, since qualitative spatial relations are readily available in the data (either directly encoded in feature vectors or calculable from other feature values plus generally accessible knowledge about the scene), machine learning algorithms to perform analogical generalization over the QSR data could provide similar or better results, potentially with less data (cf. McLure et al. (2015)). The machine evaluation presented herein was largely modeled on the second human evaluation task, requiring the machine learning algorithm to predict the original input sentence for the features of a generated visualization. Using this data and the human-evaluated data as a gold standard, a machine learning-based version of the reverse task is also possible, wherein VoxSim generates a set of new visualizations with different random values for the underspecified parameters, and the algorithm predicts which of those new instances are most likely to be judged acceptable by a human evaluator. These results could then be compared to the presented results from the first human evaluation task. 113

CHAPTER 5. FUTURE DIRECTIONS

5.1

Extensions to Methdology

Extensions to the research methodology may involve using VoxSim in an augmented or virtual reality environment to examine how human perception of motion events changes in a virtual environment when it is fully immersive rather than viewed on a screen. A variable could be added by introducing a disorienting factor to the human judges’ perception, investigating the intersection between spatial perception, cognition, and language processing in virtual environments. The use of a virtual environment also affords alternative methods of gathering additional data to augment the dataset presented here, which was gathered by a very constrained and specified method in order to present a potentially first-of-its-kind dataset in a new field. Data on alternative features can be gathered with very slight adjustments to the automatic capture code, human evaluation tasks can be designed to evaluate event visualization acceptability on a scale rather than as a binary, and VoxSim’s foundation on game design, AI theory, and game engine technology readily lends itself to “gamification” approaches in data gathering that could obviate some need for budget considerations and constraints (Pelling, 2011; Deterding et al., 2011). One such approach might be modeled on the “ESP game” (Von Ahn and Dabbish, 2004), wherein pairs of anonymous players label the same image and receive points when their labels coincide, providing an impetus to keep playing. Using a virtually identical mechanic, with motion events generated by VoxSim instead of static images, could potentially generate a large set of video captions (a la those used in HET2), and the search space could be left open or restricted and then evaluated against both the HET2 results and the presented machine-learning results. This and related approaches could provide many methods for expanding the initial datasets presented in this thesis into a genuine crowd-sourced gold standard, and allow more flexibility in gathering new data about new events or object interactions.

5.2

VoxML and Robotics

The VoxML framework also has relevance to the field of robotics. While a humanoid skeleton in a 3D environment is a directed, rooted graph with nodes laid out in the rough configuration of a human, representing the positions of the major joints, a robotic agent could be virtually represented by a similar type of graph structure with different configuration, isomorphic to the locations of major pivot points on the robot’s external structure, such as those of graspers or robotic limbs. 114

CHAPTER 5. FUTURE DIRECTIONS A 3D representation of a robotic agent that is operating in the real world then would allow the simulation of events in the 3D world (such as moving the simulated robot around a simulated table that has simulated blocks on it) representing events and object configurations in the real world. The event simulation then generates position and orientation information for each object in the scene at each time step t, which is isomorphic to the real-world configuration in the same way that the robot’s virtual skeleton is isomorphic to its actual joint structure. This allows the real robot, acting as an agent, to be fed a set of translation and rotation “moves” by its virtual embodiment that is a nearly exact representation of the steps it would need to take to satisfy a real world goal, such as navigating to a target or grasping an object (cf. Thrun et al. (2000); Rusu et al. (2008)). In short, the interdisciplinary nature of the research that led to the creation of VoxML and VoxSim naturally affords many extensions into other disciples, fields, and specializations.

5.3

Information-Theoretic Implications

Correlates between the incomplete information provided by a linguistic predicate and models for ignorance about a system are quite striking in their resemblance. Ω, or the measure of ignorance of a physical system, can be thought of in terms of the number of quantitatively-defined microstates consistent with a qualitative macrostate (or “label”), and in the data gathered about linguistic presuppositions here, we can observe similar patterns unfolding with regard to events, wherein some events (for instance, “lean”) appear to allow a number of configurations (or microstates) that would satisfy that label (or macrostate) according to human evaluators. We also observe circumstances in which the incomplete information provided by a predicate is actually further restricted to some set of values and metrics that appear to correlate more closely to prototypicality of a given motion event than others, suggesting some level of information is added by the interpreter. That evaluators seem to be resistant to highly underspecified events such as “move” being too far overspecified, or having too much information content assigned to them, suggests that predicates may be bearers of a finite level of information entropy, meaning a finite level of information is required to describe the system.1 If I may be allowed to philosophize for a moment, this suggests that representabil1

Qualitative reasoning approaches to this notion have been forwarded by Kuipers (1994) and Joskowicz and Sacks (1991), where many examples presented involve physical systems. Other discussion on qualitative spatial reasoning in physical systems is presented in the qualitative physics literature, including Forbus (1988); Faltings and Struss (1992), and a recent gamification approach presented by Shute et al. (2013).

115

CHAPTER 5. FUTURE DIRECTIONS ity of a proposition is ultimately grounded in physical reality. For a statement to have meaning, irrespective of its truth value, it must be assessed relative to some condition in the world, either as describing a true situation, or in contrast to one. A “roll” is technically a kind of turning, but a “flip” may be a better one. Adding too much, or the wrong kind of, information to an underspecified predicate can change its meaning in the eye of the interpreter. Different intensions and extensions, different senses and references, find their union in the set of parameter values that satisfy both, resulting in a realization that, to the interpreter under the model constructed by the current situational context, appears to be the truth.

116

Appendix A VoxML Structures

A.1

Objects                                                                          



block 

LEX

  =   



PRED TYPE

= =

 block   physobj, artifact 



TYPE

        =        



HEAD = rectangular COMPONENTS = nil CONCAVITY = flat ROTAT S YM

=

REFLECT S YM

prism[1]

{X, Y, Z} = {XY, XZ, Y Z}

               



 

HABITAT

    =    

I NTR

 = [2]

E XTR

= ...



AFFORD STR

    =     

EMBODIMENT



CONSTR

=

{X = Y + Z}

        

A1 A3 A4

= = =

H[2] → [put(x, on([1]))]support([1], x)     H[2] → [grasp(x, [1])]hold(x, [1])    H[2] → [slide(x, [1])]H[2] 





  =   

SCALE = Y } 



E XTR

    =     

EMBODIMENT



        [3]      =         [4]  



AFFORD STR





{X([1]) > X(y : physobj),      Z([1]) > Z(y : physobj)} 

CONSTR



=

       



UP

=

align(Y, E⊥Y )  

A1 A2 A4

= = =

 H[2] → [grasp(x, [1])]   H[3] → [put([1], on(y)]close([1], y)    H[4] → [roll(x, [1])] 





  =   

SCALE = Z}     UP = align(Y, EY )     TOP = top(+Y )  

CONSTR



=



E XTR

 = [3]



UP

=

align(Y, E⊥Y ) 



AFFORD STR

      =       

EMBODIMENT

        

A1 A2 A3 A4

= = = =

H[2] H[2] H[2] H[3]

→ [put(x, on([1]))]support([1], x)   → [put(x, in([1]))]contain([1], x)     → [grasp(x, [1])]    → [roll(x, [1])] 





  =   

SCALE = 0.0f) { OnPrepareLog (this, new ParamsEventArgs ("RotAngle",

145

APPENDIX C. [[TURN]]: COMPLETE OPERATIONALIZATION angle.ToString())); OnPrepareLog (this, new ParamsEventArgs ("RotDir", sign.ToString ())); } if (rotAxis != string.Empty) { OnPrepareLog (this, new ParamsEventArgs ("RotAxis", rotAxis)); } if ((Helper.GetTopPredicate(eventManager.lastParse) == Helper.GetTopPredicate(eventManager.events [0])) || (PredicateParameters.IsSpecificationOf( Helper.GetTopPredicate(eventManager.events [0]), Helper.GetTopPredicate(eventManager.lastParse)))) { OnParamsCalculated (null, null); } } } return; }

Figure C.1: C# operationalization of [[TURN]] (unabridged) The segment below “add to events manager” adds the event to the global events manager for tracking and updating the status and state-by-state relations between objects in the scene that result from this [[TURN]] action. OnPrepareLog and OnParamsCalculated handle logging the values assigned to the underspecified parameters to a SQL database for the experimentation and evaluation. This feature is present in VoxSim for users wishing to capture simulated events and their properties.

146

Appendix D Sentence Test Set 1. open the book

16. move the apple

31. turn the apple

2. close the book

17. move the banana

32. turn the banana

3. close the cup

18. move the bowl

33. turn the bowl

4. open the cup

19. move the knife

34. turn the knife

5. close the bottle

20. move the pencil

35. turn the pencil

6. open the bottle

21. move the paper sheet

36. turn the paper sheet

7. move the block

22. turn the block

37. lift the block

8. move the ball

23. turn the ball

38. lift the ball

9. move the plate

24. turn the plate

39. lift the plate

10. move the cup

25. turn the cup

40. lift the cup

11. move the disc

26. turn the disc

41. lift the disc

12. move the book

27. turn the book

42. lift the book

13. move the blackboard

28. turn the blackboard

43. lift the blackboard

14. move the bottle

29. turn the bottle

44. lift the bottle

15. move the grape

30. turn the grape

45. lift the grape

147

APPENDIX D. SENTENCE TEST SET 46. lift the apple

72. flip the disc at center

98. roll the ball

47. lift the banana

73. flip the book at center

99. roll the plate

48. lift the bowl

74. flip the bottle at center

100. roll the cup

49. lift the knife

75. flip the grape at center

101. roll the disc

50. lift the pencil

76. flip the apple at center

102. roll the book

51. lift the paper sheet

77. flip the banana at center

103. roll the blackboard

52. spin the block

78. flip the bowl at center

104. roll the bottle

53. spin the ball

79. flip the knife at center

105. roll the grape

54. spin the plate

80. flip the pencil at center

106. roll the apple

55. spin the cup

81. flip the paper sheet at center

107. roll the banana

56. spin the disc

82. slide the block

108. roll the bowl

57. spin the book

83. slide the ball

109. roll the knife

58. spin the blackboard

84. slide the plate

110. roll the pencil

59. spin the bottle

85. slide the cup

111. roll the paper sheet

60. spin the grape

86. slide the disc

112. open the banana

61. spin the apple

87. slide the book

113. put the block in the plate

62. spin the banana

88. slide the blackboard

114. put the block in the cup

63. spin the bowl

89. slide the bottle

115. put the block in the disc

64. spin the knife

90. slide the grape

116. put the block in the bottle

65. spin the pencil

91. slide the apple

117. put the block in the bowl

66. spin the paper sheet

92. slide the banana

118. put the ball in the block

67. flip the block on edge

93. slide the bowl

119. put the ball in the plate

68. flip the book on edge

94. slide the knife

120. put the ball in the cup

69. flip the blackboard on edge

95. slide the pencil

121. put the ball in the disc

70. flip the plate at center

96. slide the paper sheet

122. put the ball in the bottle

71. flip the cup at center

97. roll the block

123. put the ball in the bowl

148

APPENDIX D. SENTENCE TEST SET 124. put the plate in the cup

150. put the banana in the bowl

125. put the plate in the disc

151. put the bowl in the plate

126. put the plate in the bowl

152. put the bowl in the cup

127. put the cup in the plate

153. put the bowl in the disc

128. put the cup in the bottle

154. put the knife in the block

129. put the cup in the bowl

155. put the knife in the ball

130. put the disc in the cup 131. put the disc in the bottle 132. put the disc in the bowl

156. put the knife in the plate 157. put the knife in the cup 158. put the knife in the book

133. put the book in the cup

159. put the knife in the blackboard

134. put the book in the bottle

160. put the knife in the bottle

135. put the book in the bowl

161. put the knife in the grape

136. put the bottle in the cup

162. put the knife in the apple

137. put the bottle in the disc

163. put the knife in the banana

138. put the bottle in the bowl

164. put the knife in the bowl

139. put the grape in the plate

165. put the knife in the pencil

140. put the grape in the cup

166. put the knife in the paper sheet

141. put the grape in the book 167. put the pencil in the cup 142. put the grape in the bottle 168. put the pencil in the book 143. put the grape in the bowl 144. put the apple in the plate 145. put the apple in the cup 146. put the apple in the bottle 147. put the apple in the bowl 148. put the banana in the plate 149. put the banana in the cup

174. put the block touching the disc 175. put the block touching the book 176. put the block touching the blackboard 177. put the block touching the bottle 178. put the block touching the grape 179. put the block touching the apple 180. put the block touching the banana 181. put the block touching the bowl 182. put the block touching the knife 183. put the block touching the pencil 184. put the block touching the paper sheet 185. put the ball touching the block 186. put the ball touching the plate

169. put the pencil in the bottle 170. put the pencil in the bowl

187. put the ball touching the cup

171. put the block touching the ball

188. put the ball touching the disc

172. put the block touching the plate

189. put the ball touching the book

173. put the block touching the cup

190. put the ball touching the blackboard

149

APPENDIX D. SENTENCE TEST SET 191. put the ball touching the bottle

207. put the banana touching the block

223. put the pencil touching the ball

192. put the ball touching the grape

208. put the bowl touching the block

224. put the paper sheet touching the ball

193. put the ball touching the apple

209. put the knife touching the block

225. put the plate touching the cup

194. put the ball touching the banana

210. put the pencil touching the block

226. put the disc touching the cup

195. put the ball touching the bowl

211. put the paper sheet touching the block

227. put the book touching the cup

196. put the ball touching the knife

212. put the plate touching the ball

228. put the blackboard touching the cup

197. put the ball touching the pencil

213. put the cup touching the ball

229. put the bottle touching the cup

198. put the ball touching the paper sheet

214. put the disc touching the ball

230. put the grape touching the cup

199. put the plate touching the block

215. put the book touching the ball

231. put the apple touching the cup

200. put the cup touching the block

216. put the blackboard touching the ball

232. put the banana touching the cup

201. put the disc touching the block

217. put the bottle touching the ball

233. put the bowl touching the cup

202. put the book touching the block

218. put the grape touching the ball

234. put the knife touching the cup

203. put the blackboard touching the block

219. put the apple touching the ball

235. put the pencil touching the cup

204. put the bottle touching the block

220. put the banana touching the ball

236. put the paper sheet touching the cup

205. put the grape touching the block

221. put the bowl touching the ball

237. put the cup touching the plate

206. put the apple touching the block

222. put the knife touching the ball

238. put the disc touching the plate

150

APPENDIX D. SENTENCE TEST SET 239. put the book touching the plate

255. put the apple touching the disc

271. put the pencil touching the book

240. put the blackboard touching the plate

256. put the banana touching the disc

272. put the paper sheet touching the book

241. put the bottle touching the plate

257. put the bowl touching the disc

273. put the plate touching the blackboard

242. put the grape touching the plate

258. put the knife touching the disc

274. put the cup touching the blackboard

243. put the apple touching the plate

259. put the pencil touching the disc

275. put the disc touching the blackboard

244. put the banana touching the plate

260. put the paper sheet touching the disc

276. put the book touching the blackboard

245. put the bowl touching the plate

261. put the plate touching the book

277. put the bottle touching the blackboard

246. put the knife touching the plate

262. put the cup touching the book

278. put the grape touching the blackboard

247. put the pencil touching the plate

263. put the disc touching the book

279. put the apple touching the blackboard

248. put the paper sheet touching the plate

264. put the blackboard touching the book

280. put the banana touching the blackboard

249. put the plate touching the disc

265. put the bottle touching the book

281. put the bowl touching the blackboard

250. put the cup touching the disc

266. put the grape touching the book

282. put the knife touching the blackboard

251. put the book touching the disc

267. put the apple touching the book

283. put the pencil touching the blackboard

252. put the blackboard touching the disc

268. put the banana touching the book

284. put the paper sheet touching the blackboard

253. put the bottle touching the disc

269. put the bowl touching the book

285. put the plate touching the bottle

254. put the grape touching the disc

270. put the knife touching the book

286. put the cup touching the bottle

151

APPENDIX D. SENTENCE TEST SET 287. put the disc touching the bottle

303. put the apple touching the grape

319. put the pencil touching the apple

288. put the book touching the bottle

304. put the banana touching the grape

320. put the paper sheet touching the apple

289. put the blackboard touching the bottle

305. put the bowl touching the grape

321. put the plate touching the banana

290. put the grape touching the bottle

306. put the knife touching the grape

322. put the cup touching the banana

291. put the apple touching the bottle

307. put the pencil touching the grape

323. put the disc touching the banana

292. put the banana touching the bottle

308. put the paper sheet touching the grape

324. put the book touching the banana

293. put the bowl touching the bottle

309. put the plate touching the apple

325. put the blackboard touching the banana

294. put the knife touching the bottle

310. put the cup touching the apple

326. put the bottle touching the banana

295. put the pencil touching the bottle

311. put the disc touching the apple

327. put the grape touching the banana

296. put the paper sheet touching the bottle

312. put the book touching the apple

328. put the apple touching the banana

297. put the plate touching the grape

313. put the blackboard touching the apple

329. put the bowl touching the banana

298. put the cup touching the grape

314. put the bottle touching the apple

330. put the knife touching the banana

299. put the disc touching the grape

315. put the grape touching the apple

331. put the pencil touching the banana

300. put the book touching the grape

316. put the banana touching the apple

332. put the paper sheet touching the banana

301. put the blackboard touching the grape

317. put the bowl touching the apple

333. put the plate touching the bowl

302. put the bottle touching the grape

318. put the knife touching the apple

334. put the cup touching the bowl

152

APPENDIX D. SENTENCE TEST SET 335. put the disc touching the bowl

351. put the grape touching the knife

367. put the knife touching the pencil

336. put the book touching the bowl

352. put the apple touching the knife

368. put the paper sheet touching the pencil

337. put the blackboard touching the bowl

353. put the banana touching the knife

369. put the plate touching the paper sheet

338. put the bottle touching the bowl

354. put the bowl touching the knife

370. put the cup touching the paper sheet

339. put the grape touching the bowl

355. put the pencil touching the knife

371. put the disc touching the paper sheet

340. put the apple touching the bowl

356. put the paper sheet touching the knife

341. put the banana touching the bowl

357. put the plate touching the pencil

342. put the knife touching the bowl

358. put the cup touching the pencil

343. put the pencil touching the bowl

359. put the disc touching the pencil

344. put the paper sheet touching the bowl

360. put the book touching the pencil

345. put the plate touching the knife

361. put the blackboard touching the pencil

346. put the cup touching the knife

362. put the bottle touching the pencil

347. put the disc touching the knife

363. put the grape touching the pencil

348. put the book touching the knife

364. put the apple touching the pencil

349. put the blackboard touching the knife

365. put the banana touching the pencil

350. put the bottle touching the knife

366. put the bowl touching the pencil

153

372. put the book touching the paper sheet 373. put the blackboard touching the paper sheet 374. put the bottle touching the paper sheet 375. put the grape touching the paper sheet 376. put the apple touching the paper sheet 377. put the banana touching the paper sheet 378. put the bowl touching the paper sheet 379. put the knife touching the paper sheet 380. put the pencil touching the paper sheet 381. put the block on the ball 382. put the block on the plate 383. put the block on the cup 384. put the block on the disc

APPENDIX D. SENTENCE TEST SET 385. put the block on the book

409. put the plate on the block

434. put the cup on the knife

386. put the block on the blackboard

410. put the plate on the ball

435. put the cup on the pencil

411. put the plate on the cup

436. put the cup on the paper sheet

387. put the block on the bottle 388. put the block on the grape 389. put the block on the apple

412. put the plate on the disc 413. put the plate on the book

437. put the disc on the block 438. put the disc on the ball

390. put the block on the banana

414. put the plate on the blackboard

391. put the block on the bowl

415. put the plate on the bottle

392. put the block on the knife

416. put the plate on the grape

393. put the block on the pencil

417. put the plate on the apple

442. put the disc on the blackboard

394. put the block on the paper sheet

418. put the plate on the banana

443. put the disc on the bottle

419. put the plate on the bowl

444. put the disc on the grape

395. put the ball on the block

420. put the plate on the knife

445. put the disc on the apple

396. put the ball on the plate

421. put the plate on the pencil

446. put the disc on the banana

397. put the ball on the cup

447. put the disc on the bowl

398. put the ball on the disc

422. put the plate on the paper sheet

399. put the ball on the book

423. put the cup on the block

449. put the disc on the pencil

424. put the cup on the ball

450. put the disc on the paper sheet

439. put the disc on the plate 440. put the disc on the cup 441. put the disc on the book

400. put the ball on the blackboard

425. put the cup on the plate

448. put the disc on the knife

451. put the book on the block

401. put the ball on the bottle

426. put the cup on the disc

402. put the ball on the grape

427. put the cup on the book

452. put the book on the ball 403. put the ball on the apple 404. put the ball on the banana 405. put the ball on the bowl 406. put the ball on the knife 407. put the ball on the pencil 408. put the ball on the paper sheet

428. put the cup on the blackboard 429. put the cup on the bottle

453. put the book on the plate 454. put the book on the cup 455. put the book on the disc

430. put the cup on the grape

456. put the book on the blackboard

431. put the cup on the apple

457. put the book on the bottle

432. put the cup on the banana

458. put the book on the grape

433. put the cup on the bowl

459. put the book on the apple

154

APPENDIX D. SENTENCE TEST SET 460. put the book on the banana

484. put the grape on the disc

508. put the banana on the block

461. put the book on the bowl

485. put the grape on the book

509. put the banana on the ball

462. put the book on the knife

486. put the grape on the blackboard

510. put the banana on the plate

487. put the grape on the bottle

512. put the banana on the disc

488. put the grape on the apple

513. put the banana on the book

489. put the grape on the banana

514. put the banana on the bottle

490. put the grape on the bowl

515. put the banana on the grape

491. put the grape on the knife

516. put the banana on the apple

492. put the grape on the pencil

517. put the banana on the bowl

463. put the book on the pencil 464. put the book on the paper sheet 465. put the blackboard on the paper sheet 466. put the bottle on the block 467. put the bottle on the ball 468. put the bottle on the plate 469. put the bottle on the cup

493. put the grape on the paper sheet

470. put the bottle on the disc

494. put the apple on the block

471. put the bottle on the book

495. put the apple on the ball

472. put the bottle on the blackboard

496. put the apple on the plate

511. put the banana on the cup

518. put the banana on the knife 519. put the banana on the pencil 520. put the banana on the paper sheet 521. put the bowl on the block

473. put the bottle on the grape 474. put the bottle on the apple

497. put the apple on the cup 498. put the apple on the disc

522. put the bowl on the ball 523. put the bowl on the plate 524. put the bowl on the cup

499. put the apple on the book

525. put the bowl on the disc

500. put the apple on the blackboard

526. put the bowl on the book

476. put the bottle on the bowl 477. put the bottle on the knife

501. put the apple on the bottle

478. put the bottle on the pencil

502. put the apple on the grape

479. put the bottle on the paper sheet

503. put the apple on the banana

475. put the bottle on the banana

527. put the bowl on the bottle

480. put the grape on the block 481. put the grape on the ball 482. put the grape on the plate 483. put the grape on the cup

528. put the bowl on the grape 529. put the bowl on the apple 530. put the bowl on the banana

504. put the apple on the bowl

531. put the bowl on the knife

505. put the apple on the knife

532. put the bowl on the pencil

506. put the apple on the pencil

533. put the bowl on the paper sheet

507. put the apple on the paper sheet

155

534. put the knife on the block

APPENDIX D. SENTENCE TEST SET 560. put the pencil on the paper sheet

577. put the block near the disc

561. put the paper sheet on the block

579. put the block near the blackboard 580. put the block near the bottle

539. put the knife on the book

562. put the paper sheet on the ball

540. put the knife on the blackboard

563. put the paper sheet on the plate

582. put the block near the apple

541. put the knife on the bottle

564. put the paper sheet on the cup

535. put the knife on the ball 536. put the knife on the plate 537. put the knife on the cup 538. put the knife on the disc

544. put the knife on the banana 545. put the knife on the bowl 546. put the knife on the pencil 547. put the knife on the paper sheet 548. put the pencil on the block 549. put the pencil on the ball 550. put the pencil on the plate

565. put the paper sheet on the disc 566. put the paper sheet on the book 567. put the paper sheet on the blackboard 568. put the paper sheet on the bottle 569. put the paper sheet on the grape 570. put the paper sheet on the apple

551. put the pencil on the cup 552. put the pencil on the disc 553. put the pencil on the book

581. put the block near the grape

583. put the block near the banana 584. put the block near the bowl

542. put the knife on the grape 543. put the knife on the apple

578. put the block near the book

571. put the paper sheet on the banana

585. put the block near the knife 586. put the block near the pencil 587. put the block near the paper sheet 588. put the ball near the block 589. put the ball near the cup 590. put the ball near the disc 591. put the ball near the book 592. put the ball near the blackboard 593. put the ball near the bottle 594. put the ball near the grape

572. put the paper sheet on the bowl

595. put the ball near the apple

573. put the paper sheet on the knife

597. put the ball near the bowl

557. put the pencil on the banana

574. put the paper sheet on the pencil

599. put the ball near the pencil

558. put the pencil on the bowl

575. put the block near the ball

600. put the ball near the paper sheet

559. put the pencil on the knife

576. put the block near the cup

601. put the plate near the block

554. put the pencil on the bottle 555. put the pencil on the grape 556. put the pencil on the apple

156

596. put the ball near the banana

598. put the ball near the knife

APPENDIX D. SENTENCE TEST SET 602. put the plate near the ball 603. put the plate near the cup

626. put the cup near the paper sheet 627. put the disc near the block

604. put the plate near the disc 628. put the disc near the ball 605. put the plate near the book 629. put the disc near the cup 606. put the plate near the blackboard 607. put the plate near the bottle

630. put the disc near the book 631. put the disc near the blackboard

608. put the plate near the grape 632. put the disc near the bottle 609. put the plate near the apple 633. put the disc near the grape 610. put the plate near the banana

634. put the disc near the apple

611. put the plate near the bowl

635. put the disc near the banana

612. put the plate near the knife

636. put the disc near the bowl

613. put the plate near the pencil

637. put the disc near the knife

614. put the cup near the block

638. put the disc near the pencil

615. put the cup near the ball

639. put the disc near the paper sheet

616. put the cup near the disc 640. put the book near the block

649. put the book near the bowl 650. put the book near the knife 651. put the book near the pencil 652. put the blackboard near the block 653. put the blackboard near the ball 654. put the blackboard near the cup 655. put the blackboard near the disc 656. put the blackboard near the book 657. put the blackboard near the bottle 658. put the blackboard near the grape 659. put the blackboard near the apple 660. put the blackboard near the banana

617. put the cup near the book 641. put the book near the ball

661. put the blackboard near the bowl

618. put the cup near the blackboard

642. put the book near the cup

619. put the cup near the bottle

643. put the book near the disc

662. put the blackboard near the knife

620. put the cup near the grape

644. put the book near the blackboard

663. put the blackboard near the pencil

645. put the book near the bottle

664. put the blackboard near the paper sheet

621. put the cup near the apple 622. put the cup near the banana 646. put the book near the grape 623. put the cup near the bowl 647. put the book near the apple 624. put the cup near the knife 625. put the cup near the pencil

648. put the book near the banana

157

665. put the bottle near the block 666. put the bottle near the ball 667. put the bottle near the cup

APPENDIX D. SENTENCE TEST SET 668. put the bottle near the disc 669. put the bottle near the book 670. put the bottle near the blackboard

690. put the grape near the paper sheet

711. put the banana near the grape

691. put the apple near the block

712. put the banana near the apple

692. put the apple near the ball

671. put the bottle near the grape

693. put the apple near the cup

672. put the bottle near the apple

694. put the apple near the disc

673. put the bottle near the banana

695. put the apple near the book

713. put the banana near the bowl 714. put the banana near the knife 715. put the banana near the pencil

674. put the bottle near the bowl

696. put the apple near the blackboard

675. put the bottle near the knife

697. put the apple near the bottle

716. put the banana near the paper sheet

676. put the bottle near the pencil

698. put the apple near the grape

717. put the bowl near the block

699. put the apple near the banana

718. put the bowl near the ball

700. put the apple near the bowl

720. put the bowl near the disc

701. put the apple near the knife

721. put the bowl near the book

702. put the apple near the pencil

722. put the bowl near the blackboard

677. put the bottle near the paper sheet 678. put the grape near the block 679. put the grape near the ball 680. put the grape near the cup 681. put the grape near the disc

703. put the apple near the paper sheet

682. put the grape near the book 683. put the grape near the blackboard

704. put the banana near the block 705. put the banana near the ball

684. put the grape near the bottle 706. put the banana near the cup 685. put the grape near the apple 707. put the banana near the disc 686. put the grape near the banana 687. put the grape near the bowl

708. put the banana near the book

719. put the bowl near the cup

723. put the bowl near the bottle 724. put the bowl near the grape 725. put the bowl near the apple 726. put the bowl near the banana 727. put the bowl near the knife 728. put the bowl near the pencil 729. put the bowl near the paper sheet 730. put the knife near the block

688. put the grape near the knife

709. put the banana near the blackboard

689. put the grape near the pencil

710. put the banana near the bottle

158

731. put the knife near the ball 732. put the knife near the cup 733. put the knife near the disc

APPENDIX D. SENTENCE TEST SET 734. put the knife near the book 735. put the knife near the blackboard 736. put the knife near the bottle 737. put the knife near the grape 738. put the knife near the apple 739. put the knife near the banana 740. put the knife near the bowl 741. put the knife near the pencil 742. put the knife near the paper sheet 743. put the pencil near the block 744. put the pencil near the ball 745. put the pencil near the cup 746. put the pencil near the disc

755. put the paper sheet near the block

773. lean the block on the blackboard

756. put the paper sheet near the ball

774. lean the block on the bottle

757. put the paper sheet near the cup 758. put the paper sheet near the disc 759. put the paper sheet near the book 760. put the paper sheet near the blackboard 761. put the paper sheet near the bottle 762. put the paper sheet near the grape 763. put the paper sheet near the apple 764. put the paper sheet near the banana

747. put the pencil near the book 748. put the pencil near the blackboard 749. put the pencil near the bottle 750. put the pencil near the grape

765. put the paper sheet near the bowl

775. lean the block on the grape 776. lean the block on the apple 777. lean the block on the banana 778. lean the block on the bowl 779. lean the block on the knife 780. lean the block on the pencil 781. lean the block on the paper sheet 782. lean the ball on the block 783. lean the ball on the plate 784. lean the ball on the cup 785. lean the ball on the disc 786. lean the ball on the book 787. lean the ball on the blackboard 788. lean the ball on the bottle

766. put the paper sheet near the knife

789. lean the ball on the grape

767. put the paper sheet near the pencil

790. lean the ball on the apple

768. lean the block on the ball

792. lean the ball on the bowl

751. put the pencil near the apple

791. lean the ball on the banana

769. lean the block on the plate 752. put the pencil near the banana

793. lean the ball on the knife

770. lean the block on the cup

794. lean the ball on the pencil

753. put the pencil near the bowl

771. lean the block on the disc

795. lean the plate on the block

754. put the pencil near the knife

772. lean the block on the book

796. lean the plate on the ball

159

APPENDIX D. SENTENCE TEST SET 797. lean the plate on the cup

822. lean the disc on the ball

847. lean the bottle on the block

798. lean the plate on the disc

823. lean the disc on the plate

848. lean the bottle on the ball

799. lean the plate on the book

824. lean the disc on the cup

849. lean the bottle on the plate

800. lean the plate on the blackboard

825. lean the disc on the book

850. lean the bottle on the cup 851. lean the bottle on the disc

801. lean the plate on the bottle

826. lean the disc on the blackboard

802. lean the plate on the grape

827. lean the disc on the bottle

803. lean the plate on the apple

828. lean the disc on the grape

804. lean the plate on the banana

829. lean the disc on the apple

805. lean the plate on the bowl

830. lean the disc on the banana

806. lean the plate on the knife

831. lean the disc on the bowl

807. lean the plate on the pencil

832. lean the disc on the knife

808. lean the cup on the block

833. lean the disc on the pencil

809. lean the cup on the ball

834. lean the book on the block

810. lean the cup on the plate

835. lean the book on the ball

811. lean the cup on the disc

836. lean the book on the plate

812. lean the cup on the book

837. lean the book on the cup

863. lean the grape on the cup

813. lean the cup on the blackboard

838. lean the book on the disc

864. lean the grape on the disc

839. lean the book on the blackboard

865. lean the grape on the book

814. lean the cup on the bottle 815. lean the cup on the grape

840. lean the book on the bottle

866. lean the grape on the blackboard

816. lean the cup on the apple

841. lean the book on the grape

867. lean the grape on the bottle

817. lean the cup on the banana

842. lean the book on the apple

818. lean the cup on the bowl

843. lean the book on the banana

869. lean the grape on the banana

819. lean the cup on the knife

844. lean the book on the bowl

870. lean the grape on the bowl

820. lean the cup on the pencil

845. lean the book on the knife

871. lean the grape on the knife

821. lean the disc on the block

846. lean the book on the pencil

872. lean the grape on the pencil

852. lean the bottle on the book 853. lean the bottle on the blackboard 854. lean the bottle on the grape 855. lean the bottle on the apple 856. lean the bottle on the banana 857. lean the bottle on the bowl 858. lean the bottle on the knife 859. lean the bottle on the pencil 860. lean the grape on the block

160

861. lean the grape on the ball 862. lean the grape on the plate

868. lean the grape on the apple

APPENDIX D. SENTENCE TEST SET 873. lean the apple on the block

895. lean the banana on the apple

874. lean the apple on the ball 896. lean the banana on the bowl 875. lean the apple on the plate

920. lean the knife on the grape 921. lean the knife on the apple

897. lean the banana on the knife

922. lean the knife on the banana

898. lean the banana on the pencil

923. lean the knife on the bowl

876. lean the apple on the cup 877. lean the apple on the disc

919. lean the knife on the bottle

878. lean the apple on the book

899. lean the bowl on the block

879. lean the apple on the blackboard

900. lean the bowl on the ball

924. lean the knife on the pencil 925. lean the pencil on the block 926. lean the pencil on the ball

901. lean the bowl on the plate

927. lean the pencil on the plate

902. lean the bowl on the cup

928. lean the pencil on the cup

903. lean the bowl on the disc

929. lean the pencil on the disc

882. lean the apple on the banana

904. lean the bowl on the book

930. lean the pencil on the book

883. lean the apple on the bowl

905. lean the bowl on the blackboard

931. lean the pencil on the blackboard

884. lean the apple on the knife

906. lean the bowl on the bottle

932. lean the pencil on the bottle

885. lean the apple on the pencil

907. lean the bowl on the grape

886. lean the banana on the block

908. lean the bowl on the apple

880. lean the apple on the bottle 881. lean the apple on the grape

889. lean the banana on the cup 890. lean the banana on the disc 891. lean the banana on the book 892. lean the banana on the blackboard 893. lean the banana on the bottle 894. lean the banana on the grape

934. lean the pencil on the apple

909. lean the bowl on the banana

935. lean the pencil on the banana

910. lean the bowl on the knife

936. lean the pencil on the bowl

911. lean the bowl on the pencil

937. lean the pencil on the knife

912. lean the knife on the block

938. lean the paper sheet on the block

887. lean the banana on the ball 888. lean the banana on the plate

933. lean the pencil on the grape

913. lean the knife on the ball 914. lean the knife on the plate 915. lean the knife on the cup

939. lean the paper sheet on the ball 940. lean the paper sheet on the plate

916. lean the knife on the disc 917. lean the knife on the book

941. lean the paper sheet on the cup

918. lean the knife on the blackboard

942. lean the paper sheet on the disc

161

APPENDIX D. SENTENCE TEST SET 943. lean the paper sheet on the book

959. lean the block against the grape

975. lean the plate against the bowl

944. lean the paper sheet on the blackboard

960. lean the block against the apple

976. lean the plate against the knife

945. lean the paper sheet on the bottle

961. lean the block against the banana

977. lean the plate against the pencil

946. lean the paper sheet on the grape

962. lean the block against the bowl

978. lean the cup against the block

947. lean the paper sheet on the apple

963. lean the block against the knife

979. lean the cup against the ball

948. lean the paper sheet on the banana

964. lean the block against the pencil

949. lean the paper sheet on the bowl

965. lean the plate against the block

950. lean the paper sheet on the knife

966. lean the plate against the ball

983. lean the cup against the blackboard

951. lean the paper sheet on the pencil

967. lean the plate against the cup

984. lean the cup against the bottle

952. lean the block against the ball

968. lean the plate against the disc

985. lean the cup against the grape

953. lean the block against the plate

969. lean the plate against the book

986. lean the cup against the apple

954. lean the block against the cup

970. lean the plate against the blackboard

987. lean the cup against the banana

955. lean the block against the disc

971. lean the plate against the bottle

988. lean the cup against the bowl

956. lean the block against the book

972. lean the plate against the grape

989. lean the cup against the knife

957. lean the block against the blackboard

973. lean the plate against the apple

990. lean the cup against the pencil

958. lean the block against the bottle

974. lean the plate against the banana

991. lean the disc against the block

162

980. lean the cup against the plate 981. lean the cup against the disc 982. lean the cup against the book

APPENDIX D. SENTENCE TEST SET 992. lean the disc against the ball 1009. lean the book blackboard 993. lean the disc against the plate 1010. lean the book bottle 994. lean the disc against the cup 1011. lean the book 995. lean the disc against the grape book 1012. lean the book 996. lean the disc against the apple blackboard 1013. lean the book 997. lean the disc against the banana bottle 998. lean the disc against the grape 999. lean the disc against the apple

against the 1025. lean the bottle against the banana against the 1026. lean the bottle against the bowl against the 1027. lean the bottle against the knife against the 1028. lean the bottle against the pencil against the 1029. lean the grape against the block

1014. lean the book against the 1030. lean the grape against the bowl ball 1015. lean the book against the 1031. lean the grape against the knife plate

1000. lean the disc against the ba- 1016. lean the bottle against the 1032. lean the grape against the block cup nana 1001. lean the disc against the 1017. lean the bottle against the 1033. lean the grape against the ball disc bowl 1002. lean the disc against the 1018. lean the bottle against the 1034. lean the grape against the plate book knife 1003. lean the disc against the 1019. lean the bottle against the 1035. lean the grape against the cup blackboard pencil 1004. lean the book against the 1020. lean the bottle against the 1036. lean the grape against the disc bottle block 1005. lean the book against the 1021. lean the bottle against the 1037. lean the grape against the book apple ball 1006. lean the book against the 1022. lean the bottle against the 1038. lean the grape against the blackboard banana plate 1007. lean the book against the 1023. lean the bottle against the 1039. lean the grape against the grape bowl cup 1008. lean the book against the 1024. lean the bottle against the 1040. lean the grape against the disc apple knife

163

APPENDIX D. SENTENCE TEST SET 1041. lean the grape against the 1057. lean the banana against the 1073. lean the bowl against the pencil disc bottle 1042. lean the apple against the 1058. lean the banana against the 1074. lean the bowl against the block book grape 1043. lean the apple against the 1059. lean the banana against the 1075. lean the bowl against the plate blackboard apple 1044. lean the apple against the 1060. lean the banana against the 1076. lean the bowl against the cup bottle banana 1045. lean the apple against the 1061. lean the banana against the 1077. lean the bowl against the disc grape knife 1046. lean the apple against the 1062. lean the banana against the 1078. lean the bowl against the book apple pencil 1047. lean the apple against the 1063. lean the banana against the 1079. lean the knife against the blackboard bowl block 1048. lean the apple against the 1064. lean the banana against the 1080. lean the knife against the bottle knife ball 1049. lean the apple against the 1065. lean the banana against the 1081. lean the knife against the banana pencil plate 1050. lean the apple against the 1066. lean the bowl against the 1082. lean the knife against the bowl block cup 1051. lean the apple against the 1067. lean the bowl against the 1083. lean the knife against the knife ball disc 1052. lean the apple against the 1068. lean the bowl against the 1084. lean the knife against the pencil plate book 1053. lean the banana against the 1069. lean the bowl against the 1085. lean the knife against the block cup blackboard 1054. lean the banana against the 1070. lean the bowl against the 1086. lean the knife against the ball disc bottle 1055. lean the banana against the 1071. lean the bowl against the 1087. lean the knife against the plate book grape 1056. lean the banana against the 1072. lean the bowl against the 1088. lean the knife against the cup blackboard apple

164

APPENDIX D. SENTENCE TEST SET 1089. lean the knife against the 1100. lean the pencil against the 1111. lean the paper sheet against the blackboard grape banana 1090. lean the knife against the 1101. lean the pencil against the 1112. lean the paper sheet against apple bowl the bottle 1091. lean the knife against the 1102. lean the pencil against the banana pencil 1113. lean the paper sheet against 1092. lean the pencil against the 1103. lean the pencil against the bowl block 1093. lean the pencil against the 1104. lean the pencil against the knife ball

the grape 1114. lean the paper sheet against the apple

1094. lean the pencil against the 1105. lean the paper sheet against 1115. lean the paper sheet against the banana the block plate 1095. lean the pencil against the 1106. lean the paper sheet against 1116. lean the paper sheet against the ball cup the bowl 1096. lean the pencil against the 1107. lean the paper sheet against the plate disc 1117. lean the paper sheet against 1097. lean the pencil against the 1108. lean the paper sheet against the cup book 1098. lean the pencil against the 1109. lean the paper sheet against the disc blackboard

the knife 1118. lean the paper sheet against the pencil

1099. lean the pencil against the 1110. lean the paper sheet against 1119. put the book near the paper sheet the book bottle

165

Appendix E Data Tables

166

APPENDIX E. DATA TABLES

E.1

DNN with Unweighted Features Granularity 1: Predicting Predicate Only 3-way Choice Training Steps Total Correct Incorrect 1000 3357 3268 89 2000 3357 3277 80 3000 3357 3285 72 4000 3357 3283 74 5000 3357 3285 72 Training Steps µ Accuracy σ σ2 1000 97.37% 0.00763 0.00006 2000 97.64% 0.00879 0.00008 3000 97.88% 0.00541 0.00003 4000 97.82% 0.00450 0.00002 5000 97.88% 0.00541 0.00003

Training Steps 1000 2000 3000 4000 5000 Training Steps 1000 2000 3000 4000 5000

Unrestricted Choice Total Correct 3357 3138 3357 3184 3357 3202 3357 3198 3357 3193 µ Accuracy σ 93.50% 0.01094 94.87% 0.01329 95.41% 0.00491 95.29% 0.00791 95.14% 0.00662

167

Incorrect 219 173 155 159 164 σ2 0.00012 0.00018 0.00002 0.00006 0.00004

APPENDIX E. DATA TABLES Granularity 2: Predicting Predicate w/ Adjunct 3-way Choice Training Steps Total Correct Incorrect 1000 3357 2748 609 2000 3357 2714 643 3000 3357 2724 633 4000 3357 2739 618 5000 3357 2725 632 Training Steps µ Accuracy σ σ2 1000 81.88% 0.00937 0.00009 2000 80.87% 0.01255 0.00016 3000 81.16% 0.01352 0.00018 4000 81.61% 0.02167 0.00047 5000 81.19% 0.01443 0.00021

Training Steps 1000 2000 3000 4000 5000 Training Steps 1000 2000 3000 4000 5000

Unrestricted Choice Total Correct 3357 1867 3357 1895 3357 1867 3357 1919 3357 1899 µ Accuracy σ 55.63% 0.01290 56.46% 0.02063 55.63% 0.02208 57.18% 0.01810 56.58% 0.01643

168

Incorrect 1490 1462 1490 1438 1458 σ2 0.00017 0.00043 0.00049 0.00033 0.00027

APPENDIX E. DATA TABLES Granularity 3: Predicting Full Sentence 3-way Choice Training Steps Total Correct Incorrect 1000 3357 1638 1719 2000 3357 2143 1214 3000 3357 2243 1114 4000 3357 2177 1180 5000 3357 2145 1212 Training Steps µ Accuracy σ σ2 1000 48.80% 0.08193 0.00671 2000 63.25% 0.02232 0.00050 3000 66.83% 0.02235 0.00050 4000 64.86% 0.02260 0.00051 5000 63.91% 0.03331 0.00111

Training Steps 1000 2000 3000 4000 5000 Training Steps 1000 2000 3000 4000 5000

Unrestricted Choice Total Correct 3357 8 3357 14 3357 12 3357 22 3357 17 µ Accuracy σ 0.24% 0.00190 0.42% 0.00350 0.36% 0.00237 0.66% 0.00481 0.51% 0.00466

Incorrect 3349 3343 3345 3335 3340 σ2 0.00000 0.00001 0.00001 0.00002 0.00002

Table E.1: Accuracy tables for “vanilla” DNN automatic evaluation

169

APPENDIX E. DATA TABLES

E.2

DNN with Weighted Features Granularity 1: Predicting Predicate Only 3-way Choice Training Steps Total Correct Incorrect 1000 3357 3234 123 2000 3357 3275 82 3000 3357 3284 73 4000 3357 3284 73 5000 3357 3268 89 Training Steps µ Accuracy σ σ2 1000 96.36% 0.00756 0.00006 2000 97.58% 0.00701 0.00005 3000 97.85% 0.00771 0.00006 4000 97.85% 0.00703 0.00005 5000 97.67% 0.01541 0.00024

Training Steps 1000 2000 3000 4000 5000 Training Steps 1000 2000 3000 4000 5000

Unrestricted Choice Total Correct 3357 3136 3357 3190 3357 3193 3357 3204 3357 3204 µ Accuracy σ 93.44% 0.01313 95.05% 0.00987 95.44% 0.01375 95.50% 0.00974 95.47% 0.01079

170

Incorrect 221 167 154 153 153 σ2 0.00017 0.00010 0.00019 0.00009 0.00012

APPENDIX E. DATA TABLES Granularity 2: Predicting Predicate w/ Adjunct 3-way Choice Training Steps Total Correct Incorrect 1000 3357 2694 663 2000 3357 2718 639 3000 3357 2707 650 4000 3357 2711 646 5000 3357 2733 624 Training Steps µ Accuracy σ σ2 1000 80.27% 0.01754 0.00031 2000 80.98% 0.01579 0.00025 3000 80.66% 0.02251 0.00051 4000 80.78% 0.01977 0.00039 5000 81.34% 0.01664 0.00028

Training Steps 1000 2000 3000 4000 5000 Training Steps 1000 2000 3000 4000 5000

Unrestricted Choice Total Correct 3357 1851 3357 1869 3357 1895 3357 1874 3357 1903 µ Accuracy σ 55.15% 0.01230 55.69% 0.01621 56.46% 0.01279 55.84% 0.01694 56.70% 0.01693

171

Incorrect 1506 1488 1462 1483 1454 σ2 0.00015 0.00026 0.00016 0.00029 0.00029

APPENDIX E. DATA TABLES Granularity 3: Predicting Full Sentence 3-way Choice Training Steps Total Correct Incorrect 1000 3357 1521 1836 2000 3357 2018 1339 3000 3357 2120 1237 4000 3357 2091 1266 5000 3357 2107 1250 Training Steps µ Accuracy σ σ2 1000 45.32% 0.05113 0.00261 2000 60.13% 0.03917 0.00153 3000 63.17% 0.03076 0.00095 4000 62.30% 0.03146 0.00099 5000 62.78% 0.02653 0.00070

Training Steps 1000 2000 3000 4000 5000 Training Steps 1000 2000 3000 4000 5000

Unrestricted Choice Total Correct 3357 5 3357 8 3357 8 3357 7 3357 20 µ Accuracy σ 0.15% 0.00158 0.24% 0.00237 0.24% 0.00190 0.21% 0.00202 0.60% 0.00543

Incorrect 3352 3349 3349 3350 3337 σ2 0.00000 0.00001 0.00000 0.00000 0.00003

Table E.2: Accuracy tables for DNN automatic evaluation with weighted features

172

APPENDIX E. DATA TABLES

E.3

DNN with Weighted Discrete Features Granularity 1: Predicting Predicate Only 3-way Choice Training Steps Total Correct Incorrect 1000 3357 3239 118 2000 3357 3286 71 3000 3357 3285 72 4000 3357 3283 74 5000 3357 3289 68 Training Steps µ Accuracy σ σ2 1000 96.51% 0.01140 0.00013 2000 97.91% 0.00589 0.00003 3000 97.88% 0.00577 0.00003 4000 97.82% 0.00530 0.00003 5000 98.00% 0.00393 0.00002

Training Steps 1000 2000 3000 4000 5000 Training Steps 1000 2000 3000 4000 5000

Unrestricted Choice Total Correct 3357 3113 3357 3189 3357 3194 3357 3195 3357 3205 µ Accuracy σ 92.75% 0.00923 95.02% 0.00675 95.17% 0.00658 95.20% 0.01088 95.50% 0.00610

173

Incorrect 244 168 163 162 152 σ2 0.00009 0.00005 0.00004 0.00012 0.00004

APPENDIX E. DATA TABLES Granularity 2: Predicting Predicate w/ Adjunct 3-way Choice Training Steps Total Correct Incorrect 1000 3357 2695 662 2000 3357 2719 638 3000 3357 2725 632 4000 3357 2696 661 5000 3357 2717 640 Training Steps µ Accuracy σ σ2 1000 80.33% 0.01608 0.00026 2000 81.01% 0.01594 0.00025 3000 81.19% 0.01870 0.00035 4000 80.33% 0.0187 0.00035 5000 80.95% 0.01553 0.00024

Training Steps 1000 2000 3000 4000 5000 Training Steps 1000 2000 3000 4000 5000

Unrestricted Choice Total Correct 3357 1852 3357 1881 3357 1900 3357 1886 3357 1895 µ Accuracy σ 55.18% 0.01666 56.04% 0.02072 56.61% 0.00937 56.19% 0.00688 56.46% 0.01429

174

Incorrect 1505 1476 1457 1471 1462 σ2 0.00028 0.00043 0.00009 0.00005 0.0002

APPENDIX E. DATA TABLES Granularity 3: Predicting Full Sentence 3-way Choice Training Steps Total Correct Incorrect 1000 3357 1816 1541 2000 3357 2192 1165 3000 3357 2190 1167 4000 3357 2182 1175 5000 3357 2137 1220 Training Steps µ Accuracy σ σ2 1000 54.11% 0.07782 0.00606 2000 65.31% 0.02551 0.00065 3000 65.25% 0.02977 0.00089 4000 65.01% 0.01966 0.00039 5000 63.67% 0.03622 0.00131

Training Steps 1000 2000 3000 4000 5000 Training Steps 1000 2000 3000 4000 5000

Unrestricted Choice Total Correct 3357 6 3357 15 3357 18 3357 15 3357 20 µ Accuracy σ 0.18% 0.00155 0.45% 0.00253 0.54% 0.00391 0.45% 0.00321 0.60% 0.00343

Incorrect 3351 3342 3339 3342 3337 σ2 0.00000 0.00001 0.00002 0.00001 0.00001

Table E.3: Accuracy tables for DNN automatic evaluation with weighted discrete features

175

APPENDIX E. DATA TABLES

E.4

DNN with Feature Weights Only Granularity 1: Predicting Predicate Only 3-way Choice Training Steps Total Correct Incorrect 1000 3357 3298 59 2000 3357 3321 36 3000 3357 3324 33 4000 3357 3323 34 5000 3357 3321 36 Training Steps µ Accuracy σ σ2 1000 98.27% 0.00552 0.00003 2000 98.95% 0.00401 0.00002 3000 99.04% 0.00372 0.00001 4000 99.01% 0.00319 0.00001 5000 98.95% 0.00375 0.00001

Training Steps 1000 2000 3000 4000 5000 Training Steps 1000 2000 3000 4000 5000

Unrestricted Choice Total Correct 3357 3226 3357 3253 3357 3257 3357 3259 3357 3258 µ Accuracy σ 96.12% 0.00664 96.93% 0.00562 97.05% 0.00503 97.10% 0.00437 97.07% 0.00494

176

Incorrect 131 104 100 98 99 σ2 0.00004 0.00003 0.00003 0.00002 0.00002

APPENDIX E. DATA TABLES Granularity 2: Predicting Predicate w/ Adjunct 3-way Choice Training Steps Total Correct Incorrect 1000 3357 2784 573 2000 3357 2797 560 3000 3357 2800 557 4000 3357 2790 567 5000 3357 2794 563 Training Steps µ Accuracy σ σ2 1000 82.95% 0.00957 0.00009 2000 83.34% 0.00672 0.00005 3000 83.43% 0.00896 0.00008 4000 83.13% 0.00943 0.00009 5000 83.25% 0.00859 0.00007

Training Steps 1000 2000 3000 4000 5000 Training Steps 1000 2000 3000 4000 5000

Unrestricted Choice Total Correct 3357 2538 3357 2546 3357 2570 3357 2568 3357 2567 µ Accuracy σ 75.62% 0.01129 75.86% 0.00878 76.57% 0.00767 76.51% 0.00905 76.48% 0.00899

177

Incorrect 819 811 787 789 790 σ2 0.00013 0.00008 0.00006 0.00008 0.00008

APPENDIX E. DATA TABLES Granularity 3: Predicting Full Sentence 3-way Choice Training Steps Total Correct Incorrect 1000 3357 381 2976 2000 3357 2255 1102 3000 3357 2342 1015 4000 3357 2340 1017 5000 3357 2320 1037 Training Steps µ Accuracy σ σ2 1000 11.35% 0.05644 0.00319 2000 67.19% 0.04904 0.00240 3000 69.78% 0.02527 0.00064 4000 69.72% 0.02343 0.00055 5000 69.12% 0.01799 0.00032

Training Steps 1000 2000 3000 4000 5000 Training Steps 1000 2000 3000 4000 5000

Unrestricted Choice Total Correct 3357 1 3357 6 3357 15 3357 21 3357 21 µ Accuracy σ 0.03% 0.00095 0.18% 0.00253 0.45% 0.0047 0.63% 0.00356 0.63% 0.00453

Incorrect 3356 3351 3342 3336 3336 σ2 0.00000 0.00001 0.00002 0.00001 0.00002

Table E.4: Accuracy tables for DNN automatic evaluation with feature weights alone

178

APPENDIX E. DATA TABLES

E.5

Combined Linear-DNN with Unweighted Features Granularity 1: Predicting Predicate Only 3-way Choice Training Steps Total Correct Incorrect 1000 3357 3207 150 2000 3357 3218 139 3000 3357 3221 136 4000 3357 3227 130 5000 3357 3227 130 Training Steps µ Accuracy σ σ2 1000 95.56% 0.00841 0.00007 2000 95.88% 0.00813 0.00007 3000 95.97% 0.00733 0.00005 4000 96.15% 0.00703 0.00005 5000 96.15% 0.00643 0.00004

Training Steps 1000 2000 3000 4000 5000 Training Steps 1000 2000 3000 4000 5000

Unrestricted Choice Total Correct 3357 3049 3357 3061 3357 3066 3357 3069 3357 3071 µ Accuracy σ 90.85% 0.00684 91.20% 0.00538 91.35% 0.00499 91.44% 0.00543 91.50% 0.00513

179

Incorrect 308 296 291 288 286 σ2 0.00005 0.00003 0.00002 0.00003 0.00003

APPENDIX E. DATA TABLES Granularity 2: Predicting Predicate w/ Adjunct 3-way Choice Training Steps Total Correct Incorrect 1000 3357 2691 666 2000 3357 2697 660 3000 3357 2704 653 4000 3357 2716 641 5000 3357 2725 632 Training Steps µ Accuracy σ σ2 1000 80.18% 0.00700 0.00005 2000 80.36% 0.00914 0.00008 3000 80.57% 0.00914 0.00008 4000 80.92% 0.00934 0.00009 5000 81.19% 0.00705 0.00005

Training Steps 1000 2000 3000 4000 5000 Training Steps 1000 2000 3000 4000 5000

Unrestricted Choice Total Correct 3357 1796 3357 1804 3357 1827 3357 1841 3357 1848 µ Accuracy σ 53.51% 0.00486 53.75% 0.00483 54.43% 0.00574 54.85% 0.00783 55.06% 0.00720

180

Incorrect 1561 1553 1530 1516 1509 σ2 0.00002 0.00002 0.00003 0.00006 0.00005

APPENDIX E. DATA TABLES Granularity 3: Predicting Full Sentence 3-way Choice Training Steps Total Correct Incorrect 1000 3357 1779 1578 2000 3357 1886 1471 3000 3357 1922 1435 4000 3357 1935 1422 5000 3357 1950 1407 Training Steps µ Accuracy σ σ2 1000 53.00% 0.04476 0.00200 2000 56.19% 0.04404 0.00194 3000 57.27% 0.04078 0.00166 4000 57.65% 0.04352 0.00189 5000 58.10% 0.04353 0.00189

Training Steps 1000 2000 3000 4000 5000 Training Steps 1000 2000 3000 4000 5000

Unrestricted Choice Total Correct 3357 12 3357 14 3357 14 3357 15 3357 14 µ Accuracy σ 0.36% 0.00190 0.42% 0.00210 0.42% 0.00210 0.47% 0.00253 0.42% 0.00251

Incorrect 3345 3343 3343 3342 3343 σ2 0.00000 0.00000 0.00000 0.00001 0.00001

Table E.5: Accuracy tables for linear-DNN automatic evaluation

181

APPENDIX E. DATA TABLES

E.6

Combined Linear-DNN with Weighted Features Granularity 1: Predicting Predicate Only 3-way Choice Training Steps Total Correct Incorrect 1000 3357 3219 138 2000 3357 3223 134 3000 3357 3226 131 4000 3357 3224 133 5000 3357 3222 135 Training Steps µ Accuracy σ σ2 1000 95.91% 0.00683 0.00005 2000 96.03% 0.00628 0.00004 3000 96.12% 0.00771 0.00006 4000 96.06% 0.00593 0.00004 5000 96.00% 0.00674 0.00005

Training Steps 1000 2000 3000 4000 5000 Training Steps 1000 2000 3000 4000 5000

Unrestricted Choice Total Correct 3357 3050 3357 3063 3357 3071 3357 3074 3357 3069 µ Accuracy σ 90.88% 0.00693 91.26% 0.00914 91.50% 0.00751 91.59% 0.00688 91.44% 0.00838

182

Incorrect 307 294 286 283 288 σ2 0.00005 0.00008 0.00006 0.00005 0.00007

APPENDIX E. DATA TABLES Granularity 2: Predicting Predicate w/ Adjunct 3-way Choice Training Steps Total Correct Incorrect 1000 3357 2707 650 2000 3357 2716 641 3000 3357 2724 633 4000 3357 2733 624 5000 3357 2734 623 Training Steps µ Accuracy σ σ2 1000 80.66% 0.00901 0.00008 2000 80.92% 0.00844 0.00007 3000 81.16% 0.00897 0.00008 4000 81.43% 0.00919 0.00008 5000 81.46% 0.01028 0.00011

Training Steps 1000 2000 3000 4000 5000 Training Steps 1000 2000 3000 4000 5000

Unrestricted Choice Total Correct 3357 1812 3357 1832 3357 1837 3357 1839 3357 1830 µ Accuracy σ 53.99% 0.00753 54.58% 0.01039 54.73% 0.01181 54.79% 0.01024 54.52% 0.00980

183

Incorrect 1545 1525 1520 1518 1527 σ2 0.00006 0.00011 0.00014 0.00010 0.00010

APPENDIX E. DATA TABLES Granularity 3: Predicting Full Sentence 3-way Choice Training Steps Total Correct Incorrect 1000 3357 1680 1677 2000 3357 1781 1576 3000 3357 1853 1504 4000 3357 1889 1468 5000 3357 1920 1437 Training Steps µ Accuracy σ σ2 1000 50.05% 0.04326 0.00187 2000 53.06% 0.03441 0.00118 3000 55.21% 0.03388 0.00115 4000 56.28% 0.03365 0.00113 5000 57.21% 0.03123 0.00098

Training Steps 1000 2000 3000 4000 5000 Training Steps 1000 2000 3000 4000 5000

Unrestricted Choice Total Correct 3357 10 3357 12 3357 11 3357 13 3357 15 µ Accuracy σ 0.30% 0.00245 0.36% 0.00126 0.33% 0.00170 0.39% 0.00202 0.45% 0.00253

Incorrect 3347 3345 3346 3344 3342 σ2 0.00001 0.00000 0.00000 0.00000 0.00001

Table E.6: Accuracy tables for linear-DNN automatic evaluation with weighted features

184

APPENDIX E. DATA TABLES

E.7

Combined Linear-DNN with Weighted Discrete Features Granularity 1: Predicting Predicate Only 3-way Choice Training Steps Total Correct Incorrect 1000 3357 3207 150 2000 3357 3216 141 3000 3357 3221 136 4000 3357 3227 130 5000 3357 3227 130 Training Steps µ Accuracy σ σ2 1000 95.56% 0.00841 0.00007 2000 95.88% 0.00869 0.00008 3000 95.97% 0.00733 0.00005 4000 96.15% 0.00703 0.00005 5000 96.15% 0.00643 0.00004

Training Steps 1000 2000 3000 4000 5000 Training Steps 1000 2000 3000 4000 5000

Unrestricted Choice Total Correct 3357 3049 3357 3061 3357 3066 3357 3069 3357 3071 µ Accuracy σ 90.85% 0.00684 91.20% 0.00538 91.35% 0.00499 91.44% 0.00543 91.50% 0.00513

185

Incorrect 308 296 291 288 286 σ2 0.00005 0.00003 0.00002 0.00003 0.00003

APPENDIX E. DATA TABLES Granularity 2: Predicting Predicate w/ Adjunct 3-way Choice Training Steps Total Correct Incorrect 1000 3357 2691 666 2000 3357 2697 660 3000 3357 2704 653 4000 3357 2716 641 5000 3357 2725 632 Training Steps µ Accuracy σ σ2 1000 80.18% 0.00700 0.00005 2000 80.36% 0.00914 0.00008 3000 80.57% 0.00914 0.00008 4000 80.92% 0.00934 0.00009 5000 81.19% 0.00705 0.00005

Training Steps 1000 2000 3000 4000 5000 Training Steps 1000 2000 3000 4000 5000

Unrestricted Choice Total Correct 3357 1796 3357 1804 3357 1827 3357 1841 3357 1848 µ Accuracy σ 53.51% 0.00486 53.75% 0.00483 54.43% 0.00574 54.85% 0.00783 55.06% 0.00720

186

Incorrect 1561 1553 1530 1516 1509 σ2 0.00002 0.00002 0.00003 0.00006 0.00005

APPENDIX E. DATA TABLES Granularity 3: Predicting Full Sentence 3-way Choice Training Steps Total Correct Incorrect 1000 3357 1779 1578 2000 3357 1886 1471 3000 3357 1922 1435 4000 3357 1935 1422 5000 3357 1950 1407 Training Steps µ Accuracy σ σ2 1000 53.00% 0.04476 0.00200 2000 56.19% 0.04404 0.00194 3000 57.27% 0.04078 0.00166 4000 57.65% 0.04352 0.00189 5000 58.10% 0.04353 0.00189

Training Steps 1000 2000 3000 4000 5000 Training Steps 1000 2000 3000 4000 5000

Unrestricted Choice Total Correct 3357 12 3357 14 3357 14 3357 15 3357 14 µ Accuracy σ 0.36% 0.00190 0.42% 0.00210 0.42% 0.00210 0.45% 0.00253 0.42% 0.00251

Incorrect 3345 3343 3343 3342 3343 σ2 0.00000 0.00000 0.00000 0.00001 0.00001

Table E.7: Accuracy tables for linear-DNN automatic evaluation with weighted discrete features

187

APPENDIX E. DATA TABLES

E.8

Combined Linear-DNN with Feature Weights Only Granularity 1: Predicting Predicate Only 3-way Choice Training Steps Total Correct Incorrect 1000 3357 3237 120 2000 3357 3259 98 3000 3357 3272 85 4000 3357 3313 44 5000 3357 3313 44 Training Steps µ Accuracy σ σ2 1000 96.45% 0.00919 0.00008 2000 97.10% 0.00825 0.00007 3000 97.49% 0.00599 0.00004 4000 98.71% 0.00469 0.00002 5000 98.71% 0.00321 0.00001

Training Steps 1000 2000 3000 4000 5000 Training Steps 1000 2000 3000 4000 5000

Unrestricted Choice Total Correct 3357 3103 3357 3171 3357 3207 3357 3244 3357 3247 µ Accuracy σ 92.46% 0.01059 94.48% 0.00626 95.56% 0.00576 96.66% 0.00624 96.75% 0.00522

188

Incorrect 254 186 150 113 110 σ2 0.00011 0.00004 0.00003 0.00004 0.00003

APPENDIX E. DATA TABLES Granularity 2: Predicting Predicate w/ Adjunct 3-way Choice Training Steps Total Correct Incorrect 1000 3357 2720 637 2000 3357 2781 576 3000 3357 2796 561 4000 3357 2792 565 5000 3357 2798 559 Training Steps µ Accuracy σ σ2 1000 81.04% 0.00827 0.00007 2000 82.86% 0.00854 0.00007 3000 83.31% 0.00857 0.00007 4000 83.19% 0.00864 0.00007 5000 83.37% 0.00799 0.00006

Training Steps 1000 2000 3000 4000 5000 Training Steps 1000 2000 3000 4000 5000

Unrestricted Choice Total Correct 3357 2409 3357 2543 3357 2550 3357 2552 3357 2559 µ Accuracy σ 71.78% 0.01145 75.77% 0.00647 75.98% 0.00771 76.04% 0.00735 76.25% 0.00736

189

Incorrect 948 814 807 805 798 σ2 0.00013 0.00004 0.00006 0.00005 0.00005

APPENDIX E. DATA TABLES Granularity 3: Predicting Full Sentence 3-way Choice Training Steps Total Correct Incorrect 1000 3357 755 2602 2000 3357 670 2687 3000 3357 1008 2349 4000 3357 1520 1837 5000 3357 1952 1405 Training Steps µ Accuracy σ σ2 1000 22.49% 0.01966 0.00039 2000 20.23% 0.05766 0.00332 3000 30.03% 0.10858 0.01179 4000 45.29% 0.14882 0.02215 5000 57.89% 0.12705 0.01614

Training Steps 1000 2000 3000 4000 5000 Training Steps 1000 2000 3000 4000 5000

Unrestricted Choice Total Correct 3357 6 3357 6 3357 6 3357 9 3357 10 µ Accuracy σ 0.18% 0.00287 0.18% 0.00287 0.18% 0.00320 0.27% 0.00296 0.30% 0.00280

Incorrect 3351 3351 3351 3348 3347 σ2 0.00001 0.00001 0.00001 0.00001 0.00001

Table E.8: Accuracy tables for linear-DNN automatic evaluation with feature weights alone

190

Appendix F Publication History • August, 2014 — *SEM workshop, COLING 2014, Dublin, Ireland (Pustejovsky and Krishnaswamy, 2014) • May, 2016 — LREC 2016, Protoroˇz, Slovenia (Pustejovsky and Krishnaswamy, 2016a) • May, 2016 — ISA workshop, LREC 2016, Protoroˇz, Slovenia (Do et al., 2016) • August, 2016 — Spatial Cognition 2016, Philadelpha, PA, USA (Krishnaswamy and Pustejovsky, 2016a) (short paper) • September, 2016 — CogSci 2016, Philadelpha, PA, USA (Pustejovsky and Krishnaswamy, 2016b) • December, 2016 — GramLex workshop, COLING 2016, Osaka, Japan (Pustejovsky et al., 2016) • December, 2016 — COLING 2016, Osaka, Japan (Krishnaswamy and Pustejovsky, 2016b) • March, 2017 — AAAI Spring Symposium: Interactive Multisensory Object Perception for Embodied Agents, Stanford, CA, USA (Pustejovsky et al., 2017) • Forthcoming, 2017 — Spatial Cognition X, Springer LNAI series (Krishnaswamy and Pustejovsky, 2016a) (extended paper)

191

Bibliography Mart´ın Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. TensorFlow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI). Savannah, Georgia, USA, 2016. Julia Albath, Jennifer L. Leopold, Chaman L. Sabharwal, and Anne M. Maglia. RCC-3D: Qualitative spatial reasoning in 3D. In CAINE, pages 74–79, 2010. James F. Allen. Maintaining knowledge about temporal intervals. Communications of the ACM, 26(11):832–843, 1983. Daniel Andor, Chris Alberti, David Weiss, Aliaksei Severyn, Alessandro Presta, Kuzman Ganchev, Slav Petrov, and Michael Collins. Globally normalized transition-based neural networks. arXiv preprint arXiv:1603.06042, 2016. Carl Bache. Verbal aspect: a general theory and its application to present-day English. Syddansk Universitetsforlag, 1985. Benjamin K. Bergen. Louder than words: The new science of how the mind makes meaning. Basic Books, 2012. Mehul Bhatt and Seng Loke. Modelling dynamic spatial systems in the situation calculus. Spatial Cognition and Computation, 2008. Rama Bindiganavale and Norman I. Badler. Motion abstraction and mapping with spatial constraints. In Modelling and Motion Capture Techniques for Virtual Environments, pages 70–82. Springer, 1998. 192

BIBLIOGRAPHY Patrick Blackburn and Johan Bos. Computational semantics. THEORIA. An International Journal for Theory, History and Foundations of Science, 18(1), 2008. Rens Bod. Beyond grammar. An Experienced-Based Theory of Language. CSLI Lecture Notes, 88, 1998. Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. A large annotated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326, 2015. Michael Buhrmester, Tracy Kwang, and Samuel D. Gosling. Amazon’s Mechanical Turk: a new source of inexpensive, yet high-quality, data? Perspectives on psychological science, 6(1):3–5, 2011. Rudolf Carnap. Meaning and necessity: a study in semantics and modal logic. University of Chicago Press, 1947. Gavin C. Cawley and Nicola L. C. Talbot. On over-fitting in model selection and subsequent selection bias in performance evaluation. Journal of Machine Learning Research, 11(Jul):2079– 2107, 2010. Angel Chang, Will Monroe, Manolis Savva, Christopher Potts, and Christopher D. Manning. Text to 3D scene generation with rich lexical grounding. arXiv preprint arXiv:1505.06289, 2015. Chen Chung Chang and H. Jerome Keisler. Model theory, volume 73. Elsevier, 1973. Yu-Wei Chao, Zhan Wang, Yugeng He, Jiaxuan Wang, and Jia Deng. HICO: A benchmark for recognizing human-object interactions in images. In Proceedings of the IEEE International Conference on Computer Vision, pages 1017–1025, 2015. Jinho D. Choi and Andrew McCallum. Transition-based dependency parsing with selectional branching. In ACL (1), pages 1052–1062, 2013. Bernard Comrie. Aspect: An introduction to the study of verbal aspect and related problems, volume 2. Cambridge university press, 1976. Bob Coyne and Richard Sproat. WordsEye: an automatic text-to-scene conversion system. In Proceedings of the 28th annual conference on Computer graphics and interactive techniques, pages 487–496. ACM, 2001. 193

BIBLIOGRAPHY Ernest Davis and Gary Marcus. The scope and limits of simulation in automated reasoning. Artificial Intelligence, 233:60–72, 2016. Sebastian Deterding, Miguel Sicart, Lennart Nacke, Kenton O’Hara, and Dan Dixon. Gamification. using game-design elements in non-gaming contexts. In CHI’11 extended abstracts on human factors in computing systems, pages 2425–2428. ACM, 2011. Kevin Dill. A game AI approach to autonomous control of virtual characters. In Interservice/Industry Training, Simulation, and Education Conference (I/ITSEC), 2011. Tuan Do. Event-driven movie annotation using MPII movie dataset. 2016. Tuan Do, Nikhil Krishnaswamy, and James Pustejovsky. ECAT: Event capture annotation tool. Proceedings of ISA-12: International Workshop on Semantic Annotation, 2016. Simon Dobnik and Robin Cooper. Spatial descriptions in type theory with records. In Proceedings of IWCS 2013 Workshop on Computational Models of Spatial Language Interpretation and Generation (CoSLI-3). Citeseer, 2013. Simon Dobnik, Robin Cooper, and Staffan Larsson. Modelling language, action, and perception in type theory with records. In Constraint Solving and Language Processing, pages 70–91. Springer, 2013. Jacques Durand. On the scope of linguistics: data, intuitions, corpora. Corpus analysis and variation in linguistics, pages 25–52, 2009. Boi Faltings and Peter Struss. Recent advances in qualitative physics. MIT Press, 1992. Jerome Feldman. From molecule to metaphor: A neural theory of language. MIT press, 2006. Jerome Feldman and Srinivas Narayanan. Embodied meaning in a neural theory of language. Brain and language, 89(2):385–392, 2004. George Ferguson, James F. Allen, et al. TRIPS: An integrated intelligent problem-solving assistant. In AAAI/IAAI, pages 567–572, 1998. Kenneth D. Forbus. Qualitative physics: Past, present and future. Exploring artificial intelligence, pages 239–296, 1988. 194

BIBLIOGRAPHY Kenneth D. Forbus, James V. Mahoney, and Kevin Dill. How qualitative spatial reasoning can improve strategy game AIs. IEEE Intelligent Systems, 17(4):25–30, 2002. Gottlob Frege. On sense and reference. 1994, Basic Topics in the Philosophy of Language, Prentice-Hall, Englewood Cliffs, NJ, pages 142–160, 1892. Michael Gelfond and Vladimir Lifschitz. The stable model semantics for logic programming. In ICLP/SLP, volume 88, pages 1070–1080, 1988. Mark Giambruno. 3D graphics and animation. New Riders Publishing, 2002. James J. Gibson. The theory of affordances. Perceiving, Acting, and Knowing: Toward an ecological psychology, pages 67–82, 1977. James J. Gibson. The Ecology Approach to Visual Perception: Classic Edition. Psychology Press, 1979. Peter Michael Goebel and Markus Vincze. A cognitive modeling approach for the semantic aggregation of object prototypes from geometric primitives: toward understanding implicit object topology. In Advanced Concepts for Intelligent Vision Systems, pages 84–96. Springer, 2007. Will Goldstone. Unity Game Development Essentials. Packt Publishing Ltd, 2009. Branko Gr¨unbaum. Are your polyhedra the same as my polyhedra? In Discrete and Computational Geometry, pages 461–488. Springer, 2003. Saurabh Gupta and Jitendra Malik. arXiv:1505.04474, 2015.

Visual semantic role labeling.

arXiv preprint

Zellig S. Harris. Distributional structure. Word, 10(2-3):146–162, 1954. Panagiotis G. Ipeirotis, Foster Provost, and Jing Wang. Quality management on Amazon Mechanical Turk. In Proceedings of the ACM SIGKDD workshop on human computation, pages 64–67. ACM, 2010. Ray Jackendoff. Semantics and Cognition. MIT Press, 1983. Richard Johansson, Anders Berglund, Magnus Danielsson, and Pierre Nugues. Automatic text-toscene conversion in the traffic accident domain. In IJCAI, volume 5, pages 1073–1078, 2005. 195

BIBLIOGRAPHY Mark Johnson. The body in the mind: The bodily basis of meaning, imagination, and reason. University of Chicago Press, 1987. Leo Joskowicz and Elisha P. Sacks. Computational kinematics. Artificial Intelligence, 51(1-3): 381–416, 1991. Gitit Kehat and James Pustejovsky. Annotation methodologies for vision and language dataset creation. IEEE CVPR Scene Understanding Workshop (SUNw), Las Vegas, 2016. H. Jerome Keisler. Model theory for infinitary logic. 1971. Christopher Kennedy and Louise McNally. From event structure to scale structure: Degree modification in deverbal adjectives. In Semantics and linguistic theory, volume 9, pages 163–180, 1999. Kara Kipper, Anna Korhonen, Neville Ryant, and Martha Palmer. Extensive classifications of English verbs. In Proceedings of the 12th EURALEX International Congress, Turin, Italy, 2006. Saul A. Kripke. Semantical analysis of intuitionistic logic I. Studies in Logic and the Foundations of Mathematics, 40:92–130, 1965. Nikhil Krishnaswamy and James Pustejovsky. Multimodal semantic simulations of linguistically underspecified motion events. In Spatial Cognition X: International Conference on Spatial Cognition. Springer, 2016a. Nikhil Krishnaswamy and James Pustejovsky. VoxSim: A visual platform for modeling motion language. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers. ACL, 2016b. Benjamin Kuipers. Qualitative reasoning: modeling and simulation with incomplete knowledge. MIT press, 1994. Yohei Kurata and Max Egenhofer. The 9+ intersection for topological relations between a directed line segment and a region. In B. Gottfried, editor, Workshop on Behaviour and Monitoring Interpretation, pages 62–76, Germany, September 2007. George Lakoff. Women, fire, and dangerous things: What categories reveal about the mind. Cambridge University Press, 1987. 196

BIBLIOGRAPHY George Lakoff. The neural theory of metaphor. Available at SSRN 1437794, 2009. Beth Levin. English Verb Classes and Alternations: A Preliminary Investigation. University of Chicago Press, Chicago, 1993. Edward Loper and Steven Bird. NLTK: The natural language toolkit. In Proceedings of the ACL-02 Workshop on Effective tools and methodologies for teaching natural language processing and computational linguistics-Volume 1, pages 63–70. Association for Computational Linguistics, 2002. Duncan Luce, David Krantz, Patrick Suppes, and Amos Tversky. Foundations of measurement, Vol. III: Representation, axiomatization, and invariance. 1990. Minhua Ma and Paul McKevitt. Virtual human animation in natural language visualisation. Artificial Intelligence Review, 25(1-2):37–53, 2006. Nadia Magnenat-Thalmann, Richard Laperrire, and Daniel Thalmann. Joint-dependent local deformations for hand animation and object grasping. In Proceedings of Graphics Interface, 1988. David Mark and Max Egenhofer. Topology of prototypical spatial relations between lines and regions in English and Spanish. In Proceedings of the Twelfth International Symposium on Computer- Assisted Cartography, volume 4, pages 245–254, 1995. David McDonald and James Pustejovsky. On the representation of inferences and their lexicalization. In Advances in Cognitive Systems, volume 3, 2014. Matthew D. McLure, Scott E. Friedman, and Kenneth D. Forbus. Extending analogical generalization with near-misses. In AAAI, pages 565–571, 2015. Srinivas Sankara Narayanan. KARMA: Knowledge-based active representations for metaphor and aspect. University of California, Berkeley, 1997. Ralf Naumann. A dynamic approach to aspect: Verbs as programs. University of D¨usseldorf, submitted to Journal of Semantics., 1999. Nick Pelling. The (short) prehistory of gamification. Funding Startups (& other impossibilities), 2011.

197

BIBLIOGRAPHY James F. Peters. Near sets. Special theory about nearness of objects. Fundamenta Informaticae, 75(1-4):407–433, 2007. James Pustejovsky. The Generative Lexicon. MIT Press, Cambridge, MA, 1995. James Pustejovsky. Dynamic event structure and habitat theory. In Proceedings of the 6th International Conference on Generative Approaches to the Lexicon (GL2013), pages 1–10. ACL, 2013. James Pustejovsky and Nikhil Krishnaswamy. Generating simulations of motion events from verbal descriptions. Lexical and Computational Semantics (* SEM 2014), page 99, 2014. James Pustejovsky and Nikhil Krishnaswamy. VoxML: A visualization modeling language. In Nicoletta Calzolari (Conference Chair), Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, and Stelios Piperidis, editors, Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Paris, France, May 2016a. European Language Resources Association (ELRA). ISBN 978-2-9517408-9-1. James Pustejovsky and Nikhil Krishnaswamy. Visualizing events: Simulating meaning in language. In Proceedings of CogSci, 2016b. James Pustejovsky and Jessica Moszkowicz. The qualitative spatial dynamics of motion. The Journal of Spatial Cognition and Computation, 2011. James Pustejovsky, Nikhil Krishnaswamy, Tuan Do, and Gitit Kehat. The development of multimodal lexical resources. GramLex 2016, page 41, 2016. James Pustejovsky, Nikhil Krishnaswamy, and Tuan Do. Object embodiment in a multimodal simulation. AAAI Spring Symposium: Interactive Multisensory Object Perception for Embodied Agents, 2017. David Randell, Zhan Cui, and Anthony Cohn. A spatial logic based on regions and connections. In Morgan Kaufmann, editor, Proceedings of the 3rd International Conference on Knowledge Representation and Reasoning, pages 165–176, San Mateo, 1992. Matteo Ruggero Ronchi and Pietro Perona. Describing common human visual actions in images. arXiv preprint arXiv:1506.02203, 2015. 198

BIBLIOGRAPHY Eleanor Rosch. Natural categories. Cognitive psychology, 4(3):328–350, 1973. Eleanor Rosch. Prototype classification and logical classification: The two systems. New trends in conceptual representation: Challenges to Piaget’s theory, pages 73–86, 1983. Anna Rumshisky, Nick Botchan, Sophie Kushkuley, and James Pustejovsky. Word sense inventories by non-experts. In LREC, pages 4055–4059, 2012. Radu Bogdan Rusu, Zoltan Csaba Marton, Nico Blodow, Mihai Dolha, and Michael Beetz. Towards 3D point cloud based object maps for household environments. Robotics and Autonomous Systems, 56(11):927–941, 2008. Shlomo S. Sawilowsky. You think you’ve got trivials? Journal of Modern Applied Statistical Methods, 2(1):21, 2003. Claude Elwood Shannon. A mathematical theory of communication. ACM SIGMOBILE Mobile Computing and Communications Review, 5(1):3–55, 1948. Valerie J. Shute, Matthew Ventura, and Yoon Jeon Kim. Assessment and learning of qualitative physics in Newton’s Playground. The Journal of Educational Research, 106(6):423–430, 2013. Jeffrey Mark Siskind. Grounding the lexical semantics of verbs in visual perception using force dynamics and event logic. J. Artif. Intell. Res.(JAIR), 15:31–90, 2001. Scott Soames. Rethinking language, mind, and meaning. Princeton University Press, 2015. Stanley Smith Stevens. On the theory of scales of measurement, 1946. Leonard Talmy. Lexicalization patterns: semantic structure in lexical forms. In T. Shopen, editor, Language typology and semantic description Volume 3:, pages 36–149. Cambridge University Press, 1985. Leonard Talmy. Towards a cognitive semantics. MIT Press, 2000. Alfred Tarski. On the concept of logical consequence. Logic, semantics, metamathematics, 2: 1–11, 1936.

199

BIBLIOGRAPHY Sebastian Thrun, Michael Beetz, Maren Bennewitz, Wolfram Burgard, Armin B. Cremers, Frank Dellaert, Dieter Fox, Dirk Haehnel, Chuck Rosenberg, Nicholas Roy, et al. Probabilistic algorithms and the interactive museum tour-guide robot Minerva. The International Journal of Robotics Research, 19(11):972–999, 2000. Johan van Benthem and Jan Bergstra. Logic of transition systems. Journal of Logic, Language and Information, 3(4):247–283, 1994. Johan van Benthem, Jan Eijck, and Alla Frolova. Changing preferences. Centrum voor Wiskunde en Informatica, 1993. Johan van Benthem, Jan van Eijck, and Vera Stebletsova. Modal logic, transition systems and processes. Journal of Logic and Computation, 4(5):811–855, 1994. Johannes Franciscus Abraham Karel van Benthem. Logic and the flow of information. 1991. Zeno Vendler. Verbs and times. The philosophical review, pages 143–160, 1957. Luis Von Ahn and Laura Dabbish. Labeling images with a computer game. In Proceedings of the SIGCHI conference on Human factors in computing systems, pages 319–326. ACM, 2004. Terry Winograd. Procedures as a representation for data in a computer program for understanding natural language. Technical report, DTIC Document, 1971.

200