What Believability Testing Can Tell Us

3 downloads 0 Views 60KB Size Report
School of Computing. University of Paisley. High Street. Paisley ..... Daniel Livingstone would like to thank The. Carnegie Trust for the Universities of Scotland.
Livingstone, D. and McGlinchey, S.J.(2004), What Believability testing can tell us, presented at CGAIDE, Reading, Nov. 8th-10th, 2004. ISBN: 0-9549016-0-6

WHAT BELIEVABILITY TESTING CAN TELL US Stephen J. McGlinchey and Daniel Livingstone School of Computing University of Paisley High Street Paisley PA1 2BE {stephen.mcglinchey,daniel.livingstone}@paisley.ac.uk

KEYWORDS Believability, AI Evaluation

ABSTRACT A number of recent works on AI for computer games have promoted the idea of ‘believable’ AI as being desirable. That is, AI that fools observers or players into thinking that a human, not a machine, is playing the game or controlling the characters that a player interacts with. This process of testing whether or not game players and/or observers believe that they are interacting with (or watching interactions with) humans not machines we term believability testing. We review this idea of believability and show that, despite some flaws, believability testing can be useful in evaluating and improving AI for computer games.

If this is the case, then game AI is perhaps best tested by having players and observers somehow rate the AI in terms of humanness, or to distinguish AI and human players. We present a brief review of past work on testing and evaluating game AI using these approaches, and by conducting our own tests on a system developed to automatically create human like AI opponents (McGlinchy 2003) we report on the value of such approaches. In this we consider how such evaluation might best be conducted, how the results should be interpreted, and what potential the extended use of believability testing has for improving the field of game AI. In this work we concentrate on the use of AI for controlling opponents which might conceivably be under human control, and deliberately leave aside issues relating to non-player character AI and believability in games – although here too the ability to evaluate believability is important (Mac Namee 2004, p123).

INTRODUCTION EVALUATING BELIEVABILITY In general, game AI is less interested than traditional academic AI with recreating intelligence, than with creating the appearance of intelligence (see, e.g., Rabin 2002). As such, when creating machine opponents for computer and video games it is not simply their ability to challenge players, but to leave players with the impression that they have been competing against an opponent that thinks and acts like another human opponent would – in its use of strategy and tactics, in its reactions and actions. In this regard, game AI is much like a limited version of a Turing Test, in that success is determined by the ability to deceive people into thinking that the machine is human. Or at least as a limited version of the conventional conception the Turing Test, if not of the Imitation Game as specifically posited by Turing (Hayes and Ford 1995; Harnad 2000; Turing 1950).

The process of judging how believable the AI opponents and characters in computer games are is by nature a subjective process – being dependant on what observers think of the behaviours they perceive. With no objective measure to rely on, believability can only be determined by asking observers whether or not they think a character or opponent is under human or machine control – such as by using questionnaires. Such believability testing was carried out on the Soar Quakebot, an AI player for Quake II (Laird and van Lent 1999). Laird and Duchi (2000) recorded games in which a human played against either one of several other human players or against the Soar Quakebot in one of several parameter configurations. Judges were shown videos of games, and asked to rank the opponent player in each game for skill and degree of “humanness”.

Livingstone, D. and McGlinchey, S.J.(2004), What Believability testing can tell us, presented at CGAIDE, Reading, Nov. 8th-10th, 2004. ISBN: 0-9549016-0-6

Although the amount of testing conducted was small, some interesting results were apparent. For example, bots with human-like decision times were rated as more human than bots with significantly faster or slower decision times, and bots with more tactical reasoning were rated as being more human-like than less tactical ones. Super-human performance was a give away not only in decision times, but in aiming – some bots being too good at aiming to be considered human by the judges. From this preliminary work, three guidelines for AI for action game characters emerge: •

Give the AI a human like reaction and decision time (one twentieth to one tenth of a second).



Avoid giving the AI superhuman abilities (e.g. overly precise aiming).



Implement some tactical or strategic reasoning, so that the AI is not a purely reactive agent.

While questionnaires and large groups of test subjects can limit unwanted effects due to observer subjectivity, (Mac Namee 2004) notes that careful questionnaire construction can help further reduce the influence of subjective judgements. Rather than ask how believable an AI is, Mac Namee evaluates different AI’s by presenting observers with two at a time, and asking them to decide which is more believable. Additional questions ask judges to comment on any differences they notice between two versions of an AI implementation. Mac Namee shows that not only is believability a subjective measure, but one that may be subject to cultural effects. One test in particular, where subjects are shown two versions of a virtual bar populated by AI agents, demonstrates this. In one version agent behaviour is modelled according to rules where agents attempt to satisfy competing goals – buy beer from the bar, drink it while sitting with friends and go to the toilet as required. In the other version new short term goals were picked randomly every time agents completed an action. The one Italian test subject felt that having agents return to sit at the same table time after time was unrealistic, whereas the other (Irish) subjects mostly felt that this behaviour was more believable. While further tests to determine whether this, or other, results are indeed culturally significant differences have yet to be carried

out, the possibility of such differences existing does appear quite real. An additional, important difference noted by Mac Namee is that game playing novices may find it difficult to notice any differences between different AI implementations. The experience of playing a game is so new to them that they may fail to notice the significant differences between two versions of an AI – even where these differences are readily apparent to individuals with greater gameplaying experience. Related work also exists in Software Agent research, where conversational agents are used to front a range of applications, and it is important to evaluate how users react to such agents. For example, (Bickmore and Cassell 2001) look at how users interact with a conversational estate agent. Again, questionnaires are the prime means of gathering data on user reactions and perceptions and statistical measures are presented to demonstrate how different types of users vary in their response to different versions of the software agent. Commercially, player perception of AI agents in games was tested during the development of Halo (Butcher and Griesemer 2002). Amongst the findings here, it was noted that AI behaviours needed highly exaggerated animations and visible effects in order to be noticeable to players. The counter-intuitive implication here is that unrealistically overemotive reactions and actions may appear to players to be more realistic than more subtle reactions. This might not be the case when observers rather than players rate the game, as observers may have more time to study the agents than players involved in the game would (Laird and Duchi 2000), but the aim of a game is to satisfy the player of a game – not an observer. From this work, it appears that conducting believability testing can be a useful exercise – from which general design guidelines, or very specific changes to improve believability of the AI in a specific game, may be derived. In the next section we report on our own believability tests, and demonstrate how in passing a believability test we might yet deem that an AI has failed to reproduce human behaviour.

Livingstone, D. and McGlinchey, S.J.(2004), What Believability testing can tell us, presented at CGAIDE, Reading, Nov. 8th-10th, 2004. ISBN: 0-9549016-0-6

TESTING AN AI TRAINED ON PLAYER DATA In (McGlinchy 2003) player data from games of Pong was captured and subsequently used to train Artificial Neural Network AI Pong players. In this implementation of “Pong”, Figure 1, when the ball collides with a bat, the ball’s velocity vector is reflected and then rotated, allowing the player to exert some control over the direction of rebound from their bat.

two AI players and the remaining each with one human and one AI player. The programmed AI was only used in one of the games, and was played against a human player. The programmed AI works as follows: whenever the opposing player hits a shot towards the AI player, the AI projects the point where the ball will intersect the bat’s axis. After a short delay to simulate a human reaction time, the bat is moved gradually towards the point of intersection. In each iteration of the game’s main loop, the bat’s new position, p ' , is given by the equation below, where p is the bat’s current position, d is the projected point of impact, and w is a weight parameter that affects the speed at which the bat is moved to its target position.

p ' = wp + (1 − w)d

Figure 1: The game of “Pong” It was observed that not only was the AI able to play a reasonably good game of pong, but that it actually imitated elements of the play style of the players it had been trained on. For example, one player in particular would ‘flick’ the bat upon hitting the ball (as if to put spin on the ball, although not actually possible in the implementation of Pong), and when trained on data gathered from this player the AI would do likewise. Although the AI imitated the distinct play style of whatever player it had been trained on, would it be able to fool judges in believability tests? To determine this, we needed to conduct our own believability tests. A number of games were recorded and played back to test subjects, who were asked to judge whether the controllers of each bat were human or machine. For each game demonstrated, judges were able to state whether they thought the left bat, right bat, both or neither were controlled by a human. An additional question asked judges to explain how they thought they were able to tell the difference between computer and human players. The games shown to the judges consisted of a mix of these possible alternatives. The game players consisted of three humans, three AI trained on player data and an AI player programmed using simple logic. In total, eight games were shown to each subject: one with two human players, two with

In our experiments, we used a value of 0.97 for w. To make the AI player fallible, the target position for the bat was moved by a random amount between -20 and +20 pixels from a uniform distribution. (the bat’s height was 64 pixels). A Self-Organising Map (SOM) trained on player data was used to represent each of the other AI players. To find a target position to move the bat to, the ball’s speed and position are fed into the network, and the first and second placed winning SOM nodes are selected. Both winners suggest a position for the AI player’s bat to be moved to, and an interpolation of these values is used. In order to help give the bat a realistically smooth style of motion, the bat position is updated each frame, moving it towards the SOM’s targeted position. The winner search may be performed in every iteration of the main loop, although this can be wasteful of processing resources, and it is acceptable to do a winner search less often, e.g. every 300 milliseconds with little or no noticeable effect on the game. Once these adjustments were made, the SOMs, the programmed AI and the human players were recorded to build up the set of eight game recordings noted above. With this in place, a number of subjects were shown the recorded games and their completed questionnaires were reviewed.

Livingstone, D. and McGlinchey, S.J.(2004), What Believability testing can tell us, presented at CGAIDE, Reading, Nov. 8th-10th, 2004. ISBN: 0-9549016-0-6

WHAT DOES TESTING THE HUMAN-PLAYER TRAINED AI TELL US? The compiled results from initial tests seem quite promising. Overall the chance of an AI being correctly identified as such appears to be at roughly the level of chance. Slightly more responses indicate that judges have mistaken AI players for human than vice-versa, and the AI’s were mistaken for human players more often than they were correctly identified. Looking at individual returns gives a slightly different picture. While one respondent correctly judged 14 out of 16 players, another misclassified 14 out of 16. Many of the subjects varied from chance significantly – either getting most responses correct, or most responses incorrect. This indicates that some of the observers were in fact able to distinguish between human and machine controllers – even though they made incorrect inferences about which was which. By providing an additional free-text question that asked respondents to explain how they were able to distinguish the two types of player, we were able to check responses to see what aspect of the AI behaviour might have led to this result. The answers here showed that some of the judges – in particular the judges who got most identifications correct and the judges who got most wrong – noticed that some bats moved with more jerky and sudden movements than others. In most cases, these were the AI bats – although some observers thought this behaviour was a marker of human control. Believability testing has allowed us to test the current implementation of the player-trained SOM AI, and as changes are made will provide a means for testing and comparing successive versions of the AI. Aside from telling us what must be done to improve our current AI (additional smoothing and/or adding momentum to the AI bat being the most immediate changes), these results have also demonstrated the problem inherent in having observers score game AI in terms of its humanness – what appears human-like to one judge might not to another.

BELIEVABILITY OF AI IN COMMERCIAL REAL-TIME STRATEGY GAMES We have also embarked on a study of believability in commercial Real-Time

Strategy (RTS) games. The slower pace and extended duration of RTS games means that players have a good amount of time in which to observe the behaviour of their opponents. So rather than capture games and show them to observers, we are asking players to select a single game to play against an AI player in their own time. The players are then asked to complete a questionnaire on the behaviour of their AI opponent. Here, the respondents clearly know that they are facing machine opposition, but believability – the extent to which the AI acts like a human opponent might – remains both important to the player and testable. (Wetzel 2004) presents a review of persistent AI failures across genres of computer games, and suggests that the first step in fixing the problems is to find out what they are. By conducting believability testing developers can find failures before their customers do, potentially fixing them so that their customers never find the mistakes. Although our RTS survey is in very early stages, the returns do indicate that players are aware of failings that exist in some of the available games. We suspect not only that weaknesses exist in all games, but that classes of AI failure are common to large ranges of RTS games. As indicated by our preliminary finding, these are likely to allow players to adapt to and exploit the strategies employed by their AI opponents in a way that is not possible against other human opponents.

CONCLUSIONS Believability testing has demonstrated its worth as a means for evaluating game AI. As well as indicating when a game AI succeeds in recreating human like behaviour, testing can highlight why game AI fails to do so – and this can allow developers to focus their efforts on the areas where their AI systems perform weakest. Some care must be taken in conducting such tests, however. Where results from a large set of tests might indicate that an AI performs admirably in fooling observers, close inspection of the results can show otherwise – as has been the case with the tests conducted on player-trained pong-playing SOMs. ‘Humanness’ is clearly a highly subjective measure. In computer games where only a tiny aspect of human behaviour is actually observable, and where different observers can have quite different expectations, attempting to measure humanness may be fatally flawed.

Livingstone, D. and McGlinchey, S.J.(2004), What Believability testing can tell us, presented at CGAIDE, Reading, Nov. 8th-10th, 2004. ISBN: 0-9549016-0-6

Finally, a crucial question about believability remains. For any given computer game AI, what is it supposed to believably be? Should an AI act believably like another human playing the game – or like a characters in the game setting conceivably would act? Whether an AI soldier in some historical war setting should act like a modern player of the game, or like its historical model, will lead to quite different expectations about its behaviour. And accordingly, we have not presented criterion by which believability can be measured. In conducting believability tests it is most important to be clear about what we want believability to mean – and that remains a game designer’s choice.

ACKNOWLEDGEMENTS Daniel Livingstone would like to thank The Carnegie Trust for the Universities of Scotland for supporting his current research.

REFERENCES Bickmore, Timothy, and Justine Cassell. 2001. Relational Agents: A Model and Implementation of Building User Trust. Paper read at SIGCHI Conference on Human Factors in Computing Systems, Seattle, WA. Butcher, Chris, and Jaime Griesemer. 2002. The Illusion of Intelligence: The integration of AI and level design in Halo. Paper read at Game Developers Conference, 21st-23rd March, San Jose, C.A. Harnad, S. 2000. Minds, Machines and Turing. Journal of Logic, Language and Information 9:425-445. Hayes, P.J., and K.M. Ford. 1995. Turing Test Considered Harmful. Paper read at Proceedings of International Joint Conference on Artificial Intelligence (IJCAI-95), Montreal. Laird, John E., and John C. Duchi. 2000. Creating Human-like Synthetic Characters with Multiple Skill Levels: A Case Study using the Soar Quakebot. Paper read at AAAI 2000 Fall Symposium: Simulating Human Agents, November 3rd–5th, North Falmouth, MA. Laird, John E., and Michael van Lent. 1999. Developing an Artificial Intelligence Engine. Paper read at Proceedings of the Game Developers Conference, November 3rd–5th, San Jose, CA. Mac Namee, Brian. 2004. Proactive Persistent Agents: Using Situational Intelligence

to Create Support Characters in Character-Centric Computer Games. PhD Thesis, Department of Computer Science, University of Dublin, Dublin. McGlinchy, Stephen. 2003. Learning of AI Players from Game Observation Data. Paper read at Game-On 2003, 4th International Conference on Intelligent Games and Simulation, London. Rabin, Steve. 2002. AI Game Programming Wisdom. Hingham, MA: Charles River Media, Inc. Turing, A.M. 1950. Computing machinery and intelligence. Mind LIX (236):433-460. Wetzel, Baylor. 2004. Step One: Document the Problem. Paper read at AAAI workshop on Challenges in Game AI, July 25th26th 2004, San Jose.