Bootstrapping from Game Tree Search - Semantic Scholar

7 downloads 0 Views 263KB Size Report
Dec 9, 2009 - Bootstrapping from Game Tree Search. Joel Veness†∗. David Silver‡. Will Uther∗† Alan Blair†∗. University of New South Wales†. NICTA∗.
Introduction Background TreeStrap Results Summary

Bootstrapping from Game Tree Search Joel Veness†∗

David Silver‡

Will Uther∗ †

Alan Blair†∗

University of New South Wales† NICTA∗ University of Alberta‡

December 9, 2009

Joel Veness†∗ , David Silver‡ , Will Uther∗ †, Alan Blair†∗

Bootstrapping from Game Tree Search

Introduction Background TreeStrap Results Summary

Overview Game Tree Search Evaluation Functions

Presentation Overview A new algorithm will be presented for learning heuristic evaluation functions for game tree search via self-play. Topics covered include: I I I I

Why self-play is important The search bootstrapping approach Relationship to TD(λ), TD-Leaf(λ) Empirical evaluation on Chess

Joel Veness†∗ , David Silver‡ , Will Uther∗ †, Alan Blair†∗

Bootstrapping from Game Tree Search

Introduction Background TreeStrap Results Summary

Overview Game Tree Search Evaluation Functions

Game Tree Search

Joel Veness†∗ , David Silver‡ , Will Uther∗ †, Alan Blair†∗

Bootstrapping from Game Tree Search

Introduction Background TreeStrap Results Summary

Overview Game Tree Search Evaluation Functions

Heuristic Evaluation Function A heuristic evaluation function is a mapping H (s ) : State → R. For this work we use: I Parameterised representation: H (s ; w ~ ), w ~ ∈ Rn ~ s ) : State → Rn I State dependent feature vector: φ( ~ s) · w I Linear combination of features: H (s ; w ~ ) := φ( ~ ~? Problem: how to find good w

Joel Veness†∗ , David Silver‡ , Will Uther∗ †, Alan Blair†∗

Bootstrapping from Game Tree Search

Introduction Background TreeStrap Results Summary

Overview Game Tree Search Evaluation Functions

Constructing Evaluation Functions Some alternative methods to find weights: I

Hand-tune (guess and test)

I

Supervised learning / learn from expert play

I

Self-play

Self-play has a number of potential benefits: I

No need for scored training examples

I

Reduced knowledge engineering effort

but can be hard to achieve in practice.

Joel Veness†∗ , David Silver‡ , Will Uther∗ †, Alan Blair†∗

Bootstrapping from Game Tree Search

Introduction Background TreeStrap Results Summary

Overview Game Tree Search Evaluation Functions

Updating Evaluation Functions ~ ) towards We will be frequently talking about updating H (s ; w some target value T ∈ R. The following methods we consider are all: I

Online

I

Use stochastic gradient descent on either the squared ~ ))2 or sum squared error error 12 (T − H (s ; w P 1 ~ ))2 (Ts − H (s ; w 2

I

Invoked after a real action (move) is taken

s

For this talk, we are more interested in exactly how we choose the training target/s. Joel Veness†∗ , David Silver‡ , Will Uther∗ †, Alan Blair†∗

Bootstrapping from Game Tree Search

Introduction Background TreeStrap Results Summary

TD TD-Leaf Caveats

Self play with TD Learning I

Famously applied to Backgammon (TD-Gammon) by Tesauro

I

Simple greedy action selection sufficed during training

I

...unfortunately, difficulties with highly tactical domains (e.g. Chess).

time = t

s1

time = t+1

time = t+2

s2

Joel Veness†∗ , David Silver‡ , Will Uther∗ †, Alan Blair†∗

time = t+3

s3

Bootstrapping from Game Tree Search

s4

Introduction Background TreeStrap Results Summary

TD TD-Leaf Caveats

TD-Leaf Learning Introduced by Baxter et al. Combines game tree search and TD learning. Some well-known applications: Chess (KnightCap) and Checkers (Chinook). time = t

s1

time = t+1

time = t+2

s2

Joel Veness†∗ , David Silver‡ , Will Uther∗ †, Alan Blair†∗

time = t+3

s3

s4

Bootstrapping from Game Tree Search

Introduction Background TreeStrap Results Summary

TD TD-Leaf Caveats

Limitations of TD-Leaf Although undoubtedly an improvement over TD for certain types of games, a number of issues remain: I

Difficult to achieve strong results from just randomly assigned weights.

I

Expert play in chess emerged only after material weights were initialised to expert values and likely opponent blunders were excluded.

I

KnightCap required carefully controlled learning regime to learn. Is deterministic case harder than stochastic case?

I

Higher computational overhead compared to TD(λ)

Joel Veness†∗ , David Silver‡ , Will Uther∗ †, Alan Blair†∗

Bootstrapping from Game Tree Search

Introduction Background TreeStrap Results Summary

TD TD-Leaf Caveats

Our work in context... Program TD-Gammon Chinook KnightCap Meep (Us)

Game Backgammon Checkers Chess Chess

Weights Random Mixed Mixed Random

Self-play Yes Yes No Yes

Performance World Class World Class Expert/Master Expert/Master

Notes: I

Results of Knightcap, starting from random weights, trained via self play were disappointing.

I

The value of a checker and a king were fixed in Chinook.

Joel Veness†∗ , David Silver‡ , Will Uther∗ †, Alan Blair†∗

Bootstrapping from Game Tree Search

Introduction Background TreeStrap Results Summary

TD TD-Leaf Caveats

An obvious, but important point... The distribution over:

Positions seen in search , Positions seen over the board E.g. Contrast:

Joel Veness†∗ , David Silver‡ , Will Uther∗ †, Alan Blair†∗

Bootstrapping from Game Tree Search

Introduction Background TreeStrap Results Summary

Backup Properties

TreeStrap: an alternative backup scheme Consider the following modified backup policy: time = t

Joel Veness†∗ , David Silver‡ , Will Uther∗ †, Alan Blair†∗

time = t+1

Bootstrapping from Game Tree Search

Introduction Background TreeStrap Results Summary

Backup Properties

TreeStrap Properties Three main points: I

Backups come from deeper search at the same time-step, not subsequent searches.

I

A single search provides many updates; potential to learn faster?

I

Training examples come from more “representative” positions; potentially more robust?

Implementation: I

Extended to alpha-beta search, uses one-sided loss function

I

High performance programs all use transposition tables; bound information already available

Joel Veness†∗ , David Silver‡ , Will Uther∗ †, Alan Blair†∗

Bootstrapping from Game Tree Search

Introduction Background TreeStrap Results Summary

Experimental Setup Head to Head Online evaluation

Experimental Setup I

Heuristic evaluation consists of weighted linear combination of 1800 features

I

1m 1s Fischer Time controls used for training and evaluation (∼5 mins)

I

Time taken for updates reduced overall thinking time

I

Over 25000 training games played to learn weights

I

Over 16000 games played (time consuming!) in local evaluation tournament

I

2000 games used for online evaluation

Joel Veness†∗ , David Silver‡ , Will Uther∗ †, Alan Blair†∗

Bootstrapping from Game Tree Search

Introduction Background TreeStrap Results Summary

Experimental Setup Head to Head Online evaluation

Comparison to existing methods on Chess Learning from self−play: Rating versus Number of training games 2500

Rating (Elo)

2000

TreeStrap(alpha−beta) RootStrap(alpha−beta) TreeStrap(minimax) TD−Leaf Untrained

1500

1000

500

0 1 10

2

3

10 10 Number of training games

Joel Veness†∗ , David Silver‡ , Will Uther∗ †, Alan Blair†∗

4

10

Bootstrapping from Game Tree Search

Introduction Background TreeStrap Results Summary

Experimental Setup Head to Head Online evaluation

Performance at the Internet Chess Club Blitz performance at the Internet Chess Club: Algorithm TreeStrap(αβ) TreeStrap(αβ) I

I I

I

Training Partner Self Play Shredder

Rating 1950-2197 2154-2338

Self-play weights correspond to expert/weak master level play Strong opponent weights correspond to master level play Scored 13.5/15 against International Master opposition online Learning by playing a strong opponent helps, but effect is not as pronounced compared to TD-Leaf

Joel Veness†∗ , David Silver‡ , Will Uther∗ †, Alan Blair†∗

Bootstrapping from Game Tree Search

Introduction Background TreeStrap Results Summary

Highlights Questions

Highlights I

TreeStrap (·) method introduced, alternative to TD-based approaches for self-play training for games.

With respect to Chess: I

Order of magnitude reduction in training time vs TD-Leaf

I

Simple greedy move selection sufficient for training

I

First successful self-play result, starting from entirely random weights!

Joel Veness†∗ , David Silver‡ , Will Uther∗ †, Alan Blair†∗

Bootstrapping from Game Tree Search

Introduction Background TreeStrap Results Summary

Highlights Questions

Questions / Marketing Thankyou for listening. Please visit us at W38 this evening, especially if you are interested in talking about: I

algorithmic details, e.g. TreeStrap(αβ)

I

details of the chess specific features

I

how playing strength is measured

I

relationship of TreeStrap to other Reinforcement Learning techniques

I

ways in which this work can be extended

Joel Veness†∗ , David Silver‡ , Will Uther∗ †, Alan Blair†∗

Bootstrapping from Game Tree Search