Google’s DeepMind group updated their game learning algorithm, now called AlphaZero, and mastered chess.  I’ve seen the game play and it elegantly destroyed the previous top computer chess-playing algorithm (the computers have been better than humans for about a decade now), Stockfish.  Part of what is intriguing about their claim is that the new algorithm leans entirely from self-play with no human data involved — plus the learning process is apparently stunningly fast this way.

Something is weird to me about the training time they are reporting, though.  Key things we know about how AlphaZero works:

  1. Deep convolutional networks and reinforcement learning. This is a classifier-based approach that is going to act the most like a human player.  One way to think about this is if you could take a chess board and classify it as a win for white, win for black, or draw (with a perfect classifier algorithm).  Then to move, you simply look at the position after each of your possible moves and pick the one that is the best case for your side (win or draw).
  2. Based on the paper, the classifier isn’t perfect (yet). They describe using a Monte-Carlo tree search (MCTS), which is how you would use a pretty good classifier.  Standard chess algorithms rely more on Alpha-Beta (AB) tree search.  The key difference is that AB search is “broader” and searches every possible move, response move, next move, etc. as far as it can.  The challenge for games like Chess (and even more for Go) is that the number of positions to evaluate explodes exponentially.  With AB search, the faster the computers, the deeper you can look and the better the algorithm plays.  Stockfish, the current world champ, was searching 70m moves/sec for the match with AlphaZero.  MCTS lets you search smarter, more selectively, and only check moves that current position makes likely to be good (which is what you need the classifier for).  AlphaZero is reported at searching only 80k moves/sec, about a thousand times fewer than Stockfish.

That all makes sense.  In fact, this approach is one we were thinking about in the early 90s when I was in graduate school at CMU talking about (and playing) chess with Fernand Gobet and Peter Jansen.  Fernand was a former chess professional, retrained as a Ph.D. in Psychology and doing a post-doc with Herb Simon.  Peter was a Computer Science Ph.D. (iirc) working on chess algorithms.  We hypothesized that it might be possible to play chess by using the patterns of pieces to predict the next move.  However, it was mostly idle speculation since we didn’t have anything like the classifier algorithms used by DeepMind.  Our idea was that expert humans have a classifier built from experience that is sufficiently well-trained that they can do a selective search (like MCTS) of only a few hundred positions and manage to play as well as an AB algorithm that looked at billions of positions.  It looks like AlphaZero is basically this on steroids – a better classifier, more search and world champion level performance.

The weird part to me is how fast it learned without human game data.  When we were thinking about this, we were going to use a big database of grandmaster games as input to the pattern learner (a pattern learner/chunker was Fernanad’s project with Herb).  AlphaZero is reported as generating its own database of games to learn from by ‘playing itself’.  In the Methods section, the number of training games is listed at 44m.  That looks way too small to me.  If you are picking moves randomly, there are more than 9m positions after 3 moves and several hundred billion positions after 4 moves.  AlphaZero’s training is described as being in up to 700k ‘batches’ but even if each of those batches has 44m simulations, there’s nowhere near enough games to explore even a decent fraction of the first 10 or so moves.

Now if I were training a classifier as good as AlphaZero, what I would do is to train it against Stockfish’s engine (the previous strongest player on the planet) for at least the first few million games, then later turn it loose on playing itself to try to refine the classifier further.  You could still claim that you trained it “without human data” but you couldn’t claim you trained it ‘tabula rasa’ with nothing but the rules of chess wired in.  So it doesn’t seem that they did that.

Alternately, their learning algorithm may be much faster early on than I’d expect.  If it effectively pruned the search space of all the non-useful early moves quickly, perhaps it could self-generate good training examples.  I still don’t understand how this would work, though, since you theoretically can’t evaluate good/bad moves until the final move that delivers checkmate.  A chess beginner who has learned some ‘standard opening theory’ will make a bunch of good moves, then blunder and lose.  Learning from that game embeds a ‘credit assignment’ problem of identifying the blunder without penalizing the rating of the early correct moves.  That kind of error is going to happen at very high rates in semi-random games. Why doesn’t it require billions or trillions of games to solve the credit assignment problem?

Humans learn chess in many fewer than billions of games.  A big part of that is coming from directed (or deliberate) practice from a teacher.  The teacher can just be a friend who is a little better at chess so that the student’s games played are guided towards the set of moves likely to be good and then our own human machine learning algorithm (implicit learning) can extract the patterns and build our classifier.  The numbers reported on AlphaZero sound to me like it should have had a teacher.  Or there are some extra details about the learning algorithm I wish I knew.

But what I’d really like is access to the machine-learning algorithm to see how it behaves under constraints.  If our old hypothesis about how humans play chess is correct, we should be able to use the classifier and reduce the number of look-ahead evaluations to a few hundred (or thousand) and it should play like a lot more like a human than Stockfish does.

Links to the AlphaZero reporting paper:

https://arxiv.org/abs/1712.01815v1

https://arxiv.org/pdf/1712.01815v1.pdf