Nov 12

Lessons Learned on Scientific Field Work in Esports

Scene: Mandalay Bay Esports Stadium, a 12,000-seat arena in Las Vegas, Nevada. Two young men sip water and wipe their sweaty hands as they wait for a cue that it’s time to perform for the filled stadium seats and the two hundred thousand viewers online. Digital versions of themselves pose in sponsored gear on loop on the supersized screens above them. The cue is given – their faces harden. They don padded headsets to drown out the crowd. Their hands become a blur. The next 8 minutes will determine the Evo World Champion in Super Smash Bros Melee and the winner of $8,000 for a weekend of competitive gaming.

To perform at the level of a world champion, these players must perform at the peak of human cognition: based on knowledge they’ve accrued over a decade of play, they make high-level strategic and predictive gameplay choices by executing sensorimotor action combinations at the level of frames-per-second. Think ultra-high-speed chess with extreme, precise motor demands and you’ve begun to grasp esports.

As a scientist, I’ve made it my goal to understand the cognitive, physiological, and social components that ultimately determine who gets up on that stage. The psychological study of expertise has made clear that both practice and talent support cognitive and sensorimotor skill acquisition and performance. However, to understand the complete picture, we must also look at emotion regulation skills that allow for performing under pressure and the role of identity and stereotypes in granting access to the game at all. My work aims to take each of these components into account to find out what makes a Top 100 player.

Over the summer, I attended four national tournaments to work with 20 of the Top 100 players in the Super Smash Bros Melee competitive scene. My protocol involved pre-tournament surveys, cortisol sampling (a 45-minute commitment each morning for participants), a wrist-based heartrate monitor to be worn for the whole tournament, a 45-minute interview on skill and social support, and a 20-minute cognitive testing session. My plan is to combine the variance from each of these measures to explain both performance in that tournament and overall Top 100 ranking. Many of these measures had to be taken in the field, a setting often neglected in traditional psychology graduate student training. My pilot testing taught me three important lessons about field world which I explore below.

 

  • No matter how organized you are, things will go wrong and you will be stressed.

 

Be prepared to fail and MacGyver your way through travel and data collection.

For data collection, you can prepare for the worst by having access to everything both digitally and hard copy and keeping a checklist of all materials. Important tidbits like laptop chargers and labels tend to take a mental backseat to the primary study materials like questionnaires, but they’re just as important. Compartmentalizing helps too: I had a carryon dedicated to study materials and organized it so that I could look into it and immediately know what I was missing.

But even when you’ve prepared, some things will be out of your control. Different tournament venues provide different testing environments, so I had variation in screen visibility, ambient noise, and subject comfort for every cognitive test and interview. Despite a lengthy verbal explanation and multiple printouts on how to do saliva sampling to measure daily cortisol rhythms, subjects still took their samples at the wrong time or not at all. A heartrate monitor is still MIA from a subject who forgot to return it. Despite my plans, every single aspect of the study was challenged by the environment and the subjects themselves, factors outside of my control.

Even things within your control will go wrong. The stress of travel and data collection will compromise your executive functioning, especially if you’re working solo to save grant money. I had to retest my very first subject because I forgot to unmute the computer when audio cues are integral to the reaction time task – not to mention the multiple, small, silly mistakes I made from fatigue by the end of the month-long data collection marathon. Externalizing everything possible will unburden your working memory and as a result bolster your mental energy. When things go wrong, a master sheet of notes will help explain funky gaps in data and make judgments about which data to discard. Every detail belongs on paper so it won’t weigh you down: who was retested, the order of presentation, the testing environment, who wore what color watch, even what real names match up with gamertags.

In field work, so many factors are outside of your control that you cannot achieve perfection. Prepare the best you can to collect high-quality data, but also be forgiving of yourself of the mistakes and issues that will pop up along the way. My perfectionist tendencies in data collection only compounded my stress. Do maximum prep work so you can fall back on it when things go wrong or you feel overwhelmed.

 

  • Know the unique demands of your population.

 

In an ideal situation, you’ll be doing field research within your own community. Being a community member means you’re familiar with domain-specific skills and measure them better in your work; you’ll have a network from which to recruit, split rooms, and socialize; and you’ll have a gut sense of ethics as far as what is exploitative versus uplifting to the community. But even if you aren’t a member of the community you’re studying, it’s worth it to do your research on the unique demands within it. Otherwise you risk compromising your science or even failing to collect data at all.

For example, knowing the schedules of your participants (and potential participants) is crucial. Perhaps at a military base or boarding school subjects do not have 45 minutes after awakening to complete a saliva sampling procedure. Similarly, I had many subjects wake up 20 minutes before competing, meaning I couldn’t measure their cortisol awakening response. Schedules are important for recruitment too: I don’t approach people who are competing in 10 minutes, and on the flip side, I would be comfortable recruiting someone knowing they were done for the day and just hanging out. (Also, as a rule of thumb, I never asked people to participate right after they lost.)

I also knew walking in that I would have trouble with subjects showing up. Although I had many people do surveys beforehand and commit to times to meet, actually scheduling someone at a tournament is impossible because their schedule will vary with their performance and a thousand other factors, like who wants to hang out right then. Plus Smashers tend to party at national tournaments, meaning lots of oversleeping, hangovers, and lost materials. Communicating with my subjects via text instead of email was a good way to keep up with ever-changing availability. Furthermore, my regular attendance at national tournaments meant more opportunities to fill in cells for each person rather than absolutely needing to complete everything in one weekend.

Finally, sexual assault is a problem in the Smash scene. I as a woman was ready to navigate differently interactions with players who have a history of assaulting women – testing in public places rather than hotel rooms, limiting the information they know about me, that sort of thing. The only way I could know who to take caution with came from being embedded in the community’s network.

 

  • Your social identity intersects and determines your ability to do research and the content and quality of the data you do collect.

 

Most gaming scenes, especially for competitive and fighting games, are majority male. I do not exaggerate when I say that women make up less than 5% of Smash tournament attendees and about 2% of competitors. My lay theory on that percentage difference is that although many of us wish to be respected and feared top players, competitors and commentators have an especially bright spotlight of the male gaze put upon them, making them the most vulnerable to sexism and abuse – so a lot of us engage with the community by organizing tournaments and players, making art, or creating photo and video content. Whatever role we occupy, it’s safe to say that in gaming spaces, women are The Other, not the standard player, and with that minority status comes stereotypes. Unavoidable stereotypes about my gender changed the way I thought, felt, and worked on a daily basis, energy that could have gone to another hour on the ground at a tournament, another dinner out to network, or another reminder text for cortisol samples.

A huge amount of my energy during my summer tournament run was spent on first-impression management. Every morning was a struggle to handle the girl gamer stereotypes I may be slotted into, to balance impressions of my competence with my femininity. When I wore a skirt and makeup, I worried that people labeled me an “egirl” with the accompanying stereotypes of social climbing, promiscuity, and two-facedness. And indeed, in those clothes, I had better success cold-recruiting players and making friends but I was assumed to have no game competence – no one wanted to talk about the game, no one probed my scientific theories, but they were happy to feel out my social network and ask about my plans for the afterparty. If I presented more masculine, a suit jacket and loose shirt, many top players hardly gave me a second glance after I introduced myself, but people listened longer when I talked about my theories and directed conversations in more academic directions. (I always relied on the authority of a suit jacket in formal research presentations at events.) In the end, no outfit could reconcile the competing identities of scientist, dry and intelligent and unsexual and masculine thus competent, with a gamer girl, comfortable in her body and approachable and granted access to important figures of the community. My attempts to dress and act in between these two extremes only led to assumed incompetence of both social and research domains in my first meetings with people.

Even beyond first impressions, I spent so much energy navigating conversations to establish who I am, to research subjects and community members alike. Sometimes it was more useful to talk about my Smash commentary first, to establish my in-group gamer status, and in subsequent conversations bring up my research. Other times, people were clearly engaged in the science but assumed I knew nothing about the game, an assumption that had to be corrected over multiple interactions filled with jokes and references to in-depth game knowledge or my commentary gigs before they would even talk to me in an interview on a natural (rather than dumbed-down) level.

Besides my appearance, I managed my social impression in other ways. Players take a video game sexism survey as part of my initial questionnaires and I have no illusions that my status as a female researcher and gamer influence their likelihood of reporting their true thoughts. I do what I can to manage that: I recruited online rather than face-to-face as much as possible. And although I have never concealed my gender, the ambiguity of being a woman with a man’s name has afforded me much within the Smash scene.

If it isn’t obvious already, I’ll explicitly state it: every worry I’m expressing is directly linked to the integrity of my research. When I ask them about practice habits, I need them to speak to me using the highly domain-specific knowledge that may set them apart from one another – every Top 100 player is going to say they practice a certain number of hours, but how many use a CPU 20XX Fox set to approach for their chain-grab practice versus studying frame data on a computer before picking up the controller? When I ask them in surveys and in person about the role of women in the Smash community, I want them to be as comfortable as possible telling me the truth rather than sugar-coating their answers because they want to seem progressive to a scientist or because they want to date me. To even access members of a special population like the Top 100 requires networking and trust-building, all dependent on the above impressions. In the end, the identities that subjects assign to me are inextricable from the answers they give me and from accessing their expertise. Thus, it’s not something I can ignore.

In the end, my identity as a woman, scientist, and gamer all intersect to create a unique set of circumstances that, in the field and in the data, affect my research profoundly. I have a unique, diverse, and valuable perspective compared to the average gamer and games researcher, one reason I continue on despite challenges. Arguably, I have better access to top-level players to study. In other ways, I am permanently an out-group member who may never learn the truth of gender-gaming stereotypes or high-level game-specific knowledge on my own. The lesson here is that every scientist in search of truth, in the lab or otherwise, needs to remember how their identity is affecting their access to and interpretation of truth in specific ways, every step of the way.

 

Field work requires a balance of priorities: the integrity of the research, reducing noise that comes from uncontrollable circumstances, and your wellbeing as a researcher. Hopefully, these lessons I’ve learned are valuable to other researchers hoping to dive into field work themselves.

 

 

Related readings:

  • Taylor, T.L. (2012). Raising the stakes: E-sports and the professionalization of computer gaming. Cambridge,
    MA: MIT Press.
  • Taylor, N. (2018). I’d rather be a cyborg than a gamerbro: How masculinity mediates research on digital play. MedieKultur: Journal of media and communication research34(64), 21.
  • Consalvo, M. (2012). Confronting toxic gamer culture: A challenge for feminist game studies scholars. Ada: A Journal of Gender, New Media, and Technology, (1).

Oct 11

Implicit/Machine learning gender bias

I ran across a headline recently “Amazon scraps secret AI recruiting tool that showed bias against women” that I realized provides a nice example of a few points we’ve been discussing in the lab.

First, I have found myself describing on a few recent occasions that it is reasonable to think of implicit learning (IL) as the brain’s machine learning (ML) algorithm.  ML is a super-hot topic in AI and data science research, so this might be a useful analogy to help people understand what we mean by studying IL.  We characterize IL as the statistical extraction of patterns in the environment and the shaping of cognitive processing to maximize efficiency and effectiveness to these patterns.  And that’s the form of most of the really cool ML results popping up all over the place.  Automatic statistical extraction from large datasets that provides better prediction or performance than qualitative analysis had done.

Of course, we don’t know the form of the brain’s version of ML (there are a lot of different computational variations of ML) and we’re certainly bringing less computational power to our cognitive processing problems than a vast array of Google Tensor computing nodes.  But perhaps the analogy helps frame the important questions.

As for Amazon, the gender bias result they obtained is completely unsurprising once you realize how they approached the problem.  They trained an ML algorithm on previously successful job applicants to try to predict which new applicants would be most successful.  However, since the industry and training data was largely male-dominated, the ML algorithm picked up this as a predictor of success.  ML algorithms make no effort to distinguish correlation from causality, so they will generally be extremely vulnerable to bias.

But people are also vulnerable to bias, probably by basically the same mechanism.  If you work in a tech industry that is male dominated, IL will cause you to nonconsciously acquire a tendency to see successful people as more likely to be male.  Then male job applicants will look closer to that category and you’ll end up with an intuitive hunch that men are more likely to be successful — without knowing you are doing it or intending any bias at all against women.

An important consequence of this is that people exhibiting bias are not intentionally misogynistic (also note that women are vulnerable to the same bias).  Another is that there’s no simple cognitive control mechanism to make this go away.  People rely on their intuition and gut instincts and you can’t simply tell people not to as not doing so feels uncomfortable and unfamiliar.  The only obvious solution to this is a systematic, intentional approach to changing the statistics by things like affirmative action.  A diverse environment will eventually break your IL-learned bias (how long this takes and what might accelerate it is where we should be looking to science), but it will never happen overnight and will be an ongoing process that is highly likely to be uncomfortable at first.

In theory, it should be a lot quicker to fix the ML approach.  You ought to be able to re-run the ML algorithm on a non-biased dataset that equally successful numbers of men and women.  I’m sure the Amazon engineers know that but the fact that they abandoned the project instead suggests that the dataset must have been really biased initially.  You need a lot of data for ML and if you restrict the input to double the size of the number of successful women, you won’t have enough data if the hiring process was biased in the past (prior bias would also be a likely reason you’d want to tackle the issue with AI).  They’d need to hire a whole lot more women — both successful and unsuccessful, btw, for the ML to work — and then retrain the algorithm.  But we knew that was the way out of bias before we even had ML algorithms to rediscover it.

Original article via Reuters: https://www.reuters.com/article/us-amazon-com-jobs-automation-insight/amazon-scraps-secret-ai-recruiting-tool-that-showed-bias-against-women-idUSKCN1MK08G

Jul 02

Real-World Adventures in Science, part 1: Aconcagua

[In the lab, studies of decision-making are done with highly artificial tasks in tightly controlled situations (is this sine-wave grating an A or a B?).  However, our theory of how multiple memory systems contribute to decision making is supposed to apply to complex, real-world, high-leverage decision making.  Getting data on how well that works means venturing out into messy, uncontrolled contexts and struggling to collect data with definitions of the independent and dependent variables we hope make sense.  Kevin’s story is the first example of what we expect will be a series on adventures, lessons learned and maybe even ideas for formal controlled research inspired by case-study, semi-anecdotal data in the wild. – PJR]

On December 19, 2017, I found myself with a decision to make.  I was standing on the side of Mount Aconcagua in Argentina at an altitude of 19,700 feet, about three thousand feet from the summit of the highest mountain in the Western Hemisphere.  It was a snowy whiteout and the winds were extremely strong—this was Aconcagua’s infamous Viento Blanco (Spanish for the White Winds).  In addition to the inclimate weather, I no longer had a functioning camping tent or stove as they had not survived the strong winds. I had been forced to rely on the kindness of other climbers headed for the summit to shelter overnight and melt ice for water.  The decision was: attempt the summit in the somewhat dangerous weather conditions or return back down the mountain now?

Planning for this trip had started many months earlier.  Together with some climbing colleagues, we had even been sleeping in low-oxygen tents for four months as part of training to pre-acclimatize to high altitude conditions.  That wasn’t a fun experience and the feeling of chronic sleep deprivation was only partly balanced by some solid longitudinal data on changes in blood oxygen levels of the climbing team following overnight hypoxic simulations.  There was also the travel to Chile, the bus to Argentina, and days climbing up the mountain to this point.

Turning around meant not being able to follow through on all the invested effort. The Air Force had built a social media campaign around the climb and everything.  But making a mistake and attempting a dangerous summit has substantial risks.  As Paul pointed out later, we learn by trial and error in the lab, but high altitude mountaineering is often not a place where you can learn from a really bad choice…you might not get another chance at life or death decisions.

The theoretical lab model is that decision making is a mix of deliberate processing and intuitions.  We had a climbing plan, we trained for a wide variety of situations, we tried to think everything through in advance.  But then there are intuitions in the moment, where you have a gut sense of a reasonable or an unreasonable amount of danger.  Decisions are supported by a mix of these explicit and implicit processes and we are interested in how the mixing of these processes is affected in real circumstances.

An idea we had was that low oxygen, high altitude climbing might asymmetrically impair explicit decision-making compared to implicit decision-making processes.  As a proof-of-concept, I tried to explore this idea with some concrete data by bringing tools for cognitive assessment on the climb.

An Android tablet was used to administer cognitive tasks to our climbing team at different camps up the mountain. Cognitive data was successfully collected at elevations over 18,000 feet! Working memory, the explicit ability to hold things in mind, was indexed using a digit span task. In contrast, a test of simple reaction time was used to signify the goal of assessing implicit processing. The cognitive data will not be interpreted as this was not systematic data collection, but this effort highlighted the need to develop an accurate index of the implicit system, which is fundamental for multiple memory systems research to progress in this domain.

Heart rate and GPS data were also recorded throughout the excursion using the Garmin Fenix 3 to complement the cognitive assessment. This wearable platform provided the ability to collect critical physiological and contextual data about the climbing team for improved safety monitoring and novel data analytics. The Android tablet and Garmin Fenix 3 were supplemented with an external USB connectable battery pack and a portable solar charger for multi-day data collection. Hardware testing on the mountain found that battery life was a limiting, yet manageable factor in this endeavor.

This proof-of-concept venture demonstrates the feasibility of collecting real-world cognitive and physiological data during high-altitude mountaineering. Back on the mountain, our climbing team made the decision to not push through the harsh weather for the summit of Aconcagua. Decision making here depended on the interaction of multiple memory systems, and a bad decision could have prevented us from safely returning to basecamp. It is imperative to understand how multiple memory systems interact during decision making in extreme environments, because this could be the difference between life or death.

Jul 02

Expertise in Unusual Domains

It’s tempting to call this kind of thing ‘stupid human tricks’ but it’s really awesome human tricks.  I’m regularly fascinated by people who have pushed themselves to achieve extremely high levels of skill in offbeat areas.  The skill performance her is amazing, clearly thousands of hours of practice.

 

With a lot of the more traditional skill areas that people push the performance boundaries way out, there is an obvious reward to getting very good.  Music, visual arts, sports, even chess, all have big audiences that will reward being among the best in the world.  Yoyo-ing has to be a bit different.  Maybe I just don’t know the size of the audience, but it seems like the kind of thing you get good at if you think juggling is too mainstream.

Pretty amazing to watch, though.

 

 

May 04

The “Dan Plan”

I mentioned the Dan Plan awhile ago as a fascinating real-world self experiment on the acquisition of expertise.  Dan, the eponyous experimenter and experimentee, quit his job to try to spend 10,000 hours playing golf to see if he could meet a standard of ‘internationally competitive’ defined by winning a PGA tour card — starting from no prior golf experience at all.

I remembered the project awhile back and peeked in on it and it seemed to be going slowly.  Then I ran across this write-up in the Atlantic summarizing the project and basically writing it off as a failure.  I disagree!

https://www.theatlantic.com/health/archive/2017/08/the-dan-plan/536592/

Data collection in the real-world is really hard and it seems remarkably unfair to Dan to say he “failed” when even this single data point is so interesting.  As a TL;DR, he logged around 6000 hours of practice but then got sidetracked by injuries, bogged down by other constraints in life (e.g, making a living) and kind of petered out.  To think about what we learned from his experience, let’s review the expertise model implied here.

The idea is that the level of expertise is a function of two components: talent and training (good old nature/nurture) and what we don’t know is the relative impact of each.  The real hypothesis being tested is that maybe ‘talent’ isn’t so important, it’s just getting the right kind of training, i.e., “deliberate” training and as a note Ericsson is both cited in the Atlantic article and was consulting actively with Dan through his training (Bob Bjork is also interviewed, although the relationship of his research is more tenuous that suggested, although the article does not properly appreciate how good a golfer he is reputed to be).

The real challenges here are on embedded in understanding the details of both of these constructs.  For training, I doubt we really know what the optimal ‘deliberate practice’ is for a skill like golf.  Suppose if Dan wasn’t getting the right kind of coaching, or even harder, the right kind of coaching for him (suppose there are different optimal training programs that depend on innate qualities — that makes a right mess out of the simple nature/nurture frame).  One of the reason we do lab studies with highly simplified tasks is to make the content issue more tractable and even then, this can be hard.

And the end of the Dan plan due to chronic injury points out an important part of the ‘talent’ question for skills that require a physical performance.  Talent/genetic factors can show up as peripheral on central (as in nervous system).  For athletes, it is clear that peripheral differences are really important — size, weight, peculiarities of muscle/ligament structure — these all matter.  Nobody doubts that there are intrinsic, inherited, genetic aspects in those that can be influenced by exercise/diet but with significant limits.  Tendency towards injury is a related and somewhat subtle aspect of this that might act as a big genetic-based factor in who achieves the highest level of performance in physical skills.

Their existence of peripheral differences does not actually tell us much about the relative importance of central, brain-based, individual differences in skill learning, which are closer to questions we try to examine in the lab.  It could be the case that there are some people who learn more from each practice repetition and as a result, achieve expertise more quickly (fwiw, we are failing to find this, although we are looking for it).  But if any novice can get to high-level expertise in 10,000 hours, then maybe that is not as big a factor.

In my estimation, Dan got pretty far in his 6k hours.  And even if he wasn’t able to fully test the core hypothesis due to injuries, it’d be great if this inspired some other work along this line.  Maybe somebody would invest a few million in grant dollars into recruiting a bunch more people like Dan to spend ~5 years full-time commitment to various cognitive or physical skills, see what the learning curves are like, and get some more data on the relative importance of talent and time in expertise.

Apr 05

Leela Chess

Courtesy of Jerry (ChessNetwork), I found out today about Leela and the LCzero chess project (http://lczero.org/).

This appears to be a replication of the Google DeepMind AlphaZero project with open source and distributed computing contributing to the pattern learner.

Among the cool aspects of the project is that you can play against the engine after different amounts of learning.  It’s apparently played around 2.5m games now and appears to have achieved “expert” strength.

There are a few oddities, though.  First, if it is learning through self-play, are these games versus humans helping it learn (that is, are they used in the learning algorithm)?  Probably practically that makes no difference if most of the training is from real self-play, but I’m curious.  Second, they report a graph of playing ability of the engine but the ELO is scaling to over 4k, which doesn’t make any sense (‘expert’ elo is around 2000).  There really ought to be a better way to get a strength measure, for example, simply playing a decent set of games on any one of the online chess sites (chess.com, lichess.com).  The ability to do this and the stability of chess elo measures is a good reason to use chess as a model for exploring pattern-learning AI compared to human ability.

But on the plus side, there is source code.  I haven’t looked at the github repository yet (https://github.com/glinscott/leela-chess) but it’d be really interesting to constrain the alpha-beta search to something more human like (narrow and shallow) and see if it possible to play more human-like.

Jerry posted a video on youtube of playing against this engine this morning (https://www.youtube.com/watch?v=6CLXNZ_QFHI).  He beats it in the standard way that you beat computers: play a closed position where you can gradually improve while avoiding tactical errors and the computer will get ‘stuck’ making waiting moves until it is lost.  Before they started beating everybody, computer chess algorithms were known to be super-human at tactical considerations and inferior to humans at positional play.  It turns out that if you search fast and deep enough, you can beat the best humans at positional play as well.  But the interesting question to me for the ML-based chess algorithms is whether you can get so good at the pattern learning part that you can start playing at a strong human level with little to no explicit alpha-beta search (just a few hundred positions rather than the billions Stockfish searches).

Jerry did note in the video that the Leela engine occasionally makes ‘human like’ errors, which is a good sign.  I think I’d need to unpack the code and play with it here in the lab to really do the interesting thing of figuring out the shape of the function, ability = f(million games played, search depth).

 

 

Jan 10

Adventures in data visualization

If you happen to be a fan of data-driven political analysis, you are probably also well aware of the ongoing challenge of how to effectively and accurately visualize maps that show US voting patterns.  The debate over how to do this has been going on for decades but was nicely summarized in a 2016 article by the NYTImes Upshot section (https://www.nytimes.com/interactive/2016/11/01/upshot/many-ways-to-map-election-results.html).

Recently, Randall Munroe of XKCD fame came up with a version that is really nicely done version:

https://xkcd.com/1939/

2016 Election Map

I was alerted to it by high praise for the approach from Vox:

https://www.vox.com/policy-and-politics/2018/1/8/16865532/2016-presidential-election-map-xkcd

 

Of particular note to me was that a cartoonist (albeit one with a strong science background) was the person who found the elegant solution to the density/population tradeoff issue across the US while also capturing the important mixing aspects (blue and red voters in every state).  That isn’t meant as a knock on the professional political and data scientists who hadn’t come with this approach, but more of a note on how hard data visualization really is and how the best, creative, effective solutions might therefore come from surprising sources.https://xkcd.com/1939/

 

 

Dec 12

AlphaZero Beats Chess In 4 (!?) hours

Google’s DeepMind group updated their game learning algorithm, now called AlphaZero, and mastered chess.  I’ve seen the game play and it elegantly destroyed the previous top computer chess-playing algorithm (the computers have been better than humans for about a decade now), Stockfish.  Part of what is intriguing about their claim is that the new algorithm leans entirely from self-play with no human data involved — plus the learning process is apparently stunningly fast this way.

Something is weird to me about the training time they are reporting, though.  Key things we know about how AlphaZero works:

  1. Deep convolutional networks and reinforcement learning. This is a classifier-based approach that is going to act the most like a human player.  One way to think about this is if you could take a chess board and classify it as a win for white, win for black, or draw (with a perfect classifier algorithm).  Then to move, you simply look at the position after each of your possible moves and pick the one that is the best case for your side (win or draw).
  2. Based on the paper, the classifier isn’t perfect (yet). They describe using a Monte-Carlo tree search (MCTS), which is how you would use a pretty good classifier.  Standard chess algorithms rely more on Alpha-Beta (AB) tree search.  The key difference is that AB search is “broader” and searches every possible move, response move, next move, etc. as far as it can.  The challenge for games like Chess (and even more for Go) is that the number of positions to evaluate explodes exponentially.  With AB search, the faster the computers, the deeper you can look and the better the algorithm plays.  Stockfish, the current world champ, was searching 70m moves/sec for the match with AlphaZero.  MCTS lets you search smarter, more selectively, and only check moves that current position makes likely to be good (which is what you need the classifier for).  AlphaZero is reported at searching only 80k moves/sec, about a thousand times fewer than Stockfish.

That all makes sense.  In fact, this approach is one we were thinking about in the early 90s when I was in graduate school at CMU talking about (and playing) chess with Fernand Gobet and Peter Jansen.  Fernand was a former chess professional, retrained as a Ph.D. in Psychology and doing a post-doc with Herb Simon.  Peter was a Computer Science Ph.D. (iirc) working on chess algorithms.  We hypothesized that it might be possible to play chess by using the patterns of pieces to predict the next move.  However, it was mostly idle speculation since we didn’t have anything like the classifier algorithms used by DeepMind.  Our idea was that expert humans have a classifier built from experience that is sufficiently well-trained that they can do a selective search (like MCTS) of only a few hundred positions and manage to play as well as an AB algorithm that looked at billions of positions.  It looks like AlphaZero is basically this on steroids – a better classifier, more search and world champion level performance.

The weird part to me is how fast it learned without human game data.  When we were thinking about this, we were going to use a big database of grandmaster games as input to the pattern learner (a pattern learner/chunker was Fernanad’s project with Herb).  AlphaZero is reported as generating its own database of games to learn from by ‘playing itself’.  In the Methods section, the number of training games is listed at 44m.  That looks way too small to me.  If you are picking moves randomly, there are more than 9m positions after 3 moves and several hundred billion positions after 4 moves.  AlphaZero’s training is described as being in up to 700k ‘batches’ but even if each of those batches has 44m simulations, there’s nowhere near enough games to explore even a decent fraction of the first 10 or so moves.

Now if I were training a classifier as good as AlphaZero, what I would do is to train it against Stockfish’s engine (the previous strongest player on the planet) for at least the first few million games, then later turn it loose on playing itself to try to refine the classifier further.  You could still claim that you trained it “without human data” but you couldn’t claim you trained it ‘tabula rasa’ with nothing but the rules of chess wired in.  So it doesn’t seem that they did that.

Alternately, their learning algorithm may be much faster early on than I’d expect.  If it effectively pruned the search space of all the non-useful early moves quickly, perhaps it could self-generate good training examples.  I still don’t understand how this would work, though, since you theoretically can’t evaluate good/bad moves until the final move that delivers checkmate.  A chess beginner who has learned some ‘standard opening theory’ will make a bunch of good moves, then blunder and lose.  Learning from that game embeds a ‘credit assignment’ problem of identifying the blunder without penalizing the rating of the early correct moves.  That kind of error is going to happen at very high rates in semi-random games. Why doesn’t it require billions or trillions of games to solve the credit assignment problem?

Humans learn chess in many fewer than billions of games.  A big part of that is coming from directed (or deliberate) practice from a teacher.  The teacher can just be a friend who is a little better at chess so that the student’s games played are guided towards the set of moves likely to be good and then our own human machine learning algorithm (implicit learning) can extract the patterns and build our classifier.  The numbers reported on AlphaZero sound to me like it should have had a teacher.  Or there are some extra details about the learning algorithm I wish I knew.

But what I’d really like is access to the machine-learning algorithm to see how it behaves under constraints.  If our old hypothesis about how humans play chess is correct, we should be able to use the classifier and reduce the number of look-ahead evaluations to a few hundred (or thousand) and it should play like a lot more like a human than Stockfish does.

Links to the AlphaZero reporting paper:

https://arxiv.org/abs/1712.01815v1

https://arxiv.org/pdf/1712.01815v1.pdf

 

Nov 07

Cognitive Symmetry and Trust

A chain of speculative scientific reasoning from our work into really big social/society questions:

  1. Skill learning is a thing. If we practice something we get better at it and the learning curve goes on for a long time, 10,000 hours or more.  Because we can keep getting better for so many hours, nobody can really be a top-notch expert at everything (there isn’t time).  This is, therefore, among the many reasons why in group-level social functioning it is much better to specialize and have multi-person projects done by teams of people specializing in component steps (for tasks that are repeated regularly).  The economic benefits of specialization are massive and straightforward.
  2. However, getting people to work well in teams is hard. In most areas requiring cooperation, there is the possibility of ‘defecting’ instead of cooperating on progress – to borrow terms from the Prisoner’s Dilemma formalism.  That powerful little bit of game theory points out that in almost every one-time 2-person interaction, it’s always better to ‘defect,’ that is, to act in your own self-interest and screw over the other player.
  3. Yet, people don’t. In general, people are more altruistic than overly simple game-theory math would predict.  Ideas for why that model is wrong include (a) extending the model to repeated interactions where we can track our history with other players and therefore cooperation is rewarded by building a reputation; (b) that humans are genetically prewired for altruism (e.g., perhaps by getting internal extra reward from cooperating/helping); or (c) that social groups function by incorporating ‘punishers’ who provide extra negative feedback for the non-cooperators to reduce non-cooperation.
  4. These three alternatives aren’t mutually exclusive, but further consideration of the (3a) theory raises some interesting questions about cognitive capacity. We interact a lot with a lot of different people in our daily lives.  Is it possible to track and remember everything about our interactions in order to make optimal cooperate/defect decisions?  Herb Simon argued (Science, 1980) that we can’t possibly do this, working along the same lines as his ‘bounded rationality’ reasoning that won him the Nobel Prize in Economics.  His conclusion was that (3b) was more likely and showed that if there was a gene for altruism (he called it ‘docility’), it would breed into the population pretty effectively.
  5. No such gene has yet been identified and I have spent some time thinking about alternate approaches based on potential cognitive mechanisms for dealing with the information overload of tracking everybody’s reputation. One really interesting heuristic I ran is the Symmetry Hypothesis, which I have slightly recast for simplicity.  This idea is a hack to the PD where you can reason very simply as follows: If the person I am interacting with is just like me and reasons exactly as I do, no matter what I decide, they are going to do the same and in this case, I can safely cooperate because the other player will too.  And if I defect, they will also (potentially allowing group social gains through competition, which is a separate set of ideas).
  6. Symmetry would apply in cases where the people you often interact with are cognitively homogeneous, that is, where everybody thinks ‘we all think alike.’ Here, where ‘we’ can be any social group (family, neighborhood, community, church, club, etc.).   If this is driving some decent fraction of altruistic behavior, you’d see strong tendencies for high levels of in-group trust (compared with out-group), and particularly in groups that push people towards thinking similarly.  You clearly do see those things, but their existence doesn’t actually test the hypothesis – there are many theories that predict in-group/out-group formation, that these affect trust, that people who identify in a group start to think similarly.  Of note, though, this idea is a little pessimistic because it suggests that groupthink leads to better trust and social grouping should tend to treat novel, independent thinkers poorly.
  7. Testing the theory would require data examining how important ‘thinks like me’ is to altruistic behavior and/or how important cognitive homogeneity is to existing strong social groups/identity. This is a potential area of social science research a bit outside our expertise here in the lab.
  8. But if true, the learning-related question (back to our work) is whether a tendency to rely on symmetry can be learned from our environment. I suspect yes, that feedback from successful social interactions would quickly reinforce and strengthen dependency on this heuristic.  I think that this could cause social groups to become more cognitively homogeneous in order to be more effectively cohesive.  Cognitively homogeneous groups would have higher trust, cooperate better and be more productive than non-homogeneous groups, out-competing them.  This could very well create a kind of cultural learning that would persist and look a lot like a genetic factor.  But if it was learned (rather than prewired), that would suggest we could increase trust and altruism beyond what we currently see in the world by learning to allow more diverse cognitive approaches and/or learning to better trust out-groups.

 

I was moved to re-iterate this chain of ideas because it came up yet again in conversational drive into politics among people in the lab.  Although our internal debates usually center around how different groups treat the out-groups and why.  Yesterday, the discussion started with observing that people we didn’t agree with seemed often to be driven by fear/distrust/hate of those in their out-groups.  However, it was not clear that if you didn’t feel that way, whether you had managed to see all of humanity as your in-group or instead had found/constructed an in-group that avoided negative perception of the out-groups.  We did not come to a conclusion.

FWIW, this line of thinking depends heavily on the Symmetry idea, which I discovered roughly 10 years ago via Brad DeLong’s blog (http://delong.typepad.com/sdj/2007/02/the_symmetry_ar.html).  According to the discussion there, it is also described as the Symmetry Fallacy and not positively viewed among real decision scientists.  I have recast it slightly differently here and suspect that among the annoying elements is that I’m using an underspecified model of bounded rationality.  That is, for me to trust you because you think like me, I’m assuming both of us have slightly non-rational decision processes that for unspecified reasons come to the same conclusion that we are going to trust each other.  Maybe there’s a style issue where a cognitive psychologist can accept a ‘missing step’ like this in thinking (we deal with lots of missing steps in cognitive processing) where a more logic/math approach considers that anathema.

Aug 30

Evidence and conclusions

I think this should be the last note on this topic for awhile, but since it’s topical a new piece of data popped up related to possible sources of gender outcome differences in STEM-related fields.

 

The new piece of data was reported in the NY Time Upshot section, titled “Evidence of a Toxic Environment for Women in Economics”

 

The core finding is that on an anonymous forum used by economists (grads, post-doc, profs) there are a lot of negatively gendered associations for posts in which the author makes explicitly gendered reference.  In general, posts about men are more professional and posts about women are not (body-based, sex-related and generally sexist).  Note that “more” here is done as a likelihood ratio, mathematically defined but the effect size is not trivially extracted. Because we always like to review primary sources, I dug up the source, which has a few curious features.

First, it’s an undergraduate thesis, not yet peer-reviewed, which is an unusual source.  However, I looked through it and it is a sophisticated analysis that looks to be done in a careful, appropriate and accurate way (mainly based on logistic regression).  I read through it a bit and the method looks strong, but is complex enough that there could be an error hiding in some of the key definitions and terms.

Paper link:

https://www.dropbox.com/s/v6q7gfcbv9feef5/Wu_EJMR_paper.pdf?dl=0

Of note, the paper seems to be a careful mathematical analysis of something “everybody knows” which is that anonymous forums frequently include high rates of stereotype bias against women and minorities.  But be very careful with results that confirm what you already know, that’s how confirmation bias works.  In addition, economists may not be an effective proxy for all STEM fields.  I don’t know of a similar analysis for Psychology, for example.

But as an exercise, lets consider the possibility that the analysis is done correctly and truly captures the fact that there are some people in economics who treat women significantly differently than they treat men, i.e., that their implicit bias affects the working environment.  So we have 3 data points to consider.

  1. There are fewer women in STEM fields than men
  2. There are biological differences between men and women (like height)
  3. There are environmental differences that may affect women’s ability to work in some STEM fields

The goal, as a pure-minded scientist is to understand the cause of (1).  Why are there fewer women in STEM?  The far-too-frequent inference error is when people (like David Brooks) take (2) as evidence that (1) is caused by biological differences.  That’s simply an incorrect inference.

It turns out to be helpful to know about (3) but only because it should reduce you to less certainty that (2) implies (1).  It’s still critical to realize that we do not know that environment causes (1) either.  All we know is that we have multiple hypotheses consistent with the data and we don’t know.

What we do know is that (3) is objectively bad socially.  Even if (2) meant there were either mean or distributional differences between men and women, the normal distribution means there are still women on the upper tail and if (3) keeps them out of STEM, that hurts everybody.

The googler’s memo assumes (2) and reinforces (3), which is clearly and objectively a fireable offense.

 

Older posts «