Viewing a single comment thread. View all comments

LurkAroundLurkAround t1_ixf68yn wrote

AlphaGo was beating the best, this is, according to the post, a top 10% player, which most likely means 9.x% percentile. This also includes players with more than 1 game, but they played 40 games. So just by allowing a bunch of 2 games player they up their stats. A fair comparison would have been to take players with at least 40 games, sample 40 games randomly and compute the score, and then check the performance on this subtrata.

Not to take away anything from the team, but given how the the results are framed, my instinct is to believe that this is a bit oversold.

19

icosaplex t1_ixijpuw wrote

I'm one of of the paper authors:

You can see a full anonymized table of scores and ranks near the end of the Supplementary Material file linked for download at the end of the Science article. No player other than Cicero played anywhere close to 40 games, so such a procedure wouldn't be possible. Each game takes hours and requires scheduling 6 players to be simultaneously available, so understandably many players, including many good players, only played a handful of games each. If you restricted to, say, players with >= 5 games, Cicero would be 2/19.

We don't make a claim of being superhuman as AlphaGo did - we believe Cicero in this setting is at the level of a strong human player but not superhuman. We worked with top Diplomacy experts who have given us this feedback.

One thing to keep in mind is that Diplomacy has variance: there is practical luck in which players choose to ally with you or someone else, or whether you guess right or wrong in things like coin-flip tactical situations. So similar to, e.g. poker, even a middling player may occasionally win big in the short-run against top-level players to a degree that would not hold up in the long run. This means including players with too few games can sometimes have the exact opposite bias and make a strong result seem worse by comparison. In that quoted stat, we chose a threshold of > 1 game as a compromise between mitigating the most misleading tail of that bias, while still including as many players as possible rather than picking a higher threshold and arbitrarily cutting out large chunks of the player population from the comparison.

But of course, none of that ultimately matters since you can still check out the full list yourself.

If you're interested in a bit more context on the player pool: the setting was a casual but competitive online blitz Diplomacy league advertised at various times in some of the main online Diplomacy community sites. Many newer players signed up and played, but also experienced players, and as an organized league I'd expect the overall average level of play to be a little higher than, e.g. generic online games.

And thank you and others for raising such questions - it's been fun and interesting to see discussions like this.

20

fujiitora t1_ixf9mm2 wrote

9.x% percentile? i assume you meant 90+ percentile?

8

MoNastri t1_ixfi9xs wrote

I assume so. Otherwise that's a bottom of the barrel player...

1