Amortize_Me_Daddy t1_ixdyg12 wrote
Very cool work. I saw this on my LinkedIn feed and immediately had to share it with my fiancé who is a huge fan of risk and diplomacy. To me, this seems like a much bigger deal than AlphaGo - can someone give me a sanity check?
I’m also interested in how much thought was put into the persuasiveness of generated messages when making a proposal. It seems like something way out of the scope of RL, but still quite important to optimize. I am just… astounded reading over that convo between France and Turkey. If you have time, would you mind offering some insight into the impressive “salesmanship” of CICERO’s language model?
LurkAroundLurkAround t1_ixf68yn wrote
AlphaGo was beating the best, this is, according to the post, a top 10% player, which most likely means 9.x% percentile. This also includes players with more than 1 game, but they played 40 games. So just by allowing a bunch of 2 games player they up their stats. A fair comparison would have been to take players with at least 40 games, sample 40 games randomly and compute the score, and then check the performance on this subtrata.
Not to take away anything from the team, but given how the the results are framed, my instinct is to believe that this is a bit oversold.
icosaplex t1_ixijpuw wrote
I'm one of of the paper authors:
You can see a full anonymized table of scores and ranks near the end of the Supplementary Material file linked for download at the end of the Science article. No player other than Cicero played anywhere close to 40 games, so such a procedure wouldn't be possible. Each game takes hours and requires scheduling 6 players to be simultaneously available, so understandably many players, including many good players, only played a handful of games each. If you restricted to, say, players with >= 5 games, Cicero would be 2/19.
We don't make a claim of being superhuman as AlphaGo did - we believe Cicero in this setting is at the level of a strong human player but not superhuman. We worked with top Diplomacy experts who have given us this feedback.
One thing to keep in mind is that Diplomacy has variance: there is practical luck in which players choose to ally with you or someone else, or whether you guess right or wrong in things like coin-flip tactical situations. So similar to, e.g. poker, even a middling player may occasionally win big in the short-run against top-level players to a degree that would not hold up in the long run. This means including players with too few games can sometimes have the exact opposite bias and make a strong result seem worse by comparison. In that quoted stat, we chose a threshold of > 1 game as a compromise between mitigating the most misleading tail of that bias, while still including as many players as possible rather than picking a higher threshold and arbitrarily cutting out large chunks of the player population from the comparison.
But of course, none of that ultimately matters since you can still check out the full list yourself.
If you're interested in a bit more context on the player pool: the setting was a casual but competitive online blitz Diplomacy league advertised at various times in some of the main online Diplomacy community sites. Many newer players signed up and played, but also experienced players, and as an organized league I'd expect the overall average level of play to be a little higher than, e.g. generic online games.
And thank you and others for raising such questions - it's been fun and interesting to see discussions like this.
fujiitora t1_ixf9mm2 wrote
9.x% percentile? i assume you meant 90+ percentile?
MoNastri t1_ixfi9xs wrote
I assume so. Otherwise that's a bottom of the barrel player...
evanthebouncy t1_ixfiy4t wrote
iirc FAIR has work playing hannabi, which require some level of (non-verbal) communication. So a lot of the insights can be leveraged here as well.
TheAsianIsGamin t1_ixplwod wrote
It seems like the vast majority of CICERO's pitches are "here's an optimal play for you, you should do it not only because it's good for you but also because it's good for us." In other words, pointing players towards rationality. Of course, high level players in any social game are far more likely than their less skilled counterparts to want to make the rational play, so it's likely that there's some selection bias influencing how effective that salesmanship is. However, even high level players are governed by the emotions that break down game theory!
Here's an example case: I'm curious to see how CICERO responded in situations where they talk to Human A about a plan that requires Human B, but A doesn't trust B. How does CICERO respond to that? It may very well be that it doesn't get in those spots because it thinks about who's likeliest to align with whom and in what way. In this sense, it's playing to its strengths and not attempting plays it can't execute, which is an impressive strategic feat. But of course, I'm interested in seeing it try things it can't do - in this case, try a different mode of persuasion.
Viewing a single comment thread. View all comments