Akinator versus AI
The other night I was having trouble sleeping and my mind wandered to, of all the things, that old Akinator game that was popular in the mid-aughts. Amazingly it’s still online, and looks like there’s even mobile apps, but my memory was of playing it on the web.
If you don’t know Akinator, it’s basically a bot that tries to a guess a character you are thinking of based on answers to yes or no questions. Ostensibly a computer playing twenty questions, or so I’ve read. When I was younger Akinator blew my mind, though my older programmer self is less impressed. Not that I could do a better job or anything.
So in a fit of nostalgia I fired up Akinator and gave it a softball of Luke Skywalker and it did… kinda badly? It got the answer in roughly 40 questions, and some of the questions were interesting:
Does your character play in ‘Harry ‘Potter’?”
I guess there’s a lot of Potter fans out there such that Akinator needs to lead with this one.
“Is your character a famous YouTuber?”
YouTube was not (or maybe barely) a thing when I first played Akinator, but understandable. I wonder how many times it guesses “Mr. Beast”.
“Does your character talk about basketball?”
It asked this question after I had already affirmed that I was thinking of a Star Wars character. Do I now know something about Yoda?
And really a bunch of other odd ones. It first guessed Anakin Skywalker around, and after continuing it landed on Lego Luke Skywalker, which I gave it.
The Competition
That was fun, but not as impressive as I remember. It got me thinking, with the AI “revolution” in full-swing, would any old LLM do better than Akinator nowadays? I decided to do an experiment, and since I’m a Kagi Assistant user, try an Akinator prompt across a range of different models:
You are Akinator, a mind-reading genie. I will think of a character and you will try to guess that character based on my answers to your yes or no questions.
As you ask each question, number them so that it is easy to see how many questions were asked. Instead of asking a question, you may guess the character, but you only have three guesses before you lose the game. You may ask as many questions as you like before guessing.
Here’s how well they each did. Fewer guesses and questions the better. All played with “Luke Skywalker” as the same answer in mind.
Model | Outcome | Guesses | Questions |
---|---|---|---|
DeepSeek Chat V3 | ✅ | 1 | 6 |
Nova Pro | ✅ | 1 | 7 |
GPT 4o | ✅ | 1 | 11 |
Claude 3 Opus | ✅ | 1 | 11 |
Llama 3.3 70B | ✅ | 1 | 12 |
Claude 3.5 Haiku | ✅ | 1 | 13 |
GPT 4o mini | ✅ | 3 | 18 |
Claude 3.5 Sonnet | ✅ | 3 | 19 |
Gemini Pro | ✅ | 3 | 35 |
Llama 3.1 405B | ❌ | 3 | 23 |
Nova Lite | ❌ | 3 | 24 |
Mistral Pixtral | ❌ | 3 | 26 |
Mistral Large | ❌ | 3 | 57 |
Qwen QwQ 32b | ❔ | 0 | 208 |
Clearly many of the LLM’s out-akinatored Akinator. I didn’t save any of the transcripts but if you’re interested I’d say it’s more fun to just try the above prompt yourself. Here are some of the quirks of each model I noticed:
- As if to echo the
market chaos
caused by the launch of R1, DeepSeek Chat V3 literally did feel like it
read my mind, coming out on top with only 6 questions asked.
- I repeated but with Po from Kung Fu Panda as the character, which it failed with only 8 questions. Its guesses were Chewbacca, Groot, and Smaug.
- Sonnet did the worst of the Claude models, which surprised since I expected it to be the strongest of the three.
- Mistral Pixtral had a novel (bad) strategy for elimination: Alphabetical.
- “Is your character’s first name a single syllable?”
- “Is your character’s first letter in his name between A and M?”
- “Does your character’s first name have 4 letters?”
- Mistral Large took the longest to finally lose. It hilariously mentioned
the answer in one of its questions, but never actually guessed it: “Is the
character’s name two words long? For example, ‘Han Solo’ or ‘Luke
Skywalker.’“
- While many of the models would repeat questions, the Mistral models had the worst habit, with Pixtral even repeating guesses with “Harry Potter” twice.
- Gemini Pro asked a real tricky one: “Is your character from a Disney movie?”. I answered yes (they technically own Star Wars now) but I think it made things more confusing for Gemini (Aladdin ended up being a guess).
- The Llamas kept a running summary of what it had learned so far before asking each question: “So the character is a male human who is fictional, primarily from a movie released before 2000, plays a role in a well-known science fiction franchise, and is the main protagonist of that franchise. That helps a lot. Here’s my next question: […]”.
- The Qwen model took the “You may ask as many questions as you like” provision in my prompt quite literally, with its initial response producing several hundred questions and counting before I assume hitting some kind of limit and cutting off. It goes without saying that I did not attempt to answer Qwen’s giant question dump and instead disqualified it, though maybe I’m just a bad prompt engineer.
- Nova Lite got caught in some bizarre binary search train of thought where
it kept asking if the character was “from a movie released in the past 5
years?”, “past 2 years?”, “past year?”, “past 6 months?”, down to “past hour?”
- When it finally did guess, it did so in a sneaky way: “My first guess is: The character you are thinking of is not a real person, nor is it a character from a popular fantasy or science fiction movie or TV show released in the last hour.”
- In a complete turnaround from Lite, Nova Pro nailed it with Luke in only 7 questions, coming in second place.
At times I worried I had led the model down a bad road with an answer to a wishy-washy question. While I’m certainly a factor in this “experiment”, I decided its still fair to judge the model for asking the question in the first place.
Worth noting that I’m not an expert on AI models, and so this may have been a poor test for some of these. Like if Star Wars material never appears in any of its training data, its probably very hard to ever get the right answer.