Screenshot of this question was making the rounds last week. But this article covers testing against all the well-known models out there.
Also includes outtakes on the ‘reasoning’ models.
Screenshot of this question was making the rounds last week. But this article covers testing against all the well-known models out there.
Also includes outtakes on the ‘reasoning’ models.
Ok, if you’re willing to think together out loud, I’ll take that in good faith and respond in kind.
“It needed the rules, therefore it’s not reasoning” is doing a lot of work in your argument, and I think it’s where things come unstuck.
Every reasoning system needs premises - you, me, a 4yr old. You cannot deduce conclusions from nothing. Demanding that a reasoner perform without premises isn’t a test of reasoning, it’s a demand for magic. Premise-dependence isn’t a bug, it’s the definition.
If you want to argue that humans auto-generate premises dynamically - fair point. But that’s a difference in where the premises come from, not whether reasoning is occurring.
Look again at what the rules actually were: https://pastes.io/rules-a-ph
No numbers, containers, or scenarios. Just abstract rules about how bounded systems work. Most aren’t even physics - they’re logical constraints. Premises, in the strict sense.
It’s the sort of logic a child learns informally via play. If we don’t consider kids learning the rules by knocking cups over “cheating”, then me telling the LLM “these are the rules” in the way it understands should be fair game.
When the LLM correctly handles novel chained problems, including the 4oz cup already holding 3oz, tracking state across two operations, that’s deriving conclusions from general premises applied to novel instances. That’s what deductive reasoning is, per the definition I cited. It’s what your kid groks (eventually).
“Without the rules it fails” - without context, humans make the same errors. Ask a 4 year old whether a taller cup holds more fluid than a rounder one. Default assumptions under uncertainty aren’t a failure of reasoning, they’re a feature of any system with incomplete information.
“It’ll fail sometimes across 100 runs” - so do humans under load. Probabilistic performance doesn’t disqualify a process from being reasoning. It just makes it imperfect reasoning, which is the only kind that exists.
The Wizard of Oz analogy is vivid but does no logical work. “Complicated math and clever programming” describes implementation, not function. Your neurons are electrochemical signals on evolved heuristics. If that rules out reasoning, it rules out all reasoning everywhere. If it doesn’t rule out yours, you need a principled account of why it rules out the LLM’s.
PS: I believe you’re wrong about the give it 100 runs = different outcomes thing. With proper grounding, my local 4B model hit 0/120 hallucination flags and 15/15 identical outputs across repeated clinical test cases. Draft pre-publication data, methodology and raw outputs included here: https://codeberg.org/BobbyLLM/llama-conductor/src/branch/main/prepub/PAPER.md
I’m willing to test the liquid transformations thing and collect data. I might do that anyway. That little meme test is actually really good.