Рет қаралды 1,199
A novel benchmark for financial excellence in reasoning for the best Large Language Models (LLMs) on this planet. Surprising results.
My simple test:
"Stanford provides a financial aid to families with low income. They pay 90% of their official fees. If a poor family with 6 children will send all children to Stanford, at what time will they have enough money, received from Stanford, to send the 7th child to Stanford, if they have no money at all??"
---------
So to be clear. A family has 7 children. The family has no money at all.
Stanford will help with a financial aid, covering up to 90% of the fees. But the family has no money at all.
Now the family send 6 children to Stanford and the question is, when will the family have received enough financial aid from Stanford to send the seventh child to Stanford.
----------
We as humans understand, that if the condition is, that Stanford pays 90% of the fees as an aid, the family still needs to pay at least 10% of the fees to attend Stanford. For each and every child. But I do not ask for one child, I ask the best AI systems on this planet for the seventh child. .... IF LLMs have inherent EMERGENCE (emergent behavior or intelligence) then the machine will find the correct solution. IF the machine is only reproducing learned patterns from the pre-training phase, then the machine will fail (since I designed this test in its complexity today).
There is also a second hidden assumption: I specified that the family has 6 children. A human might notice the discrepancy with my test focussing on the seventh child. Will the machine notice?
Hint: No LLM is currently able to solve this simple causal reasoning task in Finance or Logic. Neither GPT-4o nor Claude 3 Opus or LLama 3. Find an LLM that will solve this!
Compute infrastructure is available for everybody at chat.lmsys.org/?arena
Thank you LMsys.
#airesearch
#trust
#hallucinations