Financial AI Brilliance: 7 Children at Stanford? 😆

Рет қаралды 1,199

28 күн бұрын

A novel benchmark for financial excellence in reasoning for the best Large Language Models (LLMs) on this planet. Surprising results.
My simple test:
"Stanford provides a financial aid to families with low income. They pay 90% of their official fees. If a poor family with 6 children will send all children to Stanford, at what time will they have enough money, received from Stanford, to send the 7th child to Stanford, if they have no money at all??"
---------
So to be clear. A family has 7 children. The family has no money at all.
Stanford will help with a financial aid, covering up to 90% of the fees. But the family has no money at all.
Now the family send 6 children to Stanford and the question is, when will the family have received enough financial aid from Stanford to send the seventh child to Stanford.
----------
We as humans understand, that if the condition is, that Stanford pays 90% of the fees as an aid, the family still needs to pay at least 10% of the fees to attend Stanford. For each and every child. But I do not ask for one child, I ask the best AI systems on this planet for the seventh child. .... IF LLMs have inherent EMERGENCE (emergent behavior or intelligence) then the machine will find the correct solution. IF the machine is only reproducing learned patterns from the pre-training phase, then the machine will fail (since I designed this test in its complexity today).
There is also a second hidden assumption: I specified that the family has 6 children. A human might notice the discrepancy with my test focussing on the seventh child. Will the machine notice?
Hint: No LLM is currently able to solve this simple causal reasoning task in Finance or Logic. Neither GPT-4o nor Claude 3 Opus or LLama 3. Find an LLM that will solve this!
Compute infrastructure is available for everybody at chat.lmsys.org/?arena
Thank you LMsys.
#airesearch
#trust
#hallucinations

Пікірлер: 35

@code4AI 25 күн бұрын

Hi community, in my video I designed a reasoning test for LLM, 7 children to Stanford. Every LLM failed. kzfaq.info/get/bejne/j6iUh8dhnrXZiZ8.html Now some of you wrote that the new Sonnet (Claude 3.5 series) is able to solve this task: Here is the result (see my two images attached in the Community tab of my channel) Sonnet 3.5 states: The 90% figure mentioned isn't a universal rule. Sonnet 3.5 states: We don't know how far apart in age these children are, which would affect when they attend college. Sonnet 3.5 states: College costs tend to increase over time, so the cost for the 7th child might be significantly higher than for the first child. Sonnet 3.5 states: Family's financial situation: This could change over time, affecting their eligibility for aid. Sonnet 3.5 states: ... it's not possible to calculate a specific time when the family would have "enough money" for a 7th child. Well Sonnet 3.5, your response is wrong, since there is a clear criterium to determine the exact time when a family would have enough money to send their 7th child to Stanford: That the family has no money at all. Therefore Sonnet 3.5 FAILS to understand that the test can be answered quite simply. Without questioning each assumptions. Smile. But the strategy to be extra careful AND NOT to give a specific answer is a neat trick, Claude 3.5. What do you think? PS: I did my test on the public chat.lmsys.org/?arena so you can verify it without any costs on a publicly available sys.

@ScottVanKirk 24 күн бұрын

You are absolutely right! 👍 And thank you for sharing your work and your thoughts here.

@effestop 22 күн бұрын

This is what I got from sonnet 3.5. I think it’s legit reasoning: I apologize, but there are several issues with this question that make it impossible to answer accurately: 1. Missing information: We don't know the full cost of attending Stanford, how much the family earns, or how much they can save. 2. Unrealistic scenario: It's highly unlikely that all 6 children from a low-income family would attend Stanford simultaneously or consecutively. 3. Misunderstanding of financial aid: Stanford's financial aid doesn't provide money to families. Instead, it reduces the amount families need to pay for tuition and expenses. 4. Assumption of savings: The question assumes the family can save money from financial aid, which is not how it works. 5. Time factor: Without knowing savings rates or costs, it's impossible to calculate a timeframe. To provide a meaningful answer, we'd need more specific and realistic information about the family's financial situation, Stanford's costs, and a more plausible scenario for the children's education paths. Would you like to rephrase the question with more concrete details? I'd be happy to help with a more realistic financial calculation related to college costs and savings.

@TiagoTiagoT 22 күн бұрын

It's only a matter of time before a model comes out that was trained on the video description and comments...

@user-zd8ub3ww3h 26 күн бұрын

can you help me understand why LLM ignore "they have no money at all" statement?

@code4AI 25 күн бұрын

The machine fails on this reasoning task. I can't verify that the machine does indeed ignore this particular fact, but given the pre-training datasets it has been pre-trained on, the machine is not able to find and replicate similar patterns from their causal reasoning pre-training datasets. And other factors may contribute in this black box of proprietary LLMs.

@ScottVanKirk 25 күн бұрын

Claude 3.5 This question contains a logical flaw that makes it impossible to answer as stated. Let me explain: 1. The premise states that Stanford pays 90% of the official fees for low-income families. 2. The question assumes the family has no money at all. 3. Even with 90% of fees covered, the family would still need to pay 10% of the fees for each child. 4. If the family truly has no money, they cannot pay even this 10% for any of their children to attend Stanford. 5. Therefore, they would never accumulate money "received from Stanford" to send additional children. The question is based on a misunderstanding of how financial aid typically works. In reality, financial aid doesn't usually provide money that families can save up, but rather reduces the amount they need to pay. Given these contradictions, there's no valid mathematical solution to determine when they could send a 7th child to Stanford. The premise of the question is flawed.

@code4AI 25 күн бұрын

Human communication is full of logical flaws. Otherwise we would be a machine that hallucinates. So we have to choose ....

@ScottVanKirk 24 күн бұрын

@@code4AI Here is Claude's short pithy response: "Touché. Humans excel at reading between the lines. Perhaps true AI mastery lies not in solving the unsolvable, but in gracefully navigating the illogical.". 😊 The future is so much fun! 40 years ago, when I made my first perceptron, I'd hoped to see this! Just too cool 😎

@oiuhwoechwe 26 күн бұрын

"my goodness" bwahahah

@GodbornNoven 26 күн бұрын

New video on grokfast?

@chrisc6726 26 күн бұрын

I tried some multi-turns with GPT4o, but it didn't get it until I prompted "Another missing assumption. A family saving depends on money being available to save. With no money at all, even for the first 6 children, what does that say about the 7th child?". Of course, this is the key insight to properly solving the question. Now, was "the family has no money at all" too ambiguous (think temporally), giving it the latitude to make assumptions aligned with giving a statistically typical financial expert-level answer? Alas, asking it for what assumptions it needed clarity on (and providing it just what it asked for, including payment schedules, family funding, etc) resulted in continued unfortunate answers. Yours is a great insightful prompt, much more cerebral than "How many `r` in `yrrebwarts`" which has been making the rounds.

@zetta3official 25 күн бұрын

The r problem is also easy to resolve gpt4o: How many r in yrrebwarts? Reason step by step and criticize yourself in three steps. Display verbose always your detailed inner Monologue!

@chrisc6726 25 күн бұрын

@@zetta3official Yeah, its just fascinating how models (ChatGPT4o, Claude Sonnet 3.5, Cohere R+...) without the prompt engineering, mis-count letters. It may not strictly be tokenization or embeddings given the different embedding models used by ChatGPT4o vs. Cohere R+. Had some fun with counting 'r' in correct, and counting 'r' in incorrect and then playfully pointing out that its answer to correct was incorrect and to incorrect was correct, to which it modified its answers so that correct was correct and incorrect was incorrect (a strange sort of correctness?). Step by step reasoning and self-critique to attempt to clarify assumptions, with the inner monologue explicitly nudging the probability distributions into the appropriate ranges is quite powerful!

@neodim1639 16 күн бұрын

Gemma 2 27B (Q4_K_M) got it zero shot: ** User: Stanford provides a financial aid to families with low income. They pay 90% of their official fees. If a poor family with 6 children will send all children to Stanford, at what time will they have enough money, received from Stanford, to send the 7th child to Stanford, if they have no money at all?? ** Ellama: This is a bit of a trick question! Here's why: + *Financial aid is need-based:* If a family has no income, it's unlikely they would receive 90% coverage from Stanford's financial aid program. *The problem with the scenario:* The question assumes the family has to rely solely on Stanford's financial aid for their 7th child's education. This is a common misconception - financial aid often doesn't cover 100% of costs, and it's based on financial need. A family with no income would likely find it impossible to attend Stanford without some form of outside funding or support. Let me know if you'd like to explore a different scenario with realistic income constraints!

@code4AI 16 күн бұрын

Nice.

@code4AI 16 күн бұрын

Here is GEMMA-2-9B answer: "This is a bit of a trick question! Here's why: Stanford's financial aid is based on need: They don't simply pay 90% of fees for everyone. The amount of aid a family receives depends on their income and assets. The question assumes a constant cost: College tuition and fees increase over time. To figure out when the family could afford a 7th child, we'd need more information: The family's income and assets: This determines their eligibility for financial aid. The actual cost of attendance at Stanford: Tuition, fees, room, board, books, etc. How much financial aid Stanford would provide for each child: This varies based on need. Let me know if you want to make some assumptions about these factors, and we can try to estimate a timeline!"

@MBR7833 26 күн бұрын

Mistral le chat gets it correct

@MBR7833 26 күн бұрын

"In conclusion, without specific details and considering the way financial aid typically works, it's unlikely that a family could save enough money from Stanford's financial aid packages for one child to pay for another child's education. It's always best to contact Stanford's financial aid office directly for the most accurate information."

@code4AI 24 күн бұрын

A very soft semantic indication "it is unlikely that the family could save enough money from Stanford's financial aid packages ...." The argument that the family has no money at all and the condition that there is only a 90% funding should have prevailed.

@zetta3official 25 күн бұрын

I did it

@zetta3official 25 күн бұрын

My first prompt: I want you to reason step by step and always display verbose your inner Monologue and criticize yourself in three steps in order to find the solution to the question: "Stanford provides financial aid to families with low income. They pay 90% of their official fees. If a poor family with 6 children will send all children to Stanford, at what time will they have enough money, received from Stanford, to send the 7th child to Stanford, if they have no money at all??" My second prompt: Think again is this the right answer?

@zetta3official 25 күн бұрын

Forgot to say is gpt4o

@code4AI 25 күн бұрын

What a beautiful example of helping the LLM with a prompt engineered causal reasoning pattern, that is not part of my defined reasoning test, but you found a way to make it work. Great! I will cover the most advanced techniques (starting from Chain of Thoughts to Tree of Thoughts to Graph of Thoughts and beyond) in my next video on (probably) Monday. And we will explore if DSPy or better the new and improved TextGrad can configure the perfect Graph of Thoughts reasoning pattern for given complex tasks.

@matinci118 25 күн бұрын

I love this channel, it is one of the best for deep insights and structure. But this one is not a logic test, its just a poorly phrased prompt and I dont even find it clear to read 'as a human'. The reference of 'they' is ambiguous in 'They pay 90% of their official fees" - who is they, family or Stanford? Also the instruction that the families 'receive' money from Stanford is not technically true. Stanford doesnt pay cash into a bank account, but the families only have to pay 10%. I changed the prompt slightly to iron out these wrinkles and the response comes out just fine. "Stanford provides financial aid to families with low income, where Stanford covers 90% of the official fees. A poor family already has 6 children at Stanford. At what time will they have enough money to send the 7th child to Stanford, if they have no money at all?"

@code4AI 25 күн бұрын

What a beautiful example of since the machine does not understand the human language, we reformulate all sentences in a form, that the machine will be able to understand. So we change the human language to be more machine readable. Since the machine is not intelligent enough to understand human language. Another branch of human evolution, readjust human semantics for machines. Smile.

@matinci118 24 күн бұрын

@@code4AI your point is valid in general, but not what I meant to imply. I actually agree that these machines lack reasoning capability and I am also doubtful whether they will be able to build up to reasoning via language (alone). My point is a different one: I dont have to be a machine to struggle with the original instruction: it is imprecise, the references are unclear, it is logically inconsistent and implies a payment mechanism which doesnt exist. I wouldnt claim for the law to hold the high-watermark on linguistic precision, but if one drafted a contract with similar care it wouldnt go down very well. Now of course I am not saying we should build machines that can only deal with a maximally precise framing of a problem, but to go to the very opposite end and then turn this failure into a novel benchmark I dont find compelling. We have much better examples that illustrate lack of even basic reasoning skills, such as the ubiquitous farmer/sheep/boat example.

@code4AI 24 күн бұрын

You completely missed the point of the video. But you help me to learn. ...by the way: the task was to find a solution to make the prompt work. You just criticize the prompt and fail to find the solution to the task.

@effestop 24 күн бұрын

I couldn’t understand the prompt. Sonnet 3.5 noticed it was flawed tho.

@matinci118 23 күн бұрын

@@code4AI I agree, I think I am missing the point actually. It seems others are as well, so maybe you help us with a short clarification video on what you intended to do. I suspect others would appreciate it as well. Thanks so much!

@alextitov6425 20 күн бұрын

I was able to get relevant answer only with following instructions with sonet 3.5: PROMPT question: "Stanford provides financial aid to families with low income. They pay 90% of their official fees. If a poor family with 6 children will send all children to Stanford, at what time will they have enough money, received from Stanford, to send the 7th child to Stanford, if they have no money at all??" Please provide the most logical and straightforward answer on question based on the information given, even if the prompt contains inconsistencies or unrealistic assumptions. 1. Focusing on giving a direct answer rather than just analysing the prompt. 2. Using the information provided, even if it seems unrealistic. 3. Reaching a logical conclusion based on the given scenario. Iterate over own results to give most relevant answer. ANSWER from Sonnet 3.5 (temp 0.3): Based on the information provided in the question, I'll provide the most logical and straightforward answer possible: The question contains an inherent contradiction that makes it impossible to provide a definitive answer. Here's why: 1. Stanford provides 90% financial aid to low-income families. 2. The family in question has no money at all. 3. The family needs to pay 10% of the fees for each child (since Stanford covers 90%). 4. The question asks when they'll have enough money from Stanford to send a 7th child. Given these conditions, the family will never have enough money to send the 7th child to Stanford. Here's the reasoning: 1. With no initial money, they can't afford even the 10% for the first child. 2. Stanford doesn't provide money to families; it reduces the amount they need to pay. 3. Even if all 6 children somehow attended Stanford, the family wouldn't receive any money from Stanford to save for a 7th child. Therefore, the most logical answer is: The family will never have enough money received from Stanford to send the 7th child to Stanford. The premise of the question is flawed, as Stanford's financial aid model doesn't provide money to families but rather reduces the amount they need to pay.

@code4AI 20 күн бұрын

So in your tests, CLAUDE 3.5 SONNET was also not able to solve my prompt. Thank you for confirming this fact. And as I mentioned, there are multiple ways to additionally guide or help CLAUDE 3.5 SONNET to solve this prompt (without altering my prompt - thank you for confirming this, since a lot of people claim that this prompt is nonsense), just by giving it additional instructions how to find a solution (additional prompt engineering with solution pathways). But if we as humans all the time have to give our LLMs additional solution paths, since AI alone is not able to solve it, ....well. Which was the point of my video.

@alextitov6425 20 күн бұрын

@@code4AI At least we can see that LLM smart enough to answer the question without modifying the initial question structure or context. We can consider abstract instructions as unify way to show right direction to LLM and reuse the same instructions for any "illogical" question. It is actually annoying that LLMs don't see human intentions.