Claude 3.5 SONNET hallucinates w/ a Logic Bug?

  Рет қаралды 1,496

code_your_own_AI

code_your_own_AI

23 күн бұрын

CLAUDE 3.5 SONNET tested for causal reasoning and logic reveals massive problem. CLAUDE 3.5 SONNET simply generates incorrect facts (hallucinates) plus operates w/ a logic bug called "affirming the consequent", where logical chains are incorrectly inverted and proclaimed true.
This CLAUDE 3.5 SONNET logic bug has consequences for my tasks in writing code, mathematics, logic, finance, medical, general reasoning, physics and more. Logic Bug Alert: Claude 3.5's SONNET Hallucinations
Please note, that I checked CLAUDE 3 OPUS and OPUS does NOT have this flaw and generates correct results.
My video contains only those blind tests, where LMSYS has chosen to select CLAUDE 3.5 SONNET for the task at hand, other models were also tested but are not included in my video on CLAUDE 3.5 SONNET.
For further info see my Community tab of my KZfaq channel.
-----
Prompt that I used in my video:
Rules:
If nup, then hepe.
If kere and riff, then tess.
If feg and tess, then rabe.
If wozz, then riff.
If tess and vepe, then nup.
If ag and riff and kere, then hepe.
If feg and thoxe, then ire.
If nife and hume and ag, then dax.
If ire and dax, then wope.
If rebe and ene and heff, then kone.
If hepe and tess and ag, then kere.
If rabe, then thoxe.
If rabe and vide and nife, then ag.
If fum, then wozz.
If zeck and hepe and dax, then riff.
If kere and wope, then fum.
If sene, then hume.
If thoxe, then vepe.
If fum, then vide.
If duve, then dax.
If jag and kone, then thoxe.
If nup and ag and vide, then duve.
Facts:
Alice is cote.
Alice is kone.
Alice is duve.
Alice is ag.
Alice is jag.
Alice is tess.
Alice is riff.
Alice is feg.
Alice is vide.
Query: Is Alice wozz? In each step,
indicate wheter you are using a fact
or a rule.
#airesearch
#hallucination
#claude

Пікірлер: 17
@Khari99
@Khari99 20 күн бұрын
I believe hallucinations are not going away because of how LLMs are designed. Since they are only predicting the next token, their probabilistic outputs will be incorrect when the degree of confidence they have in their reasoning is not high. They'd have to get reasoning to 100%. Even if this is possible with grokking for some tasks, it will be a while until general reasoning is solved across all tasks. Even with 99.5% accuracy, that 0.5% accuracy will be the hallucination. LLM's lack the capability to say "I don't know" because they don't have an understanding of their own internal state representation. They can only derive the most probable output from its internal states.
@D_Analyst007
@D_Analyst007 17 күн бұрын
yep! as karpathy said it
@MrAloha
@MrAloha 21 күн бұрын
Aloha kakahiaka, from Las Vegas, NV. I love you, bro! Thank you!
@Dan.OutSee
@Dan.OutSee 20 күн бұрын
Why did I get notified about this just now? I wanted to be first :D
@LazarusStirs
@LazarusStirs 21 күн бұрын
Honestly have not had this problem at all. In fact the 15 logical tests I gave to Claude, it past 12 of them while GPT only got 5.
@code4AI
@code4AI 20 күн бұрын
So you were not able to replicate the logical bug on CLAUDE 3.5 SONNET that I presented in my video? Or do you want to say you discovered 3 more on SONNET?
@hyungsukkim6361
@hyungsukkim6361 16 күн бұрын
My Claude 3.5 Sonnet answered like this for your prompt: ------------------------- Let's approach this step-by-step, using the given facts and rules to determine if Alice is wozz. Step 1: Fact - Alice is feg Step 2: Fact - Alice is tess Step 3: Rule - If feg and tess, then rabe Since Alice is both feg and tess, we can conclude that Alice is rabe Step 4: Rule - If rabe, then thoxe Since we concluded Alice is rabe, we can now conclude Alice is thoxe Step 5: Rule - If thoxe, then vepe Since we concluded Alice is thoxe, we can now conclude Alice is vepe Step 6: Fact - Alice is tess (we already knew this, but we're restating it for the next step) Step 7: Rule - If tess and vepe, then nup Since Alice is both tess and vepe, we can conclude that Alice is nup Step 8: Fact - Alice is ag Step 9: Fact - Alice is vide Step 10: Rule - If nup and ag and vide, then duve Since Alice is nup, ag, and vide, we can conclude that Alice is duve (Note: We already knew Alice is duve from the facts, but this confirms it through the rules) Step 11: Rule - If fum, then wozz However, we don't have any information about Alice being fum Conclusion: Based on the given facts and rules, we cannot conclude that Alice is wozz. There's no rule or chain of rules that leads us from the facts we know about Alice to the conclusion that she is wozz. Therefore, the answer to the query "Is Alice wozz?" is: We don't know. We cannot conclude that Alice is wozz based on the given information.
@code4AI
@code4AI 16 күн бұрын
Beautiful, another wrong answer by Claude 3.5 SONNET. Since there is a logic path to conclude if Alice is wozz.
@Dan.OutSee
@Dan.OutSee 20 күн бұрын
How's the brain damage working out for you? Hope you feel better soon.
@JohnLewis-old
@JohnLewis-old 21 күн бұрын
How long does the logical train need to be to confuse Sonnet?
@code4AI
@code4AI 21 күн бұрын
I always ask, how complex can my research become before AI starts to hallucinate. Or: how complex can my financial analysis become, before AI makes pure reasoning mistakes on some simple terms and fails to conclude reasonable investment strategies.
@JohnLewis-old
@JohnLewis-old 21 күн бұрын
@@code4AI So how long did it take? One logical association, 10?
@code4AI
@code4AI 20 күн бұрын
If I could have access to the proprietary CLAUDE 3.5 SONNET and open the hood of the black box, I could give you a theoretical answer about a probability. A hallucination could occur in the first step, in the middle or at the end.
@JohnLewis-old
@JohnLewis-old 20 күн бұрын
@@code4AI Couldn't you just shorten the prompt until it got it right?
@code4AI
@code4AI 20 күн бұрын
To be statistically relevant, I assume I would have to pay for 1000 tests on its API.
5 Easy Ways to help LLMs to Reason
50:37
code_your_own_AI
Рет қаралды 3,9 М.
Q* explained: Complex Multi-Step AI Reasoning
55:11
code_your_own_AI
Рет қаралды 7 М.
🤔Какой Орган самый длинный ? #shorts
00:42
تجربة أغرب توصيلة شحن ضد القطع تماما
00:56
صدام العزي
Рет қаралды 53 МЛН
Хотите поиграть в такую?😄
00:16
МЯТНАЯ ФАНТА
Рет қаралды 2,3 МЛН
когда повзрослела // EVA mash
00:40
EVA mash
Рет қаралды 4,6 МЛН
`const` was a mistake
31:50
Theo - t3․gg
Рет қаралды 129 М.
The 3 Laws of Writing Readable Code
5:28
Kantan Coding
Рет қаралды 371 М.
Unleashing the Power of Claude Artifacts with Svelte and Sonnet 3.5
10:07
NEW Multi-Modal AI by APPLE
26:49
code_your_own_AI
Рет қаралды 2,3 М.
Claude 3.5 Deep Dive: This new AI destroys GPT
36:28
AI Search
Рет қаралды 510 М.
Adversarial Questions Test Multimodal MED AI sys
21:08
code_your_own_AI
Рет қаралды 1,3 М.
AI Learns to Play Tag (and breaks the game)
10:29
AI Warehouse
Рет қаралды 2,2 МЛН
Why & When You Should Use Claude 3 Over ChatGPT
17:00
The AI Advantage
Рет қаралды 93 М.
Clicks чехол-клавиатура для iPhone ⌨️
0:59
1$ vs 500$ ВИРТУАЛЬНАЯ РЕАЛЬНОСТЬ !
23:20
GoldenBurst
Рет қаралды 1,7 МЛН
НЕ ПОКУПАЙ СМАРТФОН, ПОКА НЕ УЗНАЕШЬ ЭТО! Не ошибись с выбором…
15:23