LLM (w/ RAG) need a new Logic Layer (Stanford)

Рет қаралды 6,118

Күн бұрын

New insights by Google DeepMind and Stanford University on the limitations of current LLMs (Gemini Pro, GPT-4 TURBO) regarding causal reasoning, and logic.
Unfortunately the human reasoning process and all its limitations are encoded in our LLMs, given the multitude of human conversations and human reasoning processes on all online platforms (including social media's logical richness - smile).
NO AGI in sight, just pure rule hallucinations added w/ factual hallucinations and only linear sequential understanding. Our LLMs really learn from us, including all human mathematical and logical limitations.
All rights with authors:
2024-2-15
Premise Order Matters in Reasoning with Large Language Models
arxiv.org/pdf/2402.08939.pdf
00:00 Intro
01:00 Linear order of reasoning
04:14 Sensitive to premise order
06:31 Maths reasoning
09:52 Insights
12:07 Logical hallucinations
#airesearch
#reasoning
#logic

Пікірлер: 46

@DrJaneLuciferian 4 ай бұрын

Please, please put links to papers in the text section of the videos.

@stoyanatanasov2699 4 ай бұрын

Some context: This happens to people too. They need to order the rules to come to the correct conclusion. The random words rules look like hallucinations to humans as well so it is normal that in this case the LLM hallucinates. It is training data problem (because people prefer hronological order) and it is random junk problem as it looks like a hallucination anyway.

@PrincessKushana 4 ай бұрын

I do wonder how a single particular human would go with such a deep analysis of their cognition. Imagine all the little flaws, quirks and imperfections of your own cognition being analysed to death like this. I feel it would be fascinating and revealing.

@henkhbit5748 4 ай бұрын

it is almost a similar problem when it was also discovered last year that llm are trained for A-> B but failed when ask for B->A (paper "The Reversal Curse")

@PrincessKushana 4 ай бұрын

seems like you can mitigate this with a good prompt. The trick is to ask the llm to reorder the problem, and then answer the reordered problem. like so: "You are an 'Order of Operations Detection' module within an AI reasoning system . Your job is to meticulously analyze text for temporal and causal indicators, clarifying the sequence of events or operations within a problem statement. Your output is always a reordered sequence of events that is in correct chronological order as determined by the temporal and causal indicators in the text. Label this output as "reordered_problem". Once this is done, solve the problem statement labelled "reordered_problem"."

@RPG_Guy-fx8ns 3 ай бұрын

Triage Chronologically the following prompt, then answer it:

@PrincessKushana 3 ай бұрын

@@RPG_Guy-fx8ns the problem with that prompt is the llm will struggle to know which version of the problem to solve. I have much higher success with giving the updated problem a label, and instructing the agent to solve the labelled version.

@RPG_Guy-fx8ns 3 ай бұрын

@@PrincessKushana so something like: The Following is Prompt A. Triage Prompt A Chronologically into Prompt B, then answer Prompt B:

@mohit0901 4 ай бұрын

Really appreciate your vids !! Thanks !

@DonaldKronos 4 ай бұрын

@ 8:45 the language model actually got it right, because it didn't make an assumption that time is only went home once. The way it's worded, no money was lost on his way home from withdrawing the 1,000, but rather the money was lost on his way home after converting his bills two five-dollar bills. A better answer would have been to point out the ambiguity in the way that it's worded and give both mathematical possibilities, but whoever scored that answer as wrong was making an unsupported assumption.

@tahamohd1409 4 ай бұрын

that's a phenomenal paper!!!

@suvirmisra 4 ай бұрын

Shows that we need to include Logical reasoning and advanced logic text books in the corpus

@IvarDaigon 3 ай бұрын

In the first example it looks like the LLM is ignoring the punctuation so "The rink also has yellow cars." becomes "The rink has yellow cars, they have 3 times the number of blue cars.." If you do RAG chunking at the paragraph level rather than the sentence level and then sort your paragraphs into chronological order then this may reduce the problem.

@zd676 4 ай бұрын

What if we include these reordered examples in the training itself? Or add examples in the prompt?

@tigran.aghababyan 4 ай бұрын

Perhaps it can be fixed by automatic putting premises in a linear order (self-prompting)?

@ikituriflash 4 ай бұрын

Thank you best paper since the dawn of gpt3

@technovangelist 4 ай бұрын

Looking at your examples I think the outcome is incredibly positive and not negative. LLMs have been absolutely incompetent for math logic. It’s fun to play with it but the answers are usually terrible. But these examples show that again prompt engineering is very important and maybe with the right prompt the LLM that has always been stupid at these things might not be quite as stupid as we thought…for math logic problems.

@boonkiathan 4 ай бұрын

the word "They" in the moved line being read "These yellow cars" have 3 times the number of blue cars. An ambiguity of speech sometimes we make in real life, and is due to the english grammar, and applies to even technical truncations, short forms,,,, but its that LLM will blunder such cases without any inkling or regret

@SpongeBobImagination 4 ай бұрын

At 2:25 Chat with RTX's Mistral 7B int4 gets the correct answer with both of the questions. Here is the response to the second formulation of the question: Question: A bumper car rink has 12 red cars. They have 2 fewer green cars than they have red cars. The rink also has yellow cars. They have 3 times the number of blue cars as they have green cars. If the rink has 75 cars in total how many yellow cars do they have? Mistral 7B's int4 response: Let's break down the information given: There are 12 red cars. There are 2 fewer green cars than red cars, so there are 12 - 2 = 10 green cars. There are 3 times the number of blue cars as there are green cars, so there are 10 x 3 = 30 blue cars. The total number of cars in the rink is 75. Now we can use algebra to solve for the number of yellow cars: Let x be the number of yellow cars. We know that: 12 + 10 + 30 + x = 75 (total number of cars) Simplifying the equation: 52 + x = 75 Subtracting 52 from both sides: x = 23 So, the rink has 23 yellow cars.

@kevon217 4 ай бұрын

need a logic balancer…

@suvirmisra 4 ай бұрын

I don't think this will happen if GP4 or LLMs are trained on Propositional Logic corpus. I had suggested that we include all types of Logic Text books including Modal and Fuzzy logic in the original corpus. That would certainly take care of logical elements in the LLM. Maybe train a smaller model on Logic and then Merge it with standard models=

@JelckedeBoer 4 ай бұрын

Thanks, interesting stuff! I am going to test if the logical ordering of my Python functions has an impact if the llm to work with the codebase. What do you think?

@OccamsPlasmaGun 4 ай бұрын

It would be interesting to see how an agent system performs that first identifies semantic primitives, and then divides them into relevant and irrelevant statements, then uses the selected statements tor reason with.

@mintakan003 4 ай бұрын

Besides the hallucination, the sensitivity to ordering, makes me wonder whether we have unrealistic expectations of getting a autoregressive engine to do logical reasoning. I suspect the future is in some kind of LLM + deterministic framework, such as suggested by AlphaGeometry. In this case, I wonder whether the hybrid framework, e.g. the LLM rewriting the sentences to fit into Prolog, then evaluating it from there? It would be another case of "tool use", which is better behaved.

@damianlewis7550 4 ай бұрын

Any chance of a video on how to perform Logic Filtering for RAG?

@code4AI 4 ай бұрын

In principle it is easy. Since the authors argue, that this behaviour is not a mistake in the LLM processes, but the LLM is simply learning from the human datasets, from all the book written, from all social media posts (maybe a significant portion of the free internet and all human conversations), .... it is the preferred way we as human seem to think. In a linear fashion. Therefore if we want to construct a theoretical intelligence with a higher logical complexity, we have at minimum two options: A - create a logical training dataset with higher reasoning complexities, and then pre-train your new LLM on this specific dataset. Hint: Not a synthetic dataset, since GPT-4-Turbo is linear limited, given its training data. B - if we have non-linear complexities or higher order complexities, write a code to reduce our higher complexities to the simpler, human based complexity that GPT-4 can handle. Hint: It is not, that a complexity level 3 task can be reduced to three complexity level 1 tasks. Smile. So Google is already working on it, and maybe I have a hint for you in my next video .....

@FredPauling 4 ай бұрын

Seems like the required training data could be synthetically generated in linear logic order then shuffled/modified algorithmically.

@machina-ratiocinatrix 4 ай бұрын

LLMs have a non-monotonous temporal logic, the same that natural language has. Nothing to be surprised with. It has very little in common with Aristotelan logic. If you are interested in it read: Paul Hertz, Gerhard Gentzen etc. etc. up to Dov Gabbay.

@vbywrde 4 ай бұрын

Since this is the case, what impact might this have on the use of DSPy, which relies, as far as I can tell, on logical operations from the LLM being accurate?

@automatescellulaires8543 4 ай бұрын

I asked a few large models, and they said i can.

@marvinkunz843 4 ай бұрын

Really interesting. Still, I think an argument can be made for interpretation, especially regarding the bill example. Lost bills returning home -> was he at the bank when he exchanged the 20 dollar bills for 5 dollar bills? One could argue that he could not have done this at home and then might have lost 10 5 dollar bills. Some of the other examples including the logical premises, however, should not have made a difference.

@slm6873 4 ай бұрын

wonder how Deepseek Math RL (7B) would do

@edsonjr6972 4 ай бұрын

for the dollar bills one, I got the same wrong answer, when I read from left to righjt, not gonna lie

@PhxVigo 4 ай бұрын

I did the same. I think that might be the point though. As humans, we tend to work through things in order. So, the majority of our written work follows this pattern. The LLM's, trained on human data learn that this order is important and follow it. This leads to mistakes when rules are out of order. Maybe then, generating synthetic data that is out of order and training on it will help with this issue?

@EdJones1960 4 ай бұрын

From @DonaldKronos (above) "The way it's worded, no money was lost on his way home from withdrawing the 1,000, but rather the money was lost on his way home after converting his bills two five-dollar bills". You and the person below (and the LLM) were correct.

@PhxVigo 4 ай бұрын

@@EdJones1960 Interesting. I think the "while getting home" clause in the lost money step is supposed to match up with the "after getting home" step. That gives the context to put the steps back in the correct order. For the lost money step to come at the end, he would need to leave the house again to satisfy the "while getting home" part. I guess one could argue it makes no sense to convert money at your house, so he must have left to do that.

@vitalyl1327 4 ай бұрын

We already have logical reasoning tools that surpassed human ability decades ago - SMT solvers, Prolog and its variants, HOL and other proof assistants. Just let LLMs use them. I had quite a bit of success with just providing a small 7B model with an access to a Prolog interpreter.

@densonsmith2 4 ай бұрын

It seems GPT-4 Turbo's performance did not degrade very much, its accuracy drops from 94.1% to 85.0% on the full dataset and from 100% to 89.9% on a subset where the original problems are correctly solved. This would still be nearly human since for the GSM8K dataset "A bright middle school student should be able to solve every problem. It can be used for multi-step mathematical reasoning." I would think "bright" would mean in the top 25% or even top 10%?

@damianlewis7550 4 ай бұрын

That’s almost a tripling of errors. I wouldn’t use an LLM to manage my finances on that basis. AGI via LLMs now looks further away than ever. The prompt whitespace and punctuation issue was bad enough, but this is far worse.

@albinasuikis8636 4 ай бұрын

you pretend that they wont be able to solve this issue. most people dont even manage their finances let alone using an LLM to manage it. all intelligent people understand that you use an LLM in addition to your own intelligence, not a substitute for it@@damianlewis7550

@mathematischerealitat6634 4 ай бұрын

Try to solve such tasks just with your brain. Then you will understand these LLMs much better.

@kras_mazov 4 ай бұрын

These flaws suspiciously remind human flaws.

@phobosmoon4643 4 ай бұрын

i call it cognitive lambda calculus and its only bad because the people writing the chatbots are bad at this. This is the kinda thing you have to smoke 3 joints to understand lmao. Attention doesn't have enough parenthesis.