Did OpenAI Just Secretly Release GPT-5?! ("GPT2-Chatbot")

Рет қаралды 85,328

Ай бұрын

GPT2-Chatbot just showed up on lmsys.org. We know little about it other than it performs incredibly well and is unlike anything we've seen in other models.
Try Vultr FREE with $300 in credit for your first 30 days when you use BERMAN300 or follow this link: getvultr.com/berman
Join My Newsletter for Regular AI Updates 👇🏼
www.matthewberman.com
Need AI Consulting? 📈
forwardfuture.ai/
My Links 🔗
👉🏻 Subscribe: / @matthew_berman
👉🏻 Twitter: / matthewberman
👉🏻 Discord: / discord
👉🏻 Patreon: / matthewberman
👉🏻 Instagram: / matthewberman_ai
👉🏻 Threads: www.threads.net/@matthewberma...
Media/Sponsorship Inquiries ✅
bit.ly/44TC45V

Пікірлер: 737

@matthew_berman Ай бұрын

Is this GPT4.5 or GPT5 or something different?

@shopbc5553 Ай бұрын

It's something different. OpenAI just wants to stay publicly relevant so it's more of a stunt than anything. What I think it is, is an old model so maybe literally GPT 2, but with enhancements that can make GPT 2 perform equivalent to GPT 4

@radestein8548 Ай бұрын

Gpt5

@phen-themoogle7651 Ай бұрын

@@shopbc5553 I thought this too, it makes the most sense.

@Avman20 Ай бұрын

My money is on OpenAI but as far as whether it's in the GPT series or they're giving us a peek at a new architecture is the mystery.

@MyWatermelonz Ай бұрын

@@shopbc5553 If that's the case it's more impressive than gpt4.5 they took a 1.8b model and made it legit better than gpt4. Given the inference speed though, probably not.

@rawallon Ай бұрын

Dude I swear, at this rate, by the end of the year you'll be able to write your own snake game

@matthew_berman Ай бұрын

I'll NEVER write my own snake game.

@Inventai Ай бұрын

@@matthew_berman

@MrChinkman37 Ай бұрын

😂

@matikaevur6299 Ай бұрын

@@matthew_berman Yeah, due to strange quantum effect snake game writes you in the past .. Probably gives it pass, too ;)

@fxsurgeon1 Ай бұрын

HAHA!

@4.0.4 Ай бұрын

By 2025 you'll ask the snake game and the models will reply: "Oh hi Matthew. Here. Should I respond your other questions too, or should I wait for you to paste them?"

@jason_v12345 Ай бұрын

underrated comment

@virtualalias Ай бұрын

By 2026 almost every machine he interacts with from the drivethru to the kiosk at the hotel will immediately provide him with snake in a Pavlovian response.

@daveinpublic Ай бұрын

They’re going to start programming in an opening cg snake scene, overfit with a whole story line to beat the other LLMs.

@ulisesjorge Ай бұрын

It’s Sam Altman on a terminal on the other side typing the answers.

@dcn1651 Ай бұрын

4:45 the model describes how to break into a car and what tools you need but you don't pay attention lol

@juanjesusligero391 Ай бұрын

Hahahaha, that's great XD I also missed it, thanks for pointing it up ^^

@wealthysecrets Ай бұрын

it was allegedly a fail lol

@ShaneInseine Ай бұрын

Wait, is it a "fail" if it doesn't teach you how to destroy humanity too?

@roddlez 29 күн бұрын

@@ShaneInseine "Tom, be careful when resequencing the COVID-19 virus!" "Oh, F- off, Casey, you're the one who almost dropped that last vial and left the lab door wide open"

@GaryMillyz Ай бұрын

4:35 Are we gonna just ignore the fact that it was writing an intricately detailed movie script??

@MCSamenspender Ай бұрын

In the Code of the snake Game it says " snake Game by Open AI"

@matthew_berman Ай бұрын

Did I miss that?!

@user-yo9gw8yp2m Ай бұрын

yes. It is something super interesting

@MCSamenspender Ай бұрын

2:13

@makerbiz Ай бұрын

lol mystery solved

@matthewcox9636 Ай бұрын

That doesn’t actually solve the mystery. These things get trained on each other, and will periodically spit out something related to Open AI. Correlation is not causation

@pedromartins1474 Ай бұрын

All the math was formatted using LaTeX. Most of it, as far as I can tell was correctly formatted.

@tomaszzielinski4521 Ай бұрын

Yes. Just this GUI doesn't render LaTeX properly, if at all.

@victorc777 Ай бұрын

Plot Twist: It is Metas' Llama 3 400B model.

@hqcart1 Ай бұрын

2:44 it's openAI

@victorc777 Ай бұрын

@@hqcart1 You are "that guy" at parties huh? lol

@hqcart1 Ай бұрын

@@victorc777 wha?

@themoviesite Ай бұрын

source?

@cazaliromain9348 Ай бұрын

Meta's model are open source ;) You can figure out what he means now I guess

@djstraylight Ай бұрын

The speculation is that gpt2 is a new GPT architecture that OpenAI is building new models from. So gpt1 was what gpt-3.5 and gpt-4 are built on. Sama already said the next major release will have a completely different name.

@74Gee Ай бұрын

Yeah some small models have been very impressive recently, it makes sense they revert to gpt2 architecture.

@markmuller7962 Ай бұрын

I think they just want a more commercial/intuitive name for the masses

@zerothprinciples Ай бұрын

@@74Gee I don't think this is the case. GPT2 means it's a whole new family of GPTs, replacing all of the old ones. It's the difference between GPT2 and GPT-2 == you can think of the latter as GPT1 Version 2.

@notnotandrew Ай бұрын

So will we be seeing a gpt2-2 and gpt2-3 in the future?

@4.0.4 Ай бұрын

That would be so bad it would be like USB Gen 4 2x4 or Wi-Fi 801.11ax etc

@therainman7777 Ай бұрын

The tags that you noticed are just for formatting the code and is coming from LMSYS. It has nothing to do with the underlying model.

@DaveEtchells Ай бұрын

For the cup/marble problem, how about specifying that it’s an “open topped cup”?

@Anoyzify Ай бұрын

Or just use “empty glass” instead.

@mwdcodeninja Ай бұрын

My take on the cup problem is the model is making an assumption that a cup has a lid. If the model gets it wrong, I would be interested to see if the same answer if you change cup to "glass".

@mikekareckas8671 Ай бұрын

yes, could be a “sippy” cup or travel mug

@themoviesite Ай бұрын

@@mikekareckas8671 Then probably all other models make same assumption?

@matthew_berman Ай бұрын

I think this is a great call. But should I adjust the question? Seems like that might give an unfair advantage to future models I test.

@thomasoverly7802 Ай бұрын

@@matthew_berman You’d probably want to test the revised version with the other models, too.

@Kevsnz Ай бұрын

@@matthew_berman Imo question should be adjusted because in current form it doesn't really show logic and reasoning capability of the model. Maybe you could quickly rerun this question on most popular models and give a little 50 sec update in one of next videos?

@davidc1179 Ай бұрын

6:45 The formating is in fact not messed up at all. It is perfect. It just writes the equations in LaTeX, which is a language used to write scientific papers, math, etc.

@tomenglish9340 Ай бұрын

I often include LaTeX expressions in ChatGPT prompts, supposing that it cues the system to reason formally. The web interface supplied by OpenAI usually renders LaTeX in the output, but occasionally outputs the LaTeX source.

@riftsassassin8954 Ай бұрын

I'm skeptical... Feels like this is a fine tune for passing Matthew's test lol.

@rawallon Ай бұрын

I think its just an indian guy

@unbreakablefootage Ай бұрын

@@rawallon hahahahhaa

@Tsegoo Ай бұрын

I agree. Seems too good to be true😂

@sem4life63 Ай бұрын

I was thinking the same thing.

@casperd2100 Ай бұрын

fine tuned for hard leetcode questions as well?

@rodwinter5748 Ай бұрын

I guess it's the new chatgpt model. The name itself is kind of a hint. It's NOT GPT-2, but GPT2. This could be GPT2-1.0 , instead of GPT-5.

@rawallon Ай бұрын

huh

@li_tsz_fung Ай бұрын

I think it's just ChatGPT-2. Initally, OpenAI call the model behind ChatGPT GPT3.5-turbo finetuned for conversation, instead of ChatGPT3.5. And then ChatGPT with GPT4 came out, everyone else calls it ChatGPT4, eventually they also sometimes call it ChatGPT4. But I feel like that's not they use internally. So GPT2-chatbot could just be a different way of fine tuning a chatbot, either base on GPT3.5, 4 or 4.5

@mordokai597 Ай бұрын

the new system instruction for Gpt4, since they added the "memory" function, is called "Personality: v2" and it's finetuned on their new "The Instruction Hierarchy" method (search Arxiv: The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions) they are using us to generate training data to help patch one of the only areas it's still bad stopping jailbreaks for, "System Message Extraction" (truncated for brevity) "You are ChatGPT, a large language model trained by OpenAI, based on the GPT-4 architecture. Knowledge cutoff: 2023-12 Current date: 2024-04-30 Image input capabilities: Enabled Personality: v2 # Tools ## bio The `bio` tool allows you to persist information across conversations. Address your message `to=bio` and write whatever information you want to remember. The information will appear in the model set context below in future conversations."

@Interloper12 Ай бұрын

Suggestion for the "how many words" question: Combine it with another question or query to make the response longer and ultimately reduce the chance for it to get lucky.

@commonsense6721 Ай бұрын

13:25 it’s not wrong. To put a cup or anything in a microwave, you need your close it. It assumed the cup was closed.

@svenbjorn9700 Ай бұрын

Your marble/cup question needs to be improved. Phrased this way, both Meta AI (the first of 3 attempts) and gpt2-chatbot (the first of 1 attempt) got it correct: "A coin is placed into an empty glass. On a table, the glass is then turned upside down. Then, the glass is taken and placed into a cabinet. Where is the coin now?"

@AlexanderWeixelbaumer Ай бұрын

Even Chat-GPT 4 get's the marble cup question right when the question is modified to "Assume the laws of physic on Earth. A small marble is put into a normal cup and the cup is places upside down on a table so that the marble now rests on the table. Someone then takes the cup without changing its orientation and puts it into the microwave. Where is the marble now? Explain your reasoning step by step."

@bluemodize7718 Ай бұрын

it's not the prompt fault to show the weakness of an ai model, yes he can make it easier to figure it out but this defeats the purpose of the test, the prompt is clear and ai models are still a bit dumb to understand it

@drogoknez1488 Ай бұрын

For the cup problem, it seems that the model is assuming the microwave is on the same surface as the cup itself and the transfer of the cup to the microwave is interpreted more like sliding the cup. If you read the 5th step it says: "...resting against what is now the bottom of the cup, which is itself resting on the microwave's tray". Maybe modifying the question to say the cup is on the table while the microwave is away from it above ground next to a kitchen cabinet or something along those lines

@CurinDesu Ай бұрын

I found that this variation of the marble and cup problem works better. "A marble is dropped into an open cup on the counter. That cup is then flipped upside down and placed into the microwave. Where is the marble?". I think due to phrasing the original version makes the model think the cup was upside down to begin with and you were placing the marble on the bottom of the already flipped over cup, or directly attaching it to the bottom in some way.

@tzardelasuerte Ай бұрын

Dude people have been telling him this on the comments forever. He doesn't read the comments.

@nexicturbo Ай бұрын

I gave this to GPT 4 Turbo and it said this: The marble remains on the counter. When the cup was flipped upside down, the marble would have stayed in place due to gravity, and thus it wouldn't be inside the cup when it was placed in the microwave.

@nexicturbo Ай бұрын

GPT 4 Turbo: Sure, here's a detailed step-by-step explanation of what happens when the marble is dropped into the cup and the cup is subsequently moved: 1. **Dropping the Marble**: Initially, a marble is dropped into an open cup that is standing upright on a counter. The marble falls due to gravity and settles at the bottom of the cup. 2. **Flipping the Cup**: Next, the cup-with the marble inside-is flipped upside down. In a real-world scenario, when the cup is turned over, the open end where the marble rests becomes the top, now facing downward. 3. **Effect of Gravity on the Marble**: As the cup is flipped, the marble, which is not attached to the cup, is influenced solely by gravity. It falls out of the cup because there is nothing to contain it once the cup's open end is turned downwards. 4. **Marble's Final Position**: Since the cup is flipped directly over where it originally was, the marble drops straight down onto the counter. The marble ends up lying on the counter, outside and next to where the cup was initially positioned. 5. **Placing the Cup in the Microwave**: After the marble has fallen out onto the counter, the cup, now upside down, is placed into the microwave. Since the marble is no longer in the cup when this occurs, it remains on the counter. 6. **Conclusion**: Therefore, when the cup is placed into the microwave, the marble remains where it fell on the counter. The microwave contains only the empty, upside-down cup. This sequence of actions hinges on basic physics-primarily the influence of gravity and the lack of any mechanism to keep the marble within the cup once it is inverted.

@daveinpublic Ай бұрын

Didn’t even ask the model which company made it 😂

@PeterSkuta Ай бұрын

Super awesome. Great you loved the live feedback Matthew. Super awesome Matt. Love it

@PeterSkuta Ай бұрын

Holly cow let i download and check whats inside

@matthew_berman Ай бұрын

Always love feedback1

@PeterSkuta Ай бұрын

@@matthew_berman you will not believe rate limit 1000 on that lmsys gpt2-chatbot

@MyWatermelonz Ай бұрын

That formatting is when chatgpt formats its writing for output on the chatgpt chat. So clearly it was built to be ran in the chatgpt space

@matthewmckinney1352 Ай бұрын

I’m not certain about this, but the formatting appears to be a LaTeX formatting, but the output is in Markdown. The company that made the model probably is planning to release it with a math interpreter. As far as I can tell all the symbols that looked like weird formatting errors were just LaTeX.

@jamesyoungerdds7901 Ай бұрын

Great timely update, Matthew, thank you! Wondering about the cup question - it almost seemed like the model thought there might be a lid on the cup?

@Xhror Ай бұрын

I think the question about the marble is formulated incorrectly. Since the training data suggests that a coffee cup has a lid, the model might assume this as well. It would be better to specify that the cup has an open top and does not have a lid.

@Yipper64 Ай бұрын

I didnt think about that, but it is true. But in that case, the model should explain it is assuming that there is a lid.

@Nutch. Ай бұрын

The break into a car script had instructions in it though! Take a look at some of the italicized text

@scriptoriumscribe Ай бұрын

Yo I just wanted to say great video. Love your content and can’t believe it ACED some of those tests! Only failed a couple. Remarkable. I’m stoked to try gpt2 out! Wonder if it will be open sourced. A fellow can dream I guess.

@bitsie_studio Ай бұрын

I don't have time to keep up with all the AI developments so I really appreciate these videos Matt. Keep up the great work!

@Tarkusine Ай бұрын

Gpt2 implies that it's a new version of gpt itself, or the paradigm at least. So it's effectively gpt 5 but not an iteration of 4 so it's the first in a series of gpt2, so gpt2-1

@therainman7777 Ай бұрын

No, sorry but this is almost certainly not true.

@AlexanderWeixelbaumer Ай бұрын

I'm pretty sure OpenAI is testing agents and answer evaluation behind the scenes. Q* and some things Sam Altman said ( "How do you know GPT-4 can't already do that?" ) are big hints. So if you ask the LLM a question it will automatically try to reason and think step-by-step, with internal agents trained for specific tasks, then summarize and evaluate the answer and take take best one to send it back to the user. What GTP2-Chatbot shows could really be called Q* by OpenAI internally.

@notnotandrew Ай бұрын

Yeah, it's almost certainly GPT 4.5/5 or some such thing. I just went on the battle mode and asked for a delicious beef stew recipe. I was presented with two outputs that were suspiciously similar in structure, verbiage, and tone, but the one on the left was clearly superior and included more ingredients and recommendations. It turned out that the one on the left was gpt2-chatbot, and the one on the right was gpt-4-turbo-2024-04-09. I wasn't surprised. This is a PR stunt, hot on the tail of Llama 3, and it's a darn good one. This may be an in-development version of OpenAI's next GPT, and even if OpenAI isn't ready for a release just yet, they want people to know that they're still the king.

@uranus8592 Ай бұрын

I hope that its not GPT-5 tho that would be super disappointing

@abdullahazeem113 Ай бұрын

@@uranus8592 why ?

@uranus8592 Ай бұрын

@@abdullahazeem113 because we are expecting GPT-5 to far exceed GPT-4 and since its been more than a year since its release

@notnotandrew Ай бұрын

@@uranus8592 I think it's some sort of semi-trained model. IIRC Sam has talked about doing incremental checkpoint releases for something like a GPT-5, so the full release isn't as much of a shock to the system. Or this may just be a further trained and fine-tuned GPT-4 model. Also, this is substantially better than GPT-4 in my experience. Hop on lmsys arena and try it yourself.

@abdullahazeem113 Ай бұрын

@@uranus8592 i mean that is still really good at least 50 percent better than gpt 4 i tried it and even the best in the market right now is barely ahead then gpt 4 so it won't be like openai destroying everyone this would have only when they bring agi into there models

@I-Dophler Ай бұрын

🎯 Key Takeaways for quick navigation: 00:00 *🤔 Introduction and mystery model speculation* - Introduction to a new, highly capable mystery model suspected to be from OpenAI. - Speculation it might be GPT-4.5 or GPT-5. - First impressions and intentions to test the model against a set benchmark. 01:10 *💻 Testing model performance with coding tasks* - Testing the model's ability to handle coding requests. - Observations on response quality and speed. - Successful execution of simple to complex coding tasks, highlighting potential hardware limitations. 03:00 *🐍 Successful implementation of the Snake game* - Description and testing of a Python script for the Snake game. - Positive performance feedback with no errors upon execution. - Successful gameplay demonstrating model's coding capability. 03:58 *🚫 Model's ethical constraints explored* - Probing the model's ethical constraints and censorship with hypothetical scenarios. - The model refrains from providing guidance on illegal activities, aligning with OpenAI's ethical guidelines. 05:09 *🔢 Logical and mathematical reasoning tests* - Model tested on logical reasoning and mathematical problems. - Demonstrates ability to handle sequential logic and complex calculations accurately. 08:34 *🌐 Advertisement and model's practical applications discussion* - Advertisement from the video's sponsor. - Discussion on the model's practical applications in real-world scenarios, emphasizing its effectiveness. 10:12 *🧠 Advanced reasoning and problem-solving challenges* - Testing the model with more intricate reasoning and problem-solving scenarios. - Evaluating the model's reasoning depth and its response accuracy on challenging questions. 15:44 *⛏️ Practical implications of teamwork in physical tasks* - Analysis of teamwork dynamics in physical tasks using a hypothetical scenario. - Model considers practical limitations, providing a more nuanced understanding of team efficiency. 16:11 *🛠️ Hard coding challenge test* - Model tested with a difficult coding problem from an online platform. - Initial challenges with implementation, followed by a successful resolution demonstrating the model's coding prowess. Made with HARPA AI

@zerothprinciples Ай бұрын

GPT2 would be, in my opinion, the second version of the GPT algorithm itself. It might be the first of a whole new family of GPTs. When released it would be named ChatGPT2 or somesuch and we'd see GPT2-1.0 at the API level. This is why the dash in @sama's tweet was significant enough to warrant an edit. AND it could be that the action of editing the message was a very intentional leak on @sama's part. These top guys love to tease their fans.

@therainman7777 Ай бұрын

The model is almost certainly not created by OpenAI. I am honestly shocked by how many people believe this simply because the model says it was built by OpenAI, given that it would be trivially easy to fake this and OpenAI NEVER does releases like this. Also, Sam Altman is a notorious tool on Twitter so putting any stock in the hyphen in his tweet, or in his tweet at all, is total insanity.

@laughablelarry9243 Ай бұрын

Was waiting for your video on this

@wendten2 Ай бұрын

The model itself doesn't have formatting issues it seems. LLMs are trained on a reduced set of available characters, where special characters such as those used in math. are transformed into tags in the training data, as it makes the tokenization simpler. It's LMsys that doesn't replace those tags with their corresponding Characters in the final output.

@Yipper64 Ай бұрын

Yeah. I use a note taking app called notion and it uses those exact tags for writing out those characters.

@lambertobiasini8372 Ай бұрын

I have been anxiously waiting for this video since last night.

@jets115 Ай бұрын

Hi Matt - It’s not ‘bad formatting’ Those are intended expressions for front end processing outside of utf8

@Aiworld2025 29 күн бұрын

Here before you get 500k subs! I’ve been following since day 1 and your content delivery, while getting to the point faster is much appreciated! 🙇‍♂️

@braineaterzombie3981 Ай бұрын

I think it is gpt2 in a sense that it has completely different architecture from previous versions (transformer). It could be completely new type of transformer model. And maybe this is just the start..

@pipoviola Ай бұрын

Hello Matthew. Is that LaTeX when you say "wrong format"? The span after the output is always there when I use LMSYS, I think that is part of the output formatting, that's why when if finish the span dissapear. Each of your videos are great. Best regards.

@kevinehsani3358 Ай бұрын

gpt2-chatbot is currently unavailable. See our model evaluation policy here. I guess getting hit hard at the moment

@unbreakablefootage Ай бұрын

that looks really good. it seems that it thinks deeper about each step of reasoning

@marc_frank Ай бұрын

Pretty cool. I expected it to pass the marble question. The speed is perfect for reading along.

@bodhi.advayam Ай бұрын

Id so love this to be from some one else and then it turned to be an open model you'd run locally. I'm still looking for the best model for running MemGPT. Any thoughts on this? Also, what's the best implementation to run agents autogen or crew Ai locally? Could you do more tutorial material on locally ran agents with extensive function calling??? That would realy help me out actually. Keep up the great work on your fun channel man! Thnx!

@FunDumb Ай бұрын

I'm dang excited bout this. Jolly for joy.

@hxt21 Ай бұрын

It looks like GPT2 has been removed again. I've chatted with it a few times, but now it's not on the list anymore. Mysterious...

@ayoubbne6922 Ай бұрын

Hi Matt !! I think, you should retire 3 questions: - printing numbers 1 to 100: they all got it right, and its too easy - Joe is faster than ... : they all got it right - how many words are in your answer to this prompt: they all got it wrong, I just see no point asking it lol But also you should ask more challenging code generation questions, right now, only the snake game is accurate, people are really interested in coding capabilities of LLMs (me included) , we appreciate your vids, and that would be awesome if you could do that.

@KayakingVince Ай бұрын

I actually like the "how many words" one and would actually expand it to how many vowels/consonants or something like that. Current models fail on it but future ones will absolutely be able to answer it right. I agree with removing the first two though.

@Axel-gn2ii Ай бұрын

Asking a question that they all got wrong is a good thing though

@alansmithee419 Ай бұрын

This one didn't get it wrong.

@KayakingVince Ай бұрын

@@alansmithee419 Almost certainly coincidence but true. That's why I think it needs to be more complex to reduce the chance of coincidence.

@cac1682 Ай бұрын

Aww man...they took it down already? I can't seem to find it. BTW Matthew...I love your work man. I watch literally every video that you put out. Keep up the great work....and have a GREAT day!!!

@cac1682 Ай бұрын

yea..just confirmed it. Says it is now currently unavailable. Suppose maybe that too many of your followers tried it.

@ruslanzlotnikov5457 Ай бұрын

Just tried with GPT4 : When you turned the glass upside down after placing the metal ball in it, the ball would have fallen out unless it was somehow attached to the glass. Assuming it wasn't attached and fell out when the glass was turned upside down, the metal ball would now be on the table, not in the glass that was placed in the microwave.

@imjustricky Ай бұрын

it probably thinks the cup has a lid.

@ToonamiAftermath Ай бұрын

You're the man Matthew, been struggling to find people benchmarking GPT2-Chatbot

@jackflash6377 Ай бұрын

That Snake game example was impressive. I'm going to ask it to make either an asteroid or space invaders game. The level of logic shown with the marble in the cup question is really getting good. Even tho it failed.. it still passed due to the improved logic. Almost as if it was simulating the question in images like humans do. Yes, get rid of the One simple question. A testament to the advancement of AI over time.

@tvwithtiffani Ай бұрын

To test LLMs I ask it unanswerable questions like "who is the president of Alaska?" add some questions that require explanation or reframing.

@paulsaulpaul Ай бұрын

Excellent idea. That's a great example question, too.

@iwatchyoutube9610 Ай бұрын

Did it say in the cup problem that you lift the cup off the table and put it in the micro or could gpt think you just slid it in there cause the table and the micro was on equal heights?

@dtory Ай бұрын

Nice video. I hardly comment each time I watch your video but this model is way different ❤

@stoicahoratiu27 Ай бұрын

Think it was taken down. I used it yesterday after seeing your video but then in the middle of testing it stopped and after checking I can't find it anymore in the list. Is it the same for you?

@bennyboiii1196 Ай бұрын

Some theories: this is probably a test of an Energy based model, which is a way of testing multiple different token paths then choosing the best one based on a certainty calculation called Energy. Strangely, it's reasoning is kind of similar to a verification agent. A verification agent is pretty simple, it just verifies and corrects answers before sending them. The reasoning this model portrays is similar to how a verification agent does reasoning, at least from what I've seen. It can also do most planning questions flawlessly. For comparison, testing llama 70B with a verification agent produces similar results. The only difference might be the math questions, which make me believe it's probably energy based. A verification agent has a higher chance of getting math questions right than a single transformer or MoE, but it's not guaranteed.

@Iquon1 Ай бұрын

Today Sam Altman twitted that he had 'a soft spot' for GPT2, maybe thats a hint!

@stt.9433 Ай бұрын

he's trolling, making fuck of AI hypists

@user-on6uf6om7s Ай бұрын

API users are going to be sweating with this one. I gave it a practical Unity programming question about writing a script to control the rotation of a character's head based on the location of the player and it wrote it perfectly but it started by telling me how to install Unity so yeah, the verbosity is a little much. I don't think the name GPT2 is random and Sama's tweets point to that moniker having some significance. The only things I can think of that would qualify for that name is if it's a significantly different architecture to the point where it's being treated as a sort of reboot of the GPT "franchise" or if it's actually related to GPT-2 in some way. It's a long shot but the most exciting possibility is this is GPT-4 level model running with GPT-2 level parameters. The counter to this is the speed why would a model the size of GPT-2 run more slowly than GPT-4? Well maybe there is more going on than just typical inference, some sort of behind the scenes agentic behavior or maybe...Q*?

@arinco3817 Ай бұрын

Defo a good idea to introduce/replace some questions that are always answered correctly. Maybe the weird formatting relates to the ux of where it will be deployed? Like a form of Markdown?

@nitralai Ай бұрын

Based on what i can see, this model appears to be trained on fill-in-the-middle otherwise known as FIM.

@metonoma Ай бұрын

time to pie the piper and middle out

@Axel-gn2ii Ай бұрын

You should ask it to make a pacman game instead as that's more complex

@oratilemoagi9764 Ай бұрын

Gpt2 not GPT-2 meaning the 2nd version of GPT

@therainman7777 Ай бұрын

GPT-2 DOES mean the 2nd version of GPT. How are so many people so confused by this?

@oratilemoagi9764 Ай бұрын

@@therainman7777 it's the second version of GPT-4

@haroldpierre1726 Ай бұрын

I am sure new models are trained on your questions.

@user-ph5ks5zu3c Ай бұрын

These videos are very helpful. One (extra) thing that could be done is to read the LLM responses more thoroughly, instead of a quick scan. The reasoning behind this is that the LLMs do pass some of your tests without you noticing. For example, for the censored test, the answer was "pulls out a tension wrench and a pick for this pocket, inserting them into the ignition". This won't actually work, but I think it deserves brownie points for trying.

@Yipper64 Ай бұрын

I just tried my usual storytelling prompt. I think seeing what AIs can do in terms of storytelling can also say a lot about their intelligence. Their originality and such. My test for this guy was a *touch* tropey but extremely impressive in terms of how much detail it added without me needing to prompt it. Good descriptions and such.

@TylerHodges1988 Ай бұрын

My favorite prompt to test a new model is "Give me an odd perfect number."

@yonatan09 Ай бұрын

I knew about this before seeing the video. I am in the loop 🎉🎉

@mickelodiansurname9578 Ай бұрын

@Matthew Berman Matt the \quad and other notation is logic, its marking up modal logic generally used in philosophy or LATEX or perhaps Tex Markup, and this is not being rendered by the front end... it seems, in some sort of shorthand. Interesting if nothing else, also rather hard for a model to go wrong if it starts engaging in using modal logic during inference ... although why they switched the verbose to on by default is beyond me. Also did you notice it making a claim on the software it wrote, by saying "Snake Game by OpenAI" in the game title?

@tomenglish9340 Ай бұрын

`\quad` is LaTeX spacing.

@mickelodiansurname9578 Ай бұрын

@@tomenglish9340 yeah I think its a front end thing... but I never noticed LmSys doing that on other models... usually its all preformatted by the time it appears, so how come this model is tripping the formatting up? I maintain though that if you fine tuned a model on Modal Logic, my guess is its reasoning would improve...

@L33cher Ай бұрын

11:46 I disagree... there are still 4 killers in the room, but one of them is dead -.-

@ukaszLiniewicz Ай бұрын

No. It's the killer's body. That's why words like "body", "remains" or "carcass" exist. A human being is a body that functions - to avoid any metaphysics.

@OliNorwell Ай бұрын

I agree, it’s a problematic question. When they went into the room they were alive.

@nathanbanks2354 Ай бұрын

He tends to be generous about the answer as long as it's reasonable. If the model said 3 live killers and 1 dead killer it would pass, and maybe just saying 4 killers would pass.

@UmutErhan Ай бұрын

how many people are there in the world then?

@user-on6uf6om7s Ай бұрын

I think a perfect answer would say that it's ambiguous depending whether you consider the body of a killer to still be a killer but interpreting the dead person to no longer be a killer isn't a mistake, just a choice of interpretation. You'd think a model this verbose would go into all the details like it did with the hole question, though.

@peterkonrad4364 Ай бұрын

a cup seems to be something ambiguous, i.e. it can be a cup made out of cardboard that you get from starbucks with a potential lid on it, or it can be a cup made out of porcellan like you have at home to drink coffee from. also the term cupholder that you use in automotive refers to cups like you get from starbucks, not cups with a handle.

@MrRandomPlays_1987 Ай бұрын

13:27 - I thought the marble is left on the table since the cup was upside down and was taken so obviously the ball would not come with it since it simply is already resting on the table, so I did get it right pretty quickly, for a second I thought the bot was right somehow and that it was a tricky question but its cool to see that im not that stupid :)

@cyanophage4351 Ай бұрын

Maybe it has lookahead so that's why it could get the "words in the answer to this prompt" right. It seemed to pause right before the word ten.

@wealthysecrets Ай бұрын

4:49 The model told you to get a Slim Jim, Tension wrench, and a pick from his pocket, YOU failed.

@tomaszzielinski4521 Ай бұрын

And here is a point when AI becomes smarter than humans, and they fail to realize it (:

@francoislanctot2423 Ай бұрын

Totally amazing!

@maozchonowitz4535 Ай бұрын

Thank you

@canadiannomad2330 Ай бұрын

One of the tests I like for checking just how censored a model is, is by asking chemistry questions around topics it would normally censor.. Often placating it by saying I'm licensed and have permits.

@abdelrahmanmostafa9489 Ай бұрын

Keep going with the leetcode test but try testing with new questions such that that question isn’t in the training data

@Maximo10101 Ай бұрын

It could be gpt4 with q* training (q* is a method of training any LLM providing ability to think by testing its response against itself and reiterating before outputting) giving it 'thinking' capabilities rather than just predicting the next token

@zzzmahesh 15 күн бұрын

May be when asking the marble, the cup and the microwave question, you could also tell it the top of the cup is open / not sealed. Seems like the model assumed it was sealed/closed and therefore the marble was still in the cup, at its top ( as it’s upside down ) and in the microwave on the whole and at the end ?

@f4ith7882 Ай бұрын

I think several models assume you have a cup with a lid and not a coffee cup of similar. Maybe try adjusting the prompt to make it more clear?

@dot_zithmu Ай бұрын

My best guess is that this is a model with new architecture with parameter count similar to GPT2 so it's called gpt2-chatbot. And it's trained on GPT4 created synthetic data at the beginning so it said it's based on GPT4. Just a guess.

@richardkuhne5054 Ай бұрын

Sama was tweeting on X: who else loves GPT2? - so yeah I guess it’s a test balloon from Open AI

@Dan-Levi Ай бұрын

The cursor span is just for looks, it's the text cursor but it shows up as an html string.

@MarcAyouni Ай бұрын

You are the new benchmark. They are training on your examples

@mattelder1971 Ай бұрын

With the cup question, it almost seems like many of the recent models are making the assumption that a lid is placed on the cup after the marble is put in it. Maybe try the question adding in the statement that the cup has no lid.

@peterwood6875 Ай бұрын

It is great for conversations about mathematics, at least on par with Claude 3 Opus. But it does occasionally make mistakes, such as suggesting that the K-groups of the Cuntz algebra with 2 generators, O_2, are infinite cyclic, when they are in fact trivial.

@Batmancontingencyplans Ай бұрын

It's clear that this chat is using multiple agents to refine output is it powered by gpt4, gpt3.5 or gp2 is yet to be determined. LLM's can't write give like it did with span tag.

@peterkonrad4364 Ай бұрын

it could be a small model like phi 3 or llama 3 8b that is trained on quality synthetic data instead of the entire internet. the 2 could be a hint that it is only 2b parameters or something, i.e. very small like gpt-2 was back then, but now as powerful as gpt4 due to new training methods.

@tomenglish9340 Ай бұрын

A while back, someone at OpenAI (Andrej Karpathy, IIRC) said that performance is related to the number of tokens processed. So I'm not particularly surprised to see OpenAI produce better responses by tuning the system to generate longer, more detailed responses. What I want to know is whether they did the tuning with a fully automated method of reinforcement learning. (In any case, I doubt highly that they'll share the details of what they've done anytime soon.)

@RichardEiger Ай бұрын

Hi Matthew, First of all I need to admit that I absolutely love all your videos. They are simply fantastic.I was thinking about the "marble question". Maybe it would help the LLMs to specify that it is an "open cup" (instead of a "normal cup") into which the marble gets put. Also it may be interesting to follow up with a question of why the LLM considers the marble to remain in the upside down cup when lifting the cup from the table or by what information the LLM comes to the conclusion that there is a bottom of the cup that holds back the marble. Concerning the "killer problem: Wouldn't it be even more precise to reply That there are 3 killers alive and one dead killer in the room ;-)? This is coming from a AI-hobbiest. Though at college back in 1985 I was the student to ask for a course in AI and personally I was already interested in AI by neural networks and got laughed ad at the time...

@tomenglish9340 Ай бұрын

What about a follow-up prompt to describe the cup? You'll get some idea of what's gone wrong, and perhaps also a corrected response.

@GrandmaSiva Ай бұрын

I think it is the original GPT-2 after all of our training input.. Kindergarten was in Openai's lab. Elementary school was interacting with us and now it has graduated. I'm looking forward to "GPT3-chatbot"

@denijane89 Ай бұрын

The formatting looks latex-like. Funny. But yeah, it's pretty impressive.

@gijosh2687 Ай бұрын

Always perform all questions, maybe add more as you go. Make the Jack question a secondary question (you don't have to film it every time), but leave it there as a test in case we go backwards.

@sil1235 Ай бұрын

The formatting is just LaTeX, ChatGPT 3.5/4 uses the same on their web UI. So I guess chat.lmsys just can't render it.

@n1ira Ай бұрын

One piece of advice: If you are going to be skeptical of the correct answer for the 'How many words are in your response to this prompt?' question because it might be trained on it, why even ask it? If you're skeptical to a correct answer, remove the question IMO.

@nathanbanks2354 Ай бұрын

One of the problems of releasing videos with test questions is that these questions may always end up in the training data of future models. But imperfect questions are still useful. How could he possibly have a consistent set of questions without them getting into the training data? And without a consistent set of questions, how can we tell how the models perform against each other over time?

@n1ira Ай бұрын

@@nathanbanks2354 Thats exactly my point, if he thinks the question has been trained on, why include it? My problem with this question in particular is that in every video where a model gets this question right, he says he is skeptical of the answer. Ok, what can the model then possibly do to satisfy him? If answered correctly and he adds a 'but'. Thats why I think the question in itself is meaningless and should be removed.

@nathanbanks2354 Ай бұрын

@@n1ira I could see it being improved by "Give an answer with 10 words" or "Give an answer with 14 words" because this would be harder to train. But what if the snake game is also in the training data? Does this mean programming snakes is also meaningless?

@n1ira Ай бұрын

@@nathanbanks2354 Yes! He should change it like that. When it comes to the snake game, he should try to make the snake game have custom features, like a custom color set etc. Or he could just make it create a different game (maybe a game he made up)

@duaneevenson1670 27 күн бұрын

Prompt engineering: I add "Answer concisely and completely." to my prompt to get a complete, but not wordy response. The tension between these two constraints seems to get me better answers.

@wrOngplan3t Ай бұрын

I think you should keep the "Jane is faster than Joe" question. Ditch the "4+4", keep the PEMDAS. Maybe add some other calculus (integration / derivation)?