Claude 3.5 Sonnet vs GPT-4o: Side-by-Side Tests

Рет қаралды 52,161

Күн бұрын

The ultimate showdown between two of the most advanced large language models on the market: OpenAI's GPT-4o and Anthropic's Claude 3.5 Sonnet. In this video, I put these models to the test in a series of head-to-head challenges to determine which one truly reigns supreme. I evaluate their responses to various prompts, awarding points to the model that delivers the best performance in each category. Will Claude 3.5 Sonnet live up to its reputation as the best LLM available, or will GPT-4o take the crown? Join me for an in-depth comparison and find out which model comes out on top!
I hope you learn something from this video. Comment with any questions, and I'll make sure to respond!
***
Link to text responses from the video: gist.github.com/patrickstorm/...
***
0:00 - Intro
0:27 - Highlights and Benchmarks of Claude 3.5 Sonnet
3:12 - Showdown rules
3:58 - Round 1: Creative Writing
6:55 - Round 2: Image Descriptions
9:09 - Round 3: Coding
15:31 - Round 4: Sentiment Analysis
17:05 - Round 5: Question Answering
20:45 - Round 6: Image Generation
21:07 - Round 7: Conversational Skills
22:26 - Round 8: Summarization
23:53 - Final results & What model am I going to use?

Пікірлер: 151

@AnthonyGoubard 5 күн бұрын

I think given no point when both are correct may bias the final result. Let's say, you've done 20 tests, 15 are the same results, 1 gpt4o is better, 4 Claude Sonnet is better. The score is then 4-1 for Clause Sonnet but actually it's more 19-16.

@luigideff 5 күн бұрын

Yea exactly. It totally makes a perception difference.

@beautifulandtoolate 5 күн бұрын

A draw is typically 0.5 points each

@FloatingWeeds2 5 күн бұрын

Do 0.01 points each for a draw so you can have two different categories of data in one number 😘

@PatrickStorm_ 4 күн бұрын

Good call, it looks a lot different when you include ties. I recalculated the final scores adding in a point for ties, and the final tally is GPT-4o: 17 and Claude 3.5 Sonnet: 19. That shows a clearer picture of how close these models are 👍

@PatrickStorm_ 4 күн бұрын

Nice. That is my kind of useless optimization 😎 This would end up with GPT-4o at 6.11 and Claude 3.5 Sonnet at 8.11

@user-nl6dg2mp8p 5 күн бұрын

Summary: 1. both are great. 2. don't use either for fact finding. 3. Since they are both free, use both simultaneously.

@PatrickStorm_ 4 күн бұрын

A bit reductive :) But yeah, that's the gist! Both are really good and have different strengths.

@drlordbasil 5 күн бұрын

Claude Sonnet is wayyy better for complex tasks and assistance in debugging.

@KroeSufos Күн бұрын

Perfect for my work!

@Ivan7Kovnovic 5 күн бұрын

The GDP 2018 question was actually answered correctly. According to every source I found on the internet, Germany was 4th and the UK was 5th.

@manuelcardoso5830 3 күн бұрын

Agreed, I saw the same thing. Both AIs were rigth.

@gege151500 Күн бұрын

I actually found India on the IMF website?

@PatrickStorm_ 16 сағат бұрын

Yup, I messed up here 😬 I saw the GDP based on PPP rankings, and mistook it for nominal GDP. The AIs got it right! Source: en.wikipedia.org/wiki/List_of_countries_by_GDP_(PPP)

@drlordbasil 5 күн бұрын

Beautifully done on the video bro.

@blueicicle1973 5 күн бұрын

it sounded a little biased towards Claude

@PatrickStorm_ 16 сағат бұрын

Yeah, I have to figure out how to do this as a blind test next time. Thanks for the feedback

@suleymanbolek7296 5 күн бұрын

This was the best comparison video on KZfaq. Great job man, subscribed.

@NithinJune Күн бұрын

for the coding tests you should do a plagiarism check to check if it is straight ripping someone’s code

@briankgarland 5 күн бұрын

I pay for both, primarily for coding, and haven't used 4o since Sonnet came out.

@vm_jayfus9332 4 күн бұрын

Your channel deserves sooooo Much more attention😮

@promansarwar4374 5 күн бұрын

Great content man. This channel is going to blow up soon

@MrAmad3us 5 күн бұрын

Claude premium plan gives less messages / dollar. It’s significantly more consistent in long and complex convos, but you reach the 5h message limit quickly

@gideons6126 5 күн бұрын

Cheaper per token if you use the API which is how I go for it but I agree with you about premium value

@fool-on-the-hill 5 күн бұрын

Yeah, I ran into this quickly, not knowing there was a limit. The only limit I ran into with ChatGPT is the amount of images I could create, but it took me a while to get there even. That stinks. Oh, and Claude can’t do images… Currently, I’m subscribed to both to see if I can tell which ones better, but I’m not sure I can.

@ktms1188 3 күн бұрын

Claude 3.5 and GPT-4o both have their strengths, and it’s fascinating to see how they differ. Claude feels more human, like it’s really trying to understand what I’m asking, but then I’ve noticed with the memories function in GPT the model I think knows a lot more when I’m trying to ask now so now has much better answers like Claude 3.5. My issue is sometimes it hits those frustrating blocks and says it’s unable to answer my question, which drives me nuts even when it’s nothing controversial and it clearly would know the answer. I noticed in one of their talking points that is one of their big things. They are working on as it is overly restrictive and they know it so improve that. GPT-4o, on the other hand, is super analytical but occasionally needs me to rephrase my questions to get the best answers. I’ve been using both for a while now, and here’s what I’ve found: Claude’s artifact mode is mind-blowing, it’s nice if you’re on an iPhone or iPad since no android app. GPT’s memory function is a game-changer, making it more accurate over time as it learns from our interactions. Wouldn’t it be amazing if they combined the best of both worlds? I’d love to see a deep dive comparison between custom GPTs like “Scholar” and the standard GPT-4o, especially for fact-based questions. Does the customization really boost accuracy?

@JosefTorkelsen 5 күн бұрын

Great job on the video dude! I also agree with your results for yourself at the end that discusses how you plan to use them. I like Claude but without those extra things, Chat GPT is my daily driver.

@NithinJune Күн бұрын

Claude genuinely seems so exciting

@RanLM1 2 күн бұрын

Great video. Thank you. Subscribed

@truecuckoo 5 күн бұрын

Isn’t the biggest improvement with GPT-4O the audio conversational voice chat skills? Not converting audio to text prompts, but actually understanding the tone of the voice itself etc. It comes across as pretty nuanced in the demos I’ve seen.

@PatrickStorm_ 4 күн бұрын

Absolutely! But that isn't available to anyone but insiders yet. But even without the audio, when GPT-4o came out, it was top of the leaderboards for pretty much everything.

@djayjp 5 күн бұрын

19:32 You forgot to give the point to GPT-4o in the tally.

@AJBarea 5 күн бұрын

he gives gpt-4o two points and claude 0 points for that whole category less than a minute later around 20:13

@user-ql3es4zi1f 5 күн бұрын

GPT’s recognising Obama’s prank is astonishing

@zejdzglebiej 5 күн бұрын

The question is, what do you mean by writing a better text? I'm afraid that you evaluate texts too positively, where there is understatement, and a lack of logical structure with opening and closing. You perceive it as an aura of mystery. That's why Clodie cheated on you, because what he couldn't do, you interpreted as good writing.

@u4icdissonance180 5 күн бұрын

Excellent video, I think you did a good job of being objective. Just a note, if you're looking for conversation and support, tell GPT you're looking more for an emotionally supportive answer than a solution based answer. Then what you get out of it is very similar to Claude. Claude still deserves the point because your average user likely won't think to try that, but the option is there for people that want more conversational GPT.

@nahadmaniyot6790 2 күн бұрын

Sure, here is a summary of the video in points: * This video compares two large language models, Claude 3.5 Sonnet and GPT-40, by giving them a series of head-to-head tests. * The winner is determined by the video creator, Patrick Storm, based on his subjective criteria. * Claude 3.5 Sonnet wins in creative writing, dialogue generation, sentiment analysis (partially), conversational skills, and summarization (when length is considered). * GPT-40 wins in factual question answering and image generation (because Claude 3.5 Sonnet doesn't have an image generation model). * They tie in summarizing a research paper and coding (when the prompt is simple). * Overall, Claude 3.5 Sonnet performs better based on the video creator's evaluation.

@johnreimer4361 Күн бұрын

Ask Claude 3.5 Sonnet: "List the first 12 people in order that landed on the moon." The result is correct, and it shows that the 11th person was, in fact, Eugene Cernan. It pays to ask questions that force a step-by-step process of answering.

@AbhisheksinghbhadauriyaG 4 күн бұрын

00:03 CLA 3.5 Sonet outperforms GPT 40 in benchmarks 02:27 Claude 3.5 Sonet outperforms GPT-40 in speed and live code demonstrations. 04:46 CLA 3.5 Sonnet outperforms GPT-40 in creative writing tests 07:13 Comparison of performance between CLA 3.5 and GPT-40 models 09:52 Comparison of Claude 3.5 Sonet and GPT-40 12:27 Difference in code review of Claude 3.5 Sonnet and GPT-4o 14:57 Comparing GPT-3.5 Sonnet and GPT-4o 17:28 Comparison of GPT-40 and Claude 3.5 Sonnet performance on trivia questions 19:49 GPT-40 performed better in factual accuracy 22:04 Claude 3.5 Sonnet outperformed GPT-40 in understanding and summarizing human emotions. 24:18 CLA 3.5 Sonet offers better performance and cost saving

@djayjp 5 күн бұрын

I'm pretty sure limes float if they have been squeezed and sink if they haven't.

@jerkface38 5 күн бұрын

The esther question is also dumb. Let's be honest here, the info on the internet is ambiguous. Lots of info saying she was married in 1981 or 1982. Only the wiki says m. 1985 but with no real context. Had he asked both models why they said what they said, he would have gotten a refined response. Also, he only gave 1 point for image generation? I'm sorry but that's at least a 3 pointer even if some of the images suck.

@MudroZvon 5 күн бұрын

Instant subscribe!

@Repz98 5 күн бұрын

This video was really well made, and I enjoyed it through the entire video! I thought I was watching someone with 200k plus subs, based on the quality of this content. Keep it up, I’m subscribing now!

@PatrickStorm_ 4 күн бұрын

I'm really happy you liked it :) and you made my day saying it was comparable to a channel with 200k subs!

@journees4300 5 күн бұрын

This is not a well thought out test. There are actually standardize methods to compare AI models. Look for GLUE, COCO, MS MaRCO, SQuAD, etc. It depends on what aspects do want to compare.

@peanutbutterjellybeans1336 Күн бұрын

ChatGPT-4o is my quick search engine. Claude 3.5 Sonnet is my main workhorse. The artifact feature in Claude 3.5 Sonnet is so good.

@Doran_Krotan 4 күн бұрын

still waiting on an android claude app. * tick tock *

@nicholasfabris130 4 күн бұрын

In round 5 UK was the correct answer

@maximiliandegarnerinvonmon6457 Күн бұрын

They both actually got it correct about the 5th highest GDP in 2018.

@maximiliandegarnerinvonmon6457 Күн бұрын

Gives a scary example of how we presume that we are superior 😂😂

@maximiliandegarnerinvonmon6457 Күн бұрын

We are usually 4th but 2018 was the time we dropped fir that year and again just recently last year. It's a sticking point here due to elections and that's how we know 😂😂😂

@TheRealExecuter22 Күн бұрын

This was a great video, good pacing and structure without any empty ai hype bullshit. Very pleasing production and you got a beautiful room for filming!

@Imperial_Dynamics 22 сағат бұрын

You did not increase the width of the menu bar in Claude's case to see if the buttons re-appear

@FrancescoDellaValle 3 күн бұрын

I appreciate your work, but I found this video biased and inconsistent in judging the two models' responses. In two or three tests, you verbally preferred ChatGPT, yet you didn't award any points and declared a draw. This doesn't seem unbiased to me.

@SSS-100M 3 күн бұрын

Your video is great! I can understand the difference between Claude 3.5 Sonnet and GPT-4o. Also, I canceled GPT-4o because Claude 3.5 Sonnet is better than GPT4o. When I want to create an image, I use Gemini1.5 Pro.

@mingistech 16 сағат бұрын

I agree, Claude really needs voice chat added.

@Wonders_of_Reality 5 күн бұрын

Oh! Big thank you for the light theme! You’ve just won a subscriber.

@user-ce8ut8hr9k 2 күн бұрын

What should they have done with the 1 second interval question to keep to 1 second exactly?

@PatrickStorm_ 2 күн бұрын

It can go pretty deep, but they should create a variable with the start time, then use that to calculate time until the next tick. This stackoverflow I just found has the problem and solution well laid out: stackoverflow.com/questions/29971898/how-to-create-an-accurate-timer-in-javascript

@user-ce8ut8hr9k 2 күн бұрын

@@PatrickStorm_ That's a great link, thanks.

@CyberPulseIT 5 күн бұрын

Nice

@SurfCatten 5 күн бұрын

Pretty impressive for a fairly small KZfaq channel! If you can get the visibility I expect your channel will do very well. Unfortunately for that to happen you need to go pretty far down the clickbait KZfaq algorithm there's just no other way to build views and subscriptions

@fernandoz6329 5 күн бұрын

In this type of showdown finding which is the best, I think that would be useful to the evaluator not know who is creating the answer to avoing being biased for personal preferences. There were few answers where I disagree, so maybe I'm biased too. For coding, DeepSeek 2 is outstanding.

@PatrickStorm_ 5 күн бұрын

That is a really good point. I am very sure I would be able to distinguish between them even if it was a blind test, so I wonder how I could do that 🤔 I’ll keep this in mind for future comparison videos though. Thanks for the feedback.

@PatrickStorm_ 5 күн бұрын

And yes, DeepSeek 2 looks really good, I haven’t tried it out yet, but it’s on my list!

@jasonk125 5 күн бұрын

The question is, what are the gotchas in the JavaScript timer coding problem?

@OverLordGoldDragon 5 күн бұрын

The link in "Link to text responses from the video" seems to point to something else.

@PatrickStorm_ 5 күн бұрын

Apologies! That was a copy/paste issue. It's been updated with the correct link. Let me know if that works for you now!

@eburgwedel 6 күн бұрын

Could have been a good comparison, but wasn’t - image gen and facts make absolutely no sense; reasoning was missing entirely. It did point out a few important things, though, so thank you.

@PatrickStorm_ 5 күн бұрын

I appreciate the feedback! I hear you about the reasoning not being in there. That was a miss, future comparison videos will definitely have that section… and probably won’t have the facts section. Glad you got some stuff out of the video though!

@FloydCotton-hx4jh 4 күн бұрын

It is funny to me how people do comparisons like this. Yours was very well done and quite entertaining. However, the speed at which these models leap frog one another makes this kind of a moot point.

@PatrickStorm_ 4 күн бұрын

You’re right man. I was afraid a new, better model would come out before I even released the video. But I do think it’s worth evaluating new models to know what it is good at, and determine which model to use for different scenarios. Either way, I’m glad you got something out of the vid!

@avi7278 5 күн бұрын

so does this mean no autrey?

@arthurfernandes6934 17 сағат бұрын

GPT 4 is massively better in context, writing and everything that needs interpretation: GPT 4 bunny history: Under the pulsing neon of 2147, Detective Eli Ramos adjusted his Temporal Badge. The rain-flickered holograms cast shadows as he stepped into the alley where time had fractured. The chrono-meter beeped-1932, Chicago. Eli's target was Alvin Holloway, a time fugitive masquerading as a bootlegger. Using a purloined time device, Holloway had been jumping decades, altering events for personal gain. Eli slipped through the temporal gate, his trench coat merging seamlessly into the milieu of mobsters and flappers. The smoky speakeasy was loud with jazz, the air thick with conspiracy. He spotted Holloway immediately, charm and danger rolled into one. "You can't keep running, Alvin," Eli said, cornering him by the bar. "And you can't stop me," Holloway smirked, reaching for his concealed device. Quick as a blink, Eli drew his Temporal Disruptor. A beam of blue light, and Holloway froze-the device dropping, useless. "You're out of time," Eli quipped, cuffing Holloway. Back in 2147, the chief commended him. But Eli knew better. Time was a loop; his next case was just a ripple away. Somewhere, sometime, another fracture awaited. But for now, under the eternal city lights, Eli Ramos took a well-deserved breath. GPT 4o: In the bustling heart of Neo-Victorian London, Detective Arlo Sinclair adjusted his bowler hat and checked his pocket watch. The gears inside whirred, signaling it was time. With a flick of his wrist, the watch’s face shifted, revealing a hidden dial. He twisted it, and the world around him shimmered. Moments later, Arlo stood in the shadowy alley of a different era-New York City, 1942. The scent of fresh rain mingled with smoke from the distant war effort. He scanned the street and spotted his target: a man in a pinstripe suit carrying a briefcase too important to be left in this time. Arlo followed, his every step silent as he closed in on the man. When the moment was right, he sprang, snatching the briefcase. The man barely had time to react before Arlo twisted the dial again. The city melted away, replaced by the futuristic skyline of 2147. In his hidden office, Arlo placed the briefcase on a sleek table. Opening it, he smiled at the sight of the stolen schematics, safe from those who would misuse them. Time could be twisted, but justice always straightened its course. Arlo leaned back, ready for the next case. "Ai que tudooo," he murmured.

@timothyhernandez5141 2 күн бұрын

How about Claud 3.5 to gpt 4? , which is better 😅Thanks

@hidd3n_ 3 күн бұрын

imagine if we had their child, Claude Pete 4.5

@ibdense 4 күн бұрын

I found this interesting.

@kylanoconnor3144 5 күн бұрын

Hey Patrick, I wanted to reach out with some advice for an AI startup I just released but I can’t find your email, what’s the best way to reach you?

@PatrickStorm_ 4 күн бұрын

Hey there! I just added it to my channel page, you can email me here: patrick@patrick-storm.com

@HERKELMERKEL 4 күн бұрын

I was using ChatGPT4o for creating artistic literature; poems and song lyrics etc.. then Claude 3.5 came out.. than i was blown away. It is litterally a LLM (Large Language Model) because it understand a lot of world languages than GPT, It creates beautiful lyrics than GPT.. so in that field the winner is obvious.. but there are some fields that it was not better.

@HERKELMERKEL 4 күн бұрын

I am using free version, it was limited to a couple of prompts.. so i write very long paragraphy to save some steps

@vuhoang5903 5 күн бұрын

My experience is GPT is still better than Claude in logic, reasoning and problem solving (for coding, math, data analysis,...)

@SharvindRao 4 күн бұрын

Blah blah blah

@Monawwar 2 күн бұрын

10:31 - what's that extension 😆

@229Mike 5 күн бұрын

I don’t know if I can agree with this fellow. I tried using Claude 3.5 sonnet and it got a picture breakdown, incorrect by the timeline that was present. No problems wit chat. And at that point, I just didn’t wanna trust Claude.

@fast4549 5 күн бұрын

You can actually tell Claude to generate and image and it will make an SVG

@Sindigo-ic6xq 6 күн бұрын

although yes reasoning should absolutely be in

@PatrickStorm_ 5 күн бұрын

Agreed! I’ll make sure to add reasoning into future comparisons. Thanks for the feedback :)

@Sindigo-ic6xq 5 күн бұрын

@@PatrickStorm_ math, iq problems (nonvisual) and understanding of the world

@tekratek4077 4 күн бұрын

I was always using gpt-4o for my tasks until i discovered Claude. I code with claude and let GPT- 4o debug it. Perfect combo for me. Claude understands python and web design way better than GPT-4o.

@PatrickStorm_ 3 күн бұрын

That is a really interesting approach. The two models obviously have different data sets, so they might have different things they notice in the code. Makes me really curious how the models combined like you use them would perform on benchmarks

@photelegy 5 күн бұрын

7:15 In the image-recognition-tests I'm not sure if they really thought by themselves about this or if they found the image on the internet and found e.g. a forum or so where it's already explained. So the "just" had to reframe it.

@byrnemeister2008 5 күн бұрын

Well Claude doesn’t have internet access so that’s not really and option.

@photelegy 5 күн бұрын

@@byrnemeister2008 Ok. Thanks for the a notice. So what's the actuality of the model (after what date doesn't claude doesn't know anything?)

@dabberdaffy256 3 күн бұрын

Claude is a good demo but the chat length 😬

@igli4ko 22 сағат бұрын

why no math section?

@RolandGustafsson 5 күн бұрын

ChatGPT is way better at haikus than Claude sonnet. Using Claude for haikus reminded me of the ChatGPT 3.5 days. I've had the opposite experience with the actual stuff I do, 4o is better than Claude for my uses and I've done extensive tests. My prompts tend to be much more intensive than the ones you're asking, more detailed.

@Sindigo-ic6xq 6 күн бұрын

really nice! better than most other reviews

@cbnewham5633 5 күн бұрын

While you are correct about image generation being only available in gpt4o, it would have been good if you had a section on image generation using code. I quite often get GPT to create graphs or simple line graphics, which can be done easily in code. Eg: draw a graph of x to the power of 1.7, or draw an alien sprite from Space Invaders.

@cbnewham5633 5 күн бұрын

PS: for GPT you of course have to tell it to do so using its code capabilities, otherwise it tries to do it using DALL-E, with amusing results.

@PatrickStorm_ 5 күн бұрын

That's a good point. I could have done that in either the image generation or the code section. The new Artifacts feature is actually really cool when it comes to this sort of thing - I haven't tried using it for graphs yet though, I wonder how well that works. Now I'm curious!

@PatrickStorm_ 5 күн бұрын

Alright, I tested it. It does show the graphs, but it really wants to use React and include libraries, but this just shows a blank screen in Artifacts. I told it to just use plain html with a canvas element, and it did show a graph, but it was wonky. Like I said in the video, I think Artifacts just isn't there yet to be truly useful. However, I bet the original React code that it gave me would have looked great. I think I'll add this test into future comparison videos like this. Thanks for the feedback!

@cbnewham5633 5 күн бұрын

@@PatrickStorm_ GPT uses python to create its graphs. As I recall it uses some libraries to do the graphing - although as I'm not a python coder I have to admit I've never bothered to look at the code it produces. I've not tried Claude as yet for graphs, so I wonder what happens if you specify what language it should use for the graphing or icon generation.

@Kutsushita_yukino 5 күн бұрын

claude 3.5 sonnet lost it’s EQ OPUS had though….so it’s not reliable as a conversational model. OPUS is still worth it if you get tired of sonnet 3.5’s robotic bland responses. try comparing them if you don’t believe me. this is not subjective, it’s fact that sonnet 3.5 sacrificed emotional intelligence for more specs.

@117ao 5 күн бұрын

exactly!

@sergeyromanov2751 5 күн бұрын

Your list of questions was not balanced. You focused too much on the language problems (where Claude 3.5 Sonnet is clearly ahead) and completely ignored the logic, reasoning and math problems (where GPT-4o would crush its opponent).

@PatrickStorm_ 4 күн бұрын

Totally valid critique. I'll add those sections in to future model comparison videos.

@NithinJune Күн бұрын

image generation should be minus points lmao

@USS_Daedalus 3 күн бұрын

Maybe i switch to Sonnet

@YoKKJoni 3 күн бұрын

been using claude for 6 months now.. never seen a reason to use anythign else..

@marsrocket 5 күн бұрын

Claude tried to rhyme awash with splash. Interesting that it didn’t incorporate pronunciation. In time I suspect all AIs will incorporate many different types of knowledge.

@jaredf6205 5 күн бұрын

They rhyme enough. You’d probably think Eminem can’t rhyme at all if you read the lyrics only.

@singhbhai Күн бұрын

Claude is better at Coding really really fast.

@hooooman. 5 күн бұрын

Its always true that competition between companies is good for us. But in this case, competition between these AI companies is not good for us(as a normal human being, as well as from a developer pov)💀

@elvis7859 5 күн бұрын

Claude doesn't even search the internet for information is all from training

@PatrickStorm_ 4 күн бұрын

Right, which is a major reason I am still using ChatGPT. In these tests I made sure ChatGPT wasn't doing internet searches though, to make the tests as apples-to-apples as I could.

@MasamuneX 5 күн бұрын

i think you need more complex coding tests than pong

@Somebodythatoverthinks 5 күн бұрын

While they keep saying "coming in few weeks" sonnet gonna beat them in thier own game 😂

@shauni1987 2 күн бұрын

Why compare 4o for coding?

@caresvlbdjz 2 күн бұрын

for coding claude 3.5 sonnet is much better in my tests

@jasonreviews 5 күн бұрын

chat gpt can generate images.

@512Bytes Күн бұрын

I have a video on my Chanel integrating all Ai models into LINUX, ChatGPT, Gemini, Claude and more.

@djfremen 5 күн бұрын

Nice work. Also try to break the reasoning. For instance: “a farmer and a sheep on one side of a river. The farmer has a boat that can take one animal at a time. What is the fewest number of trips the farmer can take to have his animals on the opposite side of the river?” - this usually breaks reasoning of LLMs.

@PatrickStorm_ 5 күн бұрын

Great call out. In future comparisons, I'll absolutely add in a reasoning test like the one you mentioned. Thanks for the feedback!

@brezza6892 2 күн бұрын

Germany was 4th not 5th. They were both correct and you were wrong. According to the WEF that is.

@py_man 3 күн бұрын

Claude is way better than gpt4o in coding

@FJRyder 5 күн бұрын

The answer was Leafy, trees are leafy. No points.

@TeamStationAI 5 күн бұрын

Claude has no access to the internet.

@jimdye7431 5 күн бұрын

the fact 4o can search the web makes it better imo

@jimdye7431 5 күн бұрын

i wrote this prior to watching all the way through, your conclusion seemed to align with mine. I am a casual user and dont need to pay for both. I am however am not a coder and dont need these systems for work. I use it in place of google alot now and to make fun pictures of my dog. I use it to explain thing i read to me that i dont quite understand. I also use it to help me choose items to purchase over others. Help brain storm workout as well, among other things. I barely talk to it though, maybe once they finally release the 4o voice, witch will be who knows when at this point. Thanks for the video.

@PatrickStorm_ 5 күн бұрын

Glad you enjoyed! If you aren't using it for code, I think ChatGPT is the way to go still. Thanks for the comments!

@Douchebagus 5 күн бұрын

Gpt4o doesn't generate images, it instead uses dalle 3. I don't think that's a fair assessment of a language model when it has to rely on an image model, and shouldn't be awarded points for that.

@PatrickStorm_ 5 күн бұрын

That is a really fair point. The video is out, but if I could, I would retroactively adjust the final score to GPT-4o: 5 points and Claude 3.5 Sonnet: 8 points. I appreciate the feedback!

@iftekharhossen7221 Күн бұрын

First of all GPT-4o is free

@phillipbones7522 3 күн бұрын

Videoes like these are pointless for the simple fact its nothing more then comparing two software apps with one being a half step either behind or in front. Unless Claude can take that full step forward which it failed to do this time around then ChatGPT will just sprint forward again with another great leap. I can only afford one and for now sticking with ChatGPT as it has more protential in what I need it for

@theibrahimet Күн бұрын

he is biased for claude

@FuzzyWuzzaBer Күн бұрын

I have to say that the way you presented this showed a heavy bias towards Sonnet 3.5. In each of the ties, EACH should have gotten a point, this way it validates each model in the function you were requesting. This is hardly subjective with the utter skewering of the results to Sonnet.

@cbgaming08 Күн бұрын

Did you even watched the video?

@cbgaming08 Күн бұрын

@@FuzzyWuzzaBer He clearly stated in the beginning that it is going to be a subjective comparison and that he is the owner of the video, so he's going to do what he wants.

@FuzzyWuzzaBer 16 сағат бұрын

@@cbgaming08 Yes and I heard him.

@FuzzyWuzzaBer 16 сағат бұрын

@@cbgaming08 I actually meant to say subjective, I just got my tenses mixed in the moment. My opinion stands.