Autonomous Open Source LLM Evaluator (Ollama)

Autonomous Open Source LLM Evaluator (Ollama) - Full Guide

Рет қаралды 3,947

Күн бұрын

Autonomous Open Source LLM Evaluator (Ollama) - Full Guide
👊 Become a member and get access to GitHub and Code:
/ allaboutai
🤖 Great AI Engineer Course:
scrimba.com/learn/aiengineer?...
🔥 Open GitHub Repos:
github.com/AllAboutAI-YT/easy...
📧 Join the newsletter:
www.allabtai.com/newsletter/
🌐 My website:
www.allabtai.com
Today I take a look at my Autonomous Open Source LLM Evaluator using Ollama and GPT-4. This is a neet tool to test open source LLMs on different tasks like problems and code
00:00 Ollama LLM Eval Intro
00:21 Ollama LLM Eval Flowchart
01:28 LLM Evaluator Code 1
06:24 Test 1
08:30 LLM Evaluator Code 2
09:13 Test 2
10:53 Conclusion

Пікірлер: 18

@gunabalang9543 10 күн бұрын

I love the way you are using AI .....

@MrSuntask 10 күн бұрын

Livestream was great!

@Ms.Robot. 10 күн бұрын

This is a useful idea🎉

@CarrotStick2165 10 күн бұрын

aya:35b blows everything out of the window. Not ten times better then chatGPT but one hundred times better. It's slow as it's 35B run locally but, I love it. Besides that I use llama3 for most everyday tasks..

@ArseniyPotapov 9 күн бұрын

I've built a similar system, but I noticed that judge model sometimes hallucinates and gives high marks to obviously wrong solutions. I tried to make a jury of multiple judges (different big models) this improved judging quality, but made it 8X slower. Also, with multiple judges you will need to fuse their judgements to some consensus, it's just pretty slow and all models do hallucinate.

@ArseniyPotapov 9 күн бұрын

One of the problems almost all models suck at is the puzzle "a fox, a chicken and a sack of grain" or ("wolf, goat and cabbage problem"). All models recognize that it's a classic puzzle, but only few can give a coherent solution without weird glitches

@thenarrowgate3063 8 күн бұрын

In The Bubble sort evaluation, all the models that were eval as wrong (MIstral, Codestral..etc) had a syntax error in line 1 because it included the output text as a line of code as for the code itself it was sound on all..so it is not a proper eval as you need to check your code as to why it worked for a couple but not the others as a simple syntax error that wasnt part of the LLM's code but yours does not make for a proper eval. Other than its a cool idea

@JohnDoe-zx8bu 10 күн бұрын

What is the sense to estimate many models by some more powerful model if this is required for each problem so it would be much faster to just ask GPT-4 for an answer of the problem

@TwoWayOrbitalStation 9 күн бұрын

Because chatgpt can not be run locally. If you can evaluate what the best small local model is for a task, then you can use that model locally on your pc. If you have sensative code or senstative information, you dont want to pass this through chatgpt since openai will take your data, so you run locally. Not to mention, running locally is completely free, where using chatgpt api is gonna cost you. The whole point of the test is basic test examples, so then you can pick the model to do a similar more complex task

@JohnDoe-zx8bu 9 күн бұрын

@@TwoWayOrbitalStation the issue I see is that results might be different for slightly changed tasks. Means for this current task you get right result, but if you try to get answer for similar but different then answer might be wrong. So if you want to use small models locally then need to have some different way to estimate results without ChatGPT

@BinxNet 7 күн бұрын

@@JohnDoe-zx8bu In this system, ChatGPT (being the best model) provides a rough approximation of “best” solutions from those provided, saving you tokens on getting it to provide its own lengthy results. Can also use this in a multi-agent loop with whatever LLM it picked to improve output entirely on user side, no additional tokens. This is just where documentation and an understanding of what you’re asking of the LLMs comes into play. If you’re allowing the blind to lead the blind, of course it’s going to be horrible. However, if you need a bit of help doing this One Thing you just can’t get right, then problem solved. We are not at the stage of these tools being autonomous fix-alls for every problem. I’ve seen many people saying things like “ChatGPT almost broke my computer because i tried to get it to help me do this thing”, but the reality is, THEY almost broke their computer bc they had no fking idea what ChatGPT was telling them. Accountability is on the end-user here to determine what is and is not a useful output, and how it then can be applied.