Function Calling with Local Models & LangChain

Function Calling with Local Models & LangChain - Ollama, Llama3 & Phi-3

Рет қаралды 23,621

Күн бұрын

Code : github.com/samwit/agent_tutor...
🕵️ Interested in building LLM Agents? Fill out the form below
Building LLM Agents Form: drp.li/dIMes
👨‍💻Github:
github.com/samwit/langchain-t... (updated)
github.com/samwit/llm-tutorials
⏱️Time Stamps:
00:00 Intro
01:27 Phi-3 Model Blog
02:01 Gorilla Paper
02:12 Function Calling Leaderboard
03:00 Code Time
03:02 Set up Llama 3 with Ollama
03:45 Set up Prompt Template
05:31 Get a JSON Output from Llama 3
08:50 Get a Structured Responses using Ollama Functions
11:44 Phi-3 Model Demo
12:57 Tool Use and Function Calling Sample
15:09 Trying Tool Use and Function Calling with Phi-3

Пікірлер: 59

@OliNorwell 14 күн бұрын

I would recommend adding 'Langchain' to the title of the video, most of this is very langchain specific, for those specifically searching for that.

@samwitteveenai 14 күн бұрын

Very good point. Added! Thanks!

@marilynlucas5128 14 күн бұрын

@@samwitteveenai Great experiment you're running there but please consider using lm studio's new cli as well in your subsequent videos instead of ollama all the time. Also can you try using Anima's air llm library so you can run the llama 3 70B locally using layered inference?

@samwitteveenai 13 күн бұрын

I haven't heard of Anima's air llm library but will check it out

@theh1ve 13 күн бұрын

Lm studio isn't as 'open' as ollama so would restrict the use cases to just personal use.

@rickymehra1104 12 күн бұрын

Thank u sharing video of ollama with phi3 to run locally, hope u would come up wid more such videos to use ollama locally for different tasks. Pls mk more videos on phi3, llama3 with ollama.

14 күн бұрын

Excellent as usual! For Phi3 3.8b latest it works fine with: prompt_phi = PromptTemplate.from_template( """{context} Human: {question} AI:""" ) Otherwise you will get validation errors. All the best Sam!

@jiyuhen 14 күн бұрын

Thank you for doing this with Ollama, this was an really good explanation and helped me a lot!

@chriskingston1981 14 күн бұрын

Ah really needed this, I kept feeling, I want to learn function calling with llama3. Feels so good to use a local model with function calling, and langchain made it really easy to do. Love to experiment with it now, thank you so much for this video❤️❤️❤️ and thanks to langchain for making it easy to do function calling ❤️❤️❤️

@aa-xn5hc 14 күн бұрын

Amazing video 🙏🏻 Currently using crewai

@mshonle 14 күн бұрын

For local models I’ve found it’s helpful to at extra context at the very end of the prompt, in the assistant reply section (not the instruction section), kicking things off with “Sure, here is your JSON:” and then adding markdown syntax for preformatted text and then letting one of the end symbols be the final three backticks to close the markdown. It’s also helpful to write a custom grammar (like with llama.cpp) to constrain output to a specific schema even. (Depending on your setup this could slowdown inference if the constrained generation part isn’t running on the GPU.)

@sven262 13 күн бұрын

Thank you so much. Super helpful.

@eduardovernier7628 15 күн бұрын

Very cool! I've been using the instructor library with pydantic for structured output and had a lot of success on openai models, but it didn't work very well with local llms. I'll definitely try out your approach!

3 күн бұрын

Very useful, thanks!

@svenvanwier7196 2 сағат бұрын

I see you use a mac mini, could you talk more about what model and OS setup? Thinking of fun things to do with my 2011 2ghz i7 16gb ddr3 ram, a local something on my network if I could.

@hienngo6730 15 күн бұрын

Thank you for the informative videos as always. One note: if you want to run things all locally and want a lot better throughput, running the models using vLLM and serving the API with vLLM's OpenAI-compatible server is definitely the way to go. If you have a 24 GB VRAM GPU like a 3090 or 4090, you can run a GPTQ or AWQ quantized model, or just the full FP16 model and serve a large number of concurrent clients. With batching, you can get thousands of tokens per second in aggregate for responses if you run a lot of parallel clients.

@jay-dj4ui 14 күн бұрын

linux only, and I am not sure it has enough performance like that. Multiple API calling contiusely sounds great. just not sure....

@marilynlucas5128 14 күн бұрын

You can run the llama 3 70b model with as little as 4gb gpu using Anima's air llm library which enables layered inference.

@hienngo6730 14 күн бұрын

@@marilynlucas5128 I've never used this library before, what kind of tokens per second speed can you get? For reference, using LLaMA-3 70B with exllamav2 quantization at 2.4bpw on a single 4090, you can get around 36 tokens/second. With 2x4090s and 5.0bpw quantization, you get around 18 t/s.

@andyma1146 14 күн бұрын

Thanks for the video! I'd like to see an example of using DSPy to optimize a local model so that it can use tools more reliably. I'm actually not sure if this would work but I'd like to find out. 😃

@alx8439 14 күн бұрын

The biggest issue with function calling is that the way everyone suggests to use it is not very viable / economical, if you want your model to choose one out of many functions to call. I'll elaborate: in order for LLM to pick a function to use, you need to announce all those tools in advance and make sure it hasn't forgotten them, if you're going into multy turn chat. This means more context will be used just to make model aware about all these extra tools you want it to use and less context will be available for responses. There's probably some semantic router needs to be introduced in-between to give model only those tools which might be relevant to current question

@brianmorin5547 9 күн бұрын

100% my experience as well. In fact, I’ve only had success doing function calling by putting it at the individual run level rather than at the model level and only calling a single function that will be needed

@tonyrungeetech 8 күн бұрын

I have a video doing exactly this with a library called semantic router and crew-ai!

@MeirMichanie 13 күн бұрын

Thanks for the code and the explanation. In order to be usable, you should be able to execute the function feed the info back into the history of the conversation with the result of the function and then the llm should be able to use the results from the function to write the last message. For instance, lets say that the weather tool responds with just the temperature and nothing else, then the LLM should be able to respond back 'in Singapore the current temperature is ..." and in the same language as it was asked from the user.

@superstippi Күн бұрын

Absolutely agreed. It seems to be very hard to find information on how to do exactly that. The Phi-3 chat template doesn't seem to introduce a dedicated role for a function call result. So if it seems to be the "user" replying with a function call result, why would the model figure that it needs to phrase that into a coherent message? Also, I fail to get sensible output when there is more than one function declared and the model is supposed to be free to use a tool or reply directly. Often, I get long chunks of what appears to be training data appended to the initial reply.

@kenfink9997 13 күн бұрын

Great video as always! In future videos, could you please show how to do this with Ollama and langchain running on separate computers? I'd like to develop on Laptop or Colab with just inference running on my Desktop PC. And since Ollama doesnt currently do API keys, how do we secure the inference server and access it from a Colab notebook? Thanks!!

@Shiroikage98 14 күн бұрын

would love if you can explain this using the ollama python package. As someone else said this is very specific to langchain and i just cant find good information on how to use function calling with ollama.

@Carnivore69 12 күн бұрын

Great video. I was hoping this would give me a reason to try LangChain vs my own prompt/post-parsing for a web ui, but I'm actually getting better results than this demonstrates. I'm using llama3-8B via LM Studio. I think until these guys get their sh*t together and create a standard for output, this is going to be similar to the browser wars (standards). At the very least, they should all conform to current markdown standards or accept a config/spec for default output. Whoever comes out with an open source competitive model that does this is going to be the clear leader... for me anyway. ...And if such a model exists, please point me to it!! :)

@comfixit 15 күн бұрын

I have found Phi-3 truly impressive for its size, getting good results even for general inquiries. I almost wonder if you could just use Phi-3 if you don't need a super refined response. It's so light on resources comparatively for an LLM.

@samwitteveenai 15 күн бұрын

Agree it is a nice model especially when you consider its size

@jay-dj4ui 14 күн бұрын

Hi< Is that because we try to give it as much more accurate and better machine-readable input, so the model does not have to 'think' too much that it can follow the correct format like JSON and some basic function, and it can meet some complex requirements also. The way is more efficient and energy-saving.

@CraftPit 14 күн бұрын

Phi3 excels at creative language tasks, surpassing even GPT-4 in my tests. GPT-4 itself ranks Phi3's lyrics higher :)

@user-iu5ue4bv8q 12 күн бұрын

Thanks for sharing this, how can I use this json output funcution call format to combine the langchian agent functuion call framework , which. Use the llm.blind_tool to replace the llm=ChatOpenAI()? Will this work? Thanks

@MukulYadav-pw9se 13 күн бұрын

wow Sam!!!, this video is really helpful but i am facing challenge in running it on server as the response is not coming within 1 min and i am getting 504 Gateway Timeout error, i have used ollama docker image to install ollama but i am not able to find how to increase gateway timeout to 10 mins instead of default 1 min. Can you please help if you have faced such issue?

@madhudson1 13 күн бұрын

all looked well and good until you try feeding a question into the 'agent' that doesn't relate directly to: "get the current weather in a given location". I thought the whole point of function calling/tooling was to present the LLM with the opportunity to use tooling if necessary.

@harshkesharwani8730 2 күн бұрын

How to use chatOllama along with function calling. i want to pass messages along with functions same as open ai v1/chat/completions api provides.

@sumanthbalakrishnan285 14 күн бұрын

How do I incorporate function calling with follow up questions and memory. Say a user asks “what is the weather”. The model should be able ask “what place are you requesting for” and say the user replies “California” It should then make the function call with the mentioned arguments. Please let me know which direction I should look in order to achieve this.

@kallebysantos5167 15 күн бұрын

Is possible to fine tune a small language model for function call? For example, if we look to BERT models that perform zero-shot classification we can pass a set of labels to it, so maybe is possible to use a similar approach to get a very performatic model just for function calling, since LLMs are very huge and almost every time requires a GPU. I know that phi3 is very small but in my machine it takes like 3Gb of GPU.

@samwitteveenai 14 күн бұрын

Yes very possible to do the key is getting the dataset and most people aren't making their datasets for this public.

@pensiveintrovert4318 14 күн бұрын

I have been running gpt-pilot with Llama3-70b-instruct.Q5_K_M for a couple of weeks. The biggest problem I have, as far as I understand, is not function calling but rather the stability of the framework. It starts developing a bunch of files, but when I provide feedback, it may abandon the old files instead of correcting them, and starts creating a new set of files. Basically makes a mess.

@kaushiklade 10 күн бұрын

Hey, thats very helpful to understand how to run these models locally. Can u/anyone tell, how to actually do actual function call and pass that response to llm? Is it possible without LangGraph??? I want llm to decide which tool to call, once he decide that, llm should do entity extraction and then invoke tool, then returns ans back to llm and gives it to user. This was easy with AgentExecutor in OpenAI examples. Similar thing possible in Ollama?

@peterdecrem5872 14 күн бұрын

What was the name of the paper that shifts the probabilities to get json as response more likely?

@samwitteveenai 14 күн бұрын

Can check it out here github.com/1rgs/jsonformer

@alx8439 15 күн бұрын

At last someone finds a good use for agents - to give them some tasks you want accomplished and give loose it free overnight to use internet :)

@willjohnston8216 15 күн бұрын

I don't understand how this demonstrated using agents overnight on the Internet? I'd really like to know how to do that. What did I miss?

@alx8439 15 күн бұрын

@@willjohnston8216 Mr. Witteven just mentioned this as a possible implication. I was just glad more people to turn their minds into some real world use cases for agentic flows - like giving a topic for your agent and let it research it, find products / software, which you would never find in ads, do some data gathering and processing for you, providing helpful summaries on a hot topics you never have time to investigate properly yourself, etc etc etc

@harshkesharwani5621 15 күн бұрын

Can I use function calling with llama.cpp?

@samwitteveenai 15 күн бұрын

in theory yes but might need to mess with how to get it accept them etc.

@harshkesharwani5621 15 күн бұрын

How one can pass multiple functions and let model decide to use particular one. Does it supports multiple functions

@MavVRX 14 күн бұрын

The bind function takes in an array of functions so you can simply add the additional functions to the array separated by commas. E.g. [f1, f2]

@RobBominaar 14 күн бұрын

Well, actually, where are the functions? I only see a Json string.

@AIvetmed 15 күн бұрын

has someone tried to load the models other than using ollama like the huggingface transformer pipeline or in other words I would love to know how torun these models in Linux based servers like databricks where I am unable run ollama application in the background like in my windows PC?

@MavVRX 14 күн бұрын

Ollama already supports windows

@AIvetmed 14 күн бұрын

@@MavVRX for Linux based servers like databricks server

@samwitteveenai 14 күн бұрын

I made a Llama3 review deep dive video and show loading that in HF Transformers there in a colab

@StephenRayner 14 күн бұрын

You are not using latest version. It’s now called “bind” not bind_tools

@samwitteveenai 14 күн бұрын

I am using the latest langchain-experimental 0.58 the bind is used in the main function calling with prop models for the OllamaFunction they still have it as bind_tools. If I am missing something send me a link.