LocalAI LLM Testing: Viewer Questions using mixed GPUs, and what is Tensor Splitting AI lab session

Рет қаралды 1,199

Күн бұрын

Attempting to answer good viewer questions with a bit of testing in the lab.
We will be taking a look at using different GPUs in a mixed scenario, along with going down the route of tensor splitting to best effect your mixed GPU machines
We will be using LocalAI, and an Nvidia 4060 Ti with 16GB VRAM along with a Tesla M40 24GB.
Grab your favorite after work or weekend enjoyment tool and watch some GPU testing
Recorded and best viewed in 4K

Пікірлер: 26

@jackflash6377 22 күн бұрын

Outstanding ! Glad I found this channel. Thank you sir.

@RoboTFAI 21 күн бұрын

Thanks for watching!

@246rs246 25 күн бұрын

I'm blown away by this comprehensive answer to my question. Thumbs up and I'm looking forward to more interesting videos.

@RoboTFAI 24 күн бұрын

Awesome, thank you!

@kevinclark1466 10 күн бұрын

Great video! Looking forward to trying this…

@RoboTFAI 9 күн бұрын

Have fun!

@six1free 25 күн бұрын

hands down one of the best youtube chanels out there - and i'm not just saying that for flashing my question :D I really do love how thoroughly you've taken to answering it. .. this being the pause point... I'm going to guess that cuda will do it all for you ("as if" - I'm sure :D) I am so envious of your test rig... as it is though I need a data center for power... as for adding the other cards, further research tensors and rewatch this video when applicable :D - downloaded and saved to my good tutorials (very long) playlist... enjoy the well deserved follow-through.

@RoboTFAI 24 күн бұрын

Thanks for the idea!

@SphereNZ 20 күн бұрын

Great video, great info, really appreciate it, thanks.

@RoboTFAI 19 күн бұрын

Appreciated!

@AkhilBehl 26 күн бұрын

This is absolutely awesome stuff.

@RoboTFAI 24 күн бұрын

Thanks!

@CoderJon 11 күн бұрын

Love your videos. I appreciate that you leave the interpretation of the results to us, but I would love a video talking about your interpretations of the data. For example: Why your results for Prompt tokens per second were higher with the 90/10 split. I can assume its because there is some sort of parallel processing happening on the interpretation of the prompt, but I am still new to the AI world so would love the education.

@RoboTFAI 11 күн бұрын

Much appreciated! I attempt to keep my mouth shut and let the data show the info. Definitely not an expert and just learning like everyone else. I never intended on creating an actual channel, the first video was to prove a conversation with friends out with hard data, the testing app is for other uses in my lab, etc. Just turning into a place where we can all share some data and learn from it, or at least burn some of my power bill together!

@mbike314 14 күн бұрын

Thank you for creating this valuable content. I am pleased to have discovered it. I am interested in some 4060's you mentioned. I sent an email. Please keep going with this channel! Wonderful stuff!

@RoboTFAI 10 күн бұрын

Thanks a ton! Didn't see any email - reach out robot@robotf.ai or ping me on reddit/etc

@mbike314 4 күн бұрын

Thank you. I did send it to the wrong address. Just resent it to the correct address.

@andre-le-bone-aparte 23 күн бұрын

Question: @03:14 - NVTOP is showing - 90+ Degrees (86 on the M40) Fahrenheit on each of those cards... WITHOUT any active usage? - That seems excessive. Currently running a 4x3090 setup at 79-degrees or lower, in-between queries.

@RoboTFAI 23 күн бұрын

the 4060's are stacked with each other on the bench node in this test (I don't recommend that, they could use space between them since side facing fans, and why I use a lot of pcie extenders normally) and don't run their fans unless there is a load - the M40 in this test has an active fan on all the time. Also I live in a hot climate and it's been 85-100 degrees (75+ in the workshop as it's not conditioned)🔥

@andre-le-bone-aparte 23 күн бұрын

@@RoboTFAI 👍- Just looking to learn ways to extend the life of these GPUs and increase performance for LLM usage when running 10 hours a day (work day, remote-work, as a code assistant)

@tsclly2377 25 күн бұрын

I think loading is still an important factor, so do you use NVMe drives, like the large, high write level Octane p900 series for the fast load? and FPGAs for pre-setting data (like video, pictures) reconstructed in a faster use mode?

@RoboTFAI 24 күн бұрын

I normally leave the unloaded model test off as it doesn't allow as much resolution in the smaller charts. I use Gen 4 NVMe M.2 drives in each of these systems (rated up to 5000/4800 MB/s...yea right).

@Zeroduckies 18 күн бұрын

Or you can get 1tb ram and have 500gb ramdisk ^^

@tsclly2377 16 күн бұрын

@@Zeroduckies Using HP ML 350p machines one only gets up to 768GB of dram that has to be LRdram, but that ram is running on three channels that actually slows it down from the 2 channel 256GB because of the required 'blocking in' and processing. It is all in the specification PDF from HP.. It is only when going to The G11 model that one actually get significantly faster (PCIe 5.0.. HP skipped the 4.0 architecture in these machines) ram and a larger capacity at a astronomically increase in price.. So when getting a 'loaded' 256 dram ML 350p G8 afor a trade of on older gamer machine with a at GTX 1660ti and a less than tenth geni7 (about a 300$ value) one must be looking for a fast economical memory solution and that is where the Optane P900 card come in (with their 4000GB/s bust) and one must also compare that at the rate that the GPU actually can take in, so this is a cheap way to run data in (and out) in a comparable manner as dram... plus you are only occupying a PCIe 4 lane. Now this is al gfine and dandy, but in dual cpu chip-sets, the PCIe lanes go all over the place and that is a major consideration as the right and left side are controlled by different CPUs and SLI or VRLinking can be required for OS recondition of the linked GPU cards that is inherently required for proper function logging.... and PCIe controllers on these machines. They are going to be slower than single CPU specifically designed mother boards that are made other companies such as the multi-PCIe 16x SuperMicro or Gigabyte professions models... that have come out specifically designed for this type of application that use NVME arrays for storage.. and then you are back to the amount of writes that are going to be applied to the storage.

@tbranch227 14 күн бұрын

Can you run a larger model when you span cards? Or does your model need to be able to fit on each card that you tensor split across? What happens to performance then, if you can run larger models by aggregating card ram?

@RoboTFAI 12 күн бұрын

You can absolutely span the larger model between cards! These tests are actually doing that, performance depends on cards you are splitting between - but will be between your lowest end card, and highest end cards (if different models). Running multiple cards doesn't necessarily increase performance, it's really for expanding your VRAM capacity.