Пікірлер
@garylai5174
@garylai5174 Күн бұрын
Nice video. Thanks for this. I could be wrong but one potential error I see: In this video, you said that "You can’t do KV cache because you change the embeddings with every token you add." I don't think this is necessarily true, at least not for decoder architectures like GPTs. The previous tokens don't attend to the new tokens -- they only attend to tokens to their left (there's a causal mask). When you add a new token, the relative position between the previous tokens don't change. For example, if you add a 6th token to a sequence, the distance between token 1 and token 4 haven't changed at all; therefore, the KV cache is still valid. It seems to me that yes, relative position embedding is inefficient, but not because it invalidates KV cache; rather, it's because every time we add a new token, it needs to attend to all previous tokens twice: once for the regular attention calculation, once for the relative positional embedding
@EfficientNLP
@EfficientNLP Күн бұрын
Yes, that is correct. The KV cache can still be used in T5 relative positional embeddings, but it is less efficient because the relative position needs to be recalculated - so this is an extra step that cannot be cached, making the KV cache not as effective compared to absolute positional embeddings.
@bryonwhite6359
@bryonwhite6359 Күн бұрын
I am viet-teochew so our accents and words are a bit different, it feels weird to hear someone else speak teochew with different accent as ive only ever heard my family speak it haha An example i think is: the word “to like” is 哈 hah
@EfficientNLP
@EfficientNLP Күн бұрын
Yea for sure - there are a lot of accents of teochew! The one spoken this video is the Raoping (饶平) dialect.
@bryonwhite6359
@bryonwhite6359 Күн бұрын
I am teochew nang! But my family speaks a Vietnamese Teochew so our accent is a bit different and some words are different
@sp5394
@sp5394 6 күн бұрын
Thank you very much. Great video! Clear, concise and yet covers most of the necessary details.
@forrest-forrest
@forrest-forrest 12 күн бұрын
Amazing. Some of my colleagues work on KV cache, and this video was a great introduction to the topic. Thank you!
@DurgaNagababuMolleti
@DurgaNagababuMolleti 13 күн бұрын
Superb
@himsgpt
@himsgpt 14 күн бұрын
Can you make video on grouped query attention (GQA) and sliding window optimisation?
@EfficientNLP
@EfficientNLP 14 күн бұрын
Great ideas for future videos. Thanks for the suggestion!
@bonob0123
@bonob0123 24 күн бұрын
that was really nicely done. as a non-expert, I feel like I can now have a great general idea of what a quantized model is. thank you
@laurentiupetrea3726
@laurentiupetrea3726 25 күн бұрын
Finally! My 4th video and I was lost but this one did the trick!
@davidlee327
@davidlee327 25 күн бұрын
dude you are the mf goat
@seanyong1123
@seanyong1123 25 күн бұрын
Amazing video! I was already really blown away when using whisper for the first time in Cantonese. I was surprised that it was able to work even on a "dialect" of Chinese. To see it being able to be further finetuned to other dialects really shows how well the model scales with new data. Also, great work on building out the pipeline for the Teochew dataset. Never would have thought to use those dialect tv shows as the base. Brings me back to the times when I'd hangout at my grandpa's place and he'd be watching Hokkien dramas. It'd be interesting to see if training a seperate class for Hokkien would improve both the Hokkien and Teochew performance for the model since they're supposedly similar. Not a speaker of either, but at least what I know from my mom who's a native hokkien speaker, it would seem that both dialects would be rather mutually intelligeble. Perhaps the model might be able to pick up on both and hence have "more data" to work with.
@EfficientNLP
@EfficientNLP 25 күн бұрын
That's awesome, and I think the models will improve quickly, and soon they will be able to speak your family’s languages!
@christospapadopoulos7894
@christospapadopoulos7894 29 күн бұрын
Nothing really new about this, it seems that big tech companies really do have it easier when publishing research
@EfficientNLP
@EfficientNLP 29 күн бұрын
That’s the way it tends to go! One small step at a time
@chaidaro
@chaidaro 29 күн бұрын
I haven’t read the paper yet but my understanding is that we sample from q(x) - p(x) because we want the most surprising token that the draft model does not anticipate. It should maximize the entropy but then it should have log on the equation, anyway, I gotta read the paper to understand the math.
@igorfilippov1221
@igorfilippov1221 Ай бұрын
Very clear explanation, thank you!
@mohitlamba6400
@mohitlamba6400 Ай бұрын
Thanks a ton for this crisp and precise explanation of why we use caching in transformers.
@abdohm809
@abdohm809 Ай бұрын
great video! very helpful, i did the same for moroccan dialect arabic, i have a quetion how did you make the tool to get the histogram and the search box please ?
@EfficientNLP
@EfficientNLP Ай бұрын
That part I built in Streamlit. It's an easy way of spinning up a quick UI in Python.
@abdohm809
@abdohm809 Ай бұрын
@@EfficientNLP i see thanks , did you extract the data frim the TensorBoard TFEvent file?
@EfficientNLP
@EfficientNLP Ай бұрын
Not quite - the streamlit visualization is separate from the TensorBoard. The TensorBoard visualizes the training run as it progresses.
@abdohm809
@abdohm809 Ай бұрын
@@EfficientNLP I see, thanks a lot
@akshaydevkarama3277
@akshaydevkarama3277 Ай бұрын
great explanation,really helped me!
@andybrice2711
@andybrice2711 Ай бұрын
I picture this like mapping out a vector space in lower resolution by using a tree structure.
@yuanhu6031
@yuanhu6031 Ай бұрын
Excellent video, great high level overview!
@voncolborn9437
@voncolborn9437 Ай бұрын
Great video. No I understand the importance of 'Time to 1st token'. I like the short ones that are to the point on a topic. Learning in smaller chunks works well for me.Thanks!
@nmstoker
@nmstoker Ай бұрын
Another great video I have a question: is RVQ solely for compression or could one conceivably do some processing of an RVQ to operate on it as a representation of the data rather than on the uncompressed data? Eg teach a model to classify sounds based just on the RVQ.
@EfficientNLP
@EfficientNLP Ай бұрын
Indeed, it is often useful to use quantized representations rather than the original vector. One example that comes to mind is wav2vec2 - it performs product quantization (not quite the same as RVQ but similar, as it learns multiple discrete codebooks). It does a masked language model self-supervised setup, where the model learns to predict the quantized targets, and this works better than predicting the vector directly.
@andreacacioli2612
@andreacacioli2612 Ай бұрын
Hey There, I am trying to reach out to you via email. Could you please check? Anyway, here is my question: why does encodec's encoder output 75 frames of 128 dimension per second? I mean, don´t convolutions always just reduce dimensionality, why do they increase? I would expect a single array with less elements in the time dimension. Could you please help. Thank you
@EfficientNLP
@EfficientNLP Ай бұрын
Typically when convolution layers reduce the dimension on the temporal axis, the dimension is increased by a similar amount on the spatial axis. This way, the information is represented differently, rather than being lost.
@_XoR_
@_XoR_ Ай бұрын
I thought about using voronoi cells nearest neighbour lookup for compressing latent spaces myself, but I also thought that some processes that generate the lantent space centroids of interest can also benefit from weighted voronoi tessellation / power diagrams, where maybe depending on density of points or other features we can weight that particular cell to make it more relevant.
@EfficientNLP
@EfficientNLP Ай бұрын
That's an interesting idea, and I don't know if it's been used in speech vector compression. You would require some additional space to store the weights of Voronoi cells in a weighted Voronoi tessellation, so it may or may not be as effective as using this space to do more rounds of RVQ.
@TuanPham-fc2oy
@TuanPham-fc2oy Ай бұрын
Nicely done!
@einsteinsapples2909
@einsteinsapples2909 Ай бұрын
If you turn ur voice tool into an extension that can work on any web page on chrome, I would be interested. The way it is now can be helpful but I have better alternatives, like I can just use chatGPTs speech to text feature which is very good.
@EfficientNLP
@EfficientNLP Ай бұрын
Great point. We are currently developing a voice writer Chrome extension, and it will be available soon!
@akhileshgotmare9812
@akhileshgotmare9812 Ай бұрын
Isn't batch decoding a bit impractical to be assumed in estimating the KV cache footprint of OPT 30B? I'd say for bsz = 1 (online decoding) this is still not that significant ~= 1.4 GB.
@EfficientNLP
@EfficientNLP Ай бұрын
The ideal batch size depends on the size of the model and the memory available in your GPU hardware. You are correct that the KV cache would not take up much memory in the case of a batch size of 1; however, it would result in poor throughput and would not utilize the parallelism capabilities of the GPUs.
@wolpumba4099
@wolpumba4099 Ай бұрын
*What is RVQ?* * RVQ is a technique to compress vectors (like audio embeddings) into a few integers for efficient storage and transmission. * It achieves higher fidelity than basic quantization methods, especially at low bitrates. *How RVQ Works:* 1. *Codebook Quantization:* A set of representative vectors called "codebook vectors" are learned. Each vector is mapped to the closest codebook vector and represented by its index. 2. *Residual Calculation:* The difference between the original vector and the chosen codebook vector is calculated (the "residual vector"). 3. *Iterative Quantization:* The residual vector is further quantized using a new codebook, and a new residual is calculated. This process repeats for multiple iterations. 4. *Representation:* The original vector is represented by a list of indices, each corresponding to a chosen codebook vector in different iterations. *RVQ in EnCodec (An Audio Compression Model):* * EnCodec uses RVQ to compress audio embeddings, achieving good quality even at low bitrates (around 6kbps). * The number of RVQ iterations controls the bitrate and quality trade-off. *Learning Codebook Vectors:* * Initially, K-means clustering can be used to find optimal codebook vectors. * For better performance, codebook vectors are fine-tuned during model training: * *Codebook Update:* Codebook vectors are slightly moved towards the encoded vectors they represent. * *Commitment Loss:* The encoder is penalized for producing vectors far from any codebook vector, encouraging it to produce easily quantizable representations. * *Random Restarts:* Unused codebook vectors are relocated to areas where the encoder frequently produces vectors. *Key Benefits & Applications:* * RVQ enables efficient audio compression with smaller file sizes than traditional formats like MP3. * It has potential applications in music streaming, voice assistants, and other audio-related technologies. i used gemini 1.5 pro to summarize the transcript
@EkShunya
@EkShunya Ай бұрын
:smiley: 😄
@siddharthvj1
@siddharthvj1 Ай бұрын
how can i connect with you
@EfficientNLP
@EfficientNLP Ай бұрын
I'm active on linkedin! Link on my profile.
@cherryfan9987
@cherryfan9987 Ай бұрын
great video. thanks
@roomo7time
@roomo7time Ай бұрын
your explanation is amazing. thank you for your work
@user-qo7vr3ml4c
@user-qo7vr3ml4c Ай бұрын
Great summary, thank you.
@ricardokullock2535
@ricardokullock2535 Ай бұрын
And if one was to quantize a distilled model? Is the outcome any good?
@EfficientNLP
@EfficientNLP Ай бұрын
Yes, these two techniques are often used together to improve efficiency.
@Nishant-xu1ns
@Nishant-xu1ns Ай бұрын
excellent video sir
@sasagamershepi
@sasagamershepi Ай бұрын
Is that your wife voice in the end of the demo
@EfficientNLP
@EfficientNLP Ай бұрын
Indeed it is! She is a native speaker and I am not.
@AmirMahmoudi-je2pu
@AmirMahmoudi-je2pu Ай бұрын
nice video and great voice writer, I have tried implementing it with transformers js package and its whisper model but no luck yet since processing is heavy
@EfficientNLP
@EfficientNLP Ай бұрын
There are a number of things you can do to speed up the whisper model. Some backends are more optimized depending on your hardware; faster-whisper is a popular one. You can also try smaller models: "base" is a good tradeoff that sacrifices some quality for better performance.
@harshmittal63
@harshmittal63 Ай бұрын
@laulinky334
@laulinky334 Ай бұрын
Thanks for sharing, I am wondering how target model check the generated tokens of draft model and produce probability distribution q of x for each token?
@EfficientNLP
@EfficientNLP Ай бұрын
This is due to the parallel nature of transformers - when given a sequence of tokens, it can generate the logits for all of them in parallel, unlike generation which must be done autoregressively.
@saramoeini4286
@saramoeini4286 Ай бұрын
Hi. Thanks for your video If my encoder produce series of tags for each word in input sentence and I want to use that tags for generating text that is correct based on input and generated tags of encoder, how can i use decoder for this?
@EfficientNLP
@EfficientNLP Ай бұрын
I don't know of any model specifically designed for this, but one approach is to use a decoder model, where you can feed the text and tags in as a prompt (you may experiment with different ways of encoding this and see what works best).
@saramoeini4286
@saramoeini4286 Ай бұрын
@@EfficientNLP Thank you.
@mslc22
@mslc22 Ай бұрын
Teo Chew also has subdialects like many other dialect like Hakka also has many sub dialects.
@zhuoxinzhan6896
@zhuoxinzhan6896 2 ай бұрын
awesome project and well-explained talk! I am also a cs student from Teochew. Have learned a lot from the video. 👍
@duytdl
@duytdl 2 ай бұрын
I was talking to ChatGPT as to why it only produces 1 token at a time, and it suggested I should checkout "Non-Autoregressive models"
@vukrosic
@vukrosic 2 ай бұрын
Thank you for explaining it!
@MuhammadAli-dw7mv
@MuhammadAli-dw7mv 2 ай бұрын
nicely done
@RoyAAD
@RoyAAD 2 ай бұрын
Awesome.
@weekendwarrior7933
@weekendwarrior7933 2 ай бұрын
Absolutely amazing explanation! Keep it up man
@Basant5911
@Basant5911 2 ай бұрын
made very simple, but one more variable is choosing right draft model. Suppose if one chooses that is too too away from larger one's distribution then its also a problem.
@EfficientNLP
@EfficientNLP 2 ай бұрын
If the draft model is far from the target model's distribution, then speculative decoding will be less effective because it will have a higher rejection rate, thus reducing the speedup. However, the algorithm guarantees that the output sequence will be identical; therefore, even if the draft model is of poor quality, the text generation quality will not be affected.
@gnorts_mr_alien
@gnorts_mr_alien 2 ай бұрын
For this to work, the two models need to have identical tokenizations right? Is there any way around it?
@EfficientNLP
@EfficientNLP 2 ай бұрын
That's right - the two models need to use the same vocabulary so that we can compare their logits meaningfully.
@gnorts_mr_alien
@gnorts_mr_alien 2 ай бұрын
@@EfficientNLP thank you for the quick response. that makes sense!
@murtazanazir9997
@murtazanazir9997 Ай бұрын
Not necessarily. We can retokenize the predicted text by draft model. That can be slow though.
@hrsight
@hrsight 2 ай бұрын
nice video
@420_gunna
@420_gunna 2 ай бұрын
Love your video, thanks! If I had to give one request/critique, it'd be that I wish there were some slides in here similar to Samuel Albanie's videos that are quite information-dense recaps that could be lifted out of the presentations and put into our notes (or into a powerpoint for a paper club, or something).
@EfficientNLP
@EfficientNLP 2 ай бұрын
Interesting idea, though my videos often contain animations, drawings, screencasts, etc., and are not directly a recording of PowerPoint slides. Feel free to take screenshots of my videos for any educational purposes though!