Fine-tuning Whisper to learn my Chinese dialect (Teochew)

Рет қаралды 4,580

5 ай бұрын

In this video, we train a speech recognition model for the Teochew language, also known as Chaozhou Dialect (潮州话). Teochew, spoken by 10 million people in Southern China, is part of the Min Nan language family and is distantly related to Mandarin and Cantonese. We set up a data pipeline and fine-tune OpenAI's Whisper to understand Teochew, using transfer learning from Mandarin and Cantonese. Check out how we inspect the training using TensorBoard, evaluate model outputs with Streamlit and Gradio, and learn about the linguistics of Teochew.
The model is open source and available: huggingface.co/efficient-nlp/teochew-whisper-medium
0:00 - Intro
0:35 - Basics of Teochew language
4:37 - Data pipeline
9:19 - Whisper model architecture
10:53 - Multitask training format
12:24 - Fine-tuning Whisper
15:52 - Tensorboard visualization
17:48 - Data inspection tool
19:21 - Evaluation and results
22:23 - Comparison with other languages
23:43 - Easy and hard cases
24:58 - Demo sentence 1
26:25 - Demo sentence 2

Пікірлер: 46

@wolpumba4099 5 ай бұрын

*Summary* Teochew is a Chinese dialect spoken by 10 million people, yet it does not have a standardized writing system. The presenter's goal was to create a speech translation tool using OpenAI's Whisper model to translate Teochew speech into Mandarin. To build this tool, they compiled a dataset from 120 videos containing Teochew audio with embedded Mandarin subtitles. With the help of PaddleOCR by Baidu, they extracted the subtitles, amassing approximately 35,000 subtitle segments, each lasting up to 5 seconds. The aggregate speech corpus reached a total of 35 hours. To enrich the corpus, the presenter added clearly enunciated samples voiced by a native Teochew speaker (his wife) and audio from WeChat conversations with family members, annotated with the SIL Saymore tool. The lecture delivers an in-depth analysis of the internal mechanics of the Whisper model and details the presenter's strategy for executing the project. He proceeded to train the medium model using his personal GPU and further rented an A100 GPU for $25 to train the large v3 model. The most proficient version of the final model achieved a word error rate of approximately 30%, considering each Chinese character as an individual word. The model encountered difficulties in comprehending clips with speech longer than 5 seconds and was less successful when the number of characters representing words differed between Mandarin and Teochew. - 0:00 Introduction to the video topic: fine-tuning the OpenAI Whisper model for a specific family dialect of Chinese. *Basics of Teochew Language* - 0:36 Teochew (also known as "deu") is a dialect from the Minnan language family with around 10 million speakers. - 0:55 Teochew is closely related to Hokkien but is only slightly mutually intelligible with it. - 1:13 Teochew is considered one of the most conservative Chinese dialects. - 1:47 The goal is to perform transfer learning from Mandarin to Teochew, using Mandarin as a high-resource language to assist in model training. - 2:20 Similarities and differences between Mandarin and Teochew in syntax, phonology, and vocabulary are discussed. *Linguistic Features and Writing System* - 3:26 Teochew is primarily a spoken language with no standardized writing system. - 3:33 Different writing methods use Chinese characters to represent Teochew sounds. - 4:27 The designed system will take Teochew speech as input and output Mandarin Chinese text. *Data Pipeline* - 4:37 Teochew is a low-resource language with no readily available datasets for model training. - 5:05 Old movies and TV shows with subtitles in Mandarin are used as a data source. - 5:48 Manual inspection of videos to find suitable ones for data extraction. - 6:11 Optical Character Recognition (OCR) is used to extract text from video subtitles. - 6:29 Paddle OCR is found to be the most effective OCR tool. - 7:11 The OCR process involves grouping subtitles from consecutive frames into segments. - 8:05 The data pipeline includes cropping, OCR, segment grouping, audio clipping, deduplication, and train-test split. - 9:04 The resulting dataset comprises about 35,000 segmented clips, totaling approximately 35 hours of speech data. *Whisper Model Architecture* - 9:19 Whisper uses an encoder-decoder Transformer architecture with audio as input. - 9:31 The input audio format is standardized to 30 seconds in length. - 10:23 Whisper supports around 100 languages and is pre-trained on 680,000 hours of data. - 10:36 The model comes in different sizes, with the large model having about 1.5 billion parameters. *Multitask Training Format* - 10:55 Whisper's training involves a multitask format with control tokens for language identification and task specification. - 11:37 The model can transcribe or translate speech, with an option to include timestamps in the output. - 12:03 For the Teochew language, the Chinese language token will be used in lieu of a specific Teochew token. *Fine-tuning Whisper* - 12:24 The process of fine-tuning Whisper is facilitated by scripts and instructions provided by Hugging Face from a community event. *Fine-tuning Process and Tools* - 12:51 Hugging Face provides instructions and a community event for fine-tuning Whisper. - 13:26 The `run_speech_recognition_streaming` script is a comprehensive tool for fine-tuning Whisper. - 13:58 Modifications were made to the data loading code for simplicity. - 14:16 Recommended training configurations based on GPU memory size are provided. *Optimization and Training* - 14:31 The speaker's GPU setup is relatively small with 12 GB of memory. - 14:52 Larger GPUs allow for larger batch sizes and faster training. - 15:21 The 8bit Adam optimizer allows for training larger models on smaller GPUs. - 15:54 Use of TensorBoard for training visualization. - 16:01 Approximately 20 hours needed to train the small model for 10 epochs on the speaker's GPU. - 16:24 Training the medium model takes about twice as long per epoch. - 16:38 The large model, being four times the size of the small model, would take significantly more time and compute. *Training Visualization and Metrics* - 15:56 TensorBoard visualization shows the training duration per epoch and learning rate schedule. - 17:08 The training loss curve indicates when to stop training to avoid overfitting. *Data Inspection and Debugging* - 17:48 A Streamlit tool was built to debug the data pipeline and visualize the model's performance. - 18:01 The tool includes a histogram of the word error rate and a search function to check the model's understanding of specific words. *Evaluation Setup and Results* - 19:22 Two evaluation sets were created to represent careful and conversational speech. - 19:36 The careful speech condition involved slow and clear reading of phrases. - 19:54 The conversational speech condition involved recording natural conversations. - 20:21 The word error rate was calculated, treating each Chinese character as a separate word. - 20:37 The untrained Whisper medium model showed poor performance on the Teochew dialect. - 21:12 Training on 35 hours of data improved performance significantly, with the medium model outperforming the small model. - 21:37 The large model showed marginal improvements despite its size and additional pre-training data. *Comparison with Other Languages* - 22:24 A comparison of word error rates based on the amount of training data available for different languages. - 23:02 The speaker's results align with expectations based on the available training data. *Analysis of Easy and Hard Cases for the Model* - 23:46 Words that are similar to Mandarin or are high frequency are easier for the model to learn. - 24:19 Words that are different from Mandarin and uncommon are harder for the model. - 24:42 Longer clips tend to get cut off due to the model being trained on shorter data. *Interactive Demo with Gradio* - 25:03 The speaker references a Hugging Face blog post that includes a Gradio script for a UI demo. - 25:21 The speaker's wife, a native Teochew speaker, will test the model with some sentences. *Model Performance on Test Sentences* - 25:40 The model performs well on a sentence about eating spicy food in Chongqing, with only a minor error on the word for "spicy." - 26:26 A second sentence about traffic due to snow reveals the model's struggles with certain words and phrases. *Conclusion and Invitation for Feedback* - 27:48 The speaker concludes the video and invites feedback in the comments. - 27:57 Plans to make the model open source are mentioned. - 28:01 A call to action for liking the video, subscribing to the channel, and activating notifications for new machine learning content. Disclaimer: I used chatgpt4 to summarize the video transcript. This method may make mistakes in recognizing words.

@nmstoker 5 ай бұрын

Beautifully explained! You got the perfect balance: no dumbing down and clearly explained how things worked + why you made particular choices etc. Thank you! 👍

@seanyong1123 25 күн бұрын

Amazing video! I was already really blown away when using whisper for the first time in Cantonese. I was surprised that it was able to work even on a "dialect" of Chinese. To see it being able to be further finetuned to other dialects really shows how well the model scales with new data. Also, great work on building out the pipeline for the Teochew dataset. Never would have thought to use those dialect tv shows as the base. Brings me back to the times when I'd hangout at my grandpa's place and he'd be watching Hokkien dramas. It'd be interesting to see if training a seperate class for Hokkien would improve both the Hokkien and Teochew performance for the model since they're supposedly similar. Not a speaker of either, but at least what I know from my mom who's a native hokkien speaker, it would seem that both dialects would be rather mutually intelligeble. Perhaps the model might be able to pick up on both and hence have "more data" to work with.

@EfficientNLP 25 күн бұрын

That's awesome, and I think the models will improve quickly, and soon they will be able to speak your family’s languages!

@_-JR01 5 ай бұрын

Awesome video! You explained everything thorough but still easy to follow. Thank you for sharing your process

@GonzaloAguilarDelgado 5 ай бұрын

Amazing, best explanation ever! Even if I don't plan to train for languages is delightful to watch. Learn from Chinese languages, training, models, verification, how to build training sets... everything is in this video! Thank you a lot for sharing. Will watch more. And learn more...

@zhuoxinzhan6896 2 ай бұрын

awesome project and well-explained talk! I am also a cs student from Teochew. Have learned a lot from the video. 👍

@YuNherd 5 ай бұрын

thank you for this demo. this helps a.i. recognize other uncommon languages.

@user-qq9jn3ez3x 5 ай бұрын

我是潮汕人，我们这边很多人，特别是年轻人，都无法说流利的潮州话，一些词我可以用英语说却不会用潮州话说，而且不同地区潮州话的语调有很大的不同，这个项目对于潮州话学习很有帮助

@wolpumba4099 5 ай бұрын

I enjoyed this talk.

@goldenshemesh 5 ай бұрын

👏 Brilliant video, and smart way to get training data and create a low cost model for a niche language … hope it helps you in understanding your partners family :)

@cnwonge 4 ай бұрын

very inspiring videos, I am looking method to improve Cantonese ASR accuracly with whisper.....this is the one

@jiaqicai4858 5 ай бұрын

My mom's family speaks Teochew and my dad's Cantonese so I can quite relate. Great work on the model training and the video! Learned a lot from the video as always and please consider making a public repo. I'd like to try to contribute if i can haha

@EfficientNLP 5 ай бұрын

We have just open sourced the model -- link in the description!

@hw5622 5 ай бұрын

amazing! I speak Teochew too. This is amazing for me to speak to my parents.

@jachinlan5702 5 ай бұрын

胶地人吗？ this is amazing that machine can learn such language

@user-mx6wj1ls6l 5 ай бұрын

Appreciate this explanation gaginang. Please share when it becomes open source! Would love to contribute!

@EfficientNLP 5 ай бұрын

We have just open sourced the model -- link in the description!

@haowang982 5 ай бұрын

Very helpful tutorial! Can you share your data collection & training scripts ? Thanks for your share!

@EfficientNLP 5 ай бұрын

We will open-source the model soon! But we cannot release the data or data pipeline scripts, since they may contain copyrighted material.

@bryonwhite6359 Күн бұрын

I am viet-teochew so our accents and words are a bit different, it feels weird to hear someone else speak teochew with different accent as ive only ever heard my family speak it haha An example i think is: the word “to like” is 哈 hah

@EfficientNLP Күн бұрын

Yea for sure - there are a lot of accents of teochew! The one spoken this video is the Raoping (饶平) dialect.

@waynelau3256 5 ай бұрын

Very interesting! I am teochew as well, but I don't speak it well cause hokkien is more common in my country so I ended up mixing them up and learning hokkien more haha

@EfficientNLP 5 ай бұрын

Interesting! Hokkien is quite closely related to teochew and it has some datasets that we are looking into leveraging for further transfer learning.

@waynelau3256 5 ай бұрын

@@EfficientNLP yes! But isn't it difficult to source for these low resource languages?

@EfficientNLP 5 ай бұрын

It is, yea, but Hokkien is a slightly higher resource language (~25m speakers) both in NLP corpora and also usable data like tv shows.

@yeqinghuang 2 ай бұрын

awesome share, could you show your finetuning dataset format，many thanks

@EfficientNLP 2 ай бұрын

The data used for finetuning consists of audio segments of a few seconds in duration, paired with the corresponding translation in Mandarin Chinese

@yeqinghuang 2 ай бұрын

@@EfficientNLP thanks, That's no needs to create a new dialect when finetuning Chaozhou Dialect? just fine tune as Chinese Language？

@EfficientNLP 2 ай бұрын

That’s right - there is no way to create a new token for Teochew so we just pretend it is Mandarin Chinese.

@MW-dg7gl 3 ай бұрын

Can you share your code for the data from the tv shows and how you split it? What programs and softwares did you use?

@EfficientNLP 3 ай бұрын

We do not plan to release the source code of the data pipeline as it was performed using our own custom code or a combination of popular open-source libraries. The video should contain enough details to reproduce it if desired. Let me know if you would like clarification on any part of it!

@abdohm809 Ай бұрын

great video! very helpful, i did the same for moroccan dialect arabic, i have a quetion how did you make the tool to get the histogram and the search box please ?

@EfficientNLP Ай бұрын

That part I built in Streamlit. It's an easy way of spinning up a quick UI in Python.

@abdohm809 Ай бұрын

@@EfficientNLP i see thanks , did you extract the data frim the TensorBoard TFEvent file?

@EfficientNLP Ай бұрын

Not quite - the streamlit visualization is separate from the TensorBoard. The TensorBoard visualizes the training run as it progresses.

@abdohm809 Ай бұрын

@@EfficientNLP I see, thanks a lot

@wolpumba4099 5 ай бұрын

At 17:43 why don't you plot the loss for the test dataset to be sure you are not overfitting?

@EfficientNLP 5 ай бұрын

Yup, that is the correct way to do it, and it's what we did internally for model selection. We found that the test accuracy often improves even past the point where the training curve seems to overfit (this phenomenon is common in machine learning and is known as benign overfitting)

@codor5745 4 ай бұрын

Hi, little question: alongside the mp3 audio files, what is the type of file you've saved your transcription in ? vtt ? tsv ? txt?

@EfficientNLP 4 ай бұрын

In my implementation, I have a CSV file with all the file paths to the MP3 clips and their text transcriptions. This allows my data loader to easily load random samples from different videos into a batch.

@sasagamershepi Ай бұрын

Is that your wife voice in the end of the demo

@EfficientNLP Ай бұрын

Indeed it is! She is a native speaker and I am not.

@hw5622 5 ай бұрын

Hello again! Nice job! I have tried the model. In the code `predicted_ids = model.generate( input_features, forced_decoder_ids=forced_decoder_ids )` the forced_decoder_ids provoques bug and I have to remove it to have the `generate` method to work.

@EfficientNLP 5 ай бұрын

Interesting -- removing the forced_decoder_ids may impact the model's performance, as it is responsible for telling the model that the language is Chinese. Could you please open a note in the discussions on the model's page? And please mention the versions of HuggingFace you are using and the specific error you're getting. Thanks!

@hw5622 5 ай бұрын

@@EfficientNLP of course. Will do later at night