@@SnorkelAI Realistic LLM expectations and a glimpse into the future
@vivekpadman5248Ай бұрын
Is this approach used on all three levels of training? Base instruct ane chat fine-tuning? And are there different things to be considered for the above?
@SnorkelAIАй бұрын
I'm not 100% clear on your question. Are you referring to pre-training, fine-tuning and alignment? If so, this approach could be used on fine-tuning and/or alignment. It could also theoretically be used on pre-training, but I suspect that would yield poor results.
@vivekpadman5248Ай бұрын
@@SnorkelAI yes that was exactly my question, thanks 😊. I have one follow up question here. Why do you think it would yeild poorer results on pre training phase any insights on that and in that case what kind (size and arch) of pretrained student model should be used with a specific teacher llm Or anything would work?
@SnorkelAIАй бұрын
Sorry for the slow reply here. KZfaq didn't surface your reply comment the same way it did your initial comment. We're getting a bit outside the bounds of what can be reasonably answered within a KZfaq comment, but I think we can reasonably say this: Distilling a model means using its output to train a smaller model. For pre-training, that would mean creating an immense volume of raw generated outputs to form the parent model. Several studies have shown that pre-training generative models on other models' generated output tends not to work so well. We don't yet fully understand why, but we understand that it is a questionable practice at present.
@vivekpadman5248Ай бұрын
@@SnorkelAI no worries man, getting such a nice detailed reply is all that matters. Ah understood it properly now, also I guess the limits of the parameter size will come into picture while doing that if we use it for pretraining. Clean data plua synthetic data is anyways available now. Thanks again 😊🙏
@vivekpadman5248Ай бұрын
Very nice short informative video. I'm looking to create a distilled model on reasoning tasks for games which could run locally. This will help 😊 thanks
@SnorkelAIАй бұрын
Glad it was helpful!
@clashcodes0855Ай бұрын
free?
@SnorkelAIАй бұрын
Included in Snorkel Flow. 😃
@RyluRockyАй бұрын
Well done!
@SnorkelAIАй бұрын
Thanks!
@chakpakАй бұрын
Programmatic labeling of images at scale is so cool. 🎉
@SnorkelAIАй бұрын
We think so too!
@chakpak2 ай бұрын
Wow! Preprocessing is 🔥🔥
@chandra75992 ай бұрын
What does Wayfair do... Intro would be helpful to understand and connect with the content.
@SnorkelAI2 ай бұрын
Wayfair sells furniture and home goods online. That's good feedback, thanks!
@gobdovan3 ай бұрын
I came here to understand precisely how the Snorkel software assists with the issue. However, you discussed a general RAG system and mentioned that Snorkel expedited your process in different ways without specifying what Snorkel AI actually does. In your description, you stated, 'We explore how Snorkel Flow accelerated development of[...]'
@SnorkelAI3 ай бұрын
We're in early days on this kind of video content. The intent of this one was to talk about the case broadly and concisely. Were you looking for more of a product demonstration?
@gobdovan3 ай бұрын
@SnorkelAI, I appreciate the explanation. I was expecting a product demo based on the video's description, which mentioned exploring *how* Snorkel Flow accelerates development. I'm familiar with Snorkel the package, and the pain of developing labeling functions. The video was recommended to me, so I checked your GitHub for updates. It appears the repo hasn't been updated recently, and your focus seems to have shifted to Snorkel Flow. However, the video did not cover it in detail, and I couldn't find any comprehensive product presentations. Is Snorkel Flow aimed at larger corporations, or will it be available as a SaaS for broader access? Could you recommend any videos that present the product, particularly in the context of creating datasets for ASR/translation? I'm looking for more efficient methods to build such datasets.
@SnorkelAI3 ай бұрын
What you surmised is correct. Snorkel Flow is currently aimed at large enterprises. We will likely have more product demos heading for the channel soon. In the meantime, you can watch this one, which has a bit of product demo. kzfaq.info/get/bejne/kJicfJx_kpuxfoU.html You can also sign up for a product demo here: snorkel.ai/demo/
@arazmalek8875 ай бұрын
Thanks for the information, but listening to you talking like: 'aaaaaaa eeeeee anddddddddd' was really frustrating
@420_gunna5 ай бұрын
Thank you Snorkel for putting this channel together! All of your videos + guests have been compact and informative -- really good brand marketing, I think.
@420_gunna5 ай бұрын
When you talk about distilation requiring large, unlabeled datsets... to be clear for my understanding, it's not necessarily that they're unlabeled data, it's more like we don't care about the dataset's labels, and instead use the teacher model's output distribution as the replacement pseudolabel. I guess you COULD create a distilled model by training against some data distribution that the teacher wasn't itself trained on... but I can't imagine why you would want to do that😄
@SnorkelAI3 ай бұрын
Sort of. Typically, you would use this for data that is, in fact, unlabeled-think sections of contracts or paragraphs from text books. You could also employ this approach for data that has labels that don't fit your desired schema, in which case your statement of "we don't care about the dataset's labels" would be 100% correct. As for your second comment, there could be a number of reasons you may want to do that. Perhaps the teacher LLM does quite well on a particular labeling task when given a highly-engineered prompt. This approach would let you transfer that performance into a smaller and cheaper model.
@kendwyer12775 ай бұрын
Very informative, thanks
@420_gunna5 ай бұрын
Awesome video! Data-centric AI is really awesome, and is a tractable space for the open source community to work in.
@SnorkelAI3 ай бұрын
It really is!
@axe8637 ай бұрын
Complicated Nonstationarity is really horrific for a wide range of methods/models
@riser96448 ай бұрын
Link to the blog code or ppt would be good
@lionhuang92098 ай бұрын
where can we get PPT?
@mechwarrior838 ай бұрын
please
@askeletalghost8 ай бұрын
I simp so hard for Emad
@user-li6vs9xr7i10 ай бұрын
Hi, thanks for the nice sharing! Could you please provide the sildes you use in the video so I can further study?
@yorailevi674710 ай бұрын
Tip 4: Toss out noisy examples. More data is not always better! Should be rephrased; Toss out non-decisive/opaque examples while keeping variability of examples.
@InquilineKea Жыл бұрын
Can she train on the video data of my entire life
@InquilineKea Жыл бұрын
Why does she pattern match so hard with Fred sala?
@annapurnasolutionsllc6463 Жыл бұрын
Does California Institute of Technology pay women lesser than men then - per Anima's comment ?
@ayushsharma3148 Жыл бұрын
Hey guys. I want to save this video to my youtube playlist. Can you please open save / add to playlist option?
@NukulSharma Жыл бұрын
Tried HoloClean on bigger datasets, tensors just explodes out of memory. Any pointers which can help?
@irshviralvideo Жыл бұрын
Why use AI when you have simpler models that can be easy to explain???
@EuphonicEscapes Жыл бұрын
It is sad that there is almost no point in using an Apple Pencil any more. Or rather, if you were a digital pencil user... Your job went poof. People simply don't care about digital paintings any more.
@CalvinJKu Жыл бұрын
This is amazing. KZfaq need to send more traffic to this!
@noraalturayeif996 Жыл бұрын
Thank you for this great summary! .. Could you please share the slides?
@avinashmahure281 Жыл бұрын
Thank you for sharing this event.
@lionhuang9209 Жыл бұрын
Very useful!
@faithandherghosts Жыл бұрын
This is brave, important work. I’m grateful to have happened across this article. Just today, I reviewed news of dismissal of the accusations that members of the Fairfax, VA (US) Police Department protected a sex trafficking circle. The charges were dismissed due to evidence that the accuser (a Jane Doe) had identified as a consenting escort worker in her history of involvement with the individual at the center of the alleged trafficking ring. One thing that occurred to me in thinking about the efficacy of content skimming of sites hosting ads that recruitment hooks may be nested within, is that a lot of trafficking is off-web and involves youth that are trafficked (sold/bought) through in-person processes involving economic incentive pressure, false-choice coercion and direct threats levied against vulnerable people by powerful buyers in trafficking networks, and activities such as kidnapping and in-person drugging of victims, luring and enticement that leads to hostage involvement in the most dangerous networks in human/sex trafficking. Are there identified characteristics of heightened likelihood of trafficking in certain geographic areas - e.g. your mention of high rates of homeless youth being trafficked? Other possible metrics of likelihood might be having a known sex-tourism market, access to ports over-water transport (private boats, larger cargo/industry watercraft), and various socioeconomic measures (poverty increases, changes in industries due to disaster events or climate/seasonal tourism increase or decline/loss of protective factors such as NGOs serving vulnerable women and children, local and regional law enforcement presence being protective of victims or protective of perpetrators, stringency of port records and identification of international visitors, conflict or war-related events…etc. etc. When I reflected on the strong recruitment-trafficking line between Australia and wealthy western countries, I wondered about whether that activity may be showing secondary-broker activity relating to the conspicuously absent (non-English speaking, not online) SE Asian and Pacific Island recruitment/victim markets? I’m sure that the investigative authorities are looking into this, especially after the 200+ global child pornography/brutality arrests of last year…and, yet, because some of the buyer-markets in the trafficking economy are exceptionally covert in their activities, are there ways to identify potential likely areas of risk-of-victimization by some array of metrics that could help local/national/intl. authorities put measures in place to discourage trafficking and/or apprehend perpetrators, like port and water-access security cameras w 3rd party monitoring (because local law enforcement is known to be protective of some criminal networks in some places), access to responsive helplines and reporting lines in local languages, in-person safety measures and environmental protective measures in addition to security cameras at points of entry and exits connecting trafficking markets to the rest of the world (overland and by water) - things like lighting or businesses, police kiosks, clearing of possible low-visibility routes of on-foot transit for people who may be kidnapping or buying young people for trafficking, automobile traffic check-points? Similar metrics may be of use to discover new or transitory victim markets in the US - such as the Gulf Coast (which has a large population of SE Asian immigrants and workers in the shrimping and other port-based trade+travel+hospitality industries, as well as a lot of vulnerable people that may become increasingly vulnerable following destabilizing weather events, slow-season economies, increased costs of living, etc. South American coastal areas (and inland areas with road/river connections to buyer markets and export routes on both Caribbean and Pacific coasts) that have factors that increase risk of predator activity can be easily accessed by boat and transport to/from the US mainland and secondary transport+sale brokers may not be as heavily investigated as perhaps it ought to be…? I deeply appreciate the work you’ve done on data tracking to help show/predict active trafficking activity. I’m grateful that there are committed intl. investigators working to discover and end sex trafficking and child trafficking. Please be careful out there, because the people involved in some of these networks are very powerful and very dangerous, as you are likely aware. If you ever need someone to contract-work on reviewing and labeling data sets, I have training and experience in content analysis, theme-labeling, rubric-based scaling of similar content to hone specificity and gauge possible secondary/tertiary needs for additional or sub-theme labels. I love that sort of bot-mind work and I am 💯 % on board with the effort to use remotely-available information to shine a spotlight on the nodes and mechanics of online & offline trafficking network activities. Much appreciation - and be safe out there. Godspeed and much protection from all the good in the world.* *this is another way of saying ‘…all those nasty mf’er and all their evil ways are gonna be brought down like a sledgehammer-heavy bolt of lightning hitting the ground, and so they better not f* with anyone ‘cause they got eyes on ‘em from all over the sky.’ Cheers…and thanks for the good thinking and the chance to share thoughts here. 😁
@djethereal99 Жыл бұрын
Paper link?
@SnorkelAI Жыл бұрын
arxiv.org/abs/2205.02318 here you go
@jonaslandsgesell4322 Жыл бұрын
Nice summary video
@sachinvernekar6711 Жыл бұрын
7.53 The PAC rule doesn't really apply here. What if we are able to label only a few type of easy cases. This means we are not uniformly labelling samples from the original data distribution.
@sinaghotbi Жыл бұрын
At 28:00, it was not clear to me why accuracies are independent? Is that an empirical evidence? Is that a weak assumption?
@astromikael Жыл бұрын
Great presentation - thank you!
@ayoolafakoya9841 Жыл бұрын
Julien is very awesome
@uncle-millennium2 жыл бұрын
Excellent presentation. Very lucid. Give this lady a raise.
@dermorgendanach932 жыл бұрын
Holy god!! that's an awesome work thanks for sharing
@charlesmartin71902 жыл бұрын
Excellent talk
@muhtasham322 жыл бұрын
Thanks for sharing been waiting for this. Great work by authors!!
@sheikhakbar20672 жыл бұрын
Thanks for compiling the list on GITHUB, Sebastian. As an NLP enthusiast I want to know the current state of the art in NLP (English) and how could I use those techniques in Arabic.
@spatiallysaying81302 жыл бұрын
Tip 1: Make the labels y consistent Tip 2: Use multiple labelers to spot inconsistencies Tip 3: Clarify labeling instructions by tracking down ambiguous examples Tip 4: Toss out noisy examples. More data is not always better! Tip 5: Use error analysis to focus on subset of data to improve
@trajesh812 жыл бұрын
Great Content!
@manoharmanchandia1742 жыл бұрын
Good best wishes
@TanniaDubon2 жыл бұрын
Great session! Thanks for sharing all that insight into building a valuable product.
@samlk52002 жыл бұрын
Thank you so much for sharing this event.Great work.👍