Snorkel AI

Пікірлер

@brunoscaglione6831 16 күн бұрын

Title is misleading

@SnorkelAI 12 күн бұрын

What would you suggest as an alternate title?

@brunoscaglione6831 12 күн бұрын

@@SnorkelAI Realistic LLM expectations and a glimpse into the future

@vivekpadman5248 Ай бұрын

Is this approach used on all three levels of training? Base instruct ane chat fine-tuning? And are there different things to be considered for the above?

@SnorkelAI Ай бұрын

I'm not 100% clear on your question. Are you referring to pre-training, fine-tuning and alignment? If so, this approach could be used on fine-tuning and/or alignment. It could also theoretically be used on pre-training, but I suspect that would yield poor results.

@vivekpadman5248 Ай бұрын

@@SnorkelAI yes that was exactly my question, thanks 😊. I have one follow up question here. Why do you think it would yeild poorer results on pre training phase any insights on that and in that case what kind (size and arch) of pretrained student model should be used with a specific teacher llm Or anything would work?

@SnorkelAI Ай бұрын

Sorry for the slow reply here. KZfaq didn't surface your reply comment the same way it did your initial comment. We're getting a bit outside the bounds of what can be reasonably answered within a KZfaq comment, but I think we can reasonably say this: Distilling a model means using its output to train a smaller model. For pre-training, that would mean creating an immense volume of raw generated outputs to form the parent model. Several studies have shown that pre-training generative models on other models' generated output tends not to work so well. We don't yet fully understand why, but we understand that it is a questionable practice at present.

@vivekpadman5248 Ай бұрын

@@SnorkelAI no worries man, getting such a nice detailed reply is all that matters. Ah understood it properly now, also I guess the limits of the parameter size will come into picture while doing that if we use it for pretraining. Clean data plua synthetic data is anyways available now. Thanks again 😊🙏

@vivekpadman5248 Ай бұрын

Very nice short informative video. I'm looking to create a distilled model on reasoning tasks for games which could run locally. This will help 😊 thanks

@SnorkelAI Ай бұрын

Glad it was helpful!

@clashcodes0855 Ай бұрын

free?

@SnorkelAI Ай бұрын

Included in Snorkel Flow. 😃

@RyluRocky Ай бұрын

Well done!

@SnorkelAI Ай бұрын

Thanks!

@chakpak Ай бұрын

Programmatic labeling of images at scale is so cool. 🎉

@SnorkelAI Ай бұрын

We think so too!

@chakpak 2 ай бұрын

Wow! Preprocessing is 🔥🔥

@chandra7599 2 ай бұрын

What does Wayfair do... Intro would be helpful to understand and connect with the content.

@SnorkelAI 2 ай бұрын

Wayfair sells furniture and home goods online. That's good feedback, thanks!

@gobdovan 3 ай бұрын

I came here to understand precisely how the Snorkel software assists with the issue. However, you discussed a general RAG system and mentioned that Snorkel expedited your process in different ways without specifying what Snorkel AI actually does. In your description, you stated, 'We explore how Snorkel Flow accelerated development of[...]'

@SnorkelAI 3 ай бұрын

We're in early days on this kind of video content. The intent of this one was to talk about the case broadly and concisely. Were you looking for more of a product demonstration?

@gobdovan 3 ай бұрын

@SnorkelAI, I appreciate the explanation. I was expecting a product demo based on the video's description, which mentioned exploring *how* Snorkel Flow accelerates development. I'm familiar with Snorkel the package, and the pain of developing labeling functions. The video was recommended to me, so I checked your GitHub for updates. It appears the repo hasn't been updated recently, and your focus seems to have shifted to Snorkel Flow. However, the video did not cover it in detail, and I couldn't find any comprehensive product presentations. Is Snorkel Flow aimed at larger corporations, or will it be available as a SaaS for broader access? Could you recommend any videos that present the product, particularly in the context of creating datasets for ASR/translation? I'm looking for more efficient methods to build such datasets.

@SnorkelAI 3 ай бұрын

What you surmised is correct. Snorkel Flow is currently aimed at large enterprises. We will likely have more product demos heading for the channel soon. In the meantime, you can watch this one, which has a bit of product demo. kzfaq.info/get/bejne/kJicfJx_kpuxfoU.html You can also sign up for a product demo here: snorkel.ai/demo/

@arazmalek887 5 ай бұрын

Thanks for the information, but listening to you talking like: 'aaaaaaa eeeeee anddddddddd' was really frustrating

@420_gunna 5 ай бұрын

Thank you Snorkel for putting this channel together! All of your videos + guests have been compact and informative -- really good brand marketing, I think.

@420_gunna 5 ай бұрын

When you talk about distilation requiring large, unlabeled datsets... to be clear for my understanding, it's not necessarily that they're unlabeled data, it's more like we don't care about the dataset's labels, and instead use the teacher model's output distribution as the replacement pseudolabel. I guess you COULD create a distilled model by training against some data distribution that the teacher wasn't itself trained on... but I can't imagine why you would want to do that😄

@SnorkelAI 3 ай бұрын

Sort of. Typically, you would use this for data that is, in fact, unlabeled-think sections of contracts or paragraphs from text books. You could also employ this approach for data that has labels that don't fit your desired schema, in which case your statement of "we don't care about the dataset's labels" would be 100% correct. As for your second comment, there could be a number of reasons you may want to do that. Perhaps the teacher LLM does quite well on a particular labeling task when given a highly-engineered prompt. This approach would let you transfer that performance into a smaller and cheaper model.

@kendwyer1277 5 ай бұрын

Very informative, thanks

@420_gunna 5 ай бұрын

Awesome video! Data-centric AI is really awesome, and is a tractable space for the open source community to work in.

@SnorkelAI 3 ай бұрын

It really is!

@axe863 7 ай бұрын

Complicated Nonstationarity is really horrific for a wide range of methods/models

@riser9644 8 ай бұрын

Link to the blog code or ppt would be good

@lionhuang9209 8 ай бұрын

where can we get PPT?

@mechwarrior83 8 ай бұрын

please

@askeletalghost 8 ай бұрын

I simp so hard for Emad

@user-li6vs9xr7i 10 ай бұрын

Hi, thanks for the nice sharing! Could you please provide the sildes you use in the video so I can further study?

@yorailevi6747 10 ай бұрын

Tip 4: Toss out noisy examples. More data is not always better! Should be rephrased; Toss out non-decisive/opaque examples while keeping variability of examples.

@InquilineKea Жыл бұрын

Can she train on the video data of my entire life

@InquilineKea Жыл бұрын

Why does she pattern match so hard with Fred sala?

@annapurnasolutionsllc6463 Жыл бұрын

Does California Institute of Technology pay women lesser than men then - per Anima's comment ?

@ayushsharma3148 Жыл бұрын

Hey guys. I want to save this video to my youtube playlist. Can you please open save / add to playlist option?

@NukulSharma Жыл бұрын

Tried HoloClean on bigger datasets, tensors just explodes out of memory. Any pointers which can help?

@irshviralvideo Жыл бұрын

Why use AI when you have simpler models that can be easy to explain???

@EuphonicEscapes Жыл бұрын

It is sad that there is almost no point in using an Apple Pencil any more. Or rather, if you were a digital pencil user... Your job went poof. People simply don't care about digital paintings any more.

@CalvinJKu Жыл бұрын

This is amazing. KZfaq need to send more traffic to this!

@noraalturayeif996 Жыл бұрын

Thank you for this great summary! .. Could you please share the slides?

@avinashmahure281 Жыл бұрын

Thank you for sharing this event.

@lionhuang9209 Жыл бұрын

Very useful!

@faithandherghosts Жыл бұрын

This is brave, important work. I’m grateful to have happened across this article. Just today, I reviewed news of dismissal of the accusations that members of the Fairfax, VA (US) Police Department protected a sex trafficking circle. The charges were dismissed due to evidence that the accuser (a Jane Doe) had identified as a consenting escort worker in her history of involvement with the individual at the center of the alleged trafficking ring. One thing that occurred to me in thinking about the efficacy of content skimming of sites hosting ads that recruitment hooks may be nested within, is that a lot of trafficking is off-web and involves youth that are trafficked (sold/bought) through in-person processes involving economic incentive pressure, false-choice coercion and direct threats levied against vulnerable people by powerful buyers in trafficking networks, and activities such as kidnapping and in-person drugging of victims, luring and enticement that leads to hostage involvement in the most dangerous networks in human/sex trafficking. Are there identified characteristics of heightened likelihood of trafficking in certain geographic areas - e.g. your mention of high rates of homeless youth being trafficked? Other possible metrics of likelihood might be having a known sex-tourism market, access to ports over-water transport (private boats, larger cargo/industry watercraft), and various socioeconomic measures (poverty increases, changes in industries due to disaster events or climate/seasonal tourism increase or decline/loss of protective factors such as NGOs serving vulnerable women and children, local and regional law enforcement presence being protective of victims or protective of perpetrators, stringency of port records and identification of international visitors, conflict or war-related events…etc. etc. When I reflected on the strong recruitment-trafficking line between Australia and wealthy western countries, I wondered about whether that activity may be showing secondary-broker activity relating to the conspicuously absent (non-English speaking, not online) SE Asian and Pacific Island recruitment/victim markets? I’m sure that the investigative authorities are looking into this, especially after the 200+ global child pornography/brutality arrests of last year…and, yet, because some of the buyer-markets in the trafficking economy are exceptionally covert in their activities, are there ways to identify potential likely areas of risk-of-victimization by some array of metrics that could help local/national/intl. authorities put measures in place to discourage trafficking and/or apprehend perpetrators, like port and water-access security cameras w 3rd party monitoring (because local law enforcement is known to be protective of some criminal networks in some places), access to responsive helplines and reporting lines in local languages, in-person safety measures and environmental protective measures in addition to security cameras at points of entry and exits connecting trafficking markets to the rest of the world (overland and by water) - things like lighting or businesses, police kiosks, clearing of possible low-visibility routes of on-foot transit for people who may be kidnapping or buying young people for trafficking, automobile traffic check-points? Similar metrics may be of use to discover new or transitory victim markets in the US - such as the Gulf Coast (which has a large population of SE Asian immigrants and workers in the shrimping and other port-based trade+travel+hospitality industries, as well as a lot of vulnerable people that may become increasingly vulnerable following destabilizing weather events, slow-season economies, increased costs of living, etc. South American coastal areas (and inland areas with road/river connections to buyer markets and export routes on both Caribbean and Pacific coasts) that have factors that increase risk of predator activity can be easily accessed by boat and transport to/from the US mainland and secondary transport+sale brokers may not be as heavily investigated as perhaps it ought to be…? I deeply appreciate the work you’ve done on data tracking to help show/predict active trafficking activity. I’m grateful that there are committed intl. investigators working to discover and end sex trafficking and child trafficking. Please be careful out there, because the people involved in some of these networks are very powerful and very dangerous, as you are likely aware. If you ever need someone to contract-work on reviewing and labeling data sets, I have training and experience in content analysis, theme-labeling, rubric-based scaling of similar content to hone specificity and gauge possible secondary/tertiary needs for additional or sub-theme labels. I love that sort of bot-mind work and I am 💯 % on board with the effort to use remotely-available information to shine a spotlight on the nodes and mechanics of online & offline trafficking network activities. Much appreciation - and be safe out there. Godspeed and much protection from all the good in the world.* *this is another way of saying ‘…all those nasty mf’er and all their evil ways are gonna be brought down like a sledgehammer-heavy bolt of lightning hitting the ground, and so they better not f* with anyone ‘cause they got eyes on ‘em from all over the sky.’ Cheers…and thanks for the good thinking and the chance to share thoughts here. 😁

@djethereal99 Жыл бұрын

Paper link?

@SnorkelAI Жыл бұрын

arxiv.org/abs/2205.02318 here you go

@jonaslandsgesell4322 Жыл бұрын

Nice summary video

@sachinvernekar6711 Жыл бұрын

7.53 The PAC rule doesn't really apply here. What if we are able to label only a few type of easy cases. This means we are not uniformly labelling samples from the original data distribution.

@sinaghotbi Жыл бұрын

At 28:00, it was not clear to me why accuracies are independent? Is that an empirical evidence? Is that a weak assumption?

@astromikael Жыл бұрын

Great presentation - thank you!

@ayoolafakoya9841 Жыл бұрын

Julien is very awesome

@uncle-millennium 2 жыл бұрын

Excellent presentation. Very lucid. Give this lady a raise.

@dermorgendanach93 2 жыл бұрын

Holy god!! that's an awesome work thanks for sharing

@charlesmartin7190 2 жыл бұрын

Excellent talk

@muhtasham32 2 жыл бұрын

Thanks for sharing been waiting for this. Great work by authors!!

@sheikhakbar2067 2 жыл бұрын

Thanks for compiling the list on GITHUB, Sebastian. As an NLP enthusiast I want to know the current state of the art in NLP (English) and how could I use those techniques in Arabic.

@spatiallysaying8130 2 жыл бұрын

Tip 1: Make the labels y consistent Tip 2: Use multiple labelers to spot inconsistencies Tip 3: Clarify labeling instructions by tracking down ambiguous examples Tip 4: Toss out noisy examples. More data is not always better! Tip 5: Use error analysis to focus on subset of data to improve

@trajesh81 2 жыл бұрын

Great Content!

@manoharmanchandia174 2 жыл бұрын

Good best wishes

@TanniaDubon 2 жыл бұрын

Great session! Thanks for sharing all that insight into building a valuable product.

@samlk5200 2 жыл бұрын

Thank you so much for sharing this event.Great work.👍

@connorshorten6311 2 жыл бұрын

Great work with this, really enjoyed it!

Ең жақсы KZfaq

Пікірлер