EKB PhD

Пікірлер

@wld-ph 3 күн бұрын

Have to tried different sizes of datasets, to see whether there is some underlying system cause... 230 million words is a lot, way more than I know... and it´s all in one file... and not very parallel...

@abanoubha 3 күн бұрын

what about Go ?

@kilianklaiber6367 3 күн бұрын

So rust essentially takes half the time than python....nice, but I thought rust would be a lot faster.

@oterotube13 4 күн бұрын

So at the end is Julia vs. C

@exxzxxe 9 күн бұрын

This is the second time I have viewed this video. Thank you for performing the benchmark-testing work I would have had to do- saved me quite a bit of time. Now a question: Do you believe Mojo will progress to point where its dictionary performance will equal or exceed Python's?

@ekbphd3200 9 күн бұрын

You're very welcome! I'm glad that enjoyed it. I hope and assume Mojo's native dictionary will get faster with future releases. In the changelog for Mojo v.24.4 the creators say: "Significant performance improvements when inserting into a Dict. Performance on this metric is still not where we'd like it to be, but it is much improved." docs.modular.com/mojo/changelog#v244-2024-06-07 With the "still not where we'd like it to be" I assume that they will continue to work on the native dictionary.

@murithiedwin2182 16 күн бұрын

That's a significant speed improvement, 3x folds faster in the newer version. However, it still doesn't explain why mojo code is still slower than actual identical python code, given that mojo was going for machine code compilation but with python syntax and ease, and not bytecode... For the little documentation overview i have read, mojo team explained that mojo is not python, but python will be mojo, in the sense that python will instead be an interpreted subset of the compiled mojo, and that features in python not yet implemented in mojo will dynamically switch to run in an included actual python runtime, but in the future, mojo will be self-contained to run all python code on mojo runtime.

@ekbphd3200 16 күн бұрын

Cool! Thanks for looking that up.

@melodyogonna 18 күн бұрын

How come you know that dunder methods provide high-level sugar for Python, but you call the methods directly in Mojo? You don't to call object.__len()__. object.__setitem()__ etc directly in Mojo, they work pretty much the same way they do in Python.

@ekbphd3200 17 күн бұрын

I'll have to try the sugar way in the future. Thanks for pointing this out.

@alextantos658 20 күн бұрын

And ome could also try out the Dictionaries.jl package in Julia that is much more performant and efficient than he base Julia Dict type.

@ekbphd3200 18 күн бұрын

Thanks for the idea. I just tried Dictionary.jl to get the frequencies of words across 40k files with 230m words, and it was only slightly faster than Base.Dict (47s vs. 51s). I'll have to implement Dictionary.jl with a deeply nested dictionary and see how it does.

@alextantos658 17 күн бұрын

@@ekbphd3200 Thanks for the nice videos and the work! Besides Dictionary.jl, Julia offers several other options from DataStructures.jl, such as SwissDict and other data structures that are claimed to be faster. What I appreciate about Julia is its diverse range of options, often re-implemented within the language itself without needing to track/tune C implementations of basic operations.. Therefore, while comparing base types between languages provides valuable insights, it doesn't fully capture the extent of Julia's capabilities. PS: I am a Python user and fan too..

@ekbphd3200 9 күн бұрын

Here's a quick comparison with a simple frequency dictionary: kzfaq.info/get/bejne/iLWXhKSEsrTDnH0.html

@indibarsarkar3936 21 күн бұрын

Please try splitting the data in half and assigning each halves to a dictionary. Then measure the time taken to copy or interchange elements from one dictionary to another. Maybe the problem is in the file management and not in dictionary!!

@ekbphd3200 19 күн бұрын

Mojo's dictionary has increased in performance (when inserting items) with v24.4. I found a 4x increase in speed with a particular linguistic task. Take a watch: kzfaq.info/get/bejne/kNZ1jZirsryViKc.html

@TheRealHassan789 28 күн бұрын

I wonder if PYPY version of python is even faster, since it does JIT compiler…. Would love to see that result

@ekbphd3200 27 күн бұрын

Good question/idea! I haven't yet tried PYPY. Sounds like a good research question to put to empirical testing!

@woolfel Ай бұрын

the real benefit of Mojo is it can easily target other hardware without having to write C code. Google's work with cuda acceleration and pandas is a good example.

@davea136 Ай бұрын

hashmaps are made for O(1)ish retrieval, insertion is far less important. So yeah, insertion gets slower, especially if you haven't chosen a hashing method specific to your data (this can be really important), but the true test of a hashmap/dicitonary is how the size of the corpus affects retrieval. It would be interesting to run the experment on the retrieval end, now that you know the performance of insertion, ansd see how they compare. (You may also want to see how pickling the dictionary and restoring it performs, since this is done a lot more than generatign the initial map in research.) Also, for initial trials, maybe speed things up and get a rougher idea by increasing the quantity by orders of magnitude - 10, 100, 1000, etc. A sparser plot some times helps things jump out of the data more clearly. Thank you for the post, it was good fun!

@ekbphd3200 Ай бұрын

Great ideas! Sounds like another test I need to run! Thanks for the comment.

@alexeydmitrievich5970 Ай бұрын

I think that "small" waa about 10-50 keys as most objects in everyday python are actually these tiny dicts.

@ekbphd3200 Ай бұрын

Okay. Yeah, that's small. Thanks for the clarification.

@murithiedwin2182 Ай бұрын

Mojo official benchmark with python performance needs a comprehensive "mojo feature to python feature" comparison benchmark... We would like then to see where the 35,000X speed up claim came from...

@ekbphd3200 Ай бұрын

Yeah, I think the specific task is important. As a linguist, I use strings a lot, and Python still seems to be faster than Mojo right now. I hope Mojo becomes faster than Python with strings at some point in the future. Thanks for the comment!

@niks660097 Ай бұрын

@@ekbphd3200 The thing is in python is a lot of string operations are handled by pure C code, for example regex is completely done by a FSM in C, that's why regexes in python are faster than java, golang you name it, if the string you are searching on is big enough, in my current company we were planning to move some code from python to golang cause of performance issues, but while profiling, python's string regexes and dict's perf. caught everyone off guard.

@jinsongli3128 Ай бұрын

my test with the List also shows mojo is slower than python, not sure if I did anything wrong.

@ekbphd3200 Ай бұрын

Thanks for that info! Yeah, I assume that at some point Mojo will be faster than Python with most tasks but that’s not the case yet with the linguistic tasks I do most often.

@rkidy Ай бұрын

The curve really doesn’t have a name, but it is an increasing function tending towards being linear. It is not exponential or parabolic, they refer to specific kinds of curves. The reason behind this is due to the fact python dictionaries are implemented as hashmaps in c, which when empty take roughly a constant, very small amount of time to access, but when fuller take longer and longer proportional to how many items are already inside it.

@ekbphd3200 Ай бұрын

Thanks for that clarification!

@ekbphd3200 Ай бұрын

Thanks for that info!

@numeritos1799 Ай бұрын

Hey there! I think the stuff you make is helpful and you should keep doing it :) That being said though, you'll always have this problem with dictionaries (and most other data structures). Dealing with LOTS of memory is going to become a performance problem since, most likely, some of this memory is going to have to be stored in disk and disk reads are very. Also resizes of big data structures are particularly slow. This btw could explain some of the "outliers" that you have. I just wanna point this out because maybe you were under the impression that this performance decrease as the dict gets bigger is because of the optimizations for small-sized dictionaries. It might play a part (haven't looked at the implementation) but it should be something minor.

@ekbphd3200 Ай бұрын

Thanks for this helpful info! I appreciate it.

@Queasy. Ай бұрын

Nice video. I think that "exponential" was the right word to use to describe the increasing rate.

@ekbphd3200 Ай бұрын

Thanks for the clarification!

@rkidy Ай бұрын

It’s not exponential. It’s an increasing function approaching a linear relationship.

@Queasy. Ай бұрын

@@rkidy You may be right that it's not exponential, but wouldn't it just be a polynomial function? What leads you to think it is approaching a linear relationship?

@Gskvj Ай бұрын

can R do a quadratic "regression" to bound the data set from below?

@ekbphd3200 Ай бұрын

I’m not sure. I’ll have to look into it. Thanks for comment!

@bobdieter5296 Ай бұрын

Hey, do you have the code for this test somewhere publicly visible, like Github? I would like to see (in a bit more detail) what was benchmarked here.

@ekbphd3200 Ай бұрын

Sure. I previously forgot to link to my scripts. I just changed the description of the video to include links to my two scripts. Thanks for pointing this out!

@blackdereker4023 Ай бұрын

Python dictionaries are essentially a hash map and naturally have a limited index range for the hashes. More items means that are more chances of collision and having to do a linear search on the index.

@bertiesmith3021 Ай бұрын

A well designed hash map should resize itself to avoid this. There will be spikes at the resize points, and steps as you exhaust the memory caches. A smooth curve doesn’t indicate a good design.

@blackdereker4023 Ай бұрын

@@bertiesmith3021 The resizing part is the big problem here, in big data is extremely costly to resize. That's why databases opt to use B+ Trees and analytical databases don't even have indexes just partitions.

@dfs-comedy Ай бұрын

@@blackdereker4023 A properly-designed resizing hash table amortizes the resize costs so the amortized cost of a search operation is O(1). Databases use B+ trees because for databases, the thing being optimized is number of disk accesses rather than CPU time; B+ trees can have large branching factors to reduce disk accesses.

@ekbphd3200 Ай бұрын

Thanks for the info!

@prakhargupta2960 Ай бұрын

loved it

@ekbphd3200 Ай бұрын

Thanks!

@srijankumar9466 Ай бұрын

Great video man 👍👍

@ekbphd3200 Ай бұрын

Thanks bro!

@exxzxxe Ай бұрын

This will be an issue for us. Thanks for research!

@ekbphd3200 Ай бұрын

You're very welcome! Thanks for watching, and commenting!

@cmdlp4178 Ай бұрын

Maybe in Mojo the Strings are copied and reallocated as new objects, maybe because of the different StringKey type, or maybe (I am not as familiar with Mojo) because Dict stores and owns an own instance of the string, similar to as in C++. There the string and the memory associated with the string is "owned" by e.g. the unordered_map. Maybe you should retry measuring the time, now inserting floating point numbers for examples, that might not be allocated in Mojo.

@ekbphd3200 Ай бұрын

Good idea! I'll give that a try soon.

@user-ff5op7nd4e Ай бұрын

Sir,can you share a quick tutorial on how to create a model for source seperation using speechbrain in google colab from scratch?Please sir!!!

@ekbphd3200 Ай бұрын

I'd recommend their tutorial on source separation: speechbrain.github.io/tutorial_separation.html

@user-ff5op7nd4e Ай бұрын

Sir,I do have a doubt regarding this project and it is kinda urgent .There is this command-cd recipes/WSJ0Mix/separation python train.py hparams/sepformer.yaml --data_folder=your_data_folder...what is data_folder here..what exactly should i put in data_folder?Sir,I need your help.

@ekbphd3200 Ай бұрын

Here's there tutorial: speechbrain.github.io/tutorial_separation.html

@Julian.u7 Ай бұрын

Parabolic is not exponential. I would not use the exponential term so nonchalantly

@ekbphd3200 Ай бұрын

Yeah, that's a good point. Thanks!

@MarcoAntoniotti Ай бұрын

You should not do scripts in Julia. (or anything else). You pay the compilation time at each run.

@ekbphd3200 Ай бұрын

Good point. I guess in the end, I care about how much I have to wait for results once I run the script, so the compilation time is important to me too.

@MarcoAntoniotti Ай бұрын

@@ekbphd3200 the issue is that you are not measuring things properly. You should have a one line script that calls a function both in Python and Julia. That way you should notice the difference.

@emmahalliday1503 2 ай бұрын

Thank you so much. Really helpful explanation of the statistical measures! Would you consider a video on the key word measures!

@ekbphd3200 Ай бұрын

Great suggestion!

@ekbphd3200 Ай бұрын

Here's a video on the keyword analysis (aka. keyness analysis) function: kzfaq.info/get/bejne/bp2ElLWkztnHiWw.html

@patates1165 2 ай бұрын

nice video :)

@ekbphd3200 2 ай бұрын

Thanks!

@melodyogonna 2 ай бұрын

You're not supposed to call those dunder methods directly like that lol. __contains__ is called by the language is you do: if wd[] in freqs. __len__ is called by the language if you do len(wds). __getitem__ is called by the language if you do: freqs[wd[]+1] They're used to design library APIs which fit orthogonally into the core of the language

@starshipx1282 2 ай бұрын

nice observation.

@ekbphd3200 2 ай бұрын

Thank you much!

@ogblue8159 2 ай бұрын

thank u for sharing this, i wish if the audio quality was better after the vocals are seperated tho

@ekbphd3200 2 ай бұрын

Yeah me too.

@chrismen83240 2 ай бұрын

I tried everything seems like julia don't know the type of the values in eachmatch and that's causing recompilation (around 10% of the time is taken from there), I think it would be great to change it to use vector of char preallocated to avoid allocations and try to avoid regex match even if it means making loop over char as long as it compiles to machine code it will be fine

@ekbphd3200 2 ай бұрын

Thanks for the tip!

@yuyuemo7534 2 ай бұрын

Thank you! It is helpful~

@yuyuemo7534 2 ай бұрын

But there is a litter problem, when I add \b for be, like "*_*_\bbe\b"， AntConc4.24 tells me it didn't find anything, and I have not selected Words\Case\Regex they all.Do you have any idea?

@ekbphd3200 2 ай бұрын

Perhaps remove the both \b. It's hard to tell without seeing exactly what your parameters are in AntConc.

@sunnyyshc 2 ай бұрын

Hi @EKB PhD, thanks for sharing! I am trying to use whisper_timestamped to extract word level timestamp from singing, so I can create karaoke from the result. However, the transcribe quality went from almost perfect to completely useless when I switched from using stable_whisper to whisper_timestamped. Same audio, but using the "large" model. Any suggestions or insights why this would happened? Also, given lyrics is always available, I wonder if there are other models that would create timestamp from a audio, given the full lyrics (which I assumed the timestamp would be much more accurate, since there is no guesswork on the content of the audio, but only when a word start/end in the audio. THANKS!!!

@sunnyyshc 2 ай бұрын

fyi, I am looking into forced alignment (this align transcript to audio) and WhisperX (which promises more accurate word timestamp.)

@ekbphd3200 Ай бұрын

Yeah, WhisperX looks interesting: github.com/m-bain/whisperX

@thecognacsipper 2 ай бұрын

nice thanks

@earlkjarbrown3753 2 ай бұрын

You're welcome!

@ekbphd3200 2 ай бұрын

You're welcome! Thanks for commenting.

@svenmasche 3 ай бұрын

lmao up until minute 9 i was expecting that you were going to explain it ^^

@ekbphd3200 2 ай бұрын

Sorry. I'm curious to see when they will be able to optimize the native Mojo dictionary. In the meantime, there's a third-party Mojo dictionary that is faster. Take a look at this video: kzfaq.info/get/bejne/d9d0es5616qdmac.htmlsi=xx6_7G-srSQX3KjF

@mariusdrulea9049 3 ай бұрын

If you do this little change in your julia code, the runtime would be even faster: wds = String.(split(txt, " ")). This will eliminate the allocations happening inside the loop, cause SubStrings were converted to Strings inside the loop. Now we do this outside of the loop. I also recommend benchmarking with the @time macro.

@ekbphd3200 2 ай бұрын

Thanks for the tips! I appreciate them. I'll look at implementing this soon.

@ekbphd3200 2 ай бұрын

I tried what you suggested, but it actually took longer. See below: ### 27 seconds on average ### t1 = now() wds = split(txt, " ") freqs = Dict{String,Int}() for wd in wds freqs[wd] = get(freqs, wd, 0) + 1 end t2 = now() ### 40 seconds on average ### t1 = now() wds = String.(split(txt, " ")) freqs = Dict{String,Int}() for wd in wds freqs[wd] = get(freqs, wd, 0) + 1 end t2 = now() Do you see something I could do better?

@user-oc8ch8wh7o 3 ай бұрын

The plot you showed at the end of the video was cut off by your camera. We couldn't see what the results were for the 1.11.0 run

@oterotube13 3 ай бұрын

it is a shame.. the most important is the hidden one!, at least we know where the boxplot is in the figure.

@ekbphd3200 2 ай бұрын

Yeah, I noticed that after I finishing the video. I gotta check that type of thing before publishing future videos.

@ekbphd3200 2 ай бұрын

Yeah. Darn. I need to check that in future videos.

@rauldurand 2 ай бұрын

you may want to upload the video again after fixing it

@edmondw6689 3 ай бұрын

They recently announced opensource their product except LLVM (Apache License), so it should get much better despite my bad impressions on some of the open source authors in the past.

@CuriosidadesParaPensar-nn9sn 3 ай бұрын

please could you make a video teaching me how to use this program, I'm confused but I wanted to use it, where do I send my voice to clone?

@ekbphd3200 3 ай бұрын

I just followed the code here: huggingface.co/coqui/XTTS-v2

@exxzxxe 3 ай бұрын

So, Julia is fast (optimized) for numerical analysis, but not so much for other problems.

@ekbphd3200 3 ай бұрын

I guess, but I don't know sure. I simply trying to discover when it's faster than Python. I don't really care which on is quicker than the other, as I use both as needed.

@exxzxxe 3 ай бұрын

Julia being slower is a surprise- given all the hype.

@ekbphd3200 3 ай бұрын

Right?!

@exxzxxe 3 ай бұрын

@@ekbphd3200 I used to worry about software execution speeds too. I spent 17 years working in computational math and physics in the supercomputer field at Cray and Thinking Machines (Fortran 90 on the massively parallel machines).

@therainman7777 2 ай бұрын

It’s because the Julia code was not written correctly for performance. The video author posted his code in a comment above. Many issues in the code that are doing things in a non-Julianic way. If that’s even a word 😆

@exxzxxe 2 ай бұрын

@@therainman7777 Thanks!

@androth1502 3 ай бұрын

my guess would be that the mojo implementation is a quick lazy thing just to get maps in, while python uses a C implementation.

@ekbphd3200 3 ай бұрын

Yeah, I assume that's what's going on.

@androth1502 3 ай бұрын

@@ekbphd3200 still not a good look for them to release something so unoptimized after boldly claiming to be thousands of times faster than python. lol.

@brendanhansknecht4650 3 ай бұрын

Python actually has a realy optimized dictionary for its constraints (taking untyped keys and values). Mojo will catch up (and should surpass due to static types), but the current dict is new and mostly unoptimized. Also, this should be memory bound instead of compute bound so even in the long term, I would not expect much of a difference.

@ekbphd3200 3 ай бұрын

Yeah, good points. Thanks.

@user-qz3nx4xy8c 3 ай бұрын

Now using mojo just a Python skin language, everything you want to do must import Python package…

@exxzxxe 4 ай бұрын

Excellent! What about Rust?

@ekbphd3200 3 ай бұрын

Rust kicks butt: kzfaq.info/get/bejne/fZuFiayI39ythnk.html

@exxzxxe 3 ай бұрын

@@ekbphd3200You, sir, are a gentleman and a scholar.

@hyperplano 4 ай бұрын

In the discord channel I read the current dictionary implementation is not optimized. It's been included to allow developers to experiment more with the language, but it will be optimized in future releases.

@earlkjarbrown3753 4 ай бұрын

Great to hear! Thanks for pointing this out.

@ekbphd3200 3 ай бұрын

Thanks!

Ең жақсы KZfaq

Пікірлер