Have to tried different sizes of datasets, to see whether there is some underlying system cause... 230 million words is a lot, way more than I know... and it´s all in one file... and not very parallel...
@abanoubha3 күн бұрын
what about Go ?
@kilianklaiber63673 күн бұрын
So rust essentially takes half the time than python....nice, but I thought rust would be a lot faster.
@oterotube134 күн бұрын
So at the end is Julia vs. C
@exxzxxe9 күн бұрын
This is the second time I have viewed this video. Thank you for performing the benchmark-testing work I would have had to do- saved me quite a bit of time. Now a question: Do you believe Mojo will progress to point where its dictionary performance will equal or exceed Python's?
@ekbphd32009 күн бұрын
You're very welcome! I'm glad that enjoyed it. I hope and assume Mojo's native dictionary will get faster with future releases. In the changelog for Mojo v.24.4 the creators say: "Significant performance improvements when inserting into a Dict. Performance on this metric is still not where we'd like it to be, but it is much improved." docs.modular.com/mojo/changelog#v244-2024-06-07 With the "still not where we'd like it to be" I assume that they will continue to work on the native dictionary.
@murithiedwin218216 күн бұрын
That's a significant speed improvement, 3x folds faster in the newer version. However, it still doesn't explain why mojo code is still slower than actual identical python code, given that mojo was going for machine code compilation but with python syntax and ease, and not bytecode... For the little documentation overview i have read, mojo team explained that mojo is not python, but python will be mojo, in the sense that python will instead be an interpreted subset of the compiled mojo, and that features in python not yet implemented in mojo will dynamically switch to run in an included actual python runtime, but in the future, mojo will be self-contained to run all python code on mojo runtime.
@ekbphd320016 күн бұрын
Cool! Thanks for looking that up.
@melodyogonna18 күн бұрын
How come you know that dunder methods provide high-level sugar for Python, but you call the methods directly in Mojo? You don't to call object.__len()__. object.__setitem()__ etc directly in Mojo, they work pretty much the same way they do in Python.
@ekbphd320017 күн бұрын
I'll have to try the sugar way in the future. Thanks for pointing this out.
@alextantos65820 күн бұрын
And ome could also try out the Dictionaries.jl package in Julia that is much more performant and efficient than he base Julia Dict type.
@ekbphd320018 күн бұрын
Thanks for the idea. I just tried Dictionary.jl to get the frequencies of words across 40k files with 230m words, and it was only slightly faster than Base.Dict (47s vs. 51s). I'll have to implement Dictionary.jl with a deeply nested dictionary and see how it does.
@alextantos65817 күн бұрын
@@ekbphd3200 Thanks for the nice videos and the work! Besides Dictionary.jl, Julia offers several other options from DataStructures.jl, such as SwissDict and other data structures that are claimed to be faster. What I appreciate about Julia is its diverse range of options, often re-implemented within the language itself without needing to track/tune C implementations of basic operations.. Therefore, while comparing base types between languages provides valuable insights, it doesn't fully capture the extent of Julia's capabilities. PS: I am a Python user and fan too..
@ekbphd32009 күн бұрын
Here's a quick comparison with a simple frequency dictionary: kzfaq.info/get/bejne/iLWXhKSEsrTDnH0.html
@indibarsarkar393621 күн бұрын
Please try splitting the data in half and assigning each halves to a dictionary. Then measure the time taken to copy or interchange elements from one dictionary to another. Maybe the problem is in the file management and not in dictionary!!
@ekbphd320019 күн бұрын
Mojo's dictionary has increased in performance (when inserting items) with v24.4. I found a 4x increase in speed with a particular linguistic task. Take a watch: kzfaq.info/get/bejne/kNZ1jZirsryViKc.html
@TheRealHassan78928 күн бұрын
I wonder if PYPY version of python is even faster, since it does JIT compiler…. Would love to see that result
@ekbphd320027 күн бұрын
Good question/idea! I haven't yet tried PYPY. Sounds like a good research question to put to empirical testing!
@woolfelАй бұрын
the real benefit of Mojo is it can easily target other hardware without having to write C code. Google's work with cuda acceleration and pandas is a good example.
@davea136Ай бұрын
hashmaps are made for O(1)ish retrieval, insertion is far less important. So yeah, insertion gets slower, especially if you haven't chosen a hashing method specific to your data (this can be really important), but the true test of a hashmap/dicitonary is how the size of the corpus affects retrieval. It would be interesting to run the experment on the retrieval end, now that you know the performance of insertion, ansd see how they compare. (You may also want to see how pickling the dictionary and restoring it performs, since this is done a lot more than generatign the initial map in research.) Also, for initial trials, maybe speed things up and get a rougher idea by increasing the quantity by orders of magnitude - 10, 100, 1000, etc. A sparser plot some times helps things jump out of the data more clearly. Thank you for the post, it was good fun!
@ekbphd3200Ай бұрын
Great ideas! Sounds like another test I need to run! Thanks for the comment.
@alexeydmitrievich5970Ай бұрын
I think that "small" waa about 10-50 keys as most objects in everyday python are actually these tiny dicts.
@ekbphd3200Ай бұрын
Okay. Yeah, that's small. Thanks for the clarification.
@murithiedwin2182Ай бұрын
Mojo official benchmark with python performance needs a comprehensive "mojo feature to python feature" comparison benchmark... We would like then to see where the 35,000X speed up claim came from...
@ekbphd3200Ай бұрын
Yeah, I think the specific task is important. As a linguist, I use strings a lot, and Python still seems to be faster than Mojo right now. I hope Mojo becomes faster than Python with strings at some point in the future. Thanks for the comment!
@niks660097Ай бұрын
@@ekbphd3200 The thing is in python is a lot of string operations are handled by pure C code, for example regex is completely done by a FSM in C, that's why regexes in python are faster than java, golang you name it, if the string you are searching on is big enough, in my current company we were planning to move some code from python to golang cause of performance issues, but while profiling, python's string regexes and dict's perf. caught everyone off guard.
@jinsongli3128Ай бұрын
my test with the List also shows mojo is slower than python, not sure if I did anything wrong.
@ekbphd3200Ай бұрын
Thanks for that info! Yeah, I assume that at some point Mojo will be faster than Python with most tasks but that’s not the case yet with the linguistic tasks I do most often.
@rkidyАй бұрын
The curve really doesn’t have a name, but it is an increasing function tending towards being linear. It is not exponential or parabolic, they refer to specific kinds of curves. The reason behind this is due to the fact python dictionaries are implemented as hashmaps in c, which when empty take roughly a constant, very small amount of time to access, but when fuller take longer and longer proportional to how many items are already inside it.
@ekbphd3200Ай бұрын
Thanks for that clarification!
@ekbphd3200Ай бұрын
Thanks for that info!
@numeritos1799Ай бұрын
Hey there! I think the stuff you make is helpful and you should keep doing it :) That being said though, you'll always have this problem with dictionaries (and most other data structures). Dealing with LOTS of memory is going to become a performance problem since, most likely, some of this memory is going to have to be stored in disk and disk reads are very. Also resizes of big data structures are particularly slow. This btw could explain some of the "outliers" that you have. I just wanna point this out because maybe you were under the impression that this performance decrease as the dict gets bigger is because of the optimizations for small-sized dictionaries. It might play a part (haven't looked at the implementation) but it should be something minor.
@ekbphd3200Ай бұрын
Thanks for this helpful info! I appreciate it.
@Queasy.Ай бұрын
Nice video. I think that "exponential" was the right word to use to describe the increasing rate.
@ekbphd3200Ай бұрын
Thanks for the clarification!
@rkidyАй бұрын
It’s not exponential. It’s an increasing function approaching a linear relationship.
@Queasy.Ай бұрын
@@rkidy You may be right that it's not exponential, but wouldn't it just be a polynomial function? What leads you to think it is approaching a linear relationship?
@GskvjАй бұрын
can R do a quadratic "regression" to bound the data set from below?
@ekbphd3200Ай бұрын
I’m not sure. I’ll have to look into it. Thanks for comment!
@bobdieter5296Ай бұрын
Hey, do you have the code for this test somewhere publicly visible, like Github? I would like to see (in a bit more detail) what was benchmarked here.
@ekbphd3200Ай бұрын
Sure. I previously forgot to link to my scripts. I just changed the description of the video to include links to my two scripts. Thanks for pointing this out!
@blackdereker4023Ай бұрын
Python dictionaries are essentially a hash map and naturally have a limited index range for the hashes. More items means that are more chances of collision and having to do a linear search on the index.
@bertiesmith3021Ай бұрын
A well designed hash map should resize itself to avoid this. There will be spikes at the resize points, and steps as you exhaust the memory caches. A smooth curve doesn’t indicate a good design.
@blackdereker4023Ай бұрын
@@bertiesmith3021 The resizing part is the big problem here, in big data is extremely costly to resize. That's why databases opt to use B+ Trees and analytical databases don't even have indexes just partitions.
@dfs-comedyАй бұрын
@@blackdereker4023 A properly-designed resizing hash table amortizes the resize costs so the amortized cost of a search operation is O(1). Databases use B+ trees because for databases, the thing being optimized is number of disk accesses rather than CPU time; B+ trees can have large branching factors to reduce disk accesses.
@ekbphd3200Ай бұрын
Thanks for the info!
@prakhargupta2960Ай бұрын
loved it
@ekbphd3200Ай бұрын
Thanks!
@srijankumar9466Ай бұрын
Great video man 👍👍
@ekbphd3200Ай бұрын
Thanks bro!
@exxzxxeАй бұрын
This will be an issue for us. Thanks for research!
@ekbphd3200Ай бұрын
You're very welcome! Thanks for watching, and commenting!
@cmdlp4178Ай бұрын
Maybe in Mojo the Strings are copied and reallocated as new objects, maybe because of the different StringKey type, or maybe (I am not as familiar with Mojo) because Dict stores and owns an own instance of the string, similar to as in C++. There the string and the memory associated with the string is "owned" by e.g. the unordered_map. Maybe you should retry measuring the time, now inserting floating point numbers for examples, that might not be allocated in Mojo.
@ekbphd3200Ай бұрын
Good idea! I'll give that a try soon.
@user-ff5op7nd4eАй бұрын
Sir,can you share a quick tutorial on how to create a model for source seperation using speechbrain in google colab from scratch?Please sir!!!
@ekbphd3200Ай бұрын
I'd recommend their tutorial on source separation: speechbrain.github.io/tutorial_separation.html
@user-ff5op7nd4eАй бұрын
Sir,I do have a doubt regarding this project and it is kinda urgent .There is this command-cd recipes/WSJ0Mix/separation python train.py hparams/sepformer.yaml --data_folder=your_data_folder...what is data_folder here..what exactly should i put in data_folder?Sir,I need your help.
@ekbphd3200Ай бұрын
Here's there tutorial: speechbrain.github.io/tutorial_separation.html
@Julian.u7Ай бұрын
Parabolic is not exponential. I would not use the exponential term so nonchalantly
@ekbphd3200Ай бұрын
Yeah, that's a good point. Thanks!
@MarcoAntoniottiАй бұрын
You should not do scripts in Julia. (or anything else). You pay the compilation time at each run.
@ekbphd3200Ай бұрын
Good point. I guess in the end, I care about how much I have to wait for results once I run the script, so the compilation time is important to me too.
@MarcoAntoniottiАй бұрын
@@ekbphd3200 the issue is that you are not measuring things properly. You should have a one line script that calls a function both in Python and Julia. That way you should notice the difference.
@emmahalliday15032 ай бұрын
Thank you so much. Really helpful explanation of the statistical measures! Would you consider a video on the key word measures!
@ekbphd3200Ай бұрын
Great suggestion!
@ekbphd3200Ай бұрын
Here's a video on the keyword analysis (aka. keyness analysis) function: kzfaq.info/get/bejne/bp2ElLWkztnHiWw.html
@patates11652 ай бұрын
nice video :)
@ekbphd32002 ай бұрын
Thanks!
@melodyogonna2 ай бұрын
You're not supposed to call those dunder methods directly like that lol. __contains__ is called by the language is you do: if wd[] in freqs. __len__ is called by the language if you do len(wds). __getitem__ is called by the language if you do: freqs[wd[]+1] They're used to design library APIs which fit orthogonally into the core of the language
@starshipx12822 ай бұрын
nice observation.
@ekbphd32002 ай бұрын
Thank you much!
@ogblue81592 ай бұрын
thank u for sharing this, i wish if the audio quality was better after the vocals are seperated tho
@ekbphd32002 ай бұрын
Yeah me too.
@chrismen832402 ай бұрын
I tried everything seems like julia don't know the type of the values in eachmatch and that's causing recompilation (around 10% of the time is taken from there), I think it would be great to change it to use vector of char preallocated to avoid allocations and try to avoid regex match even if it means making loop over char as long as it compiles to machine code it will be fine
@ekbphd32002 ай бұрын
Thanks for the tip!
@yuyuemo75342 ай бұрын
Thank you! It is helpful~
@yuyuemo75342 ай бұрын
But there is a litter problem, when I add \b for be, like "*_*_\bbe\b", AntConc4.24 tells me it didn't find anything, and I have not selected Words\Case\Regex they all.Do you have any idea?
@ekbphd32002 ай бұрын
Perhaps remove the both \b. It's hard to tell without seeing exactly what your parameters are in AntConc.
@sunnyyshc2 ай бұрын
Hi @EKB PhD, thanks for sharing! I am trying to use whisper_timestamped to extract word level timestamp from singing, so I can create karaoke from the result. However, the transcribe quality went from almost perfect to completely useless when I switched from using stable_whisper to whisper_timestamped. Same audio, but using the "large" model. Any suggestions or insights why this would happened? Also, given lyrics is always available, I wonder if there are other models that would create timestamp from a audio, given the full lyrics (which I assumed the timestamp would be much more accurate, since there is no guesswork on the content of the audio, but only when a word start/end in the audio. THANKS!!!
@sunnyyshc2 ай бұрын
fyi, I am looking into forced alignment (this align transcript to audio) and WhisperX (which promises more accurate word timestamp.)
lmao up until minute 9 i was expecting that you were going to explain it ^^
@ekbphd32002 ай бұрын
Sorry. I'm curious to see when they will be able to optimize the native Mojo dictionary. In the meantime, there's a third-party Mojo dictionary that is faster. Take a look at this video: kzfaq.info/get/bejne/d9d0es5616qdmac.htmlsi=xx6_7G-srSQX3KjF
@mariusdrulea90493 ай бұрын
If you do this little change in your julia code, the runtime would be even faster: wds = String.(split(txt, " ")). This will eliminate the allocations happening inside the loop, cause SubStrings were converted to Strings inside the loop. Now we do this outside of the loop. I also recommend benchmarking with the @time macro.
@ekbphd32002 ай бұрын
Thanks for the tips! I appreciate them. I'll look at implementing this soon.
@ekbphd32002 ай бұрын
I tried what you suggested, but it actually took longer. See below: ### 27 seconds on average ### t1 = now() wds = split(txt, " ") freqs = Dict{String,Int}() for wd in wds freqs[wd] = get(freqs, wd, 0) + 1 end t2 = now() ### 40 seconds on average ### t1 = now() wds = String.(split(txt, " ")) freqs = Dict{String,Int}() for wd in wds freqs[wd] = get(freqs, wd, 0) + 1 end t2 = now() Do you see something I could do better?
@user-oc8ch8wh7o3 ай бұрын
The plot you showed at the end of the video was cut off by your camera. We couldn't see what the results were for the 1.11.0 run
@oterotube133 ай бұрын
it is a shame.. the most important is the hidden one!, at least we know where the boxplot is in the figure.
@ekbphd32002 ай бұрын
Yeah, I noticed that after I finishing the video. I gotta check that type of thing before publishing future videos.
@ekbphd32002 ай бұрын
Yeah. Darn. I need to check that in future videos.
@rauldurand2 ай бұрын
you may want to upload the video again after fixing it
@edmondw66893 ай бұрын
They recently announced opensource their product except LLVM (Apache License), so it should get much better despite my bad impressions on some of the open source authors in the past.
@CuriosidadesParaPensar-nn9sn3 ай бұрын
please could you make a video teaching me how to use this program, I'm confused but I wanted to use it, where do I send my voice to clone?
@ekbphd32003 ай бұрын
I just followed the code here: huggingface.co/coqui/XTTS-v2
@exxzxxe3 ай бұрын
So, Julia is fast (optimized) for numerical analysis, but not so much for other problems.
@ekbphd32003 ай бұрын
I guess, but I don't know sure. I simply trying to discover when it's faster than Python. I don't really care which on is quicker than the other, as I use both as needed.
@exxzxxe3 ай бұрын
Julia being slower is a surprise- given all the hype.
@ekbphd32003 ай бұрын
Right?!
@exxzxxe3 ай бұрын
@@ekbphd3200 I used to worry about software execution speeds too. I spent 17 years working in computational math and physics in the supercomputer field at Cray and Thinking Machines (Fortran 90 on the massively parallel machines).
@therainman77772 ай бұрын
It’s because the Julia code was not written correctly for performance. The video author posted his code in a comment above. Many issues in the code that are doing things in a non-Julianic way. If that’s even a word 😆
@exxzxxe2 ай бұрын
@@therainman7777 Thanks!
@androth15023 ай бұрын
my guess would be that the mojo implementation is a quick lazy thing just to get maps in, while python uses a C implementation.
@ekbphd32003 ай бұрын
Yeah, I assume that's what's going on.
@androth15023 ай бұрын
@@ekbphd3200 still not a good look for them to release something so unoptimized after boldly claiming to be thousands of times faster than python. lol.
@brendanhansknecht46503 ай бұрын
Python actually has a realy optimized dictionary for its constraints (taking untyped keys and values). Mojo will catch up (and should surpass due to static types), but the current dict is new and mostly unoptimized. Also, this should be memory bound instead of compute bound so even in the long term, I would not expect much of a difference.
@ekbphd32003 ай бұрын
Yeah, good points. Thanks.
@user-qz3nx4xy8c3 ай бұрын
Now using mojo just a Python skin language, everything you want to do must import Python package…
@@ekbphd3200You, sir, are a gentleman and a scholar.
@hyperplano4 ай бұрын
In the discord channel I read the current dictionary implementation is not optimized. It's been included to allow developers to experiment more with the language, but it will be optimized in future releases.