Does Python's dictionary get slower as it gets bigger?

  Рет қаралды 1,211

EKB PhD

EKB PhD

2 ай бұрын

In a previous video, I showed that Mojo's native dictionary object becomes progressively slower as the number of items increases. This video asks that same question about Python (v3.12.3).
Here's that previous video about Mojo's dictionary:
• How big is a "small" d...
Here's my Python script used in this video:
github.com/ekbrown/scripting_...
#python #corpuslinguistics

Пікірлер: 25
@Queasy.
@Queasy. 2 ай бұрын
Nice video. I think that "exponential" was the right word to use to describe the increasing rate.
@ekbphd3200
@ekbphd3200 2 ай бұрын
Thanks for the clarification!
@rkidy
@rkidy 2 ай бұрын
It’s not exponential. It’s an increasing function approaching a linear relationship.
@Queasy.
@Queasy. 2 ай бұрын
@@rkidy You may be right that it's not exponential, but wouldn't it just be a polynomial function? What leads you to think it is approaching a linear relationship?
@blackdereker4023
@blackdereker4023 2 ай бұрын
Python dictionaries are essentially a hash map and naturally have a limited index range for the hashes. More items means that are more chances of collision and having to do a linear search on the index.
@bertiesmith3021
@bertiesmith3021 2 ай бұрын
A well designed hash map should resize itself to avoid this. There will be spikes at the resize points, and steps as you exhaust the memory caches. A smooth curve doesn’t indicate a good design.
@blackdereker4023
@blackdereker4023 2 ай бұрын
@@bertiesmith3021 The resizing part is the big problem here, in big data is extremely costly to resize. That's why databases opt to use B+ Trees and analytical databases don't even have indexes just partitions.
@dfs-comedy
@dfs-comedy 2 ай бұрын
@@blackdereker4023 A properly-designed resizing hash table amortizes the resize costs so the amortized cost of a search operation is O(1). Databases use B+ trees because for databases, the thing being optimized is number of disk accesses rather than CPU time; B+ trees can have large branching factors to reduce disk accesses.
@ekbphd3200
@ekbphd3200 2 ай бұрын
Thanks for the info!
@rkidy
@rkidy 2 ай бұрын
The curve really doesn’t have a name, but it is an increasing function tending towards being linear. It is not exponential or parabolic, they refer to specific kinds of curves. The reason behind this is due to the fact python dictionaries are implemented as hashmaps in c, which when empty take roughly a constant, very small amount of time to access, but when fuller take longer and longer proportional to how many items are already inside it.
@ekbphd3200
@ekbphd3200 2 ай бұрын
Thanks for that clarification!
@ekbphd3200
@ekbphd3200 2 ай бұрын
Thanks for that info!
@davea136
@davea136 2 ай бұрын
hashmaps are made for O(1)ish retrieval, insertion is far less important. So yeah, insertion gets slower, especially if you haven't chosen a hashing method specific to your data (this can be really important), but the true test of a hashmap/dicitonary is how the size of the corpus affects retrieval. It would be interesting to run the experment on the retrieval end, now that you know the performance of insertion, ansd see how they compare. (You may also want to see how pickling the dictionary and restoring it performs, since this is done a lot more than generatign the initial map in research.) Also, for initial trials, maybe speed things up and get a rougher idea by increasing the quantity by orders of magnitude - 10, 100, 1000, etc. A sparser plot some times helps things jump out of the data more clearly. Thank you for the post, it was good fun!
@ekbphd3200
@ekbphd3200 2 ай бұрын
Great ideas! Sounds like another test I need to run! Thanks for the comment.
@numeritos1799
@numeritos1799 2 ай бұрын
Hey there! I think the stuff you make is helpful and you should keep doing it :) That being said though, you'll always have this problem with dictionaries (and most other data structures). Dealing with LOTS of memory is going to become a performance problem since, most likely, some of this memory is going to have to be stored in disk and disk reads are very. Also resizes of big data structures are particularly slow. This btw could explain some of the "outliers" that you have. I just wanna point this out because maybe you were under the impression that this performance decrease as the dict gets bigger is because of the optimizations for small-sized dictionaries. It might play a part (haven't looked at the implementation) but it should be something minor.
@ekbphd3200
@ekbphd3200 2 ай бұрын
Thanks for this helpful info! I appreciate it.
@srijankumar9466
@srijankumar9466 2 ай бұрын
Great video man 👍👍
@ekbphd3200
@ekbphd3200 2 ай бұрын
Thanks bro!
@exxzxxe
@exxzxxe 2 ай бұрын
This will be an issue for us. Thanks for research!
@ekbphd3200
@ekbphd3200 2 ай бұрын
You're very welcome! Thanks for watching, and commenting!
@Gskvj
@Gskvj 2 ай бұрын
can R do a quadratic "regression" to bound the data set from below?
@ekbphd3200
@ekbphd3200 2 ай бұрын
I’m not sure. I’ll have to look into it. Thanks for comment!
@prakhargupta2960
@prakhargupta2960 2 ай бұрын
loved it
@ekbphd3200
@ekbphd3200 2 ай бұрын
Thanks!
I've Read Over 100 Books on Python. Here are the Top 3
9:26
Python Programmer
Рет қаралды 325 М.
IQ Level: 10000
00:10
Younes Zarou
Рет қаралды 7 МЛН
БИМ БАМ БУМ💥
00:14
⚡️КАН АНДРЕЙ⚡️
Рет қаралды 3,8 МЛН
Inside Out Babies (Inside Out Animation)
00:21
FASH
Рет қаралды 18 МЛН
Reacting to Controversial Opinions of Software Engineers
9:18
Fireship
Рет қаралды 2 МЛН
Why is Mojo's dictionary slower (!) than Python's?
12:16
EKB PhD
Рет қаралды 4 М.
Is it worth it to call Rust from Python with PyO3?
8:50
Python's 5 Worst Features
19:44
Indently
Рет қаралды 105 М.
God-Tier Developer Roadmap
16:42
Fireship
Рет қаралды 7 МЛН
How big is a "small" dictionary in Mojo lang?
7:20
EKB PhD
Рет қаралды 1,5 М.
10 weird algorithms
9:06
Fireship
Рет қаралды 1,2 МЛН
My Initial Impresson Of Go
12:39
TheVimeagen
Рет қаралды 80 М.
Choosing Your Language: Python or Mojo?
14:33
ArjanCodes
Рет қаралды 112 М.
The most important Python script I ever wrote
19:58
John Watson Rooney
Рет қаралды 175 М.
IQ Level: 10000
00:10
Younes Zarou
Рет қаралды 7 МЛН