Does Python's dictionary get slower as it gets bigger?

Рет қаралды 1,211

2 ай бұрын

In a previous video, I showed that Mojo's native dictionary object becomes progressively slower as the number of items increases. This video asks that same question about Python (v3.12.3).
Here's that previous video about Mojo's dictionary:
• How big is a "small" d...
Here's my Python script used in this video:
github.com/ekbrown/scripting_...
#python #corpuslinguistics

Пікірлер: 25

@Queasy. 2 ай бұрын

Nice video. I think that "exponential" was the right word to use to describe the increasing rate.

@ekbphd3200 2 ай бұрын

Thanks for the clarification!

@rkidy 2 ай бұрын

It’s not exponential. It’s an increasing function approaching a linear relationship.

@Queasy. 2 ай бұрын

@@rkidy You may be right that it's not exponential, but wouldn't it just be a polynomial function? What leads you to think it is approaching a linear relationship?

@blackdereker4023 2 ай бұрын

Python dictionaries are essentially a hash map and naturally have a limited index range for the hashes. More items means that are more chances of collision and having to do a linear search on the index.

@bertiesmith3021 2 ай бұрын

A well designed hash map should resize itself to avoid this. There will be spikes at the resize points, and steps as you exhaust the memory caches. A smooth curve doesn’t indicate a good design.

@blackdereker4023 2 ай бұрын

@@bertiesmith3021 The resizing part is the big problem here, in big data is extremely costly to resize. That's why databases opt to use B+ Trees and analytical databases don't even have indexes just partitions.

@dfs-comedy 2 ай бұрын

@@blackdereker4023 A properly-designed resizing hash table amortizes the resize costs so the amortized cost of a search operation is O(1). Databases use B+ trees because for databases, the thing being optimized is number of disk accesses rather than CPU time; B+ trees can have large branching factors to reduce disk accesses.

@ekbphd3200 2 ай бұрын

Thanks for the info!

@rkidy 2 ай бұрын

The curve really doesn’t have a name, but it is an increasing function tending towards being linear. It is not exponential or parabolic, they refer to specific kinds of curves. The reason behind this is due to the fact python dictionaries are implemented as hashmaps in c, which when empty take roughly a constant, very small amount of time to access, but when fuller take longer and longer proportional to how many items are already inside it.

@ekbphd3200 2 ай бұрын

Thanks for that clarification!

@ekbphd3200 2 ай бұрын

Thanks for that info!

@davea136 2 ай бұрын

hashmaps are made for O(1)ish retrieval, insertion is far less important. So yeah, insertion gets slower, especially if you haven't chosen a hashing method specific to your data (this can be really important), but the true test of a hashmap/dicitonary is how the size of the corpus affects retrieval. It would be interesting to run the experment on the retrieval end, now that you know the performance of insertion, ansd see how they compare. (You may also want to see how pickling the dictionary and restoring it performs, since this is done a lot more than generatign the initial map in research.) Also, for initial trials, maybe speed things up and get a rougher idea by increasing the quantity by orders of magnitude - 10, 100, 1000, etc. A sparser plot some times helps things jump out of the data more clearly. Thank you for the post, it was good fun!

@ekbphd3200 2 ай бұрын

Great ideas! Sounds like another test I need to run! Thanks for the comment.

@numeritos1799 2 ай бұрын

Hey there! I think the stuff you make is helpful and you should keep doing it :) That being said though, you'll always have this problem with dictionaries (and most other data structures). Dealing with LOTS of memory is going to become a performance problem since, most likely, some of this memory is going to have to be stored in disk and disk reads are very. Also resizes of big data structures are particularly slow. This btw could explain some of the "outliers" that you have. I just wanna point this out because maybe you were under the impression that this performance decrease as the dict gets bigger is because of the optimizations for small-sized dictionaries. It might play a part (haven't looked at the implementation) but it should be something minor.

@ekbphd3200 2 ай бұрын

Thanks for this helpful info! I appreciate it.