Typeahead Suggestion Design Deep Dive with Google SWE! | Systems Design Interview Question 6

No video

Typeahead Suggestion Design Deep Dive with Google SWE! | Systems Design Interview Question 6

Рет қаралды 6,245

Күн бұрын

Imma need me some typeahead real soon that's for sure
00:00 Introduction
01:15 Functional Requirements
02:15 Capacity Estimates
03:21 API Design
04:13 Database Schema
05:32 Architectural Overview

Пікірлер: 30

@senatusconsultumultimum7815 2 жыл бұрын

What a cool Data structure. I love the bottom up update, very clean.

@OKJazzBro Жыл бұрын

At around 18:00 you state that the client can itself compute the hash of the string in order to prevent the linear time (linear in the length of the string) look-up from the server's local cache. I raise 2 problems with this: 1. It doesn't really make a difference where the hash is computed as far as E2E latency is concerned. After all, hashing is a prerequisite to a hash table look-up, so it needs to happen sequentially before the look-up. 2. With hash tables, hashing is just the first step. After a hash match you have to do an exact key match to rule out the possibility that it was a hash collision / false match. So no matter what, it will not be constant time look-up and will depend on the length of the string. Another thing, your approach of storing just references to leaf nodes to reconstruct actual full words instead of storing full words at each trie node is very interesting. But I'm curious, do you know if it's done this way in a real world application or is it something you came up with? Because, I think it could be costlier than we might think to walk up a trie like that because each node is a different hash table and it would amount to a ton of CPU cache misses. I'm inclined to thinking that simply storing full words at each trie node would be more congenial to the latency requirements (and this is how it's described in Alex Xu's book). With this approach (where we discard your "store trie leaf references" idea), the only way the trie would help us is by keeping the data structure more compact when compared to a full blown hash table which maps each possible prefix to the top K words. Sorry if these are aspects you discussed in the video. Since most of the video is talk without much illustration (not complaining or anything, this is free stuff after all :) ) , I end up having to replay bits and pieces of it, so it's plausible I missed the bigger picture.

@jordanhasnolife5163 Жыл бұрын

1) Fair point! Looking back on it I think I was just referring to reducing load on the cache but you're definitely correct here sometimes I don't even fully understand what I was saying. 2) Yeah but that's usually bound by a constant factor based on the load factor of the map. 3) This is the solution from Grokking actually. I haven't read Alex Xu's book, but I'm glad that you told me so that I can check it out! I don't think that you would even be using the CPU cache here because while you need to store hash tables to go down the trie, going up the trie just requires a single pointer to a parent node in memory. Agreed with your point that storing all the top suggestions directly is ideal, but I believe one of the challenges of the problems is that all of the terms are too much to store (at least in memory).

@OKJazzBro Жыл бұрын

@@jordanhasnolife5163 The CPU cache comes into the picture when you dereference those pointers to get the characters, right? It's in the same sense that a linked list is known to perform worse as a FIFO queue than a circular array. The data are scattered all around memory (and not in contiguous blocks of memory), so the CPU cache is often missing what you want, so the look-up goes to memory.

@scbs2007 Жыл бұрын

++ Can clarify requirements: touch upon bloom filters to forget about one-hit-wonders.

@jordanhasnolife5163 Жыл бұрын

Sounds good will do next time around

@ryandsouza9093 10 ай бұрын

Great video, love the way yout articulate your thoughts and your quick capacity estimates! Any reason why we are using REST? REST cannot stream data. We can use a WebSocket connection instead for bi-directional streaming minimizing latency even further than the overhead of establishing a new TCP connection for each character.

@jordanhasnolife5163 10 ай бұрын

I think that's reasonable! Though I guess it also does mean that you're maintaing a persistent connection to the server as long as you have a text field clicked, totally see your point though!

@De1n1ol Ай бұрын

you don't establish a new TCP connection for each new character because of the keep-alive header, which included by default since HTTP 1.1, that keeps TCP connection open. Websockets are redundant here

@yuwang3736 17 күн бұрын

Kinda confused about the cache part and the HDFS, how often do you expect to update the cache? Is it the only way to change the HDFS by adding query?

@jordanhasnolife5163 16 күн бұрын

The cache is updated like once per day. I don't know what you mean "change the HDFS", but I'm assuming you mean add data to it, in which case yes a new query to our service will go through our stream processing solution and eventually be sunk to HDFS.

@yuwang3736 16 күн бұрын

@@jordanhasnolife5163 Cool, thanks. That is exactly the point. I thought some new queries could come from the input of the users.

@SureshBabu-wb1dg Жыл бұрын

Why do we need to compute the hash of the query string while we are using range-based partitioning(dynamic) for trie? Could you explain a lit bit more? Is it only required for caching layer?

@jordanhasnolife5163 Жыл бұрын

Yep, exactly - just for caching the partitioning is probably best kept by range. We can cache based on a hashed string to avoid hot caches if need be.

@raj_kundalia 7 ай бұрын

thank you!

@cambriandot3665 Жыл бұрын

14:56 Architecture

@neek6327 2 жыл бұрын

Good stuff, man. Wanted to get your thoughts on an idea I had. Since the amount of data is so small, do you think it would be advantageous to store the tries (or the prefix->suggestions map) directly on the fetch servers to avoid a network call to Redis? I think we could apply the same ideas wrt to replication/sharding/load balancing to the tries stored on the fetch servers.

@jordanhasnolife5163 2 жыл бұрын

Seems pretty reasonable to me!

@blumaukko1790 Жыл бұрын

How do you store a trie on Redis? Afaik that data structure is not supported

@jordanhasnolife5163 Жыл бұрын

Tries are implemented with nested hashmaps which I believe you can have on Redis

@nomnom8692 6 ай бұрын

@@jordanhasnolife5163 I know this video is old, but redis you also can't store nested hash, you can if you store it as a serialised string, but then your application will have to deserialise. I guess this doesn't change much the design, but instead of fetching from redis, we could just store in the (application) server itself, and every some time the server can re-fetch the redis to update the trie (re-starting the serialization/deserialization).