No video

1: TinyURL + PasteBin | Systems Design Interview Questions With Ex-Google SWE

  Рет қаралды 44,028

Jordan has no life

Jordan has no life

9 ай бұрын

38 minutes in one go, my ex girlfriend punching the air right now

Пікірлер: 252
@prashantgodhwani6304
@prashantgodhwani6304 9 ай бұрын
I swear this channel is such a hidden gem.
@jordanhasnolife5163
@jordanhasnolife5163 9 ай бұрын
Thanks man! Hopefully not too hidden haha
@ronakshah725
@ronakshah725 2 ай бұрын
I think this attracts a particular kind of readers. Those looking for more in-depth and first- principles understanding vs memoization. I think you also have a good balance of real use case driven thinking and hand waving where necessary. The latter is crucial and people ( including me at times) tend to overlook the importance of hand waving. Not everything deserves the same level of detail. But knowing which does is the game changer! Anyways, I will be subscribing and reviewing a good portion of this 😊
@dwivedys
@dwivedys 2 ай бұрын
You stole my thoughts! What is this man!!! Amazing…brilliant
@jporritt
@jporritt 5 ай бұрын
I've gone through a lot of videos on this subject and this is by far the best. Not just regurgitating what they've read elsewhere, but actually coming from a place of proper understanding.
@jordanhasnolife5163
@jordanhasnolife5163 5 ай бұрын
Thanks James!
@gchris6434
@gchris6434 4 ай бұрын
Great content! Probably the most in-depth TinyURL design I've seen. Most others don't actually get into write conflicts/locking.
@brandonwl
@brandonwl 8 ай бұрын
I appreciate the new 2.0 video style and how in depth it is. I found the older videos too brief considering the complexity of the systems they covered.
@ww12835
@ww12835 5 ай бұрын
Thank you very much, I’ve been reading DIDA and Understanding Distributed Systems so it’s great to see how those concepts apply to the use cases you are showing. This is the first video of yours that I’ve seen yet, planning to watch all the series when I can because you highlight the questions that we should be asking when designing and assert the tradeoffs of different solutions, it’s amazing. I don’t know if you have done for the other videos but if I could ask for anything would be for you to change the pencil color when reviewing the flows of data, it makes it easier to find where you are pointing to in the drawing board. Again, thank you very much, these videos are really helpful.
@jordanhasnolife5163
@jordanhasnolife5163 5 ай бұрын
Sounds good - Ive gotten this suggestion a couple of times now so I'll give it a shot at some point
@hzoe
@hzoe 25 күн бұрын
You are so underrated brother... this is the best content on the internet regarding System Design interviews 100%... who cares about an interviewer saying 'yes' to every question that the interviewee asks, or an interviewee coming up with the same 10 boxes every single System Design question when you explain stuff like this... keep it up, you are saving lifes!!
@gopald9962
@gopald9962 5 ай бұрын
You are very good, the best even, at driving the solution of a design problem by applying logical reasons rather than throwing something at my face i wouldn't understand. Thanks for existing!
@firezdog
@firezdog 3 ай бұрын
I really like the format for these types of videos of starting with a naive solution and then iterating on each component as we discover performance problems. That way the initial db schema and the overall graph of how requests / responses implement the features can be clearly presented to the viewer + the viewer gets a conceptual framework to start evaluating the ways in which the simple picture needed to become more sophisticated.
@jordanhasnolife5163
@jordanhasnolife5163 3 ай бұрын
Thanks for the feedback!
@mytabelle
@mytabelle 6 ай бұрын
Hey, first of all thank you for the videos. They've been really helpful for me prepping for an upcoming interview! I watched a separate video on tinyURL/pastebin link generation, and the main point of the video was that instead of hashing, which can lead collisions when we truncate the MD5/whatever hash, like you mentioned in the video. Their suggestion was that we could have a cluster of key generations/ID generators, and each one would have a range of URLs to distribute when a write request is made. They could be coordinated by ZooKeeper and would allocate only from their own chunks, and would refresh their URLs as needed. The only concern is that we lose that range of URLs (say 1 million) if a specific generator goes down, but since we have 2 trillion combinations if we use base 36 with 8 characters and even more if we use base 62, this isn't a major concern. This would also speed up writes in the case there is a collision. What are your thoughts on this?
@jordanhasnolife5163
@jordanhasnolife5163 6 ай бұрын
Hey yeah, I touch upon that in this video. That's basically the key generation service where you partition the keys and assign in sequence. That works too! There may some lock contention within a partition but if you have enough partitions such that it doesn't matter that's fine too!
@psychechip
@psychechip 5 ай бұрын
I fucking love how you can explain complex shit and be funny at the same time. Keep up the good work!
@idiot7leon
@idiot7leon 4 ай бұрын
Brief Outline 00:01:15 Problem Requirements 00:02:04 Performance Considerations 00:03:23 Link Generation 00:05:45 Assigning URLs - Replication 00:08:07 Assigning URLs - Caching 00:08:51 Assigning URLs - Partitioning 00:10:34 Assigning URLs - Single Node 00:11:29 Assigning URLs - Predicate Locks 00:14:12 Assigning URLs - Engine Implementation? 00:15:42 Assigning URLs - Database Choice 00:16:17 Maximizing Read Speeds 00:17:35 Maximizing Read Speeds - Hot Links 00:18:50 Maximizing Read Speeds - Populating the Cache 00:21:02 Analytics - Naive Solution 00:22:22 Analytics - Stream Processing 00:23:49 Analytics - Click Consumer 00:26:00 Analytics - Exactly Once? 00:29:16 Analytics - One Publisher Per Row 00:30:17 Deleting Expired Links 00:31:21 PasteBin - Huge Pastes 00:34:23 Final Diagram Thanks, Jordan!
@jordanhasnolife5163
@jordanhasnolife5163 4 ай бұрын
Thank you man! Legend!
@zalooooo
@zalooooo 7 ай бұрын
Awesome video! Couple thoughts: - For storage, it may be useful to talk about content addressable storage. Many pastes may be the same and that will save space - I’m not sure what the CDN buys us here. Reading from s3 isn’t crazy slow but the big issue is that the object space is huge and my hunch is that most of the time, you’d get CDN misses in most regions. That is assuming you’re not storing every paste to your CDN, which feels expensive. This is more a not that a cdn feels like a premature optimization - I really like your count approach. Another couple be using a counter CRDT and increment server side while serving the link. That being said, yours is more general purpose and scales for other types of analyitics
@jordanhasnolife5163
@jordanhasnolife5163 7 ай бұрын
Thanks! Good call on the first point, you could take some sort of hash to deduplicate. For the CDN point, I mostly have it there just to talk about CDNs. In practice, caches seem to be one of those things that you just introduce as you need, so we'd have to see how our existing latencies were and then experiment here.
@slover4384
@slover4384 2 ай бұрын
At 12:34 -> You do not have to do any predicate query. Relational databases use B-trees and are not append-only like databases that use LSM trees. So an "INSERT" will fail if another INSERT beat it to the race because the 2nd insert will notice the primary key in the B-tree. This sort of "INSERT and fail if primary key exists" is very hard to accomplish in LSM based DBs as they are append-only. In fact to accomplish this in an LSM based database, you usually have to do a special slow instruction that involves multiple phases and uses something like Paxos or raft or 2PC. So with a database like postgres, one solution to this TinyURL is to actually just have different application servers INSERT new rows. If there was a race on the exact same primary key, the 2nd application server will get a "duplicate key exists" error and will know to retry or send the error to the client.
@jordanhasnolife5163
@jordanhasnolife5163 2 ай бұрын
Yeah - I probably shouldn't have included this section in the video since it's just thrown people off. I meant to say that predicate locking was one way of accomplishing this, but like you mentioned in SQL databases they're gonna get this done for you on the primary key. Interesting stuff on the LSM tidbit, that's new to me! I imagine if the key was still in the memtable perhaps we wouldn't need too much additional complexity, as we'd see it was there, but had it been flushed this could get complex. Thanks for the insight, and I appreciate you pointing out these flaws!
@advaitchabukswar4163
@advaitchabukswar4163 16 күн бұрын
Once again, learned a lot and read and write scaling techniques were helpful in my system design interview. Another write scaling technique you can talk about is Log structured merge trees for o(1) write and O(logn) reads.
@jordanhasnolife5163
@jordanhasnolife5163 16 күн бұрын
I wouldn't say writes there are O(1) since they also go into a tree, but it should be a much smaller tree and in memory, so maybe practically it is :)
@advaitchabukswar4163
@advaitchabukswar4163 16 күн бұрын
@@jordanhasnolife5163 you are correct they are eventually not o(1) but if they are async writes, for your request sake, it's o(1) and at some point we add the entire log in sorted manner. Again, I am not an expert here so I could be completely wrong, lol. Please let me know if I am. Will do more research for complete understanding.
@jordanhasnolife5163
@jordanhasnolife5163 15 күн бұрын
@@advaitchabukswar4163 I don't think it is asynchronous, the write to the tree is synchronous as far as I'm aware
@advaitchabukswar4163
@advaitchabukswar4163 12 күн бұрын
@@jordanhasnolife5163 I see…
@gammagorom
@gammagorom 9 ай бұрын
Thank you for all the system design content! I’ve been binging through the 2.0 playlist and this video does a great job putting all the pieces together. I don’t have any experience with system design interviews so I have two questions in particular: 1. If in an interview, someone were able to deliver a design with the same level of depth + complexity that you’ve described here, what level would they be placed at? Alternatively, what sort of depth would interviewers be looking for in say an L4 interview? 2. Many of the system design interview guides recommend to lay out a basic architecture first and then go into a deep dive at the end but in your videos, you weave both parts together. How would you recommend to split these parts up? For example, is database schema + choosing a type of database part of the deep dive or should it be part of the basic architecture? And how about for replication/partitioning?
@jordanhasnolife5163
@jordanhasnolife5163 9 ай бұрын
1) At the end of the day, you're interviewing for a certain level, so you'd be placed at that level. I don't really think uplevels are ever going to happen, but failing a systems design interview might lead to a downlevel. I like to think that these videos are sufficient to pass just about any systems design interview, considering the results of my viewers, but I can't be sure, as I'm not an L5 and haven't had the test to try that theory out haha 2) Yeah, and I used to do my old videos this way. However what I've come to realize is that one's choice of database schema is intimately tied to their database architecture choice. So if I just list a schema outright, but don't go into what architecture I'm using, it might not make sense. In an actual interview, it may make sense to list a schema first, and then say you'll explain it later, but my main objective of these videos is to make them as clear as possible - so I hope that my choice to do this makes sense.
@scottmangiapane
@scottmangiapane 3 ай бұрын
6:40 never thought I'd laugh out loud during a system design video. Thanks for making my studying somewhat tolerable
@jordanhasnolife5163
@jordanhasnolife5163 3 ай бұрын
Thanks Scott!
@BRBallin1
@BRBallin1 2 ай бұрын
Using PH as your example URL makes you a legend
@jordanhasnolife5163
@jordanhasnolife5163 2 ай бұрын
Legend is a strong word to use for what I am
@harikareddy5759
@harikareddy5759 8 ай бұрын
Hey Jordan, awesome video as usual! I really appreciate the way you break down tradeoffs. I checked out the TinyURL/Pastebin design from grokking's course, but it felt a bit sloppy to me, your version is so much better :) thank you!
@jordanhasnolife5163
@jordanhasnolife5163 8 ай бұрын
Thanks Harika!
@ninad11081991
@ninad11081991 8 ай бұрын
I'd love it if you could make content regarding what level of depth is expected from a Mid level SD interview vs Senior vs Staff! It would be cool to take the same problem and do a mock at each level so we can understand the differences! Would love to see a mock series between you and Gaurav tackling this. I've interviewed with companies where they haven't picked a level for me yet and they say we will determine it based on your performance. But I don't know what the expectation is exactly in the context of a system design interview.
@jordanhasnolife5163
@jordanhasnolife5163 8 ай бұрын
To be honest I just don't have enough context here as I haven't conducted interviews for a company, I do really wish I could say otherwise but at the moment my objective is to just go as in depth as possible.
@andriidanylov9453
@andriidanylov9453 7 ай бұрын
Appreciate. It is a very serious description/implementation for such easy task.
@fane1159
@fane1159 Ай бұрын
FK , this man contents has its own system design . Love you buddy
@andreybraslavskiy522
@andreybraslavskiy522 3 ай бұрын
Hi Jordan. You were talking about 16 TB of potential data, partitioning, single leader replication. Sounds like Mongo fits better than relational solution. Thank you for great content!
@jordanhasnolife5163
@jordanhasnolife5163 3 ай бұрын
I think they're more or less equally feasible for this one, feel free to go with mongo, basically personal preference at this point
@Didneodl
@Didneodl 9 ай бұрын
Wonder why key-value DB wouldn’t be the db choice. It doesn’t require data relation and we can quickly check if tiny url has been used and also supports fast retrieval
@jordanhasnolife5163
@jordanhasnolife5163 9 ай бұрын
Yeah I tried to say that this would be the best choice, but just doubt that storing all data in memory is feasible from a cost perspective, so wouldn't be surprised if your interviewer pushed you harder here.
@adi96adi
@adi96adi 8 ай бұрын
I was wondering the same thing. Its weird how advice on costs usually comes up in the "what to say in the beginning" interview advice, but isn't mentioned more in technology tradeoffs. I also would've thought that non relational dbs are, for the most part, cheaper than relational ones, but I guess that's only until the point when they stop being caches and start being proper dbs. Am I right or missing something?
@jordanhasnolife5163
@jordanhasnolife5163 7 ай бұрын
@@adi96adi Yeah not exactly sure regarding the cost of non relational vs. relational dbs but I can say with certainty that storing in memory is more expensive and for something like tinyURL, likely to be overkill.
@ihor_biedin
@ihor_biedin 18 күн бұрын
Hi Jordan, great video! I'm not sure that we have to have uderId in the table, because the same long URL can be used by different users, and the eventual short URL (hash) will be the same.
@jordanhasnolife5163
@jordanhasnolife5163 18 күн бұрын
While I agree with you, you want to be able to attribute clicks to the links of different users, so it may be useful to have different short urls for the same long url
@sarrahdoesthings
@sarrahdoesthings 2 ай бұрын
I'm here for your eyes
@jordanhasnolife5163
@jordanhasnolife5163 2 ай бұрын
👀
@thunderzeus8706
@thunderzeus8706 Ай бұрын
Hi Jordan, this is the best in-depth design video about tinyurl in any book or on any website. Thanks a lot! I do have two questions: 1. I am confused by the discussion around the concurrent writes part. In the video you first ruled out the multi-leader and leaderless replication designs. Then you discussed how to handle the concurrent writes to a single row. But if we are choosing single-leader replication, would this still happen? Shouldn't the leader handle the concurrent writes with locking or queuing, and determine the "one with later timestamp" to be unsuccessful (any maybe then assign a different unused shortURL to it)? 2. In many other sources, there is the other way to assign the tinyURL, which is through distributing sequentially increasing numbers in base-36/62 via a ticketing server. The ticketing server can scale itself by preallocating different ID ranges. I am curious what kind of partitioning could this design have to handle large read and write traffic? Like if we use hash range partition, clearly all the writes will concentrate on one partition which is BAD. Of course we could do another hash on the shortURL and partition on that to make it evenly distributed, but hashing itself is not free, and adds to complexity. What's your point of view on that? Thanks in advance!
@jordanhasnolife5163
@jordanhasnolife5163 Ай бұрын
1) Yes, which is why we use single leader replication with locking :) 2) Yeah, that's the key generation approach that we describe. Each database covers a range of keys, and hence we should get linear increases in speed as we add more nodes. I don't see why all writes will concentrate on one partition if we use a hash range, you can just randomly route each incoming request to one of the partitions and the partition will assign it a UUID.
@isaacneale8421
@isaacneale8421 Ай бұрын
Great video, especially the discussion on single leader vs multi leader databases here. A couple notes: I don’t think write through for the CDN is a good idea. If you wrote a 10 GB file there, it would be at the expense of other files that may be getting accessed frequently. We shouldn’t give priority to a large unknown file over something that has been accessed and we won’t know will be popular. This is assuming that CDNs have a bounded size. Maybe there is a fancy one out there that isn’t bounded. I think you could make an argument that you won’t have records that are accessed for a long time after they are created.
@jordanhasnolife5163
@jordanhasnolife5163 Ай бұрын
Yeah agree with all said, I don't think we know in advance what's popular here anyways so a write around approach seems better.
@kumarc4853
@kumarc4853 3 ай бұрын
Learned a lot, concise, well explained, and also funny I have a sys design interview today, thank you so much
@jordanhasnolife5163
@jordanhasnolife5163 3 ай бұрын
Good luck!!
@michaelparrott128
@michaelparrott128 9 ай бұрын
One comment about the writer - if the writer is an API instance serving the requests to the user, and we're writing to the CDN/S3/DB, this could potentially take a long time and you'd likely want some kind of async mechanism to do these writes with a way of updating the user on its progress.
@jordanhasnolife5163
@jordanhasnolife5163 9 ай бұрын
Agreed. That's potentially more of a client side change though, pretty sure that when uploading files to these services they expose the percentage completion
@krish000back
@krish000back 7 күн бұрын
Awesome Content. I wasn't expecting this much detail in url shortener. Two questions though: 1. How does partition to kafka will be done? Ideally, servers need to know which kafka patition the message should be sent to and also, some coordination service would be required to maintain that? 2. For writing large files for Pastebin. CDN is a cache and cache will be have eviction policy, so if any url will be hit with gigantic file, which is not used for days and someone hit it after lets say weeks. Considering, we use LRU, it might get evicted and CDN anyways would need to load it again. Is the only reason to put it in CDN first is to make sure popular urls will be in CDN itself and when even first few users hit it, they all will get it pretty quickly?
@jordanhasnolife5163
@jordanhasnolife5163 7 күн бұрын
1) Yeah kafka employs zookeeper internally. You can just tell it a "partition id" you want to read and it figures out the rest IIRC. 2) Yep basically! If it hasn't been hit in a while, no point of keeping it in the CDN anyways.
@2020goro
@2020goro 6 ай бұрын
11:21 hi here you mention about the race condition when 2 users would add/create for a same actual url, but I’m confused as to why would this happen if the hash functions accounts for the userID already when generating the tinyURL? Wouldn’t that make the tinyurls unique for 2 users even for same links?
@jordanhasnolife5163
@jordanhasnolife5163 6 ай бұрын
It's super unlikely but as long as there are fewer hash combinations than userId and url combinations you can have some collisions
@vedant9173
@vedant9173 Ай бұрын
Thanks for explaining the rationale behind your design decisions, very few videos do that
@cole_dm
@cole_dm 8 күн бұрын
Can you explain a bit more why an on-disk hash index would not work for this problem? I understand that on-disk hash indexes will suffer from poor data locality and disk I/O optimization, but in this case we don't care about range queries, so locality is not a concern. Also, it seems to me that we can cache disk pages in memory for a hash index just as we would for a B+tree or similar and achieve similar a similar probability for cache hits, as co-located shortURLs on disk have no increased probability of access. At the very least, it seems like we could use a hybrid solution with a B+tree to index our on-disk hash table and achieve constant time value access once the hash table range containing our target key is paged into memory. PS love your channel dude really interesting stuff 🫡
@jordanhasnolife5163
@jordanhasnolife5163 7 күн бұрын
In this case it would work great if you keep the hashing part of it in memory and store the address on disk as the value. Agreed!
@cc-to2jn
@cc-to2jn 9 ай бұрын
oh wow u changed the original system design vids! Great I was thinking of reviewing soon :)
@jordanhasnolife5163
@jordanhasnolife5163 8 ай бұрын
Welcome back ;)
@Ryan-sd6os
@Ryan-sd6os 3 ай бұрын
hey jordan, curious, why do we avoid multi leader and use single leader when both can have stale data issue
@jordanhasnolife5163
@jordanhasnolife5163 3 ай бұрын
I'm not concerned about stale data, I'm concerned with write conflicts, which would occur when two users both claim the same URL. Single leader replication doesn't have this issue.
@Ryan-sd6os
@Ryan-sd6os 3 ай бұрын
@@jordanhasnolife5163 thank you legend.
@edcoronado
@edcoronado 8 ай бұрын
great video, man. a lot of system design videos are super generic with just "here's some lego blocks that I know off the top of my head" and don't start from a bottom-up approach with the system architecture designed to support a specific algorithm, not just a top-level cookie-cut use case.
@jordanhasnolife5163
@jordanhasnolife5163 8 ай бұрын
Appreciate that man!!
@SatinderSingh71
@SatinderSingh71 8 ай бұрын
This series is definitely in a league of its own. Who da man! Jordan da man - Fellow dork
@PalanivelPurushothaman
@PalanivelPurushothaman Ай бұрын
Great Video, Have a question 29:51 , it is exactly 1 idempotency key per row and not few if we partition the kafka topic by short URL. can you please clarify.
@jordanhasnolife5163
@jordanhasnolife5163 Ай бұрын
Yes, I was saying this is if you don't partition by short URL
@harshdalal2891
@harshdalal2891 5 ай бұрын
Why do you need predicate locks for solving inserts in MySQL DB ? Why can't we just use primary key constraints / unique key constraints given by DB to achieve uniqueness ?
@jordanhasnolife5163
@jordanhasnolife5163 5 ай бұрын
To be clear, we totally can, I just would imagine that under the hood that has to use some sort of predicate locking so that we don't create two rows with the same primary key
@vedantshinde2277
@vedantshinde2277 7 ай бұрын
Great Stuff!! So, to maintain the unique constraint using predicate locks, serializable transactions will have to be used with first a select and then insert, right?
@jordanhasnolife5163
@jordanhasnolife5163 7 ай бұрын
Yep more or less!
@obidan2
@obidan2 6 ай бұрын
@@jordanhasnolife5163 If there is already a unique constraint on the database for the short_url, I am confused as to why you would need to do anything in this case. The database will return a UNIQUE_CONSTRAINT failure to whoever tries to INSERT after a short_url was already inserted, and that client can retry with an increased number. Am I missing something?
@mcee311
@mcee311 9 ай бұрын
for multi leader / leaderless cant we resolve the issues of conflicting writes by using consistent hashing?
@jordanhasnolife5163
@jordanhasnolife5163 9 ай бұрын
You could say something along the lines of "for key x only write to replica 1 of partition 3". But then you've basically just re-created single leader replication.
@vipulspartacus7771
@vipulspartacus7771 2 ай бұрын
Hi Jordan, I have a few questions on how this can be deployed globally. Suppose you need to generate URL's for different users across the world. Then you will need to deploy the generator and datastores in different regions and generate unique hashes for each region. If not, there could be hash collisions which will require you to check different data bases in different continents. Maybe we need to generate a UUID kind of solution, so that each region can have its own local store which can create and serve the content better. Also, the view count you might be getting it from different continents. Even in this case, you need to have some sort of sharded counter type of thing to consolidate counts from different regions. What are your thoughts.
@jordanhasnolife5163
@jordanhasnolife5163 2 ай бұрын
I'd use a key generation service with a db leader in each region, and aggregate counts using stream processing via Kafka and flink, as that can take as long as I need. You can have DB follower replicas in other regions for faster reads.
@vipulspartacus7771
@vipulspartacus7771 2 ай бұрын
@@jordanhasnolife5163 by key generation service do you mean something that will append a region specific code to the hash, otherwise, how can the key generated be unique, as verifying that it is not conflicting with other urls in other regions will take time.
@dhruvpatidar359
@dhruvpatidar359 2 ай бұрын
That Google logo and the stand behind looks like ghost to me 😰😱
@choucw1206
@choucw1206 8 ай бұрын
Great video! However, can you elaborate the following things related to the short URL generation? 1. Most of the common hash function, e.g. MD5, SHA-2, generate the string way longer than you need. How do you use it while still keep the hash collision low? 2. Can you elaborate how to detect the hash collision? You mention using the linear probing so it kinda imply you want to do it in the memory, so my guess is the service probably need to first query the DB to check if this hash/key already exist. Won't that be a performance concern? 3. I've read a few other designs of generating the unique short URL, and some of them raise the issues of using the hashing of the original URL (with optionally including other data like timestamp). They usually suggest to use a separate unique ID generation service/component, and perform proper encoding like Base 62 and use it as the short URL(besides the domain name part). Have you ever considered this and what's your thought of this? Btw, unique ID generation strategy in the distributed system may be an interesting topic to discuss as well if you haven't done it yet 🙂
@jordanhasnolife5163
@jordanhasnolife5163 8 ай бұрын
I think you should just be able to truncate, if we found that this was leading to too many collisions we could always use 10 characters instead of 8 for example 2) oh on the contrary I'm basically proposing we use application logic that says if key x exists take it otherwise try to write key x+1 in the db. This could be done a bit faster with something like a stored procedure.
@jordanhasnolife5163
@jordanhasnolife5163 8 ай бұрын
3. I don't really see the difference, feel free to elaborate on how this improves things!
@SatinderSingh71
@SatinderSingh71 8 ай бұрын
@@jordanhasnolife5163 Grokking says the key gen is great at solving hash collisions since everything is created beforehand on an offline process. It will create two tables in your DB (used & unused). The application can periodically grab a bunch of unused keys from the key gen service (KGS). Once used, app will update KGS to move entry into used. That's their reasoning for it anyways. Yours makes sense as well to just populate db with everything from get go since its only a few TBs which is nothing.
@varshard0
@varshard0 7 ай бұрын
​@@jordanhasnolife5163generating a unique id in a sequence, then create a base 36-64 string as a shorten URL allow you to not having to deal with a collision ever. Running number is guaranteed to be unique unlike hash. However, it run the risk of being guessable and could be ddos, by making 2 shorten URLs that point to each other.
@adithyasama7558
@adithyasama7558 5 ай бұрын
@@jordanhasnolife5163 so each url-assign-node will be assigned a range for example (0000 - 5555). This way no 2 nodes will ever try to insert the same short url. Each node will just increment its local counter when ever it's creating a new short url. This way theres no conflict resolution to be solved.
@truptijoshi2535
@truptijoshi2535 Ай бұрын
You have drawn multiple instances of kafka in the same partition. Does it also follow Single leader replication? If not, how does it determine which click count event should go to which replica of the same partition?
@jordanhasnolife5163
@jordanhasnolife5163 Ай бұрын
As far as I can tell I only drew one Kafka instance per partition, but yes this would use single leader replication
@mmdip2001
@mmdip2001 Ай бұрын
Thanks for this video. One of the best I have seen in this topic. Question for you: You are saying we will partition by shortKey url where in keys from A-M will go to one node and keys from N-Z will go to another node. How do we know in advance which bucket we will be generating the key for?
@jordanhasnolife5163
@jordanhasnolife5163 Ай бұрын
You take the hash of your inputs and then use some load balancer which listens to a service discovery layer (zookeeper) to go to the right node
@truptijoshi2535
@truptijoshi2535 Ай бұрын
Also, could you explain the sequence of workflows for pastebin? Does it redirect to the long url and then fetches the paste content from CDN?
@jordanhasnolife5163
@jordanhasnolife5163 Ай бұрын
Yeah, perhaps after the service sees a certain number of loads for some paste it can place it in the CDN and replace the link for it in the database accordingly.
@fillywinks
@fillywinks 8 ай бұрын
Great video. I wonder if it would be considering overengineering to go for a keyed time windowed stream in flink, sink each trigger to the db idempotently (semi-event sourcing style) and trigger a reduce on the db once in x updates/once in a while (something like what rocksdb does). You know what, I think I just answered my own question writing this comment 😅
@jordanhasnolife5163
@jordanhasnolife5163 8 ай бұрын
Ha, I think you could probably do it, just not sure if it's necessary!
@akbarkool
@akbarkool 5 ай бұрын
For the issue with multi leader potentially causing write conflicts, a simple fix could be to hash both the incoming url and outgoing url together. That would leave the chances of same hashed value being infinitely less unless the business is also looking up p0rn sites
@jordanhasnolife5163
@jordanhasnolife5163 5 ай бұрын
Haha yeah I think another thing to note here is the range space. We can hash on lots of fields but if we don't have that many hashing combinations we'll have collisions
@ShivangiSingh-wc3gk
@ShivangiSingh-wc3gk 21 күн бұрын
In interviews, do we have to talk about all the ways we can do a part. eg all ways to hash/partition/caches
@jordanhasnolife5163
@jordanhasnolife5163 19 күн бұрын
I guess discuss with your interviewer, but it's not unreasonable to touch upon why you wouldn't want to do a particular approach
@firezdog
@firezdog 3 ай бұрын
For the caches, since we want to handle hot-keys, why not use LFU as opposed to LRU for the eviction policy?
@jordanhasnolife5163
@jordanhasnolife5163 3 ай бұрын
I think they both have their pros and cons here, but a URL that was popular a year ago but nobody looks at now is still technically pretty "frequently used"
@ninad11081991
@ninad11081991 8 ай бұрын
Is it really worth the trade off to use a relational db here due to the fear of write conflicts? We are giving up on a ton of write throughput by making this choice aren't we? What if we instead not take the input of the tinyURL from the user, instead just take his longUrl, have our cassandra already prepolulated with tinyURLs and then round robin requests among our leaderless architecture?
@ninad11081991
@ninad11081991 8 ай бұрын
Fun fact, Bloomberg actually rejected me because I used mysql in my system design interview for this problem
@jordanhasnolife5163
@jordanhasnolife5163 8 ай бұрын
I think it's moreso about how you justify it. To clarify, I said to use single leader replication, not necessarily a relational db. That being said, if you round robin requests around your leaderless architecture and two people have the same shortURL that you propose they could very well still take the same shortURL on different replicas which will lead to a write conflict.
@ninad11081991
@ninad11081991 8 ай бұрын
@@jordanhasnolife5163 Makes sense! and Makes Sense. If I get asked this again, I'll bring up both considerations
@brandonwl
@brandonwl 8 ай бұрын
@@ninad11081991 it is very hard to shard sql dbs but much easier using mongodb/cassandra. For mongo db can use the same strategy as a sql DB shown in the video. But for cassandra you need a separate KeyGen service since write conflicts can happen in cassandra. Do you think they failed you because they were looking for a cassandra solution?
@sachin469D
@sachin469D 9 ай бұрын
Been watching all your videos.. Can you do a video on Lucence OR how in general how search index work with a given edit distance?
@Saurabh2816
@Saurabh2816 9 ай бұрын
yeah interested in apache lucene, elastic search and searching indexing
@jordanhasnolife5163
@jordanhasnolife5163 9 ай бұрын
I already have! kzfaq.info/get/bejne/qt9pd7SZspmWnYE.html&ab_channel=Jordanhasnolife Didn't go into Levenshtein distance indexing, I think it's probably too specific for the details of this particular channel but I'm sure there are some good resources explaining it somewhere online!
@jerryhuang472
@jerryhuang472 5 ай бұрын
Thanks for the vid, wondering what if the URL read succeeds and the write to the Kafka queue doesn't? Is there any decent way to to solve this, I know you mention 2p commit a bit, but was wondering if there's anything faster.
@jordanhasnolife5163
@jordanhasnolife5163 5 ай бұрын
Nope, unfortunately nothing better than 2pc. If you figure it out though let me know, we'll be billionaires!
@moonbiscuit8742
@moonbiscuit8742 8 ай бұрын
Just found the channel, and now watching all your videos from a year ago. Appreciate that you're going through the rationale and building intuition about these concepts. Also, what would be you're top 3 book recommendations?
@jordanhasnolife5163
@jordanhasnolife5163 8 ай бұрын
DDIA, 50 shades of gray, the art of the deal
@jordanhasnolife5163
@jordanhasnolife5163 8 ай бұрын
Thanks for the kind words!
@YeetYeetYe
@YeetYeetYe 10 күн бұрын
Good content technically speaking, but a bit removed from an actual System Design interview. I found HelloInterview to be more geared towards passing the actual interviews (I swear I'm not shilling, just giving you feedback because I genuinely love your channel for the technical aspect).
@jordanhasnolife5163
@jordanhasnolife5163 10 күн бұрын
I'm hearing this a lot these days. I could see it depending on the interviewer/level, but I'm pretty content to overprepare for these since I find it actually helps me at work too! And no worries, you can have preferences :)
@MasterRace023
@MasterRace023 2 ай бұрын
At 28:55 why do we need to store an extra idempotency key for each additional SS consumer? im not quite understanding that part
@jordanhasnolife5163
@jordanhasnolife5163 2 ай бұрын
Well each one is basically simultaneously aggregating some of the counts of people that have clicked our link. If one of them goes down, and then comes back up, and tries to hit the database again with an operation to increment the counter for our link, we need to make sure we haven't already applied that operation (and then spark streaming could have gone down before the ack from the DB got back to it). So using an idempotency key would allow the database to say, update y from consumer z has already been applied, don't do it again.
@dmitrigekhtman1082
@dmitrigekhtman1082 6 ай бұрын
If you write to S3 first and then the DB write fails, you’re left with untracked orphaned junk data from the failed request. Another approach would be to write to the DB first to persist the intent to upload the pastebin, do the rest async, and make sure to communicate something reasonable to the user while the upload is in progress.
@jordanhasnolife5163
@jordanhasnolife5163 6 ай бұрын
While this is true, I prefer orphaned junk to the alternative of a paste that links to nothing. I could just have a cleanup job run in the background once a month!
@dmitrigekhtman1082
@dmitrigekhtman1082 6 ай бұрын
A paste that links to nothing is definitely bad. Recurring cleanup works and is necessary in any case, at least as a fallback for bugs. But it's more reliable to have your normal path be to write something to the DB first [chunk_id: xxx, status: need_to_upload], then take (idempotent) action [upload to S3 if not done already] based on the persisted goal state, then record that you're done [chunk_id: xxx, status: uploaded]. Less things to clean up this way. The relevant buzzword is reconciliation.
@jordanhasnolife5163
@jordanhasnolife5163 6 ай бұрын
@@dmitrigekhtman1082 Seems like a fair point to me!
@truptijoshi2535
@truptijoshi2535 2 ай бұрын
Hi Jordan! Great video. I have a question about probing. Should the application logic check if the hash is available? Since we are creating a hash based on the long URL + userID + timestamp, why would there be any conflict? Should we also think of UUID or Twitter's snowflake or Base 62 conversions to avoid probing? Could you please explain more about the Hashing function?
@jordanhasnolife5163
@jordanhasnolife5163 2 ай бұрын
Just because those are our inputs to the hash function doesn't mean we'll generate a unique output. If we only have 6 characters in our short link, for example, some things will hash to the same value.
@kanani749
@kanani749 3 ай бұрын
Hey Jordan, Great video. Wanted to ask how analytics calculation would work in this case if in the system we were to use Atomic write operations (I am referencing DDIA Chapter 7 - Preventing Lost Updates - Atomic Write Operations). From my understanding we are basically trying to implement a read modify write cycle and it would still boil down to a lock which you mentioned. If we designed a distributed Redis store partitioned by URL to handle the counter incrementing would the main issue at hand be maintaining an accurate count during failover? Could using a consensus algorithm help with solving with this issue or would this reduce performance and throughout too much?
@jordanhasnolife5163
@jordanhasnolife5163 3 ай бұрын
We can use atomics, but atomics are built around grabbing locks. If we have too many people incrementing the same counter, we could have contention. In my solution, I batch all of the upgrades, and increment them in intervals, via a real time stream consumer. By having just one consumer per URL, I don't need to worry about locking. Since redis is single threaded, I suppose your idea works. That being said, again, we are limited by the throughput of the redis node. While a consensus algorithm would ensure that we don't lose any writes, I'd agree with you that it's probably too slow and doesn't make sense here.
@matthewsaucedo2471
@matthewsaucedo2471 9 ай бұрын
25:25 What is the difference b/t Spark Streaming and Flink, if you were to just use a larger batch in Flink that your example (you said 10, but consider we use a more comparable 100)?
@jordanhasnolife5163
@jordanhasnolife5163 9 ай бұрын
I'd say none then really, just wanted the opportunity to use something other than flink lol
@matthewsaucedo2471
@matthewsaucedo2471 8 ай бұрын
I’ve never used either technology (only Kinesis streams from AWS) so was just curious!
@truptijoshi2535
@truptijoshi2535 Ай бұрын
Hi Jordan! If we are pre-generating the keys, then we cannot use user-ID and long url for the hash right? Then how do we generate 8character unique short url?
@jordanhasnolife5163
@jordanhasnolife5163 Ай бұрын
I'm not sure what you mean when you say pregenerating keys, presumably you mean the materialized rows approach. In this case we do no hashing and just randomly assign a write to one of the partitions. If you meant the hashing approach we hash the user id and long URL and deal with any collisions via probing.
@truptijoshi2535
@truptijoshi2535 Ай бұрын
@@jordanhasnolife5163 Yes, for the materialized rows approach, how are we generating unique keys? Are we using UUID ?
@5atc
@5atc 8 ай бұрын
When you say you would use database partitioning, do you actually mean sharding? How do you approach discussing these two topics during an interview?
@jordanhasnolife5163
@jordanhasnolife5163 8 ай бұрын
I've never really known them to be different terms
@Luzkan
@Luzkan 6 ай бұрын
Sup Jordan! After 2:35 I went ahead and made a design too and I'm looking forward to hear an opinion about my thoughts, feel free to confront and bully me based on my decision. You have hard constrained the choices around no write conflicts (with the example of an identical hash for two different urls by two users made simultaneously). I thought, that situation could be extremally rare in real life (and even if - we could add extra complexity on how we generate tinyurl to assure its uniqueness, with a salt maybe, or a distributed counter, but tbh, only after such event would happen considerable number of times). Thus, I wouldn't want to discard options based on that edge case. That enables the Cassandra option, which looks very nice for a system which is in-fact read heavy as you said, but! each tinyurl visit (which is a read) is also a write because of inc() counter requirement. I really like your idea of going with mini batch updates and Spark Streaming, though. I think this is also a good design, and I'm not holding a strong opinion on either of those, and now what? :D Cheers! Thank you for the content, keep up great work!
@jordanhasnolife5163
@jordanhasnolife5163 6 ай бұрын
Yeah! I think Cassandra could be interesting here, assuming that you used a leaderless counter (so basically a CRDT), and I agree that IRL conflicts should be relatively rare. Since there are a lot fewer hashes than there are URLs different sites may hash to the same result but ideally by then enough time will have passed that Cassandra could have performed anti entropy so we could see that the hash was taken and then probe to another. Thanks, I appreciate the kind words!
@Anonymous-ym6st
@Anonymous-ym6st 4 ай бұрын
Another question: when mentioning a few TB should be good for one single machine, I am wondering in practice what storage usually can be done in one machine? a few hundred TB?
@jordanhasnolife5163
@jordanhasnolife5163 4 ай бұрын
Depends how you're storing the data, but just look up the prices of hard drives and look up how many you can fit in an average server
@ariali2067
@ariali2067 5 ай бұрын
Hey Jordan, you mentioned that we shouldn't use write back cache b/c that will incur the same issue as multi-leader/leaderless replication. Curious why we cannot use Redis for write back cache here? Redis supports single leader replication right? Thanks in advance!
@jordanhasnolife5163
@jordanhasnolife5163 5 ай бұрын
Hi Aria - you can use redis as a write back cache, and redis supports single leader replication. But a write back cache means that you write to a cache first and eventually flush that to your database. If you haven't watched my playlist prior to this one, I think you may get some benefit from drilling down those concepts first! These videos should then all make more sense.
@radosawmul8963
@radosawmul8963 4 ай бұрын
What strikes me is a question if kafka can actually handle 2 trillion partitions 👀 for example firehose can handle max of 5k partitions per delivery stream. I think some more hybrid approach would be in order here.
@radosawmul8963
@radosawmul8963 4 ай бұрын
29:40 Unless you didn’t mean to have a partition for each url, but just a range-based partitioning. Because that was kinda clear only in the last architecture diagram, not during analytics deep dive. If so then all good 😅
@jordanhasnolife5163
@jordanhasnolife5163 4 ай бұрын
Yep you got it!
@Henry-wq7ki
@Henry-wq7ki 3 ай бұрын
If we are single leader, why do we need predicate lock? Only one client will accès the row, right ?
@jordanhasnolife5163
@jordanhasnolife5163 3 ай бұрын
Not if it doesn't exist yet!
@rakeshvarma8091
@rakeshvarma8091 4 ай бұрын
Great Video Jordan. I have couple of questions. 1. You are saying we will partition by shortKey url where in keys from A-M will go to one node and keys from N-Z will go to another node. But why do we need to do this kind of partitioning ? Consistent Hashing offers a good evenly distributed keys among available servers isn't it. Is it wise to use ConsistentHashing everywhere i.e., For DB partitioning, Cache Partitioning & Kafka Queue Partitioning. 2. At kzfaq.info/get/bejne/a7xmf8Sena2-n2g.html, Can you shed some more light on the Idempotency key for achieving the Exactly once processing. How does keeping the last written key will help in avoiding duplicate updates for clicks ? Because we might get the clicks for that same key later as well right i.e., Subsequent clicks on the url ? How do we differentiate between these two ?
@jordanhasnolife5163
@jordanhasnolife5163 4 ай бұрын
1) We are using consistent hashing, albeit on the range of short keys. That's because the short keys themselves are hashes, no need to hash them again (there's no reason you couldn't if you wanted). 2) Each batch of clicks that we handle in spark streaming gets assigned a random key id (you can hash the incoming data or something so it is deterministic), called an idempotency key. When we change the value in the db, we check (in the clicks database) to make sure we haven't seen that idempotency key so that we don't accidentally update our db twice.
@zhonglin5985
@zhonglin5985 4 ай бұрын
Hi Jordan, nice content! A question -- why would it be a problem if we use spark streaming to calculate total counts? You mentioned that specifically the it will fail if there are multiple spark streaming consumers. What does that mean?
@jordanhasnolife5163
@jordanhasnolife5163 4 ай бұрын
I just meant that if you have many consumers for one url, now your counts are split between many consumers. That's ok, but you'll face contention when incrementing the counter in the db from multiple places.
@OF-nf2cb
@OF-nf2cb 2 ай бұрын
Hi Jordan, thanks for this video! I dont get why we need to worry about two people adding rows with the same primary "tinyUrl" key, doesnt every ACID compliant db prevent adding two of the same row like that or am I missing something very obvious??
@jordanhasnolife5163
@jordanhasnolife5163 2 ай бұрын
Yup! (Admittedly I don't know how this would work when many partitions are involved), but this is how they'd do it
@OF-nf2cb
@OF-nf2cb 2 ай бұрын
Sorry I still dont understand 😢 I was referring to 11:15 where it talks about locking for adding new rows, I thought databases already did this, so why do we need to worry about it?
@jordanhasnolife5163
@jordanhasnolife5163 2 ай бұрын
@@OF-nf2cb Because it's a systems design interview, and we should know how things work :)
@OF-nf2cb
@OF-nf2cb 2 ай бұрын
@@jordanhasnolife5163 Ah I see so you were talking about how dbs internally handle it? I was thinking you meant we were implementing that
@jordanhasnolife5163
@jordanhasnolife5163 2 ай бұрын
@@OF-nf2cb Depends on the db :) as you mentioned, most SQL ones will get it done for ya via primary key constraints
@Henry-wq7ki
@Henry-wq7ki 3 ай бұрын
Can you please clarify how you assign short URL’s ? The second option with materialized conflicts ?
@jordanhasnolife5163
@jordanhasnolife5163 3 ай бұрын
Put some URLs on multiple database partitions, and have enough partitions that the increased throughput of many partitions is able to offset the lock contention required to assign the next URL (by primary key) to a given user that gets routed to that partition.
@Henry-wq7ki
@Henry-wq7ki 3 ай бұрын
@@jordanhasnolife5163 thanks
@tarunnurat
@tarunnurat 6 ай бұрын
Hi Jordan, great vid. Just had a question. In your DB schema, you mentioned having an index on the tinyURL. But given that the tinyURLs in each row are unique, wouldn't they be the primary key for our rows? And if so, why would we add an index on a primary key?
@jordanhasnolife5163
@jordanhasnolife5163 6 ай бұрын
Same thing basically - not all databases sort on that primary key I believe so I'm just saying this to be clear
@Anonymous-ym6st
@Anonymous-ym6st 4 ай бұрын
Thanks for the great content! I am wondering for the write-back cache and data consistency, is it possible to make sure the same user always write and read from the same cache? in that case will there still be any data consistency problem or anything I am not considering correctly? (apart from edge cases that the cache node is failing and they can only read from a new cache?
@jordanhasnolife5163
@jordanhasnolife5163 4 ай бұрын
Yeah in theory if you use consistent hashing to put the same user on the same write back cache this should be ok, but as you mentioned if that cache goes down we're in trouble
@yashagarwal8249
@yashagarwal8249 6 ай бұрын
Hey Jordan, love your content! I'm a bit confused about when the S3 data is being used here, if we are always writing to CDN first and fetching from it as well. Is it a fallback if some data gets evicted from a particular edge CDN server?
@jordanhasnolife5163
@jordanhasnolife5163 6 ай бұрын
Hey Yash! We aren't always writing to the CDN here! Rather, we use the CDN as a write around cache, where pastes that are accessed many times ideally will eventually be loaded in the CDN. For the first write, we go to S3.
@tunepa4418
@tunepa4418 5 ай бұрын
How can we achieve or guarantee uniqueness of the tiny url field across multiple shards. Wouldn’t this be too expensive especially if we are using predicate locks
@jordanhasnolife5163
@jordanhasnolife5163 5 ай бұрын
You shard by the tinyurl ID so that you don't have to!
@satadhi
@satadhi 3 ай бұрын
Hello Jordern is it possible to get these slides (or whatever they are). It will be very helpful for revision. Thank you :)
@jordanhasnolife5163
@jordanhasnolife5163 3 ай бұрын
Eventually (when I am done with this current series) I will upload everything in batch - Jordern
@orhemi8076
@orhemi8076 7 ай бұрын
What are the benefits gained from choosing SQL DB over noSQL? noSQL can scale horizontally much easier
@jordanhasnolife5163
@jordanhasnolife5163 7 ай бұрын
What makes you say that noSQL scales horizontally much easier? This may be true when we have a lot of distributed joins to perform, but in the case in which we don't, I really don't see any advantage. We get equal amounts of data locality either way.
@ilqarilyasov
@ilqarilyasov 4 ай бұрын
5:27 you mentioned that the hash function provides a unique value. Are you sure? Doesn't the hash function of a hash table typically return an index to map keys to indices in the underlying array?
@jordanhasnolife5163
@jordanhasnolife5163 4 ай бұрын
A hash function is any function that deterministically maps some element of a domain to a different element of a range
@bennyamor
@bennyamor 4 ай бұрын
Creating a partition for each short URL in Kafka could be problematic- large number of partitions in Kafka can lead to increased overhead in terms of memory usage, disk space, and metadata management,Each partition in Kafka consumes system resources such as file descriptors, memory, and network connections,Management Complexity.....
@jordanhasnolife5163
@jordanhasnolife5163 4 ай бұрын
Not proposing this - when I say partition by x it's me using shorthand for partition by hash range of x
@bennyamor
@bennyamor 3 ай бұрын
@@jordanhasnolife5163 thanks
@fedpet
@fedpet 2 ай бұрын
Why do you think hash indexes are only in-memory? Afaik, for example postgres supports materialized hash indexes
@jordanhasnolife5163
@jordanhasnolife5163 2 ай бұрын
Even if they weren't in memory, the performance would be bad due to random seeks. There do seem to be augmentations of normal indexes with hash ones, but those are gonna be in memory.
@fedpet
@fedpet Ай бұрын
@@jordanhasnolife5163 afaik, hash indexes used to perform worse in postgres because they were not optimised. They are now, and should outperform b-trees for equality searches. In a b-tree for those searches you are still paying O(logN) every time
@pranapavan
@pranapavan 5 ай бұрын
@jordan do you have slides for the problems, just like you have for the system design concepts? Thank you, keep rocking!
@jordanhasnolife5163
@jordanhasnolife5163 5 ай бұрын
Yep! I'm waiting until the series is done to get them all posted in batch
@AizazShahid-ck8cn
@AizazShahid-ck8cn Ай бұрын
If we are doing something(probing etc) to avoid collisions then why do we have to care what replication we use? As then there should be no conflict, no?
@jordanhasnolife5163
@jordanhasnolife5163 Ай бұрын
We still have eventual consistency. I can grab a short URL x on node A and you can grab it on node B, and we won't know that there's a write conflict until we synchronize.
@AizazShahid-ck8cn
@AizazShahid-ck8cn 29 күн бұрын
@@jordanhasnolife5163 Wouldn't HBase make the most sense in that case?
@jordanhasnolife5163
@jordanhasnolife5163 29 күн бұрын
@@AizazShahid-ck8cn I think that anything with single leader replication would. I don't think we get much benefit from being column oriented here, and if we're prioritizing for read speeds perhaps we're better suited with a B-tree.
@vanderer_prashant
@vanderer_prashant 6 ай бұрын
If we want to add a new feature to the system - to generate QR code instead of short URLs, what'd be the changes if any?
@jordanhasnolife5163
@jordanhasnolife5163 6 ай бұрын
Yeah I'd say that's effectively the same. I'd imagine there's some form of string representation of a QR code and we would just store that as the primary key in the database instead of a short link.
@LeoLeo-nx5gi
@LeoLeo-nx5gi 8 ай бұрын
Firstly very awesome content, thanks a ton. Secondly had a query regarding a point you mentioned on using Kafka queue for analytics where each short url would belong to a partition. i.e we might have 2 trillion partitions. I think that would be very bad in general as that would increase the latency and would it even be possible to create that many partitions? (there should be some upper bound I think) thanks!!
@jordanhasnolife5163
@jordanhasnolife5163 8 ай бұрын
I think you could make that argument about any partitioning schema to be fair! When I say partition by x, I typically mean partition by the hash range of x if that clears things up at all. My point is I just want all values for key x to be on the same node.
@LeoLeo-nx5gi
@LeoLeo-nx5gi 8 ай бұрын
@@jordanhasnolife5163 ahh thanks this is helpful
@yrfvnihfcvhjikfjn
@yrfvnihfcvhjikfjn 8 ай бұрын
1 trillion urls? That's a moderately sized pornography collection 😏
@jordanhasnolife5163
@jordanhasnolife5163 8 ай бұрын
Bit of a connoisseur myself
@himanimadan3547
@himanimadan3547 6 ай бұрын
I have a question - how we would get the same shortlinks "abc" at 5:58 if we are using hash function( userid, timestamp, long url). if hash function is generating the same value then wouldn't it be resolved by probing? Also hash function has different values in it like userid, long url.
@jordanhasnolife5163
@jordanhasnolife5163 6 ай бұрын
There are only so many short URLs, so even if you have different inputs they could theoretically hash the the same one. And yes, this would be resolved by probing, but in order to actually know that the key is taken and we need to probe, we have to be sure to lock on that row so that we don't have two users grabbing a key at the same time.
@himanimadan3547
@himanimadan3547 6 ай бұрын
@@jordanhasnolife5163 Thanks! I watched many videos from this channel and all are great!
@tmehmood007
@tmehmood007 3 ай бұрын
How to find out when the tinyURL was accessed for the first time?
@jordanhasnolife5163
@jordanhasnolife5163 3 ай бұрын
Every read has to access the database anyways more or less, just write the timestamp then and then don't overwrite it if it exists. You'll need to use locking here, but what you can do to avoid the head is first read the TS without the lock, and if null then grab the lock and update it in order to minimize contention.
@adrian333dev
@adrian333dev 3 ай бұрын
You've got good humor
@ishanvijay507
@ishanvijay507 7 ай бұрын
Hey man, Just came across this channel and i see two playlists for system design problems Are the videos from first playlist good enough to cover ground for now until you keep coming up for videos in this playlist ? Just asking if there's any reason behind updating them as they were only 1 year old
@jordanhasnolife5163
@jordanhasnolife5163 7 ай бұрын
Yeah I think so, but I'm just trying to be more visual in the second series. The first one is just a lot of me talking with no backround images.
@hemanthaugust7217
@hemanthaugust7217 9 ай бұрын
How do you generated hashcode (base36) from url & timestamp with minimal collisions? I've seen other designs generating Sequence no in a distributed fashion and converting number to base36. We can avoid this approach, if you can share how to generate such a hashcode ?
@jordanhasnolife5163
@jordanhasnolife5163 8 ай бұрын
I'm a bit confused here, why don't you just use any typical hash function?
@hemanthaugust7217
@hemanthaugust7217 8 ай бұрын
@@jordanhasnolife5163 a hash function like md5/sha256 might generate a unique code but when it's truncated to 7-8chars, it leads to a lot of collisions. I've not tested it, but can have a high probability.
@varshard0
@varshard0 7 ай бұрын
I wouldn't recommend using a sequence number as it's likely to be guessable. Then the attacker could predict the next shorten URL and try to have 2 shorten URLs pointing at each other and create an infinite loop that hurt the server.
@aram_mkr
@aram_mkr 4 ай бұрын
Any thoughts on using sharded counters for click count vs kafka stream processing ?
@jordanhasnolife5163
@jordanhasnolife5163 4 ай бұрын
Yep, I did mention counter crdts, which imo are basically the same
@adithyasama7558
@adithyasama7558 5 ай бұрын
Can you explain a bit more about the technology choices? you picked mySQL as it has all the requirements you came up with. but wouldn't scaling be easier with something like mongodb compared to mySQL. mongodb also has indexes. so wouldn't mongo be the choice in terms of faster development
@jordanhasnolife5163
@jordanhasnolife5163 5 ай бұрын
Can you clarify your assumptions here? Why does mongo lead to faster development? Why does mongo scale better? I think that people often say this about NoSQL databases without much evidence to back it up, so that's going to be my retort for you here. Generally, most developers are familiar with SQL databases, and I believe that I can get away with using the relational model here, both in the shape of data, and in the sense that I want the ability to have ACID transactions and use single leader replication. Could I use mongo? Sure - but I don't see any clear advantage that it has over MySQL when both work perfectly fine in this situation.
@sahilkalamkar5332
@sahilkalamkar5332 5 ай бұрын
Heyy ​@@jordanhasnolife5163, a beginner here, what about the argument where people say that NoSQL is easier to scale horizontally? I always get a bit confused on this point, I think scaling horizontally relates to since we have data locality in NoSQL, we don't need to perform joins across nodes. But do I definitely choose SQL vs NoSQL? Should I always go for SQL, if it fits, choose it, else try with Nosql?
@mickeyp1291
@mickeyp1291 7 ай бұрын
btree is not scalable, and kafka uses LSM tables as its main system, and is highly scalable. so using ULID or UUIDs and replicating over a kafka queue might increase capaity without degrading your performance i think (although am not an expert in this system type)
@jordanhasnolife5163
@jordanhasnolife5163 7 ай бұрын
1) I don't know where you're getting that Kafka uses LSM trees, because it's basically just writing to a commit log, therefore not requiring any sort of indexing 2) What makes you say that B-trees aren't scalable? And what do you mean by "not scalable"? Let me know if I'm missing anything!
@shuaizhang6502
@shuaizhang6502 4 ай бұрын
Kafka does not use LSM tree or any other B tree index
@jordanhasnolife5163
@jordanhasnolife5163 4 ай бұрын
@@shuaizhang6502 Agree
@mariusz.zlotucha
@mariusz.zlotucha 5 ай бұрын
Hi, could you estimate cost of all sql dbs from your example? I'm guessing it's pretty huge :)
@jordanhasnolife5163
@jordanhasnolife5163 5 ай бұрын
Hey, I'm not too familiar with exact pricing here, but you can go on AWS, check out how much it costs for an Amazon RDS instance, and then multiply that by the number of databases that you need as we discuss in the capacity estimates section.
@rishabhsaxena8096
@rishabhsaxena8096 7 ай бұрын
Hey Jordan, In the case of one publisher Per Row(29:34) you mentioned that we will have one spark streamer per tiny url id. But here we have trillions of URLs how is that feasible to have trillion streamers for this case?
@jordanhasnolife5163
@jordanhasnolife5163 7 ай бұрын
Hey Rishabh! What I meant there is that the data for each tiny url id belongs on one spark streaming node, but you can have a spark streaming node responsible for many tiny url ids. You basically partition the consumers/kafka queues on tiny url id.
@siddharthgupta6162
@siddharthgupta6162 6 ай бұрын
Was exactly my doubt too. Thanks!
@KratosProton
@KratosProton 9 ай бұрын
Great video jordan, one question though. Will this playlist have different questions than your last system designs questions playlist or will it be the same??
@KratosProton
@KratosProton 9 ай бұрын
BTW much more detailed explanation than the last playlist video, highly appreciated! Keep it up!!
@KratosProton
@KratosProton 9 ай бұрын
The quality, not the other thing!
@jordanhasnolife5163
@jordanhasnolife5163 9 ай бұрын
I'm going to incorporate different modifications to questions, but I'd say the questions themselves will be the same services. For example, my last TinyURL video didn't include anything about analytics. I may add a couple of new ones at the end of the series depending on how things are going.
@KratosProton
@KratosProton 9 ай бұрын
@@jordanhasnolife5163Great, thanks dude!
@user-vv8fw4fj5i
@user-vv8fw4fj5i 9 ай бұрын
I also like the way this time he talking while writing, it’s getting me easier to understand.
@hardikmaheshwari8241
@hardikmaheshwari8241 11 күн бұрын
V good content
@firewater586
@firewater586 5 ай бұрын
Great content!
@mickeyp1291
@mickeyp1291 7 ай бұрын
Loving your videos - as a n enterprise system architect for the past 7 years i feel however that I dont understand WHY you're using hashing, do you plan to find that hash again ? this is classic ULID approach that was intended because of replication and consistency issue. when looking up the ULID as an index it can be indexed and this would solve all the issues and problems you faced. hashing is when you need to take the raw data and authenticate it somehow. i think you need to revisit this solution, you are using hashing as an indexing method and that is a red flag for me
@jordanhasnolife5163
@jordanhasnolife5163 7 ай бұрын
I really don't see a functional difference here between hashing and UUID here, but I agree with you that this is a perfectly acceptable option. Considering that we're taking in things like the userId and timestamp in order to create our hash, I'd imagine that we'd have equally well distributed data. According to this guy, github.com/oklog/ulid, the ULID is created from a timestamp and some random data, which effectively just makes it a variant on a hash. We don't really care about whether or not I can sort these, because we're not running any range queries here. Another advantage of hashing is that it allows me to deduplicate requests if somebody submits multiple identical ones, to stop them from creating multiple tiny URLs. I think there are many benefits to hashing beyond just authentication. Nonetheless, I appreciate your opinion and think that using something like a random UUID makes for a totally fine key here and in practice works just as well or better than a hash.
Идеально повторил? Хотите вторую часть?
00:13
⚡️КАН АНДРЕЙ⚡️
Рет қаралды 17 МЛН
URL shortener system design | tinyurl system design | bitly system design
34:39
Tech Dummies Narendra L
Рет қаралды 460 М.
Landmark ruling: how Google broke the law | About That
13:49
CBC News
Рет қаралды 176 М.
System Design: Design a URL Shortener like TinyURL
16:00
Code Tour
Рет қаралды 82 М.
13: Google Maps | Systems Design Interview Questions With Ex-Google SWE
35:25