Dev Deletes Entire Production Database, Chaos Ensues

  Рет қаралды 2,630,751

Kevin Fang

Kevin Fang

Күн бұрын

If you're tasked with deleting a database, make sure you delete the right one.
Sources:
about.gitlab.com/blog/2017/02...
about.gitlab.com/blog/2017/02...
Notes:
1:05 - The middle bullet point about the account that had 47,000 IPs was never mentioned in the postmortem (there was an initial report the day of and a more detailed postmortem a bit over a week after that). Perhaps that was a red herring which they figured out later on didn't really matter.
3:07 - I made the error say too many open connections since it's easier to understand than semaphores
3:39 - This part was confusing, since the postmortem and the initial report conflicted. The postmortem said the engineers believed pg_basebackup was failing because previous attempts created some files in the data directory, but the initial report said the theory was because the data directory existed (despite being empty). So for some reason the engineers really wanted to delete the data directory, but for what reason who knows.
4:37 - They probably didn't check for backups in this order. I'm sure team-member-1 immediately called out he had taken a backup 6 hours earlier, and then they just had to verify the other backups in case there was a better one.
6:21 - Being reported by a troll will not automatically remove a user, but flag it for manual review. It was then incorrectly deleted after review.
Chapters:
0:00 Seconds before disaster
0:16 Part 1: Database issues
2:21 Part 2: The rm -rf moment
4:32 Part 3: Restore from backup
6:13 Part 4: Post incident discoveries
7:27 Lessons learned
9:46 The fate of team-member-1
10:11 ???
Music:
- Thriller Trailer Teaser Tense by Cold Cinema • Thriller Trailer Tease...
- Finding the Balance by Kevin MacLeod
- Eyes Gone Wrong by Kevin MacLeod
- Desert City by Kevin Macleod
- Jane Street by TrackTribe

Пікірлер: 2 600
@VestigialHead
@VestigialHead Жыл бұрын
Damn I cannot even imagine the stress that admin was feeling after he realised he deleted DB1. He must have aged twenty years.
@1996Pinocchio
@1996Pinocchio Жыл бұрын
The legendary Onosecond.
@NS-sd3mn
@NS-sd3mn Жыл бұрын
​@@1996Pinocchio I see that you see tom scott
@youngstellarobjects
@youngstellarobjects Жыл бұрын
The stress should really be minimal if you have a backup and restore procedure, that it actually works and you know how it works. Mistakes happen.The problem wasn't the delete command, it was the nonexistent backups and documentation.
@LeoVital
@LeoVital Жыл бұрын
@@youngstellarobjects Nah, still stressful. Most companies aren't making a backup on every write that happens to a DB, so whoever deletes a DB knows that they've just made an oopsie that will cause a lot of headache for multiple people. And probably cost a lot of money for the company as well.
@pqsk
@pqsk Жыл бұрын
As long as you have a backup there's no problem. I've done this before, but if there's no backup you prolly die of stress 😅😅😅
@Misanthrope84
@Misanthrope84 Жыл бұрын
"You think it's expensive to hire a professional? Wait till you hire an amateur" - some old wise businessman.
@urbexingTss
@urbexingTss Жыл бұрын
that indeed is wise
@shahriar0247
@shahriar0247 Жыл бұрын
Loll
@blue5659
@blue5659 Жыл бұрын
A professional costs you in bold italic and underline. An amateur mostly costs you in fineprint
@-na-nomad6247
@-na-nomad6247 Жыл бұрын
The person here is not an amateur, anyone can get brain farts especially when working an unexpected overnight, you should try it sometime, you'll start seeing ducks and rabbits in the shell.
@Misanthrope84
@Misanthrope84 Жыл бұрын
@@-na-nomad6247 I'm a veteran in the Devops field. This comedy of mistakes could have never happened to me since I'm following a protocol, which these guys obviously did not. They were guessing and experimenting as if it were an ephemeral development environment. Their level of fatigue had little to do with their incompetence in understanding the commands they were running.
@Chris_Cross
@Chris_Cross Жыл бұрын
The fact they live streamed while trying to restore the data is a truly epic move.
@xpusostomos
@xpusostomos 7 ай бұрын
Hope it was monetized
@godjhaka7376
@godjhaka7376 6 ай бұрын
​@@xpusostomosthat's why they live stream and post anyway. Not to educate but rather make money
@Elesario
@Elesario 4 ай бұрын
Sounds like they had the spare bandwidth ;P
@joseaca1010
@joseaca1010 4 ай бұрын
Programmer vtuber when?
@kv4648
@kv4648 3 ай бұрын
​@@joseaca1010already have one: vedal
@Webmage101
@Webmage101 11 ай бұрын
I think the biggest problem (seemingly addressed at 6:21) is the fact they could delete an employee account by spam reporting it.
@alex_zetsu
@alex_zetsu 10 ай бұрын
Actually at the time of the video, what they addressed was the fact that deleting an account could cause problems with the server, it seems they didn't actually stop trolls from deleting an employee's account. I'd have thought employee accounts would be protected. The trolls didn't even get admin powers through privilege escalation, they just reported the target.
@Milenakos
@Milenakos 10 ай бұрын
read the video description
@DevinDTV
@DevinDTV 10 ай бұрын
@@Milenakos every company says they do a manual review, but none of them actually do
@Milenakos
@Milenakos 10 ай бұрын
​@@DevinDTV source??? (edit: i was mostly complaining about you just saying they are lying out of thin air)
@Therealpro2
@Therealpro2 9 ай бұрын
​@@Milenakos source????????????????????????????????????????????
@SIMULATAN
@SIMULATAN Жыл бұрын
So you're telling me a platform as big as GitLab went down because one engineer picked the wrong SSH session? Damn that makes me feel way better about my mistakes lol
@shahriar0247
@shahriar0247 Жыл бұрын
i would highly high suggest people using customized shells, i use oh my zsh, i customize my themes to show git info, hostname (sometimes) and a lot more, not because i wanna know which ssh session im in, but i like the design :)
@syedmohammadsannan964
@syedmohammadsannan964 Жыл бұрын
Dude IKR! No one engineer should have that much power to shutdown an entire company's operation for even a second.
@0xCAFEF00D
@0xCAFEF00D Жыл бұрын
@@syedmohammadsannan964 No someone has to have that. The general problem is that there's no safety nets. I don't mean to suggest this is a good solution, because safe-rm is just jank. But using safe-rm would most likely have saved this situation. If you replace rm through a symlink to safe-rm you can configure a blacklist on production that doesn't allow for deleting the database or other critical data. I find many things about safe-rm to be unsafe. It doesn't protect if you cd into a directory and then do rm -rf *. A better program should simply evaluate the path its trying to delete and disallow it if the blacklist covers it. It also doesn't allow for custom messages through its blacklist. What you want is for a bad rm -rf to send a warning to the user. Otherwise there's no way of guaranteeing they don't just start avoiding the issues. For example, most likely you're not going to leave your backup unprotected by the blacklist just to create differences between production and backup. So a developer in this situation would expect to run into issues deleting postgres db on either server. It doesn't tell the user anything really. If you instead configure messages you can call attention to the hostname. The goal is just to induce further friction for dangerous actions. rm has always been so risky because it's so easy.
@Darkk6969
@Darkk6969 Жыл бұрын
@@0xCAFEF00D I always check the hostname of the server and triple check the directory before using the rm -rf command. If in doubt I use the mv command to a different directory as backup. If everything works ok then I go in there and delete the old directory. Same thing happened to Pixar's movie Toy Story they were working on. Some storage admin used rm -rf on a directory by mistake and practically wiped out the movie. Lucky someone had a copy of the data on a laptop that was offsite at the time. They were able to rebuild the movie from that data.
@BuyHighSellLo
@BuyHighSellLo Жыл бұрын
@@0xCAFEF00D no, NO single employee should have enough privilege to bring down anything business sensitive. except if you’re the CTO maybe. These operations all should require a flag or check from someone else first. Just like how one person usually shouldnt be able to push any code by themselves. They need 1 or more checks before that.
@rosscads
@rosscads Жыл бұрын
Given the trouble they were in after the deletion, a recovery time of 24h and a recovery point of 6h is actually pretty heroic. Especially considering the stress they would have been under. 😰
@TheDaern
@TheDaern Жыл бұрын
​@@L2002 Because of this? They were open and honest about their screwups which, for me, makes them a pretty good organisation to deal with. Plenty of others would not be and, at the end of the day, this stuff does happen from time to time. My measure of a company is not how well they work day to day, but how they handle adversity. Everyone screws up eventually and it's how you handle this that marks out the good ones from the bad ones. Also, a company who almost lost a production DB because of failed backups is unlikely to do it again ;-)
@MunyuShizumi
@MunyuShizumi Жыл бұрын
@@L2002 Ah, yes, because Microsoft never has outages, data loss, or data leak incide- oh wait..
@sinnlos229
@sinnlos229 Жыл бұрын
​@@L2002Care to elaborate? Cause everyone else here, including me, disagrees.
@titan5064
@titan5064 Жыл бұрын
Don't feed the troll, clearly not someone who's ever worked with computers on a proper level
@realpillboxer
@realpillboxer Жыл бұрын
@@titan5064 exactly. Their handle is "L" -- they are a literal walking loss (loser).
@Nick77ab2
@Nick77ab2 Жыл бұрын
This is why problems like this are actually sometimes good. Of course extremely stressful, but they found sooo many issues and fixed them all. Amazing.
@federicocaputo9966
@federicocaputo9966 Жыл бұрын
you are asuming they fixed them all Until it breaks again.
@JeyC_
@JeyC_ Жыл бұрын
​@@federicocaputo9966 atleast next time they now have the experience to what not to do or what to do
@brett2258
@brett2258 Жыл бұрын
That's a really good positive approach right there!
@djweavergamesmaster
@djweavergamesmaster 11 ай бұрын
reminds me of that one ProZD skit, where the villain fixes everything
@mikabakker1
@mikabakker1 10 ай бұрын
@@federicocaputo9966 that is life
@dragonfire4869
@dragonfire4869 Жыл бұрын
This reminds me of Toy Story, and how like a month before release the entire animation was accidentally deleted, causing absolute panic and hell at Disney. Luckily, one employee had the whole thing on a hard drive that they were taking home to work on. Her initials are on one of the number plates of one of the cars in the film. Always make a backup. Edit: She was a project manager who had to work from home, and the numberplate was actually "Rm Rf" in reference to the notorious line of code that did it.
@mrsharpie7899
@mrsharpie7899 11 ай бұрын
I don't remember if it was the day-saving employee's initials, or RM-RF that was on the license plate
@alimanski7941
@alimanski7941 11 ай бұрын
It was Toy Story 2, and the easter egg was in Toy Story 4, where the license plate had "rm rf" in it
@ScruffyNZ.
@ScruffyNZ. 11 ай бұрын
they fired that person recently
@atulyadav3197
@atulyadav3197 11 ай бұрын
@@ScruffyNZ. Yes, I heard this too
@GoatzombieBubba
@GoatzombieBubba 10 ай бұрын
@@ScruffyNZ. That person should be happy to not work for a woke company like Disney.
@jarrod752
@jarrod752 Жыл бұрын
_Luckily team 1 took a snapshot 6 hours before..._ This happened to me. I copied a clients database to my development environment about 2 hours before they accidently wiped it. They called our company explaining what happened and it got around that I had a copy. Our company looked like a hero that day, and I got a bunch of credit for good luck.
@abelkibebe577
@abelkibebe577 Жыл бұрын
You are a Legend :)
@mipmipmipmipmip
@mipmipmipmipmip Жыл бұрын
I think this was how most of Toy Story was saved. It's also bad security practice :)
@ilyasziani5504
@ilyasziani5504 Жыл бұрын
@@mipmipmipmipmip Why is it bad security practice?
@amyx231
@amyx231 Жыл бұрын
And now you routinely copy the client database every 24 hours?
@jarrod752
@jarrod752 Жыл бұрын
@@amyx231 Actually, due to the nature of my current work, I have a script I run on demand approx every few days as needed that takes a snapshot. I usually get around to deleting everything that's more than a month old about twice a year or when my dev server starts btching about space.
@maxcohn3228
@maxcohn3228 Жыл бұрын
Something my first boss taught me (when I broke something big in production in my first few weeks) is that post mortems are to identify problems in a system and how to prevent them, avoiding blame to individuals. This is huge. Making sure to identify why it was even possible for something like this to happen and how to prevent it in the future is a great way to handle a post mortem like this. Good on the GitLab team.
@lhpl
@lhpl Жыл бұрын
Good boss. Bad ones often like when things are done fast and "efficient". And when this then establishes a culture of unsafe practies, thing will go fine, maybe for a long time. This one day, a human error occurs. Typically, such a boss will then blame the person who "did" it, even if the cause was the unsound culture. If as an employee you try to work safely, you get criticised for being slow and inefficient (and you technically are.)
@FireWyvern870
@FireWyvern870 Жыл бұрын
Yeah, things like this are the problem of the system, not fault of the operators
@honkhonk8009
@honkhonk8009 Жыл бұрын
You only fire people for their character, not cus of the inevitable fuckup. Also you basically sunk money into training this dude after that fuckup, so sacking him right after you inevitably paid to get him that experience, is counterproductive.
@gownerjones1450
@gownerjones1450 Жыл бұрын
Also very cool that they did it completely in public even with livestreams. This will hopefully help other companies avoid mistakes like that.
@FlabbyTabby
@FlabbyTabby Жыл бұрын
Depends. Many times, it's used as on opportunity to kick out people they consider undesirable, even if they're great employees.
@CryShana
@CryShana Жыл бұрын
When I was still a junior developer at some startup company, I was working on a specific PHP online store. Every time we would upgrade the site, we would first do it on Staging, then copy it over to Production. The whole process was kinda annoying as there was no streamlined upgrade flow yet and no documentation anywhere - it was a relatively new project we took over. I have upgraded it before so I knew what to do, and I just did the thing I always did. I was close to finishing it up and we had an office meeting coming up soon and lunch afterwards, so I wanted to be done with this before that - so I rushed a bit. And when I was copying files to Production, I overlooked something - I had also copied the staging config file (that contained database access info etc) to the production location and overwrote the production config file. After the copying had finished, thinking I was finally done, I relaxed and prepared myself for the meeting. As I was closing everything, I also tried refreshing the production site, just to see if it works. And then I realized... Articles weren't appearing, images weren't loading, errors everywhere. Initially I didn't believe this was production at all, probably just localhost or something, RIGHT?? However after re-refreshing it and confirming I had actually broke production, panic set in. Instead of informing anyone, I quietly moved closer to my computer, completely quiet, and started looking at what is wrong - with 100% focus, I don't think I was ever as focused as I was then - I didn't have time to inform anyone, it would only cause unnecessary delays. I had to restore this site ASAP. I remember sweating... the meeting was starting and I remember colleagues asking me "if I am coming" - and I just blurted "ye ye, just checking some things..." completely "calmly" as I was PANICKING to fix the site as soon as possible. Luckily I quickly found the source of the mistake within a minute and had to find a backup config file - and then after recovering the config file, everything was fixed. Followed by a huge sigh of relief. The site must have been down for only around 2 minutes. No one actually noticed what I had done - and I just joined the meeting as if nothing had happened - even though I was sweating and breathing quickly to calm myself down, I hid it pretty well. And this was a long time ago - and still to this day, I still remember that panic very well. Now I always make sure I have quick recovery options available at all times in case something goes wrong - and if possible always automate the upgrade process to minimize human errors
@valdimer11
@valdimer11 2 ай бұрын
Well done. Having made mistakes like that, I can completely understand how you were feeling in that moment and how your brain just went "in the zone". It's only ever happened to me twice but I will NEVER forget them.
@yt-sh
@yt-sh 8 күн бұрын
Good lessons, thank you
@vjndr32
@vjndr32 6 күн бұрын
Mann, we all have our fair share of breaking production.
@obanjespirit2895
@obanjespirit2895 5 күн бұрын
I did something similar but with testing on what i thought was dev server. Had some close calls but this time i fcked up. Was super high but was always high so doubt that was it. Quickly had to go and undo changes but was so shook had to make a chrome ext that would put up some graphics and ominous 40k mechicus music whenever i go on a live domain. Havnt made the same mistake since.
@JeffThePoustman
@JeffThePoustman Жыл бұрын
Ugh, felt that "he slammed CTRL+C harder than he ever had before" (3:55). The only thing worse than deleting your own data is deleting everyone else's. In this case the poor guy kinda did both. Great story arc.
@ic6406
@ic6406 4 ай бұрын
Yeah, I guess it was the most stressful moment in his life after realizing what you've done. I think he had a huge blackout
@ludoviclagouardette7020
@ludoviclagouardette7020 Жыл бұрын
The rule I apply for backups is that no one should connect to both a backup server and a primary at the same time, two people should be working together. The employee that was logged on both DBs should have been really two physically separated employees
@act.13.41
@act.13.41 Жыл бұрын
That is an excellent rule.
@refuzion1314
@refuzion1314 Жыл бұрын
Yes, but, in the case that there is only one employee available and he has to connect to both he should either have different color schemes for the different servers OR do it all in one shell window and disconnect / connect to the server they have to edit that way it is a lot harder to execute commands on the wrong server.
@thoriumbr
@thoriumbr Жыл бұрын
I try to follow this rule myself. Every time I have to connect to a prod server to get anything, I disconnect as soon as I get the info before getting back to the test/dev server window.
@thoriumbr
@thoriumbr Жыл бұрын
@@refuzion1314 Different color schemes looks good but don't work during an outage, when you are stressed, exhausted, or anything distracts you. Sounds nice, but the mental load during crisis is too large to pay attention to that.
@onemprod
@onemprod Жыл бұрын
I can't tell you enough how easy it is to accidentally overwrite the wrong file. While I was working on something on a test machine with a usb stick plugged in to save the current progress, I saved the script, thought I saved it in the local directory and copied the unmodified script to my just saved usb stick version...
@gosnooky
@gosnooky Жыл бұрын
Imagine for a moment, that you're that guy. That feeling of pure dread and the adrenaline rush immediately after the realization of what you've just done. We've all felt it at some point.
@omniphage9391
@omniphage9391 10 ай бұрын
In my first job, ive gotten a 2 am call where in the first two weeks of working in the company, i accidentally left a process in prod shut down after maintanence, leading to intensive care patient data not making it into connected systems. Looking back, the entire company was set up super amateurish, yet they operate in several hospitals in my country.
@PixelSlayer247
@PixelSlayer247 10 ай бұрын
Having exited my game without being sure I saved my progress before, this is very relatable.
@thephlophers
@thephlophers 10 ай бұрын
the onosecond
@stacilynn604
@stacilynn604 10 ай бұрын
like hitting a car in a parking lot 😵
@ashesagainst7236
@ashesagainst7236 9 ай бұрын
At my second IT job I accidentally truncated an important table in the prod DB. The stress was immense but we identified a ton of issues and the team was pretty supportive. My boss ended up begging upper management to get us a backup server but they determined it wasn't important enough. The company went belly-up a few years later because of a ransomware attack they couldn't recover from.
@mxbx307
@mxbx307 Жыл бұрын
There is an awful lot that could be learned from this. 1) You should "soft delete" i.e. use mv to either rename the data e.g. renaming MyData to something like MyData_old or MyData_backup, or just mv it out of the way so you can restore it later if needed. Don't just rm -rf it from orbit 2) Script all your changes. Everything you need to do should be wrapped in a peer-reviewed script and you just run the script, so that the pre-agreed actions are all that gets done. Do not go off piste, do not just SSH into prod boxes and start flinging arbitrary commands around 3) Change Control - as above 4) If you have Server A and Server B, you should NOT have both shell sessions open on the same machine. Either use a separate machine entirely or - better still - get a buddy to log onto Server A from their end and you get on Server B from yours. Total separation 5) Do not ever just su into root. You use sudo, or some kind of carefully managed solution such as CyberArk to get the root creds when needed
@magicmulder
@magicmulder 7 ай бұрын
Also for (2), never try to "improve" anything during the actual action. I once prepared a massive Oracle migration that I had timed to take about 3 hours. Preparation was three weeks. As I was watching the export script for the first schema during the actual migration, I thought "why not run two export jobs concurrently, it's gonna save some time". Yeah, made the whole thing slow down to a crawl, so it ended up taking 6 hours. Boss was furious. So no, never try to "improve" during the actual operation, no matter how big you think your original oversight was.
@lashlarue7924
@lashlarue7924 7 ай бұрын
100%, upvoted.
@xpusostomos
@xpusostomos 7 ай бұрын
I religiously never delete anything
@thedemolitionsexpertsledge5552
@thedemolitionsexpertsledge5552 7 ай бұрын
I have no idea what any of this means but I feel like this is bad
@alvinbontuyan8083
@alvinbontuyan8083 4 ай бұрын
Fucking up catastrophically with Bash commands is a canon event. It is religion for me to always copy a file/directory to "xxx.bak" before doing anything sensitive
@TheDrTrouble
@TheDrTrouble Жыл бұрын
The best practice is to rename the directory or file to something else. Idk how the developers are so calm when using deletion commands
@setasan
@setasan 11 ай бұрын
Well, when you live in a poor country, being underpaid by a fucking contractor company, with a overloaded team. shit hapnz
@schwingedeshaehers
@schwingedeshaehers 10 ай бұрын
I "deleted" on program from me with the cp command (I wanted to copy the config and the main file in a sub directory, but forgot to enter the directory after it, so it wrote the config to the main file) (I could get a older version of the file from the SD card, by manually read the content of that region and find one with it on it, as it doesn't override an save, but takes a new place)
@Funnywargamesman
@Funnywargamesman 10 ай бұрын
On a home system? Absolutely. In a working environment? Doubtful. Maybe with a small company it would be acceptable, but creating an orphan database that may or may not contain sensitive information with no one in charge of it, or worse, no one who KNOWS ABOUT it, would be awful. God help you if that contains financial, medical, or government records.
@AndrewARitz
@AndrewARitz 10 ай бұрын
@@Funnywargamesman you don't create it to keep it around forever, you create it as a failsafe for when you are doing potentially dangerous stuff, like deleting a whole database.
@Funnywargamesman
@Funnywargamesman 10 ай бұрын
@@AndrewARitz I cannot tell you how many times "temporary" things become permanent on purpose, let alone the times people have said they are going to do something, like deleting a temp database they copied locally because their permissions didn't let them use it remotely, and then proceeded to forget to delete it. This will be especially true with the most sensitive databases, "because it's more important, so we should make a copy first, right?" Security is everyone's job and if you do (typically) irresponsible things like copying databases, "as a failsafe," chances are you are going to form a habit that means you will do it with a sensitive database. If you think YOU won't do it, that's fine, but assuming you are of average intelligence you need to remember 50% of people are dumber than you and some of them get REAL dumb. If you set policy to say that it would be allowed, then THEY will do it. This is exactly why I said that home environments and really tiny companies could be different, there it could/would be fine. Chances are, if you don't know the names of every single person in your company off the top of your head, it is too large to be that lax with data protection and management. Take it or leave it, it's my opinion.
@randomgeocacher
@randomgeocacher Жыл бұрын
A helpful hack is to set production terminal to red and test terminal to blue or something like that. Just a small helper to avoid human f’ups if you need to run manual commands in sensitive systems.
@tacokoneko
@tacokoneko Жыл бұрын
i second this I also use colors to differentiate multiple environments
@vaisakhkm783
@vaisakhkm783 Жыл бұрын
it was easy and changing prompt color... but make a huge differece
@Wampa842
@Wampa842 Жыл бұрын
I use colored bash prompts to differentiate machine roles - my work PC uses a green scheme, non-production and testing servers use blue, backups use orange, and production servers use yellow letters on red background. It's very hard to miss.
@darrionwhitfield46
@darrionwhitfield46 Жыл бұрын
I use oh-my-posh with different themes
@iUUkk
@iUUkk Жыл бұрын
Both database servers were actually used in production.
@helmchen1239
@helmchen1239 Жыл бұрын
I once accidentally ran a chmod -R 0777 /var because i've missed a dot before the slash (in a web project with a /var folder), which (as i've now learned) may make a unix system totally unresponsive. I can very well understand how it feels, the moment you realize what you have just done. That did cost us a few hundred euros and kept 2 technicians busy for an afternoon on the weekend. Lessons learned, today we can laugh about it.
@Darkk6969
@Darkk6969 Жыл бұрын
Ya, Unix / Linux will do what you tell it to do without any warnings. Pretty sure you sat there and wondered why that command is taking a long time to finish before you realize your mistake. Right then there it's the "Oh Shit" moment. 😀 Lucky for me though I use VMs so can always revert to previous snapshots.
@desoroxxx
@desoroxxx Жыл бұрын
the onosecond
@parlor3115
@parlor3115 Жыл бұрын
@@Darkk6969 What if you ran it on the host?
@FurriousFox
@FurriousFox Жыл бұрын
@@parlor3115 he doesn't, Noah only runs things in virtualized environments, making snapshots every minute
@aarondewindt
@aarondewindt Жыл бұрын
Why does it make it unresponsive? I accidentally chmod 0777 the entire "/" once and well, I had to start again from scratch. Thankfully I was just creating a custom Ubuntu image with some preinstalled software for one of my professors. So it just cost me time. Still, I never figured out why opening up the permissions would lock everything up.
@Dairunt1
@Dairunt1 Жыл бұрын
One of my most stressful moments as a software designer was when I accidentally broke a test environment right before a meeting with our client; I managed to have the project running at a 2nd test environment but that really taught me the importance of backups and telling the rest of staff about a problem ASAP.
@christopherg2347
@christopherg2347 Жыл бұрын
If you are working with multiple shells, VMs, remote sessions or the like - make sure they are color coded based on the machine you are running against! It can be as simple as picking a different color scheme in windows. But it is just too easy to mess up when all the visual difference is a single number, somewhere in the header.
@neekfenwick
@neekfenwick 4 ай бұрын
Yep, I came here to say this. For any serious system I connect to, I use different params for my session, in my case I like old fashioned xterm, something like: alias u@s="xterm -fg white -bg '#073f00' -e 'ssh user@server'" It's very useful to see the green red, blue etc colouring and be sure which system you're talking to.
@Kalmaro4152
@Kalmaro4152 3 ай бұрын
It's very nice that Linux shells actually support setting session colors
@GanerRL
@GanerRL Жыл бұрын
imagine flagging messing with some employee and managing to bring down the entire site by proxy
@batorerdyniev9805
@batorerdyniev9805 Жыл бұрын
What
@hypenheimer
@hypenheimer Жыл бұрын
Bot
@GanerRL
@GanerRL Жыл бұрын
@@hypenheimer beep boop
@Jacob-ABCXYZ
@Jacob-ABCXYZ Жыл бұрын
How to take down a site, the stealthy way
@kulled
@kulled Жыл бұрын
@@hypenheimer nah. it was probably a minecraft shorts bot account before he bought it though.
@LordHonkInc
@LordHonkInc Жыл бұрын
"rm -rf" is one of those commands I have huge respect for cause it reminds me of looking down the barrel of a gun (or any similar example of your choosing): Best case, you do it a) seldom, b) after a lot of strict and practiced checks, and c) if there's no alternative; unfortunately, the worst case is when you _think_ you're in that best case scenario.
@givenfool6169
@givenfool6169 Жыл бұрын
I sourced my bash history like an idiot about a week ago. I have so many cd's and "rm -rf ./"'s and other awful things in there. I somehow got lucky and hadn't used sudo in that terminal at the time. I got caught on a sudo check before it ran anything absolutely hell inducing. Just a bunch of cd's and some commands that require a sourced environment to execute. Super Lucky. Icould have wiped out everything, because just a couple commands after that was a "rm -rf ./" and it had already cd'd into root.
@henningerhenningstone691
@henningerhenningstone691 Жыл бұрын
@@givenfool6169 Lmao it had never once occurred to me what havoc it could wreak if you accidentally source the bash history, since it had never occurred to me that that's even possible (because why the hell would you?!). But of course it is, what an eye opener!
@givenfool6169
@givenfool6169 Жыл бұрын
@@henningerhenningstone691 Yeah, I was trying to source my updated .bashrc but my auto-tab is setup to cycle through anything that starts with whatevers been typed (even ignores case) so I tabbed and hit enter. Big mistake. I guess this is why the default auto-tab requires you to type out the rest of the file if there are multiple potential completions.
@Shadowserpant00
@Shadowserpant00 Жыл бұрын
@@henningerhenningstone691 bro idk wtf you're talking about and it's scaring me
@oliverford5367
@oliverford5367 Жыл бұрын
Do ll first, make sure you're wanting to delete that directory, the press up and change ll to rm
@robbybankston4238
@robbybankston4238 Ай бұрын
I'm glad they didn't fire the engineer. It goes to show the differences in mindsets from some organizations that care about it being a learning experience (albeit an expensive one). Many corporations would have fired the engineer as soon as the issue was resolved without hesitation. Thanks to those orgs who care about their team members and being more concerned with lessons learned.
@Tmccreight25Gaming
@Tmccreight25Gaming 2 ай бұрын
Ultimate workplace comeback: "At least I've never nuked the entire database"
@usellstech-ip2sg
@usellstech-ip2sg Ай бұрын
Better to have someone who knows what to do, than someone who has never experienced it
@reyynerp
@reyynerp 12 сағат бұрын
they work remotely
@matthias916
@matthias916 Жыл бұрын
I once accidentally deleted 2000 rows in one of my companies production databases, everything was restored 5 minutes later but it felt so bad, can't imagine what deleting an entire database would feel like
@marco56702
@marco56702 Жыл бұрын
terrible, sending the queries make you shiver
@varunkhadse5869
@varunkhadse5869 9 ай бұрын
ig panick was at next level coz both dbs were deleted.
@Rncko
@Rncko 8 ай бұрын
It feels like lighting a torch onto a sea of currency bank notes... that belongs to the company. (and company is just about to release year end bonus)
@Atulnavadiya
@Atulnavadiya 7 ай бұрын
I have had good hands-on experience at my company on sql database but I'd check my query atleast 10 times before execute it..we had clients data saved in the database of more than 10 years..
@TrevoltIV
@TrevoltIV 7 ай бұрын
@@marco56702Right, I’m always quadruple checking every query to make sure my retarded ass didn’t type delete * or something
@MechMK1
@MechMK1 Жыл бұрын
For this reason, all our servers have color-coded prompts. Dev/Testing servers are green. Staging is yellow. Prod is bright red. When you enter a shell, you immediately see if you are on a server that is "safe" to mess around with, or not. The advantage to doing this in addition to naming your server something like "am03pddb", is that you don't have to consciously read anything. Doesn't matter if you accidentally SSH into the wrong server. If you meant to SSH into a "safe" server, then the bright red prompt will alert you that you are on prod. And if you meant to SSH into a prod server, then you better take the time to read which server it actually is.
@tacokoneko
@tacokoneko Жыл бұрын
i agree except there are only so many colors, so if manually controlling a lot of different machines (something that could maybe be avoided depending on what the servers do) i believe it's important to use unique memorable hostnames. the two servers in this story had hostnames 1 character apart and the same length, unless the names were all changed for the artwork
@seedmole
@seedmole Жыл бұрын
@@tacokoneko Yeah like imagine if those two characters were visually similar ones, like any combo of 3, 5, 6 and 8. Fatigued eyes could easily misleadingly "confirm" that you're on the right one when you're not.
@makuru_dd3662
@makuru_dd3662 Жыл бұрын
Also, dont ever ever work on the live database, a lesson i have learned the hard way many times on my own.
@MunyuShizumi
@MunyuShizumi Жыл бұрын
@@makuru_dd3662 That statement makes no sense. No matter how critical a system is, you'll have to perform some kind of maintenance at least semi-regularly.
@makuru_dd3662
@makuru_dd3662 Жыл бұрын
@@MunyuShizumi you make a backup or anything, yes you need to maintain it but not by making massive untested changes.
@ChosenOne-wz6km
@ChosenOne-wz6km Жыл бұрын
This video is awesome! The step by step analysis of what occurred during the outage coupled with the story telling format helped me learn some things I didn't know about database recovery procedures. Please make more videos in this format!
@theultimatetrashman887
@theultimatetrashman887 Жыл бұрын
the realization of what you're doing before it finishes itself is so cruel and happens so often, thats why when you're doing a job you always do it slow but correctly
@xmorse
@xmorse Жыл бұрын
The real problem here is that you can delete any user data by simply mass reporting him
@technicolourmyles
@technicolourmyles Жыл бұрын
I'm seeing a lot of serious problems here... I guess this is why I never heard of GitLab before.
@PatalJunior
@PatalJunior Жыл бұрын
I highly doubt is instantly deleted, probably someone made the decision to delete it (could just be an account spamming a bunch of mess onto repositories, and that isn't good either.
@FighteroftheNightman
@FighteroftheNightman 11 ай бұрын
​@@technicolourmylesthey're literally the 2nd largest enterprise git solution provider in the world.
@nonamepasserbya6658
@nonamepasserbya6658 11 ай бұрын
When in doubt, it's probably 4chan That low hanging fruit aside, not a good thing if someone can just do that with a bot acc. Maybe grant employees a special anti report protection can help until they find a more permanent solution against those trolls
@Webmage101
@Webmage101 11 ай бұрын
​@@PatalJunior6:21 literally says they fucked up by not making it check the details before deletion
@build-things
@build-things Жыл бұрын
As an engineer for a large company you got me in the feels talking about asking for help or posting a pr and then seeing all the mistakes you made😊
@stingrae789
@stingrae789 Жыл бұрын
In my previous position I worked closely with one guy and we used to joke about how we were using each other as a rubber duck :D.
@EChan-eu2co
@EChan-eu2co Жыл бұрын
The buzzword is SRE and postmortems are supposed to be blameless now...
@jillfizzard1018
@jillfizzard1018 Жыл бұрын
This is why you first mark the PR as a draft and read over the changes one more time before marking it as ready.
@mortache
@mortache Жыл бұрын
@@stingrae789 Damn I didn't know this thing has a name! I legit have done this before while discussing weird math problems
@ErikPelyukhno
@ErikPelyukhno 10 ай бұрын
Your editing is phenomenal. What an insane series of events 😂 Glad gitlab was able to get back to running, seeing all that public documentation was refreshing to see since it shows they were being transparent about their continued mistakes and their recovery process.
@jamesrosemary2932
@jamesrosemary2932 Жыл бұрын
A long time ago we implemented a policy that absolutely nobody operates the production console alone. There always has to be someone else looking over your shoulder to point out oversights like the one in the video.
@HazySkies
@HazySkies Жыл бұрын
"Slams Ctrl+C harder than he ever had before" As a relatively new linux user, I felt that one.
@ss-to7ii
@ss-to7ii 9 ай бұрын
As a new Linux user use the "-i" flag for "interactive" when using rm and a couple other commands.
@KR-tk8fe
@KR-tk8fe 6 ай бұрын
As a windows user, I was very confused
@LC-uh8if
@LC-uh8if 5 ай бұрын
@@KR-tk8fe CTRL+C. On most Unix/Linux based CLIs, this combination aborts whatever command you were running. Technically, it sends a SIGINT (Interrupt) to the foreground process (active program), which usually causes the program to terminate, though it can be programmed to handle it differently. Its basically, the Oh Shit or This is taking too long button.
@MrCmon113
@MrCmon113 2 ай бұрын
​@@LC-uh8ifIsn't that the same in Windows terminals? 🤔
@DomskiPlays
@DomskiPlays Жыл бұрын
Our prod server has no staging environment or anything like that. I've asked the DB admin if the data and schema is safe in case of someone accidentally deleting everything and they told me everything is backed up daily. Kinda scared that I don't know how or where this is happening except for a job.
@indyalx
@indyalx Жыл бұрын
I checked my database backup script a couple days ago and noticed it hadn't backed up in 5 days O_O I SLAMMED the manual backup immediately. Then went and fixed the issue and made sure it would notify if there was no backup in 6 hours.
@CMDRSweeper
@CMDRSweeper Жыл бұрын
The next question is... "Have you tested the backups?" If they can't say for sure WHEN they were tested... Be very afraid...
@indyalx
@indyalx Жыл бұрын
@@CMDRSweeper we load the prod backup into staging nightly
@forbiddenera
@forbiddenera Жыл бұрын
6 hour full backups, mirroring/replicas, multiple servers and daily volume backups..
@robertbeisert3315
@robertbeisert3315 Жыл бұрын
"Trust me, bro" only works in Dev. Every other environment needs regular verification.
@danusminimus9557
@danusminimus9557 Жыл бұрын
Seen your video history and the evolution of your videos - this format is amazing and you're really good at it :D
@jfbeam
@jfbeam Жыл бұрын
The #1 thing I learned WAY EARLY on in my IT career (three decades): Never delete anything you can't _immediately_ put back. Never do anything you can't undo. Instead of deleting the data directory, _rename_ it. If you're on the wrong system, that can easily be fixed. (and on a live db server, that alone will be enough of a mess to clean up.) As for backups, if you aren't actively checking that (a) they've run, (b) they've completed successfully, and (c) they're actually usable... well, this is the shit you end up in. (The fact they're actively hiding ("lying") about this fiasco should be criminal.)
@kurenaigames5357
@kurenaigames5357 Жыл бұрын
yea renaming is the key. first rename, then setup everything and then delete the renamed folder like a few months later.
@jhyland87
@jhyland87 Жыл бұрын
A few places i worked at as a linux admin or engineer, the shell prompts (PS1) were color coded. Green was dev, yellow was qa and red meant your in prod. Worked like a charm.
@blackbot7113
@blackbot7113 8 ай бұрын
Yeah, that's the way I do it as well, just the other way round (red being test). Extends to the UI as well - if the theme is red, you're on the test instance of Jira, not the real one.
@jhyland87
@jhyland87 8 ай бұрын
@@blackbot7113 Yeah, it's a very wise thing to do imo. Currently, I work at a bank, and I recommended we have the header in the UI of the colleague and customer portal be different colors for lower environments, as well as the PS1 prompt on the servers. And I kinda got snickered at and got a reply along the lines of "How about we just pay attention to the server and page were on?" Its crazy because it's such an easy change to implement and almost entirely prevents anyone making such silly (yet catastrophic) mistakes. Edit: I make the PS1 prompt for my own user on the servers different colors, but that only helps so much since I sudo into other service users (or root). Additionally, we "rehydrate" the servers every. couple months, which means they get re-provisioned/deployed, so any of those settings get wiped out entirely. For it to be permanent, it needs to be added in the Docker file.
@daigennki
@daigennki Жыл бұрын
Awesome work on the video!! I love the editing being both funny and straight to the point, and your narration is easy to understand too. You seriously deserve more attention.
@rishavmasih9450
@rishavmasih9450 Жыл бұрын
Oh God my heart started sinking when you said he noticed the shell he was running the command in.
@minsiam
@minsiam 9 ай бұрын
When I was just starting in a company, I accidentally deleted all the ticket intervals from the database. Causing all the tickets to close immediately and make some massive spam to the admins. I was really terrified of the situation and didn't know what to do, we didn't have any backup as well. I apologized as much as I can and didn't make another mistake like this again in years, sometimes mistakes make you work harder and be more careful in life.
@karmatraining
@karmatraining Жыл бұрын
An old best practice that so many people these days seem to forget or never have heard about is that every week, you try to pull a random file from your backup system, whatever that is. (Or systems, in this case). You will learn SO MUCH about how horribly your backups are structured by doing this - so many people think they set up good backup systems but never continuously test them in any way, and then they get big surprises (like the GitLab team) when they do need to fall back on them.
@matthewstott3493
@matthewstott3493 Жыл бұрын
Testing to verify backups, replication, failover and the like is absolutely critical. As new scenarios occur, having a feedback loop to update the plan is key. It's a continuous process that most shops have learned the hard way. It is boring and tedious but if you don't test you will experience catastrophic consequences.
@-TheBugLord
@-TheBugLord Жыл бұрын
Exactly. Just like a dam, if there is a weak-point at the bottom, it all may come crumbling down. There needs to be a lot of redundancy when it comes to backups. Especially when it comes to a big server. An engineer accidentally removing a database should not have that catastrophic of consequences.
@esa4573
@esa4573 Жыл бұрын
Yeah, the general rule is/should exist for having to be ready for stuff like that. If your fuckup is non-recoverable or a massive pain, you did something wrong. I'm sure a lot of companies are practically "trained" for when someone yeets the whole database or service.
@sortebill
@sortebill 10 ай бұрын
your content is really good, please keep up making these mini documentaries about tech failures!
@swaggy3987
@swaggy3987 7 ай бұрын
What's far more impressive about this whole situation is how calm the engineers were in handling the situation. That to me is far more valuable than having engineers that are too gun-shy to make prod db changes at 12AM and panic when something goes wrong.
@WackoMcGoose
@WackoMcGoose Жыл бұрын
As a former Amazonian (only QA for the now-ended Scout program, sadly), I read quite a few cautionary tales on the internal wiki about Wrong Window Syndrome. Sometimes, not even color-coded terminals and "break-glass protocols" (setting certain Very Spicy commands to only be usable if a second user grants the first user a time-limited permission via LDAP groups) is enough to save you from porking a prod database.
@Skyline_NTR
@Skyline_NTR Жыл бұрын
This interests me. Got any resources/links to set that up (dangerous commands temporarily allowed by time-limited permissions via LDAP)
@WackoMcGoose
@WackoMcGoose Жыл бұрын
@@Skyline_NTR Afraid not, it was several pay grades above me both in job role and in coding knowledge, and I lost access to the company slack back in december so I can't really ask anyone...
@ProgrammingP123
@ProgrammingP123 11 ай бұрын
@@WackoMcGoose Ahh were you laid off also??? I was lol
@WackoMcGoose
@WackoMcGoose 11 ай бұрын
@@ProgrammingP123 Yup, they disbanded the entire Scout division and then put a company-wide hiring freeze a month later so I had no hope of transferring...
@wojtekpolska1013
@wojtekpolska1013 Жыл бұрын
respect for not firing the guy, it was obviously just a small mistake, and it wasn't his fault that the backups didn't work. it shouldn't be possible for 1 command to completely delete everything in the first place. Good that they didn't just use him as a scapegoat :p
@yerpderp6800
@yerpderp6800 Жыл бұрын
If they fired him they would just reintroduce the possibility of the same thing happening again in the future. I'm pretty sure the old employee will be paranoid for a loooong time and will double-check from now on lol. An expensive lesson but a lesson nonetheless.
@tuxie93
@tuxie93 Жыл бұрын
Yep and he'll train new employees making super sure to emphasize triple checking before deleting from prod.
@D00000T
@D00000T 9 ай бұрын
That’s Unix systems for you. Their open nature makes them super useful for a lot of things but it’s also so easy to break them. Plus that old trick of telling new linux users that sudo rm -rf is a cool easter egg command wouldn’t be the same with more safeties and preventions.
@BitTheByte
@BitTheByte 9 ай бұрын
What if I want to delete everything? I don’t want a baby proofed OS. I want an OS that does what I want. Even if I want to burn it all
@wojtekpolska1013
@wojtekpolska1013 9 ай бұрын
@@BitTheByte why buy a computer at that point lol
@chrisfung443
@chrisfung443 Ай бұрын
Lucky guy that he found the data from Manually Snapshot instead of backup. Can't imagine how the admin-1 feeling in that moment.
@derpnerpwerp
@derpnerpwerp Жыл бұрын
This reminds me of all the times I have been in the wrong ssh session just before doing something that would have been pretty bad. I setup custom PS2 prompts to tell me exactly what environment, cluster, etc I am in.. and even colorize them accordingly but the problem is.. you start to just ignore them after a while. Its also kinda dangerous when stuff becomes fairly routine that is manual and potentially damaging
@streetchronicles5693
@streetchronicles5693 Жыл бұрын
Yesterday I was added to a support team because we are getting a lot of tickets from users not waiting long enough for a service to load and closing the connection early. I died laughing from this story.
@Simone-uu8ne
@Simone-uu8ne Жыл бұрын
all things aside, that wasn't that bad. Yeah, they weren't operational for 24h, but that made many other companies realize their fault management. For example, my uni professor told us about this incident and we could comprehend the importance of backups and testing
@gblargg
@gblargg Жыл бұрын
I think the biggest issue was losing 6 hours of commits and comments.
@kookie-py
@kookie-py Жыл бұрын
@@gblargg people will cope
@gblargg
@gblargg Жыл бұрын
@@kookie-py Agreed, virtually all of them will have the commits locally as well. Just noting that the data loss is a bigger deal than mere downtime.
@kookie-py
@kookie-py Жыл бұрын
@@gblargg right
@_Titanium_
@_Titanium_ Жыл бұрын
This is why programming in general is great, nobody dies if you fuck up. (Obvious exceptions, medical, aviation etc)
@Dobaspl
@Dobaspl 4 ай бұрын
Even before I started working in one company, one IT specialist deleted the directories of the new CC-supporting system. This was shortly after its implementation into production. Worse still, it turned out that the backup process was not working properly. For a week, the team responsible for programming this system practically stayed at work, recreating the environment almost from scratch. :D
@felixbluwox
@felixbluwox Жыл бұрын
One idea to help prevent this is setting up the ssh sessions so each one has a different fore/background color, lets say the prod machine has green foreground and the backup is blue and make it standard for everyone working with the terminals, that way it'll be harder to get confused between the two. you can even have multiple ssh terminals and assign each one a different foreground color.
@TonytheCapeGuy
@TonytheCapeGuy Жыл бұрын
I can just imagine the relief that team felt when they find SOMETHING that they could use to restore files.
@CarrotCastle
@CarrotCastle Жыл бұрын
One of my first jobs in IT was working as a big data admin and this video allows me to re-live the spicy moments of that job but with none of the responsibility attached
@ChandravijayAgrawal
@ChandravijayAgrawal 4 ай бұрын
One thing I learnt all this is never run delete command ever, and if you are paste the screenshot of command in your group before running it
@markh3684
@markh3684 Жыл бұрын
Mistakes in the moment happen. I'm focusing more on the "we thought things were working as expected" parts. The backup process familiarity, backups not going to S3, Postgres version mismatches, insufficient WALs space, alert email failures, diligence on abuse deletes... These were all things that could have been and should have been caught way before the actual incident.
@hummel6364
@hummel6364 Жыл бұрын
In my vocational school I had a subject simply called "Databases" and our teacher there once told us a story about how one of his co-workers lost his job. In essence he did everything right, created his backups and backup scripts and everything worked. At some point during the lifetime of the server this was running on someone replaced a harddrive for whatever reason, this lead to a change of the device UUID, which he had hard-coded into his backup script, when the main database failed a year or two later, they tried restoring from this backup only to find that there was none. Wasn't even really his fault, the only mistake he made was not implementing enough fail-saves. Nowadays we have it comparatively easy with all the automatic monitoring and notifications, but this was at least 30 years ago.
@thewhitefalcon8539
@thewhitefalcon8539 Жыл бұрын
I guess that could have been solved by testing the backups. Install the database software on a spare server or just your own workstation, and then restore the backup onto it
@hummel6364
@hummel6364 Жыл бұрын
@@thewhitefalcon8539 well the backup ran properly for years, he just never thought that the UUID might change
@thewhitefalcon8539
@thewhitefalcon8539 Жыл бұрын
@@hummel6364 I suppose as long as he's employed he should probably be checking the backup at least every couple months. Would I have remembered to do that? I dunno, but I'm not employed as a database admin.
@yerpderp6800
@yerpderp6800 Жыл бұрын
​@@hummel6364 yeah he kind of deserves to be fired...feel like it should be common sense the hdd could fail, no good excuse to not expect that. You should almost never hardcode stuff, not sure why they thought it was okay to hardcode the uuid of a drive that would one day fail.
@hummel6364
@hummel6364 Жыл бұрын
@@yerpderp6800 I think the idea was that the device might change from sdX to sdY when other drives are added or removed, so using the UUID was the only simple and safe way to do it.
@joseaca1010
@joseaca1010 Жыл бұрын
i cant even imagine the sheer terror Team Member 1 felt when he saw the db in which he ran the delete command
@CryptbloomEnjoyer
@CryptbloomEnjoyer 8 ай бұрын
I know the exact feeling of terror the moment you realize the command you just ran has is about to cause havoc
@SteveAcomb
@SteveAcomb Жыл бұрын
Great video! Well produced content about software engineering war/horror stories are exactly what I’ve been looking for, keep it up!
@edc2186
@edc2186 Жыл бұрын
As a dev for a large company who has been on a number of late night calls, I literally gasped at this. But good on the team to work through the issue, and good on management to keep these guys around
@k7y
@k7y Жыл бұрын
At my previous job anything done on production servers required a change request which takes about a week to get approved and complex commands had to be tested on Lab environment before they could be copy pasted to production server.
@AndreGreeff
@AndreGreeff Жыл бұрын
I must say, I heard many stories about this.. but that was a very nice summary of the nitty-gritty details, thank you. (:
@daryl9915
@daryl9915 Жыл бұрын
A couple of jobs ago, I had a colleague who managed to do worse than this. I think they were playing about with learning Terraform and managed to delete the entire account. Prod servers, databases, the dev/qa servers, disk images, even the backups. Luckily it was a smaller account hosting a handful of tiny trivial legacy sites, but even so, we didn't see them for the rest of the week after that mishap
@lashlarue7924
@lashlarue7924 7 ай бұрын
😱😱😱😱😱😱😱😱😱😱😱😱😱😱
@justdoityourself7134
@justdoityourself7134 Жыл бұрын
Having a live screenshare with team members watching might seem a little wasteful. But for critical procedures like this, it is well worth the added cost.
@Navak_
@Navak_ 10 ай бұрын
Most people don't see the importance of such extreme level of caution until it's too late. It's like handling a firearm.
@shashankh7768
@shashankh7768 7 ай бұрын
The story telling/edit is unmatched. Hands down best docu/short movie on youtube😂!
@TomSM5
@TomSM5 7 ай бұрын
Nice to hear that they didn't fire him. He did the correct procedure, some of the steps were unknown like the lag caused by the command, which could have been avoided by having clear documentation about it. Also when people are tired late at night, mistakes do happen, which anyone can be the victim of.
@bennythetiger6052
@bennythetiger6052 Жыл бұрын
This video made me say "Oh... my... God..." way too many times 😂😂. Felt like some Chernobyl documentary about a bad sequence of actions. Love it! This is very insightful as to what things can take place on these types of environments as well as what are some measures that can prevent major falis like that. It's also super interesting to see that, no matter how perfect a software system is, humans will still find a way to screw it up 😂
@blazi_0
@blazi_0 Жыл бұрын
Bro let's also don't forget the damage had already done, the server was down for like 18 hours thousands of prs, comments, issues and projects are all delete permanently, this should be a bigger deal
@mrsharpie7899
@mrsharpie7899 11 ай бұрын
I'd love to see the USCSB do an animation on this incident lmao
@hchris96
@hchris96 Жыл бұрын
I didn’t realize I would like these videos, but you are a good storyteller for production issues and I hope to see more in the future I am gonna share this with some of my coworkers
@jim2lane
@jim2lane 10 ай бұрын
OMG, we have all been there haven't we? That awful, dreadful realization after deleting something that you shouldn't have. Mine was back in the days of manual code backups, before ALM tools were ubiquitous like today. I thought I had taken the last three days of code changes and overwritten the old backups that were no longer needed. And then I realized that I had done the exact opposite, and just deleted three complete days of coding - and would now have to recreate them from scratch 😒😭
@stevencoetzee1597
@stevencoetzee1597 Жыл бұрын
By far the most suspense I have felt during a dev story
@jeromesimms
@jeromesimms Жыл бұрын
Wow! This was great and so interesting. I'm so glad I found this channel. I would love to hear more in depth analysis of software engineering fails
@johnthomas2970
@johnthomas2970 Жыл бұрын
This gives me good insight on why our tech team keeps breaking shit….
@eswarnichtsmehrfrei
@eswarnichtsmehrfrei 3 ай бұрын
All my backup jobs have to report to an uptime service.
@tatsuuuuuu
@tatsuuuuuu 8 ай бұрын
Linux actually can in certain circumstances "undo" this wild kind of situation. Having ZFS as the file system will allow you to revert to a previous image of the filesystem. it's like versioning but for the entire file system. of course it takes up quite a bit of space so it's not done that often, software install are automated "imaging" points for instance. but you can trigger one manually when you think you're about to do something you're unsure about. (since the selection of save states is at GRUB, yes an unbootable system is still recoverable if you still have GRUB)
@bmo3778
@bmo3778 Жыл бұрын
I barely understand anything here, but all I can say is massive thanks to the team who have worked hard, advancing our computer tech to the current state we have!
@jumbo_mumbo1441
@jumbo_mumbo1441 Жыл бұрын
Honestly the worst part of this was all the backup failures
@Qbe_Root
@Qbe_Root 18 күн бұрын
The fact that running a raw rm command on the backup server was anywhere close to an expected procedure is also a problem, like you could make an alias for that specific rm command called "wipe_backup" or something, only on the backup server, so that trying to run the command in the wrong SSH window would just error out with "command not found". But anyway, good on them for not firing the guy and realizing such a catastrophic failure is never a single person's fault
@thunderchild4816
@thunderchild4816 Жыл бұрын
This is my worst nightmare. I was on a release where a database administrator did something similar and he spent a good 10 minutes swearing. We had backups but it was a pain to find which one to use and get it setup.
@CharlesChacon
@CharlesChacon Жыл бұрын
I’m pretty sure this event only ended up affecting things like comments and issues, but not the actual git repositories themselves, which would have been a huge relief, I imagine. Still, this was one of the most interesting things I’ve ever followed and ended up motivating me to learn a ton about databases, cloud practices, devops, and everything-as-code culture. Thanks for providing such a great lesson, GL. And huge kudos to them for transparency
@MrB10N1CLE
@MrB10N1CLE Жыл бұрын
3:52 it was at this moment when the viewers collectively scream, transcending space-time and raising a cosmic choir of dread and regret.
@Emophiliac2
@Emophiliac2 7 ай бұрын
This reminds me of when I was working at another company's site back in the late 80s. One of the company's employees had logged into a VMS system and then connected to a Unix system from there. At some point, the employee accidentally started a 'rm -rf /' command. He realized the mistake and did a ^C. Unfortunately, all that did was log him out of the VMS system. The command continued to run on the Unix system until it finally removed something important enough so as to crash the system. It took a few days to restore their computer. And that also resulted in the removal of Admin privileges to most employees. Funnily enough, I was able to keep my Admin privilege, not being an employee.
@whoman0385
@whoman0385 7 ай бұрын
thats why you always use a shell that says "hey your rm -rf'ing something, you sure about that?"
@Rametesaima
@Rametesaima Жыл бұрын
I've always been paranoid when working in Prod. Always make it a point to have at least the Ops Lead on a screen-sharing session where I show what I'm doing while requesting affirmative acknowledgement of each step before proceeding. It's annoying. It's slow. But boy ohh boy does it make me feel safer.
@isaiahsmith6016
@isaiahsmith6016 Жыл бұрын
It may be slow but look at it this way. You're probably saving a lot more time in the long run by preventing something horrible from happening in the first place.
@Socsob
@Socsob Жыл бұрын
This is so cool to know the inner workings of a team like this
@dany2685
@dany2685 8 ай бұрын
I am working as a bank programmer and we have two important servers in production. One is in sync with the main one. If the main one is broken or something does not work properly we change it to the other one. Also they have many ways to backup like multiple storage units and maximum security of who has access to data. We had some issues on testing platform where a guy accidentally deleted the database but we had backup in less than 30 min made by our sys admin guy. We did not ever have any tragic issue on production.
@sarsaparillasunset3873
@sarsaparillasunset3873 Жыл бұрын
lol, that's why for the production database CLI, I set the background color of my terminal to red to make sure I don't confuse it with the staging environment. But I also use Amazon RDS and let it manage backup for me, because I know setting up that stuff is a PITA, particularly because you'll need to test the replication, failover, the rollback logs, the snapshotting, to make sure all your config actually works.
@blank001
@blank001 Жыл бұрын
One strict rule I always follow when connecting to prd servers via ssh or DB UI agent (pgadmin) is I always use different background colors, Red for prod Green for staging Black for test and local + double checking every command You can never be sure enough
@GeorgeTsiros
@GeorgeTsiros Жыл бұрын
I like how the terminal has the decoration of some linux-y windowmanager, but the message boxes are winXP xD
@cc3
@cc3 9 ай бұрын
I deleted the main site from our backend in my first month as a full stack developer. Fortunately i figured out how to rebuild the apache server and clone the repository but i definitely worked well past my hours that day and the stress was crazy
@loupassakischristos9758
@loupassakischristos9758 Жыл бұрын
I experienced something similar a couple years ago, it's the kind of thing that you think only can happen to others but yeah... I had to delete some specific data from the production database, I created the sql requests and executed them to the testing environment. The dataset between those databases is completely different, and the requests passed without any issue. But when I passed them to production they were taking way too long and then I realised. I almost had a panic attack. I reported the incident immediately and was mentally prepared to be fired. Fortunately we could retrieve most data from a backup and the lost ones were not that big of an issue. I still work in the same company :p
@jonix24mejor
@jonix24mejor Жыл бұрын
And yes... this is exactly the reason why I didn't study programming / engineering in college, and instead opted for graphic design / communication. if I write or design something wrong and it gets published, well, at worst the publication stays published as a reminder of my mistake, in programming all it takes is one finger mistake, misremembering something or just a simple distraction and you can absolutely wipe an entire company's network infrastructure out of existence.
@malborboss
@malborboss Жыл бұрын
We need more videos like this one. This was amazingly interesting
@JaxVideos
@JaxVideos 3 ай бұрын
A long time ago, in our thriving software shop, on a corporate network of 50 or so SGI workstations and some heavier iron a script's rm -rf line accidentally picked up a space character after one of the leading '/'s of a file name. As all disks were remotely mounted, this became a corporate total deletion, after midnight, with the main server room locked tight.
@MPSmaruj
@MPSmaruj 10 ай бұрын
Also one thing I used to scoff at when I was a newbie was assigning names as aliases to your servers. Like: actual words instead of numbers. It seemed a little asinine to me at first but even in this scenario: it's much easier to confuse db1 and db2 than, eg.: amelie and betrand.
@hououinkyouma2426
@hououinkyouma2426 Жыл бұрын
Can't wait for part 2
@kevinfaang
@kevinfaang Жыл бұрын
Could just be missing the sarcasm but if you're referring to the ending Google bard isn't exactly the best at being factually accurate...
@Xanhast
@Xanhast Жыл бұрын
@@kevinfaang maybe he's being ominous :o
@eboubaker3722
@eboubaker3722 Жыл бұрын
Wow the amount of stuff i learned here is huge, please make more reviews like these i subscribed and turned on notifications please don't disappoint me
@MichaelJordan-hi4ed
@MichaelJordan-hi4ed 11 ай бұрын
This genuinely made my day.
@NexusGamingRadical
@NexusGamingRadical Жыл бұрын
My only tech lead yet told me almost the exact same story after i got stressed out after breaking some layout in our production website. Instantly destressed and was a great intro to Software Engineering :D
@iTsBadboyJay
@iTsBadboyJay Жыл бұрын
absolute nightmare. loved every min of this
How GitHub's Database Self-Destructed in 43 Seconds
12:04
Kevin Fang
Рет қаралды 938 М.
How One Line of Code Almost Blew Up the Internet
13:47
Kevin Fang
Рет қаралды 1,9 МЛН
Заметили?
00:11
Double Bubble
Рет қаралды 3,4 МЛН
Как быстро замутить ЭлектроСамокат
00:59
ЖЕЛЕЗНЫЙ КОРОЛЬ
Рет қаралды 12 МЛН
e-label
8:55
Rick Randazzo
Рет қаралды 22
The Worst Typo I Ever Made
11:25
Tom Scott
Рет қаралды 6 МЛН
Gitlab DELETING Production Databases | Prime Reacts
17:27
ThePrimeTime
Рет қаралды 298 М.
SQLite is enough
5:58
Martin Baun
Рет қаралды 7 М.
Never install locally
5:45
Coderized
Рет қаралды 1,6 МЛН
How A Steam Bug Deleted Someone’s Entire PC
11:49
Kevin Fang
Рет қаралды 883 М.
4 Billion If Statements | Prime Reacts
9:47
ThePrimeTime
Рет қаралды 507 М.
Cloudflare Deploys Really Slow Code, Takes Down Entire Company
13:24
How This SQL Command Blew Up a Billion Dollar Company
13:11
Kevin Fang
Рет қаралды 630 М.
The Man Who Broke The Internet By Deleting 11 Lines of Code
5:43
Half as Interesting
Рет қаралды 1,1 МЛН
wireless switch without wires part 6
0:49
DailyTech
Рет қаралды 1,5 МЛН
What model of phone do you have?
0:16
Hassyl Joon
Рет қаралды 78 М.
Эффект Карбонаро и бумажный телефон
1:01
История одного вокалиста
Рет қаралды 2,8 МЛН
Карточка Зарядка 📱 ( @ArshSoni )
0:23
EpicShortsRussia
Рет қаралды 662 М.