DESeq2 Basics Explained | Differential Gene Expression Analysis

DESeq2 Basics Explained | Differential Gene Expression Analysis | Bioinformatics 101

Рет қаралды 79,409

Күн бұрын

A basic task in the analysis of count data from RNA-seq is the detection of differentially expressed genes. DESeq2 is one of the most commonly used packages to perform differential gene expression analysis in R. In this video, I have tried to explained the DESeq2 model and provide some intuition on what goes behind this package and the steps performed to call differentially expressed genes.
I have tried my best to keep it simple and explain it to the best of my knowledge. Please feel free to leave your comments below, I am happy to hear your thoughts, as well as any links to articles/blogs/papers that you think, explain these concepts better! Let's use this space to share resources and learn more!
Here are some resources that helped me to understand some of these concepts:
1. bioconductor.org/packages/rele...
2. genomebiology.biomedcentral.c...
3. www.biostars.org/p/278684/
4. www.biostars.org/p/316488/
5. uclouvain-cbio.github.io/WSBI....
Chapters
0:00 Intro
0:29 A typical study design
1:32 Features of RNA-Seq counts data
3:04 Poisson distribution for counts data
5:14 Why is Poisson not the best model?
6:58 Negative Binomial is the way to go!
8:46 DESeq2 steps
9:32 Biases in counts data
12:29 Estimate Size Factor (median of ratios method)
16:37 Estimate Dispersions
20:00 Generalized Linear Models
24:21 Hypothesis testing
Show your support and encouragement by buying me a coffee:
www.buymeacoffee.com/bioinfor...
To get in touch:
Website: bioinformagician.org/
Github: github.com/kpatel427
Email: khushbu_p@hotmail.com
#bioinformagician #bioinformatics #deseq2 #differentialgeneexpressionanalysis #rnaseq #fpkm #rpkm #tpm #normalization #rna #ncbi #genomics #beginners #tutorial #howto #omics #research #biology #ngs

Пікірлер: 103

@wesleyeliasbheringbarrios8108 2 жыл бұрын

Based on the videos I've seen from your channel, everything is really great! Everything we bioinformatics beginners need most: theory explained without complication, in a way that is easy to understand, and guiding us step by step through the process with a list of videos in logical order. I just have to thank you for all your commitment and 100% accurate work on these videos, PLEASE continue! I will credit you in several of my presentations, thank you very much!

@Bioinformagician 2 жыл бұрын

I really appreciate your kind words, really encourages me to keep doing this :) Thank you very much!

@kitdordkhar4964 2 жыл бұрын

You are a great teacher! I enjoy watching the detailed theory and analysis. Your tutorial is very helpful. Cheers!

@BarcodeIIIlIIllIlll Жыл бұрын

I am writing a thesis that is partially reliant on bioinformatics and have no experience with deseq2. This video was immensely helpful in getting me up to speed in general understanding. Thank you very much!

@alicekao6305 Жыл бұрын

Thank you so much! This is very clear. I like the series of your video talking about the logic behind each bioinformatic package. I think it's extremely important for me with biology background to know the basics of each package and identify the best tool to use when I get my data.

@chrisspeed8432 Жыл бұрын

This was incredibly helpful. I plan to watch it again and take detailed notes along the way. Thank You!

@coolalexpcs 4 ай бұрын

You explain this in very clear and logical way! Appreciate it

@amus21455 Жыл бұрын

It is superrrrr helpful!!!!!!! This is the best video about DESeq for someone with zero background like me!

@kmrsongh Жыл бұрын

Really very helpful video tutorial. I appreciate the effort you made in explaining the DESeq2 background statistics. You explain them perfectly and in a very simple manner. very helpful for us. Thanks a lot and keep sharing such informative videos.

@QAKS1264 2 жыл бұрын

Very helpful, clear and accurate explanation. Thank you.

@andydavidson3097 5 ай бұрын

Very well done! I watched lots of video on DESeq2 nobody explains the underlying math!

@IndigoIndustrial Жыл бұрын

Very impressive. More scientists should be engaging the way you do.

@abhisheksawalkar1018 11 ай бұрын

Simply excellent. Everything was explained using lucid examples. Very good for beginners.

@priyankabiotech87 2 жыл бұрын

U made it so simplified..loved ur explanation..thank you

@preeti97rox Жыл бұрын

Thank you for being so helpful to everyone!

@anvieb1293 Жыл бұрын

This is such a valuable and informative video, thanks so much!

@user-up7ms2cs7m 2 ай бұрын

Thank you! This was helpful! My study design was complex, as I was looking at 4 different conditions, with one reference level.

@devinjones7271 Жыл бұрын

This is SO helpful! Thank you!!!

@bobyang8491 2 жыл бұрын

very helpful!! Thanks for teaching!

@chibrina 10 ай бұрын

this is amazingly helpful as a beginner, thank you

@riaztabassum8395 Жыл бұрын

very detailed and simplest explanation. 👌

@joseoviedo4529 Жыл бұрын

hello, I truly appreciate your videos and explanations. They are very clear and concise. I do have a request though for a future video. Could you do a how-to on gene set analysis using a GO class annotation and how to filter the desired genes from the completed DE analysis data frame. Thank you for all you do, Keep it up!

@reakal7740 2 жыл бұрын

Great video! Congrats!

@humphreygardner6982 2 ай бұрын

Really superb! Thank you!

@grace-426 2 ай бұрын

I am so thankful for your tutorials.. can you please make one video on like, how to manage so many genes and how to come to some conclusion after getting so many genes

@aldaszarnauskas27 Жыл бұрын

Great video and explanation!!!

@muhammadhafizsulaiman7163 2 жыл бұрын

Very smart person. Great explanation

@MahdiAbdul-Jabbar 2 ай бұрын

Awesome video!

@KTROWS 4 ай бұрын

Amazing job explaining.

@jamesrauschendorfer9396 2 жыл бұрын

This is super helpful!

@tushardhyani3931 2 жыл бұрын

Thank you for this video !!

@VenuraHerathPhotography 2 жыл бұрын

Keep up the good work! Would love to see a tutorial on edgeR time-series differential analysis.

@Bioinformagician 2 жыл бұрын

Will plan a video covering this. Thanks for the suggestion :)

@harshasatuluri4540 2 жыл бұрын

Very clear!

@islamalmsarrhad2152 8 ай бұрын

That was epic.. Many thanks

@angelamoreira5023 2 жыл бұрын

Excellent!!!

@khalildabrat4593 2 жыл бұрын

Very helpful!

@pooriasalehi5402 11 ай бұрын

really really thanks ma'am, it's amazing, I owe you.

@amirhosseinshafieian3951 2 жыл бұрын

Really love that, after watching lots of videos on KZfaq, finally I understood what's going on by ur video, I only could not understand the MLE part, if it is feasible for u please make a video to elaborate it in more detail. Thanks a lot

@Bioinformagician 2 жыл бұрын

I will think about making a separate video explaining MLE. Thanks :)

@benjaminbergey5512 2 жыл бұрын

One of the clearest explanations of the DESeq pipeline - thank you. Question about using the GLM: Do you know where I might find an example of a calculation for a given gene? I had a bit of difficulty following through the calculations, and I think a concrete example (just with arbitrary data) might help me grasp it better.

@Bioinformagician 2 жыл бұрын

I am glad you found this video helpful! Check out this paper: www.ncbi.nlm.nih.gov/pmc/articles/PMC7873980/ It does a fantastic job explaining single and multi factor linear models with calculations.

@amrsalaheldinabdallahhammo663 2 жыл бұрын

Thanks for that video, You are genius :)

@AA-gl1dr 2 жыл бұрын

excellent video.

@sarahnawaz6925 3 ай бұрын

Amazing💯

@leia2636 2 жыл бұрын

wow that was magical

@aditimehta4886 2 жыл бұрын

Hey Khushbu, really nice explaination.😊

@a.k.nikson3987 24 күн бұрын

Awesome !

@kobrarahimi9164 2 жыл бұрын

it was great 100 out of 100.

@1993dana15 2 жыл бұрын

crispy clear

@PriyaDas-zw5hn Жыл бұрын

Hi Dr. Khushbu, Thankyou for the very informative videos. Learning a lot from these. I had a query, if we have a time series of treated and untreated samples, should the pairs of treated and untreated at each time point be considered separately for estimating size factors?

@abdourahamandjibotassiou4367 10 ай бұрын

very nice

@rays_of_hopes Жыл бұрын

Thank you so much mam

@ayaqz3144 25 күн бұрын

thank you

@ghadeeralkurdi174 2 жыл бұрын

Could i ask you what are the range of x and y axis you used in mean vs variance plot at 6:27 min

@farihachaudhary577 Жыл бұрын

Hi there, i just wanted to ask that if we can use DEseq analysis for unpaired data. I have 11 samples of normal (control) and about 160 tumor samples. Or we should go with paired data?

@CarlMedriano Жыл бұрын

Thanks for this info, I am just a bit lost especially when I try to calculate using gene D which resulted to GM of 0 and reference values of 0. Wouldnt the following steps result to 0 (assuming that values /0 are just placed as 0)?

@georgyjogen2859 Жыл бұрын

Hi, Really like your video. thank you for the channel once again. Its a blessing. I have a small doubt. @11:27 you said that since gene D is not expressed in treated condition the total of 42 from untreated needs to be divided amoung the expressed 3 genes, causing it to be inflated. How is that, could you please explain? Thanks in advance

@adrianozaghi9209 Жыл бұрын

Thank you so mutch, the paper about this algorithm is complex asf

@NguyenThiPhuongLan-in5cd Жыл бұрын

Hi may I ask if we have n=3 biological replicates/2 groups how can we put in 2 groups? Just calculate mean of read counts for each genes in each group?

@kevinradja 2 жыл бұрын

Really love your video and is inspiring me to also try making my own videos and test my knowledge. At 14:30 there's an error when you are estimating the size factor. The geometric mean is calculated by the mean of the natural log of the counts (ln because that is what DESeq2 uses). Taking the log turns the Pi symbol in the paper into a sigma of logs. Might also be good to mention that it isn't square root if you have more than two conditions. If I'm wrong though, someone please let me know!

@Bioinformagician 2 жыл бұрын

Thank you for pointing out that error. You are right, DESeq2 uses natural logs and it would be 1/nth power of the total of multiplied terms. I should have mentioned it. However, the values barely differ with the method chosen. Just for the explanation, I chose the multiplying method because it has fewer steps which makes it easier to understand and gets the point across :) Geometric mean with log method: log(2) + log(10) = 2.99/2 = 2.718281828459^1.495 OR exp(1.495) (taking antilog) = 4.459337 Geometric mean with multiply method: sqrt(2*10) = 4.472136

@kevinradja 2 жыл бұрын

That's a great point and shows why we take the log! With large outliers the averages of logs are less affected than regular averages but doesn't change when the values are close. Also do you plan on making a video on the dispersion in DESeq2 in more detail? There's so much more in the paper I didn't understand at all.

@Bioinformagician 2 жыл бұрын

@@kevinradja I will surely think about making a video on dispersion in more detail :)

@saranyasweet Жыл бұрын

Mam please do put videos for how to do DGE for raw 16srDNA paired end data in fastq format ?

@juanete69 Ай бұрын

Before getting the counts... do we need to align our reads?

@emojiman745 2 жыл бұрын

I may have missed it, but what do we do in with the reeplicates? You mentioned the replicates in the study design segment (00:38), but the calculations you display are about one group. Should we take the mean of the samples and make them into one column? one column for the treated (mean of the b1, b2 and b3 for t1 and b1, b2 and b3 for t2) and one column for untreated (mean of the B1, B2 and B3 for T1 and B1, B2 and B3 for T2)?

@Bioinformagician 2 жыл бұрын

Apologies if I wasn't clear in my video, there are ways to handle technical replicates. Check this section out from DESeq2 vignette: bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#collapsing-technical-replicates With regards to biological replicates, you should NOT collapse biological replicates.

@poojasavla6240 Жыл бұрын

bro i love you

@LongboardTrickfreak 2 жыл бұрын

I might be mistaken but are you shure the values for calculating the median in step 3 (est. size factors) are correct? When i calculate them with R i get 0.45 for instance for the normalizatiom factor untreated. Shouldn‘t the median be one of the values? Apart from that: great video, helped me a lot!

@Bioinformagician 2 жыл бұрын

Thanks for reaching out! I am sure, the median of values 0, 0.45, 0.55, 0.58 is 0.5. I calculated it using R as well.

@user-sf1ys2wl4k Жыл бұрын

Hey, the calculations in the video are correct. But maybe you were confused because those are medians, not means. In the case of 4 values, you have to take two values in the middle and then the average of them;) So we take 0.45 and 0.55 and get 0.50.

@georgeanthonywalters-marra9628 2 жыл бұрын

Hello, this was an awesome and very informative video! I've been trying to learn more about CRISPR screen analysis (specifically MAGeCK). Are you familiar at all with analysis of CRISPR screens and would you say that the concepts in this video would be transferable? Thank you so much!

@Bioinformagician 2 жыл бұрын

Unfortunately, I have not worked with CRISPR screen data before, so I am unable to answer whether these concepts are transferable.

@georgeanthonywalters-marra9628 2 жыл бұрын

@@Bioinformagician No problem!

@patticat 3 ай бұрын

Is this the video where design factor was explained? I'm coming from another of your videos where you say "if you don't know design factor, look at my previous video" but you never said which one. I think this one was a good candidate, however, I am still very confused as to how to use the design factor.. that was x= 0 or x =1? or what was that when you added two conditions? I'm super lost with the last 2 seconds of explanation there.. if you have another video explaining this, which one is it? Thanks! Everything else is on point!

@clutch3171 5 ай бұрын

this is secretly genius

@shetalkzz8842 3 ай бұрын

can I perform deseq2 in galaxy for finding differentially expressed mirnas

@nikitamaurya4518 17 күн бұрын

I am confused between the normalization method explained in this video and the normalization method explain in another video [Difference between RPKM/FPKM and TPM | RNA-Seq Normalization Methods | Bioinformatics 101]. Which normalization is correct?

@you-mingliu3261 2 жыл бұрын

Great video, but I'm still confused about the dispersion α. For one gene, the α was estimated separately in the control group and treatment group (So, there are 2 α for one gene)? Or there is only one α for each gene which means the mean and the variance were calculated cross the control and treatment group?

@Bioinformagician 2 жыл бұрын

As far as my understanding goes, it the latter. The mean and variance is calculated across all groups, so there is only one α for each gene.

@wansabaiinjapan1586 7 ай бұрын

Very excellent explanation. Thank you! I am too new to the field. I have questions regarding how we can use or what values we will use to make heatmap, Venn diagram, etc. In 15.49, once we get median of ratio and normalize our samples with this value to obtain norm_values for each gene of each sample. Before I use these value to plot heatmap. Do I need to again transform to log2? Or do I need to convert to z-Score? if yes, how to get z-score for each gene in each sample? Sorry for asking so many questions. Thanks in advance!

@adaobiokafor9546 6 ай бұрын

for visualizations, you need to scale (ie. calculate z scores). Just use the scale() function in R.

@alexyang274 2 жыл бұрын

question regarding the coefficients for the fitting the linear model - from my understanding, based on this explanation, the linear model can accommodate theoretically infinite number of coefficients. in the vignette for deseq2, michael love mentions that while deseq2 can do this, it is perhaps easier to concatenate multiple factors into a single variable and have deseq2 perform its linear modeling this way. can you explain why this is the case? and how this can extend from a 2-factor design to a n-number design and so forth?

@Bioinformagician 2 жыл бұрын

Can you point me to the section in the vignette where Michael Love talks about concatenating multiple factors into a single variable?

@alexyang274 2 жыл бұрын

@@Bioinformagician in the vignette, the subheading is under "interactions"; copied and pasted from the vignette, love writes: Initial note: Many users begin to add interaction terms to the design formula, when in fact a much simpler approach would give all the results tables that are desired. We will explain this approach first, because it is much simpler to perform. If the comparisons of interest are, for example, the effect of a condition for different sets of samples, a simpler approach than adding interaction terms explicitly to the design formula is to perform the following steps: combine the factors of interest into a single factor with all combinations of the original factors change the design to include just this factor, e.g. ~ group Using this design is similar to adding an interaction term, in that it models multiple condition effects which can be easily extracted with results.

@Bioinformagician 2 жыл бұрын

Thank you for pointing me to this. I want to bring in a little context here, without it can be misleading. I have tried to explain it here: khushbupatel.notion.site/Interaction-terms-DESeq2-5a4a75b83adc4fe89576e6ee9b00daf0 Hope this clears your confusion and answers your question. Thanks! :)

@justsoil15 Жыл бұрын

I use docker and command line to run deseq2. How to save plots to png files?

@user-uq7gw5ll5r 4 ай бұрын

Mam can u help me analyse rna sequence database using deseq2 tool pls

@user-zc9jl2to3h Жыл бұрын

In 22:53, why do you say that "y - B0 = log(y) - log (B0)" ???? isn't that incorrect?

@jatinderchera1613 Жыл бұрын

Hello mam. Your video is very helpful especially for beginners like me. I have some queries and I would be very grateful if you can help me out. We got RNAseq done from a company and they have provided us with analyzed data. My queries are : 1. They have provided PCA plot and they have mentioned the following, "DESeq2 generates PCA plot based on a matrix of normalized read counts,the result typically depends only on the few most strongly expressed transcripts because of showing largest absolute differences between control and treated samples." The plot they provided showed very high variance among the biological replicates of one treatment group (due to lower read count in some samples). Is there any way to get around this by considering some other features (apart from read counts) to compute variances ? 2. They have also provided RPKM values of various genes that are unique to specific treatment groups. I observed some of the genes had 'zero' reads in some of the replicates of the same treatment group. Can we consider these genes for our analyses ? 3. I also observed completely identical RPKM values for many genes in the list (identical even upto 9 decimal places). What could be the reason for this and can we proceed with the analyses of such genes ? Any help from your side would be highly appreciated. 😊

@Bioinformagician Жыл бұрын

1. Do you happen to know how low are the read counts among biological replicates of that one treatment group? You could perhaps take a look a pre-alignment and post-alignment QC especially total number of reads and total number of uniquely mapped reads for each sample. Another way to identify noisy/problematic samples is to use a distance matrix to get similarities or dissimilarities across samples. 2. You could get total counts for genes across all samples and see if these genes with 0 reads have consistent low read counts across other samples as well. We would ideally want to remove genes with less than 10 total read counts across all samples. You could be more stringent and set a higher number. 3. This seems suspicious. I would recommend to generate RPKM/TPM values yourself.

@jatinderchera1613 Жыл бұрын

Thank you very much for your response mam. I am very new to such data types. I am learning everything from scratch so I will try my best to carry out whatever you suggested.

@relaxstation600 Жыл бұрын

13:40 step1

@donklike09 Жыл бұрын

Awesome! but how is 2/0.5 = 4.016...? isn't it just 4? (16:14) and same with the other numbers from the untreated.

@Bioinformagician Жыл бұрын

You’re right. The discrepancy is due to rounding off. If you don’t round the numbers, you would get 4.016 instead of 4

@snekhai 2 жыл бұрын

When you normalize counts, and have 0/0 (your sample D), why do you assign 0?

@Bioinformagician 2 жыл бұрын

In step 1 to calculate geometric mean, we take square root of product of counts in all samples. For sample D, product of 30 x 0 = 0. Square root of 0 is 0. Hence 0.

@pgresner 2 жыл бұрын

yes, but then, in Step 2, you divide 30/0 (which is infinity) and even 0/0 (which is undefined) - so why you get 0's for untreated/ref and treated/ref? is this some kind of a convention or just a mistake?

@Bioinformagician 2 жыл бұрын

@@pgresner It’s a mistake. They should be Inf instead of 0s. I didn’t mention a very important point, non-finite values (i.e Inf, -Inf and NaN) are filtered out and not used to calculate the median. Thank you for pointing it out, I shall put a note about this in the description.