Everything You Need To Master Actor Critic Methods | Tensorflow 2 Tutorial

  Рет қаралды 46,967

Machine Learning with Phil

Machine Learning with Phil

3 жыл бұрын

In this brief tutorial you're going to learn the fundamentals of deep reinforcement learning, and the basic concepts behind actor critic methods. We'll cover the Markov decision process, the agent's policy, reward discounting and why it's necessary, and the actor critic algorithm. We'll implement an actor critic algorithm using Tensorflow 2 to handle the cart pole environment from the Open AI Gym.
Actor critic methods form the basis for more advanced algorithms such as deep deterministic policy gradients, soft actor critic, and twin delayed deep deterministic policy gradients, among others.
You can find the code for this video here:
github.com/philtabor/KZfaq-...
Learn how to turn deep reinforcement learning papers into code:
Get instant access to all my courses, including the new Prioritized Experience Replay course, with my subscription service. $29 a month gives you instant access to 42 hours of instructional content plus access to future updates, added monthly.
Discounts available for Udemy students (enrolled longer than 30 days). Just send an email to sales@neuralnet.ai
www.neuralnet.ai/courses
Or, pickup my Udemy courses here:
Deep Q Learning:
www.udemy.com/course/deep-q-l...
Actor Critic Methods:
www.udemy.com/course/actor-cr...
Curiosity Driven Deep Reinforcement Learning
www.udemy.com/course/curiosit...
Natural Language Processing from First Principles:
www.udemy.com/course/natural-...
Reinforcement Learning Fundamentals
www.manning.com/livevideo/rei...
Here are some books / courses I recommend (affiliate links):
Grokking Deep Learning in Motion: bit.ly/3fXHy8W
Grokking Deep Learning: bit.ly/3yJ14gT
Grokking Deep Reinforcement Learning: bit.ly/2VNAXql
Come hang out on Discord here:
/ discord
Need personalized tutoring? Help on a programming project? Shoot me an email! phil@neuralnet.ai
Website: www.neuralnet.ai
Github: github.com/philtabor
Twitter: / mlwithphil

Пікірлер: 77
@MachineLearningwithPhil
@MachineLearningwithPhil 3 жыл бұрын
This content is sponsored by my Udemy courses. Level up your skills by learning to turn papers into code. See the links in the description.
@youssefmaghrebi6963
@youssefmaghrebi6963 2 ай бұрын
what mattered was the explanation of those little details that everyone ignores because they simply don't understand it like you do, so thanks a lot.
@georgesantiago4871
@georgesantiago4871 3 жыл бұрын
Your videos have defogged all these concepts for me. Thank you so much!!!
@gabrielvalentim197
@gabrielvalentim197 Жыл бұрын
Thank you for your videos Phil. They are very informative and helps me to understand more and more about this content!
@MachineLearningwithPhil
@MachineLearningwithPhil Жыл бұрын
Glad to be of service Gabriel
@portiseremacunix
@portiseremacunix 3 жыл бұрын
Thanks! Saved and will watch later.
@softerseltzer
@softerseltzer 3 жыл бұрын
Very clear and nice explanation, thank you!
@fernandadelatorre7724
@fernandadelatorre7724 3 жыл бұрын
You are so so great! Saving up to buy your courses, your videos have been so helpful :)
@MachineLearningwithPhil
@MachineLearningwithPhil 3 жыл бұрын
Thank you Maria.
@Falconoo7383
@Falconoo7383 Жыл бұрын
Thank you, Dr Phil.
@hameddamirchi
@hameddamirchi 3 жыл бұрын
thanks dr. phil i think it is a good idea in addition to show results in command line, show environment renders after model learns.
@ronmedina429
@ronmedina429 3 жыл бұрын
Thanks for the content Dr. Phil. :-)
@MachineLearningwithPhil
@MachineLearningwithPhil 3 жыл бұрын
Thanks for watching
@rahulrahul7966
@rahulrahul7966 2 жыл бұрын
Hi Phil Thanks for the video. Can you please explain how the score is as the iterations progress even though we are sampling the actions randomly?
@hamidrezamirtaheri5414
@hamidrezamirtaheri5414 3 жыл бұрын
Would it be possible to label these precious lectures with a kind of sequential indexing (per topic) as you are enriching them, so one just heading to them have an idea where would be the best to start and follow along. Many thanks for sharing your exceptional skills.
@ahmadalhilal9118
@ahmadalhilal9118 3 жыл бұрын
Very informative Can we adjust the actor-critic functionality to decide the output (resultant from softmax), and update the gradients accordingly? Since RL starts learning from scratch, I would like to use heuristics output as final softmax output to speed up the learning! Is that possible?
@fantashio
@fantashio 3 жыл бұрын
Great! Keep it up
@herbertk9266
@herbertk9266 3 жыл бұрын
Thank you sir
@user-lz6ud7yk1l
@user-lz6ud7yk1l 2 жыл бұрын
Your video helps!
@jahcane3711
@jahcane3711 3 жыл бұрын
Hey @Phil I have been following along, loving the content. Now I'm wondering, onn a scale of 0-1 what is the probability you will do a video on implementing CURL: Contrastive Unsupervised Representations for RL?
@SogaNightwalker
@SogaNightwalker 3 жыл бұрын
Can't watch now, but leaving a comment to get this video going :D
@alinouruzi5371
@alinouruzi5371 3 жыл бұрын
Thank svery much
@oussama6253
@oussama6253 3 жыл бұрын
Thank you !
@MachineLearningwithPhil
@MachineLearningwithPhil 3 жыл бұрын
Thanks for watching
@robertotomas
@robertotomas 3 жыл бұрын
Hahaha I feel like I’m in my ai/ml class. Every weeks lecture discussion starts with everyone saying thank you 😀 it’s awesome -I love this video so far, still watching, but it is amazingly clear . So, I totally agree - thank you!
@DjThrill3r
@DjThrill3r 2 жыл бұрын
Hey i have a question. Do you have like a source or literature, where the concept, that the value function and the policy both originate from the same network gets explained and why this is possible? Ty
@fastestwaydown
@fastestwaydown 2 жыл бұрын
Really well made video, both from the theoretical standpoint and also coding wise super clear to understand. One small error you made: In your theoretical section you mixed up different notations of the reward R_x: the most common used notation (also used by Sutton/Barto in the mentioned Book) is to use the index of the next state for the reward that occurs after taking an action: Notation 1: S0, A0, R1, S1, A1, R2... however, it might be noted as notation 2: S0, A0, R0, S1, A1, R1... in different literature. At 7:15 you used notation 1 (and also the sum notation is slightly wrong, it needs to run from t to T-1, not from 0 to T-1, but you fixed it in the discounted version of the formula) at 12:24 and 13:24 you used notation 2 for the delta equation (needs to be R_t+1 instead) I really loved the video, and leave this comment to help mitigate some of these confusions i had myself when studying these topics :)
@MachineLearningwithPhil
@MachineLearningwithPhil 2 жыл бұрын
Thanks for the clarification Benjamin
@tarifcemay3823
@tarifcemay3823 2 жыл бұрын
Hi Phil. I thought prob_ratio must equal to one if we replay the same action as the actor is updated after replay . am I right?
@lichking1362
@lichking1362 7 ай бұрын
hi can we use this method for decision making too?
@elhouarizohier3824
@elhouarizohier3824 Жыл бұрын
how would you use this method in the context of reinforcement learning from human preferences ??
@Corpsecreate
@Corpsecreate 3 жыл бұрын
Hey Phil. For some reason when I use this actor critic method (or REINFORCE) in a poker environment (texas holdem) it always learns to fold with 100% probability. If I use a dueling DQN approach, it works correctly and plays the stronger hands, and folds the weaker ones. It seems that i am running into a local optimum (since rewards are negative when you bet, and are only positive at the end of the episode if you win) where folding always has the maximum reward on the first timestep (0 instead of some negative number). I am using a gamma of 0.999. Would you have any idea whats going on here?
@papersandchill
@papersandchill 2 жыл бұрын
You need a better exploration strategy. PG methods are on-policy. This means that there is a higher tendency to stuck in local minima.
@MrArv83
@MrArv83 Жыл бұрын
Video time 6:04: For two flips, need to multiply by 2? E(2 flips) will still be 0 since 0 x 2 = 0.
@SaurabDulal
@SaurabDulal 3 жыл бұрын
I don't see action_space being anywhere in the code, don't we need to when sampling the action?
@sriharihumbarwadi5981
@sriharihumbarwadi5981 2 жыл бұрын
From the RL book by Sutton/Barto, the one-step actor-critic uses the semi-gradient method to update the critic network. Which means state_value_, _ = self.actor_critic(state_) should not be included inside the GradientTape. This is confirmed by the pseudocode given in Sutton/Barto where w is updated as w = w + alpha*grad(V(s, w)) (here V and w represent critic network and its parameters respectively). But if we include state_value_, _ = self.actor_critic(state_) inside the GradientTape, the update would have an additional grad(V(s', w)) term! ( here s' is the next state, ie state_ in code)
@MachineLearningwithPhil
@MachineLearningwithPhil 2 жыл бұрын
Page 274. Delta term is proportional to the difference in value function of successive states. Both gradients (actor and critic) have a delta term in them.
@ellenamori1549
@ellenamori1549 3 жыл бұрын
Thank you for the tutorial. One question, in your application the agent learns after every step it takes in the environement. How about learning in a batch after each episode?
@MachineLearningwithPhil
@MachineLearningwithPhil 3 жыл бұрын
Generally not the way it's done with actor critic. It's a temporal difference method, so it learns each time step. Policy Gradient is based on Monte Carlo methods and do what you described.
@ellenamori1549
@ellenamori1549 3 жыл бұрын
@@MachineLearningwithPhil Thank you!
@Falconoo7383
@Falconoo7383 Жыл бұрын
AttributeError: module 'tensorflow' has no attribute 'contrib' Can anybody help me to solve this error?
@KrimmStudios
@KrimmStudios 3 жыл бұрын
Thank you! Just wondering where the learning rates alpha and beta are implemented?
@MachineLearningwithPhil
@MachineLearningwithPhil 3 жыл бұрын
21:35 Learning rates come into play when we compile the models with an optimizer. I didn't specify a learning rate so it uses the default values.
@KrimmStudios
@KrimmStudios 3 жыл бұрын
@@MachineLearningwithPhil I see. Thanks again
@Falconoo7383
@Falconoo7383 Жыл бұрын
Which tensorflow version is good for this?
@nawaraghi
@nawaraghi 3 жыл бұрын
I really appreciate your explanation. I tried to run it on FrozenLake, and NChain, it didn't work although I changed the input_dims from 8 to 1. Any hints or help how I can alter the code to work on FrozenLake?
@MachineLearningwithPhil
@MachineLearningwithPhil 3 жыл бұрын
Frozen lake isn't an appropriate environment for the algorithm. FL is for tabular methods, not approximate ones. In other words, neural nets won't really work.
@JousefM
@JousefM 3 жыл бұрын
Comment for the algorithm! :)
@MachineLearningwithPhil
@MachineLearningwithPhil 3 жыл бұрын
Thanks Jousef!
@davideaureli6971
@davideaureli6971 3 жыл бұрын
Hi @Phil thank you for thiz amazing video. Just one question about your loss (critic-loss) is it possible to see that it explodes using the delta**2 ? Because the gradient after that gives me all nan values. Some advices ?
@MachineLearningwithPhil
@MachineLearningwithPhil 3 жыл бұрын
Strange. What environment? Make sure the ln term isn't exploding
@davideaureli6971
@davideaureli6971 3 жыл бұрын
@@MachineLearningwithPhil I have just noticed that the nan values appear when there is one probability which goes to 0 in our probs tensor. We can just put a small quantity to prevent this ? And this is the reason for the nan in the gradient because we have a derivative of 0 ?
@MachineLearningwithPhil
@MachineLearningwithPhil 3 жыл бұрын
Ln of 0 is undefined. You can just add some small value, yes.
@davideaureli6971
@davideaureli6971 3 жыл бұрын
@@MachineLearningwithPhil another question, in a problem where the values predicted by the Actor is completely in another range of values respect to the Critic part (Actor --> (0,1) while Critic -->(-80,160)), it is really difficult find the optimal combination with just one network ?
@Jnaaify
@Jnaaify 3 жыл бұрын
@@davideaureli6971 HI! I have the same problem. But I can't get it fixed. How did you do it? Thanks!
@fawadnizamani761
@fawadnizamani761 3 жыл бұрын
Why do we pass the softmax probabilities to tfp categorical distribution, can we not just select the highest probability action from the softmax output? I'm not really good understanding the math so having a hard time figuring it out.
@Jnaaify
@Jnaaify 3 жыл бұрын
I am wondering the same thing. It looks like it also works if you just take the action with the highest probability.
@davideaureli6971
@davideaureli6971 2 жыл бұрын
I think it is to implement the exploration part for the Agent
@LidoList
@LidoList 3 жыл бұрын
Thanks for the great tutorial. Does one game mean one episode ?
@raffaeledelgaudio2724
@raffaeledelgaudio2724 2 жыл бұрын
usually yes
@papersandchill
@papersandchill 2 жыл бұрын
Episode is typically when the environment is reset. (This never happens in the real world!) unless the real world itself is a simulator, like a game, for example Chess.
@Falconoo7383
@Falconoo7383 Жыл бұрын
ImportError: This version of TensorFlow Probability requires TensorFlow version >= 2.9; Detected an installation of version 2.8.0. Please upgrade TensorFlow to proceed. I am getting this error can anybody help me to solve this? I also upgraded the Tensorflow but again got the same error. @Mahine Learning with Phil
@ashishsarkar3998
@ashishsarkar3998 3 жыл бұрын
pls make another tutorial on deep q learning with tensorflow 2
@ShortVine
@ShortVine 3 жыл бұрын
he already made it, check his channel
@selcukkara82
@selcukkara82 2 жыл бұрын
Hi Phil, i am a beginner. Can you tell me if critic is needed after training completed, or not? That is, only is actor enough after training? Thanks.
@papersandchill
@papersandchill 2 жыл бұрын
only actor!
@selcukkara82
@selcukkara82 2 жыл бұрын
@@papersandchill thank you so much.
@tunestar
@tunestar 3 жыл бұрын
Every time people show one of those math formulas on KZfaq a panda baby dies in the world.
@MachineLearningwithPhil
@MachineLearningwithPhil 3 жыл бұрын
Call the WWF!
@alexandralicht1023
@alexandralicht1023 3 жыл бұрын
@@MachineLearningwithPhil I am looking to set up render() for a RL environment, do you have any videos related to this? or env.render()
@alinouruzi5371
@alinouruzi5371 3 жыл бұрын
goodddddddddddddddddddddddddddddddddddddddd
@birinhos
@birinhos 2 жыл бұрын
What is the game ? Edit : ok car pole ...
@filipesa1038
@filipesa1038 2 жыл бұрын
"Probability for getting head multiplied by the reward of getting head" - In my case is most likely zero
@SpringerJen
@SpringerJen 3 ай бұрын
hi
Proximal Policy Optimization (PPO) is Easy With PyTorch | Full PPO Tutorial
1:02:47
Machine Learning with Phil
Рет қаралды 59 М.
$10,000 Every Day You Survive In The Wilderness
26:44
MrBeast
Рет қаралды 129 МЛН
The day of the sea 🌊 🤣❤️ #demariki
00:22
Demariki
Рет қаралды 16 МЛН
СНЕЖКИ ЛЕТОМ?? #shorts
00:30
Паша Осадчий
Рет қаралды 4,8 МЛН
Policy Gradient Theorem Explained - Reinforcement Learning
59:36
Elliot Waite
Рет қаралды 57 М.
Soft Actor Critic is Easy in PyTorch | Complete Deep Reinforcement Learning Tutorial
1:02:31
Soft Actor Critic (V2)
31:52
Olivier Sigaud
Рет қаралды 11 М.
The OTHER AI Alignment Problem: Mesa-Optimizers and Inner Alignment
23:24
Robert Miles AI Safety
Рет қаралды 222 М.
What is Q-Learning (back to basics)
45:44
Yannic Kilcher
Рет қаралды 91 М.
Reinforcement Learning: Machine Learning Meets Control Theory
26:03
Steve Brunton
Рет қаралды 255 М.
Policy Gradient Methods | Reinforcement Learning Part 6
29:05
Mutual Information
Рет қаралды 24 М.
$10,000 Every Day You Survive In The Wilderness
26:44
MrBeast
Рет қаралды 129 МЛН