Does your PPO agent fail to learn?

  Рет қаралды 15,325

RL Hugh

RL Hugh

Жыл бұрын

One hyper-parameter could improve the stability of learning, and help your agent to explore!
We investigate how to improve the reliability of training when using stable baselines 3 library, with ViZDoom, using the PyTorch deep neural network library, and the Python 3 language.

Пікірлер: 34
@philippk5446
@philippk5446 Жыл бұрын
In this context, you should always track your KL divergence, since a high KL divergence may indicate over-exploration
@datauniverses187
@datauniverses187 Жыл бұрын
Wth , the default is 0.0 ! no wonder why my agents get stuck in a local optima
@underlecht
@underlecht Жыл бұрын
Awesome material. Please continue with these insights, very interesting and useful
@rlhugh
@rlhugh Жыл бұрын
Thanks! Let me know about anything you'd be curious to see some investigation into please.
@vladyslavkorenyak872
@vladyslavkorenyak872 4 ай бұрын
Hello Sir. Do you have any insight about the "use_sde" variable in PPO stable baselines v3? It supposedly activates "generalized State Dependent Exploration" but I did not find any clear results about the pros and cons of this.
@willnutter1194
@willnutter1194 Жыл бұрын
Great videos, really enjoy your style of communication and thoughts. Thanks for making them :)
@petarulev9021
@petarulev9021 Жыл бұрын
I have the exact same problem of overfitting - my agent learns very useful stuff, but at some point - it just overfits to one action. This is why I take the checkpoint before overfitting, but this is a nasty fix. I just incorporated the entropy regularization and my model is training. The data is incredibly noisy, I will let you know about the result. In the meantime, I am wondering how the kf_coeff influences the whole process and what do you think about it and the relation between the entropy regularization and kl_coeff? I would appreciate a video or a comment. Cheers, petar
@remcopoelarends9888
@remcopoelarends9888 Жыл бұрын
Very nice video! Could you maybe make a video about explaining and setting the hyperparameters of PPO in sb3? Keep up the good work!
@rlhugh
@rlhugh Жыл бұрын
Thanks! Any particular parameter(s) that you are most interested in?
@remcopoelarends9888
@remcopoelarends9888 Жыл бұрын
@@rlhugh The ones that are less self-explanatory, such as clip_range, normalize_advantage, ent_coef, max_grad_norm and use_sde.
@RoboticusMusic
@RoboticusMusic 3 ай бұрын
It might be more helpful to explain and demo what entropy regularization is, what it does, and the history of the concept and different forms of it. The rest would be pretty intuitive.
@rlhugh
@rlhugh 3 ай бұрын
Thank you for the feedback. Very useful, and I appreciate it :)
@SP-db6sh
@SP-db6sh Жыл бұрын
Make a video on using Finrl
Жыл бұрын
It's a great video, I am tunning gains of Kp, Ki with reinforcement learning PPO. The result is a constant too in all the trajectory of the movement of the robot. So I would like to know why this result is a constant too. Maybe something wrong I am doing? Or it is fine. I really appreciate your comments. Thanks!
@joelkarlsson9869
@joelkarlsson9869 2 ай бұрын
So the entreg is the same as ent_coef in the PPO, or did i missfollow you?
@rlhugh
@rlhugh 2 ай бұрын
Yes, thats correct.
@Meditator80
@Meditator80 Жыл бұрын
really fantastic videos 🎉
@rlhugh
@rlhugh Жыл бұрын
Thank you!
@hoseashpm7810
@hoseashpm7810 Жыл бұрын
Every episode, my PPO agent cumulative reward seems very “noisy”. Meaning the average cumulative reward increases but the instantaneous cumulative reward seems similar to a a noisy signal. I tried tips to designing a reward function with a gradient, and tried changing the entropy loss weight, yet it just does not reach to a consistent policy. I feel like pulling my hair now.
@rlhugh
@rlhugh Жыл бұрын
Somehow I missed this comment earlier. Yeah, the reward usually is very noisy. In Tensorboard, there is an option to smooth the graph. Same option exists in mlflow, probably Weights and Biases too. But .. what do you mean by 'instantaneous cumulative reward'? Isn't the cumulative reward by definition the sum of all rewards from time 0 until some time T?
@hoseashpm7810
@hoseashpm7810 Жыл бұрын
@@rlhugh hi Hugh. Thanks for the tip. By “instantaneous” i meant that the cumulated reward at the end of every episode. I used matlab for designing the agent. I ended up using a double DQN with a discrete action space. It ended up learning a lot faster and smoother. Maybe my knowledge of PPO sucks. I tried extending the training time but the PPO agent gets stuck somehow.
@rlhugh
@rlhugh Жыл бұрын
Interesting. Good info. Thank you! Do you have any thoughts on what about your task might make it more amenable to value function learning? What are some of the characteristics of your input and output space that might be different than eg playing Doom using the screen as input?
@vialomur__vialomur5682
@vialomur__vialomur5682 Жыл бұрын
thanks!
@SP-db6sh
@SP-db6sh Жыл бұрын
I regret to see it 6 months later. Can u make video on Custom env creation for system like user experience for new app, trading bot ?
@rlhugh
@rlhugh Жыл бұрын
So, firstly I don't have experience with using RL for trading. But secondly, my gut intuition is that one uses RL when ones actions affect the environment, or at least, the current state. However, unless you are making giant trades, your trading actions will not much affect your environment, i.e. the price, I think? The state does include things like how much money you have, and what stock you own. However I'm not sure that how much stock you own, and how much money you have, will much affect an estimate of the value of a stock? I would imagine that supervised learning is all you need, and will be much more efficient? What makes you feel that RL could be appropriate for estimating the value of a stock, or taking actions on stock?
@rlhugh
@rlhugh Жыл бұрын
(I suppose one option could be to create a simulator, by using stock prices from a year or so ago, and assuming that one's stock trades do not affect market price?)
@rlhugh
@rlhugh Жыл бұрын
what timeframe were you thinking of using for each step of RL? eg 5 minutes? 1 day? 1 week? 1 month? Do you know where one could obtain prices for several stock that you are interested in trading, for eg 1 year ago, at the level of granularity that you are interested in training RL on?
@p4ros960
@p4ros960 Жыл бұрын
@@rlhugh keep in mind that price does not mean anything in trading.
@rlhugh
@rlhugh Жыл бұрын
@@p4ros960 can you elaborate on that? Afaik, all securities with stocks as the underlying asset do have a value that depends on the price of the underlying stock? For example, if you sell a call, the more the price of underlying stock goes up, the more money you will lose when that call is exercised, I think?
@Bvic3
@Bvic3 6 ай бұрын
What's 100k steps? You run 100 times 1 epoch of learning on 1000 frames?
@rlhugh
@rlhugh 6 ай бұрын
Steps relate to the simulation, not to the learning. A step is one iteration of: receive an observation, take one action. Epochs of learning etc are configured separately. You can choose to run 5 epochs of learning over each batch of steps, for example, which would result in each step being used in 5 different training epochs.
@Bvic3
@Bvic3 6 ай бұрын
@@rlhugh Ok, thanks. That's what I expected but I just wanted a confirmation.
Proximal Policy Optimization (PPO) - How to train Large Language Models
38:24
ХОТЯ БЫ КИНОДА 2 - официальный фильм
1:35:34
ХОТЯ БЫ В КИНО
Рет қаралды 2,1 МЛН
Sigma Girl Education #sigma #viral #comedy
00:16
CRAZY GREAPA
Рет қаралды 55 МЛН
格斗裁判暴力执法!#fighting #shorts
00:15
武林之巅
Рет қаралды 71 МЛН
L4 TRPO and PPO (Foundations of Deep RL Series)
25:21
Pieter Abbeel
Рет қаралды 24 М.
Direct Preference Optimization:  Forget RLHF (PPO)
9:10
code_your_own_AI
Рет қаралды 12 М.
Proximal Policy Optimization Explained
17:50
Edan Meyer
Рет қаралды 40 М.
How to Choose an Appropriate Deep RL Algorithm for Your Problem
6:16
Dibya Chakravorty
Рет қаралды 3,3 М.
Proximal Policy Optimization is Easy with Tensorflow 2 | PPO Tutorial
29:08
Machine Learning with Phil
Рет қаралды 11 М.
AI Learns to Walk (deep reinforcement learning)
8:40
AI Warehouse
Рет қаралды 8 МЛН
AI Learns Insane Monopoly Strategies
11:30
b2studios
Рет қаралды 10 МЛН
AI Learns to Speedrun Mario
8:07
Kush Gupta
Рет қаралды 613 М.
Эффект Карбонаро и бумажный телефон
1:01
История одного вокалиста
Рет қаралды 2,5 МЛН
Which Phone Unlock Code Will You Choose? 🤔️
0:14
Game9bit
Рет қаралды 11 МЛН