Decision Transformer: Reinforcement Learning via Sequence Modeling (Research Paper Explained)

Рет қаралды 60,248

Күн бұрын

#decisiontransformer #reinforcementlearning #transformer
Proper credit assignment over long timespans is a fundamental problem in reinforcement learning. Even methods designed to combat this problem, such as TD-learning, quickly reach their limits when rewards are sparse or noisy. This paper reframes offline reinforcement learning as a pure sequence modeling problem, with the actions being sampled conditioned on the given history and desired future rewards. This allows the authors to use recent advances in sequence modeling using Transformers and achieve competitive results in Offline RL benchmarks.
OUTLINE:
0:00 - Intro & Overview
4:15 - Offline Reinforcement Learning
10:10 - Transformers in RL
14:25 - Value Functions and Temporal Difference Learning
20:25 - Sequence Modeling and Reward-to-go
27:20 - Why this is ideal for offline RL
31:30 - The context length problem
34:35 - Toy example: Shortest path from random walks
41:00 - Discount factors
45:50 - Experimental Results
49:25 - Do you need to know the best possible reward?
52:15 - Key-to-door toy experiment
56:00 - Comments & Conclusion
Paper: arxiv.org/abs/2106.01345
Website: sites.google.com/berkeley.edu...
Code: github.com/kzl/decision-trans...
Trajectory Transformer: trajectory-transformer.github...
Upside-Down RL: arxiv.org/abs/1912.02875
Abstract:
We present a framework that abstracts Reinforcement Learning (RL) as a sequence modeling problem. This allows us to draw upon the simplicity and scalability of the Transformer architecture, and associated advances in language modeling such as GPT-x and BERT. In particular, we present Decision Transformer, an architecture that casts the problem of RL as conditional sequence modeling. Unlike prior approaches to RL that fit value functions or compute policy gradients, Decision Transformer simply outputs the optimal actions by leveraging a causally masked Transformer. By conditioning an autoregressive model on the desired return (reward), past states, and actions, our Decision Transformer model can generate future actions that achieve the desired return. Despite its simplicity, Decision Transformer matches or exceeds the performance of state-of-the-art model-free offline RL baselines on Atari, OpenAI Gym, and Key-to-Door tasks.
Authors: Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas, Igor Mordatch
Links:
TabNine Code Completion (Referral): bit.ly/tabnine-yannick
KZfaq: / yannickilcher
Twitter: / ykilcher
Discord: / discord
BitChute: www.bitchute.com/channel/yann...
Minds: www.minds.com/ykilcher
Parler: parler.com/profile/YannicKilcher
LinkedIn: / yannic-kilcher-488534136
BiliBili: space.bilibili.com/1824646584
If you want to support me, the best thing to do is to share out the content :)
If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: www.subscribestar.com/yannick...
Patreon: / yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

Пікірлер: 105

@YannicKilcher 3 жыл бұрын

OUTLINE: 0:00 - Intro & Overview 4:15 - Offline Reinforcement Learning 10:10 - Transformers in RL 14:25 - Value Functions and Temporal Difference Learning 20:25 - Sequence Modeling and Reward-to-go 27:20 - Why this is ideal for offline RL 31:30 - The context length problem 34:35 - Toy example: Shortest path from random walks 41:00 - Discount factors 45:50 - Experimental Results 49:25 - Do you need to know the best possible reward? 52:15 - Key-to-door toy experiment 56:00 - Comments & Conclusion

@paxdriver 3 жыл бұрын

I love what you're doing here. Thank you, man

@sofia.eris.bauhaus 3 жыл бұрын

general intelligence can be achieved by maximizing the Schmids that are Hubed.

@hongyihuang3560 3 жыл бұрын

“this must just have gotten in here by *accident*” right...

@samanthaqiu3416 3 жыл бұрын

I commented exactly the same but deleted and liked once I saw you commented it first

@yimingqu2403 3 жыл бұрын

I know I'll find this. What an accidence.

@naromsky 4 ай бұрын

Yep, nice touch.

@seanohara1608 3 жыл бұрын

Scary to think that there might be “youngsters” that watch these videos who do not know what an LSTM is. I love living in a time with this pace of innovation.

@SirPlotsalot 2 жыл бұрын

It's nuts that in a few years they'll talk about LSTMs as an "old" technique...

@DennisBakhuis 3 жыл бұрын

"I realize, some of you youngsters don't know what an LSTM actually is" ow boy, am I getting old now?

@matiasdanieltrapagliamansi3109 2 жыл бұрын

Crystal Clear, you are awesome man! thanks a lot

@JTMoustache 3 жыл бұрын

The fact that conditioning on the past works better probably means the problem is non markovian with respect to the state representation chosen initially for the task. Condition on past states and rewards (and actions, why not) enriches the states and allow the model to better discriminate the best action. It is limited in term of context size, but much richer than classic RL where the system is supposed to be markovian and a single state is all you get. Also, credit assignment happens whatever the size of context, because the reward is going to be propagated backwards in time as the agent encounter states which are close enough. In more classic RL models it should be even worse that this model if it was not the case, because it only updates a single state-action value, rather than this rich (and smoothed) state representation. It is because value is = current reward + future reward, that the reward is progressively progragated back. (you maximize non discounted rewards defining a value fonction with discounted future rewards so the series converges in inifinite horizon) Also interesting, in the planning as inference litterature, you also condition on the "optimality" of your action, similarly to conditionning on the reward, although it does not matter the value of the reward, simply that its the optimal trajectory.

@tetamusha 3 жыл бұрын

Conditioning in the past is also beneficial in POMDPs, which are still considered Markovian. The classical solutions to POMDPs revolve around approximating the agent's belief that it is in each of the states, and this belief is refined by taking into account the history of observations.

@GuillermoValleCosmos 3 жыл бұрын

I think there's a missunderstanding in the example at 32:00 The issue with a limited context length is the following: {if you have a situation where the action that would lead to reward R depends on an action that you took a long time ago, then your limited-context policy cant know that which action it should take}. However, that problem is *not* solved by Q-learning/dynamic programming. If you use a policy with limited context even with Q-learning, you cannot learn which action you should take, for precisely the same reasons of having an input to your policy network that doesn't have enough info. I think this is a problem of partial observability, not of RL credit assignment. On the other hand, if your issue is the time separation between a critical action and the reward, as opposed to the observation context length, then the approach in the paper is fine, as I think they are using *returns* (i.e. accumulated rewards over the whole episode). I found your example a bit vague, but I think your criticism was more for the above point.

@NothingButFish 3 жыл бұрын

As long as the problem has the Markov property (which is the case, since the authors assume the agent acts in an MDP, so there is full observability) it is always possible to predict the optimal action given knowledge of only the current state. The issue with context length would be that if the predicted reward is only a function of the context, a crucial (state, action) pair that leads to a high reward outside the context cannot be properly accounted for. However, since they train on reward-to-go (which the total reward going forward in the episode instead of just the context), even a context length of 1 should work, which the authors comment on. But they found that a higher context length tends to work better, which I'd imagine is because it just makes the transformer generalize better naturally.

@Stochine 3 жыл бұрын

@@NothingButFish Exactly. If this was not the case, no traditional tabular reinforcement learning method would have convergence and optimality guarantees. With optimal Q-values and if the Markov property truly holds, the singular state is enough information to perform optimally. However, if the Markov property does not hold then for high dimensional state spaces it gets a little more fuzzy. With only a singular state and high dimensional state spaces, you may have it so that the agent has a difficult time distinguishing between singular states (as they may look roughly the same e.g., staring at a wall, but at different parts of a level. They could look the exact same to the agent.). This is one of the true cases were the size of the context becomes a problem

@user93237 2 жыл бұрын

@NothingButFish Maybe I am misunderstanding something, but it seems strange a context length of 1 would work in the general case. E.g. imagine a grid world agent moving to the right around an obstacle on the final step finally reaching the reward. That would be the only training example with max_reward -> left. Now imagine the agent would start to the right of the goal, then the only example corresponding to high reward would lead into the wrong direction.

@mgostIH 3 жыл бұрын

Developing an entire literature around game theory, monte carlo tree search, domain specific methods: 🤢 Throwing a transformer on it: 😎

@miikalehtimaki1136 3 жыл бұрын

Sutton's bitter lesson at work.

@dylancope 3 жыл бұрын

You threw in Schumidhuber's 2019 paper, but it's also interesting to note how this approach goes back to Hutter 2005 with General Reinforcement Learning as Solomonoff Induction + Utility Theory.

@user93237 2 жыл бұрын

How is Solomonoff Induction related to UDRL? SI is a formalization of Occam's Razor, i.e. that future observations are best predicted by the shortest programs that fit the observations. AIXI extends on that doing Bayesian updating on how much reward each program produces, which is basically value function learning. UDRL kind of does the reverse, learning reward to action mappings.

@scottmiller2591 3 жыл бұрын

Schmidhuber often gets into my folders as the earliest dated file - I don't know how I keep screwing up, but it's good to hear I'm not the only one.

@fuma9532 3 жыл бұрын

Right on time, I was just delving into those two papers :)

@scottmiller2591 3 жыл бұрын

Reward discounting is equivalent to 1 - the per turn probability of leaving the game before it is finished. You can have cyclic behavior whenever the rewards cannot be expressed as a potential; it's not quite the same thing as stability for reward. W/o reward discounting you may see things like asymptotic explosions, which are not cyclic.

@menzithesonofhopehlope7201 3 жыл бұрын

I love how AI researchers from different firms work together 😊

@martinschulze5399 2 жыл бұрын

I think they are dominating the paper market too heavily :(

@CyberneticOrganism01 2 жыл бұрын

Thanks for the video, it's very helpful. Ultimately I think this architecture design is awkward (requiring the expected reward to predict an action) and that we're just trying to explain something that doesn't make too much sense. The Transformer can output the probability distribution over a vocabulary, and in this respect it is perfectly suitable for the RL setting, where we need a probability distribution over actions. The problem lies in other aspects, in particular the many **layers** of Transformers which makes it unstable to train in the RL setting (as pointed out in Parisotto's paper). I see this paper as an early attempt to put the Transformer into RL and this, if successful, would be an AGI prototype. We're very close to having an AGI 😀

@martinschulze5399 2 жыл бұрын

Well explained, thanks.

@BehnamManeshgar 2 жыл бұрын

Thanks for the video it is very helpful. In 36:50 the toy example seems to be correct though. During the generation it uses the prior knowledge from its experiences to make a decision, so has -3 from the yellow graph and -1 from blue graph. During generation it does not care about the real reward it gets (which is -2). what matters and is shown in the figure is the expected reward from prior knowledge.

@dojutsu6861 2 жыл бұрын

As someone from an NLP background where transformers are prevelant, I must raise that the concern of limited span is partially addressed by lines of research which are addressing how to expand the look back possible with transformers

@vornam_nachnam_7044 Жыл бұрын

In a discrete action space (the paper uses a continuous action space), that means that we would need to add an "action start" token, for t=0? Because to predict the action at time t, we have to encode the Reward we want for time step t, the state at time step t and the action at time step t-1. Or did I get something wrong?

@pjbontrager 3 жыл бұрын

I would imagine that by discount factor they were referring to gamma. As Q-learning is a TD(0) algorithm so there is no lambda to tune. One good intuition for the meaning/purpose of a discount factor is a proxy for the likelihood your agent will survive to reach a future reward. It’s more about tuning how far back it can look for credit assignment, which affects how stable the learning process is.

@pjbontrager 3 жыл бұрын

Two other notes: it seems like it’d be pretty cheap to sample what is the highest reward you can ask for, just double the highest reward from your training data and do a binary search. Second, for the context size, you can break sequences into subsequences and have the final reward of a subsequence be the sum future reward left out, though I guess you would say this is bringing the dynamic programming aspect back in.

@YannicKilcher 3 жыл бұрын

Good idea, I guess that approach would work quite well. Maybe needs a bit of consideration for unbounded reward environments, but essentially we're also limited there by the demonstrations we have

@shayanjawed6277 3 жыл бұрын

wallah i thought this paper's super interesting, wishing it shows up on yannick's channel, and here it is! :D

@DanielHernandez-rn6rp 3 жыл бұрын

Perhaps this was already pointed out, and I apologise for sounding overly-rigurous! At 17:50 you start describing the fundamental intution of Temporal difference learning via saying "Q^{\pi}(S) = r + Q^{\pi}(s')". Which is great, but That's the value function (V(s))! Not the state-action value function (Q(s, a)), which also takes an action in its function signature. For the purpose of your explanation it doesn't really matter. But I'll leave this comment here just in case. Keep up the amazing work. And congrats on your recent graduation, Dr. Kilcher :D

@YannicKilcher 3 жыл бұрын

Yes, good catch, thank you very much!

@ademord 3 жыл бұрын

What is temporal difference learning?

@DanielHernandez-rn6rp 3 жыл бұрын

@@ademord, the wikipedia article for it does a fairly good job at explaining what it's about (en.wikipedia.org/wiki/Temporal_difference_learning). The goal is: find a function that maps states to future expected reward (V(s) -> float), and you do this by minimizing the difference (or discrepancy in layman terms) between your estimated values at point in time (t) and (t+1).

@binjianxin7830 3 жыл бұрын

It feels like the notion of “reward “ was confused with “return”. Discount factor is just gamma and lambda is just for return.

@duynguyenmau5551 2 жыл бұрын

Anyone know which pdf editor he was using on the video?

@Kram1032 3 жыл бұрын

Couldn't you simply do a hybrid version where you like pass in Q-values at the start of the context length or something?

@YannicKilcher 3 жыл бұрын

nice idea. though that would be admitting you still need the old techniques and that this just extends the addressable context :D

@Kram1032 3 жыл бұрын

@@YannicKilcher yeah, of course, if you really want to make the point that you truly don't need Q-values (which the authors seem to want to do), that's fair. But honestly it seems much better an idea to me to mix and match the techniques to get The Best Of Both Worlds as it were. Perhaps also connect it with those decaying-memory transformers (or something similar) so you can potentially get much longer-scale horizons as for most games (and surely also most situations irl), it's only gonna be a fairly small fraction of information you actually need to remember for very long times. Most stuff is relevant for only a few timesteps. (To avoid the silly kind of single- or few-step behavior that a lot of AIs with small contexts do, where they repeat the same actions over and over)

@axelbrunnbauer778 3 жыл бұрын

was thinking exactly the same.

@vishalbatchu2433 3 жыл бұрын

Great videos as always! Which tool/software do you use to get an infinite PDF canvas to draw on?

@YannicKilcher 3 жыл бұрын

OneNote

@user-sf3lr9ey8c 5 ай бұрын

Great video.

@oshri8 3 жыл бұрын

Hi, great video as always. I have a problem with the term "offline RL" not every policy learning algorithm is reinforcement learning. The main problem that RL tries to solve is not reward assignment but exploration vs exploration tradeoff. If there is no exploration it is not RL.

@user-lv8rf9tm1f 3 жыл бұрын

I would like to question statement that the lack of exploration is the reason to not call this algorithm reinforcement learning one. The reason is that RL main goal is to learn policy using simulation and/or using dataset of trajectories taken from markov decision process, meaning you have access to reward function and next state as feedback from environment (with or without internal model). This feedback is criticism here, so, if we cannot access this feedback either from data or from existing simulation, it is not RL. Meanwhile exploration/exploitation trade off is just the problem of RL as well as credit assignment and so on. I might be wrong and would like to here your thoughts. I guess, there are some terms used wrongly or not accurate in this video, like inverse rl or offline learning, but idea is well presented and link to transformers is kind of key in this paper, so other things can be inaccurate to some extent.

@vivekpadman5248 6 ай бұрын

That's a nice point but what I feel is supervised training is still different as in offline reinforcement learning there is the presence of an agent which is performing actions similar to the agent u r training(same action space) and the feedback is received by looking at the actions taken by that agent and the rewards from the environment. The only difference is our agent cannot take actions in it under it's policy. I hope I sent the point through. It's like learning to drive just by looking at someone driving, not the same as reading a driving book is it 😊.

@MarkusFjellheim Ай бұрын

32:00 I think the model could still learn what actions outside the context window it should take to get future rewards. Even with a context window of 1, it could look at its training data and see 2 random agents taking different actions and having different returns to go and from this know what actions are better. If you have enough training data the model should be able to generalize over the environment and make these predictions in general

@salmankhalildurrani Жыл бұрын

great job, make more videos!

@danielalorbi 3 жыл бұрын

11:40 lmao, thanks Grandpa Yannic

@kargarisaac 3 жыл бұрын

There is another new and interesting paper by Sergei Levin's group called trajectory transformer.

@NothingButFish 3 жыл бұрын

Since the paper implies that a prior is used on the data to essentially extract the highest-reward trajectories, I'm skeptical that it would work well on problems with nondeterministic dynamics. For example, for a case where a particular action has a 50/50 chance of producing a reward of 100 or a reward of -100, the bad trajectories would be thrown out and it would learn that the state,action in question leads to a reward of 100, when in fact on average it leads to a reward of 0. A different action for that state that always gives a reward of 90 would be a better prediction for "action that leads to reward of 100." Or am I misunderstanding?

@vivekpadman5248 6 ай бұрын

Well I think the reward to go mechanism has to find the reward properly here. It's not the limitation of the arch but the dataset. If the dataset is built properly with proper exploration it should have both +100 and -100 cases in it 😅

@G12GilbertProduction 3 жыл бұрын

I have a question: Can you add a link for second paper?

@YannicKilcher 3 жыл бұрын

Done

@AkashMishra23 3 жыл бұрын

OMFG, I was just going through this paper today

@harinkumar1073 3 жыл бұрын

I think Transformer-XL will help solve this problem coz they learn dependencies beyond the context length

@jerryyao8491 3 жыл бұрын

That's great if I can listen well

@rohanpandey2037 3 жыл бұрын

2:25 LMAO ur rly cultivating this schmidhuber beef

@m.s.2753 3 жыл бұрын

very interesting and thank you for the video. i like your critical thoughts. i would never dare to question a berkeley, facebook, google paper^^ couldn't one combine this with the TD idea by adding the value V to the sequence?

@aniruddhadatta925 3 жыл бұрын

Interestingly this has been implemented partially in the recent Google Football Research kaggle competition where there was a Multi Headed Attention model as a feature extraction for the Actor Critic network of PPO

@dermitdembrot3091 3 жыл бұрын

DQNs learns a function q=f(a) from actions to expected returns. UpsideDownRL learns something like the inverse a=f^-1(q). The real f is well defined but not necessarily bijective, so f^-1 may not exist. In other words: For some q there is no action that leads to q. And for some q there is more than one such action. That makes UDRL unnatural me and makes me wonder why its "working". What did I miss? (Note that I omitted the conditioning on states and histories.)

@michaelguan883 3 жыл бұрын

I think that’s why function approximation is necessary for Upside down RL. You don’t need an exact map, just trust the neural net to map one to another approximately and hope for the best.

@dermitdembrot3091 3 жыл бұрын

@@michaelguan883 well, I was already assuming function approximation and we are approximation a function f^-1 that doesn't necessarily exist. It apparently works in practice but conceptually it seems odd.

@petemoss3160 Жыл бұрын

i am building character chatbots and AI agents, my dreams have been filled with something like this taking arbitrary array of states and actions in time series with reward. this combined with behavior cloning in the wild would be widely applicable. starting by designing a hard-coded agent with the states and actions, and logging everything with timeseries and reward scores should be enough to generate the data for this model. i gotta start playing around with this.

@freemind.d2714 3 жыл бұрын

Sequence modeling, Transformer, Memory model, back to RL... kind feel like researchers are running in a circle here...

@patf9770 3 жыл бұрын

Seems to me like various approaches are converging towards an optimum.

@freemind.d2714 3 жыл бұрын

@@patf9770 Local optimum by researcher descent

@tetamusha 3 жыл бұрын

I think the reason they don't need a discount factor is because their model does not use the return, which is the expected sum of total rewards, isn't it? Discounting is theoretically necessary because the return is the expected value of an infinite sum, which is ill-defined (it could go to infinity or simply never stabilize). By using the discount, you guarantee that this sum reaches a stable point, which has an expectation. By the explanation in the video, their model seems to "ask" for the action that makes the agent observe a given immediate reward, not the return. They don't need gamma for that because they don't approximate the return.

@AntonPechenko Жыл бұрын

I'm afraid that I might repeat already mentioned in other comments. It seems that explanation of this paper is not correct in details. Seems that transformer is conditioned on reward-to-go R^ and this is how it addresses mentioned context problem (and, yes, R^ is that noisy target that was also mentioned). So no matter how far away is the actual reward undiscounted sum still might give and access to the successful action. Also transformer doesn't try to infer reward from the context. I think that context might not matter much unless you are solving the MDP. Thank you for the video and other videos!

@JuanCamiloGamboaHiguera 3 жыл бұрын

Discount factors are not a design choice. You need them for infinite horizon, otherwise the sum blows up and the expectation is undefined. Here they don't "need" them because they're working with a finite horizon formulation.

@quickdudley 2 жыл бұрын

Discount factor is one way to handle infinite horizon but I've also read about other ways of doing it. I can't recall exactly how they work but something to do with predicting the advantage of each action instead of predicting reward directly.

@HappyMathDad Жыл бұрын

But Yannik, you just said that the Q function can't go back more than maybe 20 steps. So the context problem not truly addressed there either, isn't it?

@Not_a_Pro_117 3 жыл бұрын

I think you're mistaking online/offline RL for on-policy/off-policy RL

@norik1616 3 жыл бұрын

So Schmidhuber WAS first? 😂🤣

@user93237 2 жыл бұрын

Schmidhuber cites Kaelbling, L. P. 1993. Learning to Achieve Goals as originator of similar ideas.

@muhammadaliyu3076 3 жыл бұрын

I love UCBerkeley

@galchinsky 3 жыл бұрын

Why doesn't Schmidhuber like the transformers? Or does he?

@patham9 3 жыл бұрын

No idea but the limited context window is clearly problematic.

@YannicKilcher 3 жыл бұрын

Why would he not like them? he invented them

@patham9 3 жыл бұрын

@@YannicKilcher No that's wrong. Transformers were proposed in "Attention Is All You Need" by Vaswani et. al.

@mouduge 3 жыл бұрын

@@patham9 Yannic is joking: Schmidhuber invented many many things (incl. LSTMs), but people have often failed to cite his prior contributions, and he loudly complained about it (e.g., his memorable NeurIPS rant at the end of a talk by Ian Goodfellow on GANs), so now there's a recurring joke in the community where people attribute everything to Schmidhuber (a bit like the Chuck Norris meme). Transformers have largely replaced LSTMs in many applications, so some people have claimed that LSTMs are dead, but Schmidhuber defends his invention (e.g., by pointing out the limited context size of transformers). So I'm not sure he hates transformers, but he definitely still loves LSTMs.

@patham9 3 жыл бұрын

@@mouduge LSTM's were developed by Sepp Hochreiter and first described in his diploma thesis, Schmidhuber was just his advisor. I agree mis-credit is a serious issue in the ML field besides the trial&error nature which distinguishes it from a real science.

@herp_derpingson 3 жыл бұрын

0:55 I read the causal transformer as casual transformer. . 11:40 Can we call it a "Model only reinforcement learning" ? . 32:50 I dont think it is safe to say that all Q-Learning/RNN based learning will be able to incorporate information from that back into the past into the current decision. It can, but it is not guaranteed and in practice, it might forget. . 50:20 I think this "any reward" thing can be quite useful in developing AI for video games. We dont want a computer opponent to play the hardest it can, the human player should be able to dial down the difficulty. . This paper just throws SARSA into a transformer? Thats it?

@YannicKilcher 3 жыл бұрын

I think the "model" term is usually reserved for some sort of externally sourced model (simulator, etc.) or explicitly termed "learned model". The question is more whether we can call it RL at all :D Yea absolutely right Q-learning etc. of course also have giant trouble with learning these things, but at least they could in theory. And yes, it's almost sarsa, except the "r" here is the reward-to-go, which makes things a bit more interesting. And yes, great idea. I think the paper actually references AlphaStar, which also conditions on opponent MMR to adjust its strength/strategy in response. Slightly different, but means it should be possible

@maximkazhenkov11 3 жыл бұрын

10:20 Causal transformer? More like casual transformer! 😂

@lifestil Жыл бұрын

Lol schmidhuber 😂 That's some funny inside baseball!

@paolofazzini6460 5 ай бұрын

did anybody tell you that your voice remembers that of Roger Federer's? 😀

@XOPOIIIO 3 жыл бұрын

The reason to discount future rewards is that its not smart enough 43:30

@ssssssstssssssss 3 жыл бұрын

Calling this reinforcement learning is a stretch. This is more akin to imitation learning as it is modeling a group of agents. RL is not a modeling problem.

@pjbontrager 3 жыл бұрын

But it’s not exactly imitation learning because it can find sequences that are better than its training sequences. These can then be bootstrapped to have it explore even better sequences. I think the only way to describe it is upside down RL

@ssssssstssssssss 3 жыл бұрын

@@pjbontrager It is similar to upside down reinforcement learning. But that is not reinforcement learning either because it will not "reinforce" good decisions until it gets to some optimal decision given the data. In order to progress toward an optimal decision in this example, you'd need to introduce reinforcement learning or some other decision algorithm and learn the optimal reward to predict, since that is an action.

@pjbontrager 3 жыл бұрын

@@ssssssstssssssss well it models where the rewards are and then dynamically searches that space at runtime. If you use it in an online setting, it would find new rewards and add those to its training set. This would allow it to keep optimizing without a need for another algorithm.

@user93237 2 жыл бұрын

@@pjbontrager For people not seeing why UDRL can be better than imitation learning: Imitation only learns a state -> action mapping, so it can only perform as well as the teacher. UDRL, OTOH, learns a reward -> action mapping, so one can plug in an even higher reward than the "teacher" or experiences so far, and, hopefully, it will extrapolate to actions that achieve an even higher reward. When training on these achieved rewards, one can train an even better model and then, possibly, one can run it with even higher reward as input, thereby bootstrapping ever better performance.

@HappyMathDad Жыл бұрын

@@ssssssstssssssss that sounds like an overspecification of RL. There are many problems where we can't really judge progress until we get an answer.

@bdennyw1 3 жыл бұрын

If you are watching this video, I'm sure you'd like the Introduction to Reinforcement Learning videos from David Silver. kzfaq.info/get/bejne/aNaHqZp4tNzZlmQ.html

@vivekpadman5248 6 ай бұрын

Look offline rl is unsolvable as even we humans are not able to learn something without doing it. Yes better archs which need very little online training could be learnt but without online training its impossible

@Bill0102 5 ай бұрын

I'm immersed in this. I read a book with a similar theme, and I was completely immersed. "The Art of Saying No: Mastering Boundaries for a Fulfilling Life" by Samuel Dawn