A Very Simple Transformer Encoder for Time Series Forecasting in PyTorch

Рет қаралды 3,340

Күн бұрын

The purpose of this video is to dissect and learn about the Attention Is All You Need transformer model by using bare-bones PyTorch classes to forecast time series data.
Code Repo:
github.com/BrandenKeck/pytorc...
Very helpful:
github.com/oliverguhr/transfo...
github.com/ctxj/Time-Series-T...
github.com/huggingface/transf...
Attention Is All You Need:
arxiv.org/pdf/1706.03762.pdf

Пікірлер: 9

@mohamedkassar7441 6 күн бұрын

Thanks!

@thouys9069 Ай бұрын

nice man! it's these case studies that really generate insight. good stuff

@lets_learn_transformers Ай бұрын

Thank you!

@jeanlannes4522 Ай бұрын

Hello man, great videos. Really helpful links. I have a question : do you pass every time series datapoint (for every single batch) through a linear layer? What is the intuition behind this "dimension augmentation" if I may call it this way ? I see a lot of Conv1D being used and am trying to understand how to perform a good embedding. I feel like most papers on TSF with transformers aren't clear on this matter.

@lets_learn_transformers Ай бұрын

Hi @jeanlannes4522 - thank you! You are correct: each element of each time series is embedded "individually". Conv1D may be a better embedding approach for many (possibly most/all) problems. I used the linear approach because it was easy for me to understand, as it is almost an exact analog for word embedding with PyTorch's nn.Embedding() layer. The intuition (as far I understand) is that the model learns a vector representation for each individual "datapoint". When the datapoints are words in an NLP problem these vectors are a great measure of similarity between two words. For a problem with continuous data, this doesn't make as much sense because you could just as easily measure similarity with simple distance between two points. So, when the Linear layer learns something like 0.55 and 0.56 are similar, it's not as meaningful. One could argue that Conv1D is performing a similar task, but it is considering neighboring values in the embedding process, so it could generate "smarter" embeddings like 0.55 on an "increasing trajectory/slope" is different from 0.55 on a "decreasing trajectory/slope". This is something that I may try on my own now that you mention it! Do you mind sharing any sources where this is used if you have them on hand?

@jeanlannes4522 Ай бұрын

@@lets_learn_transformers Thanks for your answer. There is a philosophical question that remains : if every word has a meaning, does a single datapoint of a time series have one too ? Or only a sequence of these datapoints ? Should you tokenize your time series at the datapoint scale or at a few points scale to capture a little meaning (like a pattern, increasing, flat, decreasing, volatile etc.). ? But then how do you compress your data ? The question of multivariate time series remains (what if we have p features, p > 1 ?). One could argue that some words taken alone do not have a "meaning" (it, 's, _, ', .)... It is a difficult question. To get back to what you are doing, are you training the weights of your nn.linear(1,embed size) with the big transformer backprop ? Just to make sure I understand what you are doing. I am not sure if augmenting the dimension of a single datapoint makes sense. I really think you have to work with sub-windows of the original time series. But who knows.... I believe Conv1D is interesting too. Don't know if one is allowed to leak future neighboring values. But at least the past values can add meaning to the datapoint embedding as you say "increasing trajectory" added to a given value. The first time I read it was used was in MTS-Mixers: Multivariate Time Series Forecasting via Fac- torized Temporal and Channel Mixing and Financial Time Series Forecasting using CNN and Transformer.

@lets_learn_transformers Ай бұрын

@@jeanlannes4522 I completely agree - thank you for a great discussion. The nn.linear weights are trained via backprop upstream from the Transformer Encoder. It is possible that this behaves ok because I'm using a very small Transformer - it is possible that the linear layer would be far too simple with a larger model. I ran some experiments on the sunspots data and found the two to be comparable - but since I'm not going in depth with hyperparameters or early stopping it's hard to tell how good the results are. Do you mind if I make a short follow-up video about this discussion? Would you like your name included / not included in the video?

@isakwangensteen6577 Ай бұрын

When you say you extended the forecasting window, do you mean that the model now outputs more time step predictions or are you still just predicting one timestep into the future and unrolling the model for more days?

@lets_learn_transformers Ай бұрын

Hi @isakwangensteen6577 - sorry for the lack of clarity. I mean that the model now outputs more time step predictions!