It should be treated as the reference for:
The Temporal Encoder is one of the most important upstream modules in the project because its embeddings feed multiple downstream risk and analyst models. If its outputs are missing, misaligned, or low quality, the Technical Analyst, Volatility Model, Drawdown Model, and Regime Detection module are all affected.
The Temporal Encoder belongs to the encoder layer of the system.
INPUT DATA FAMILY
└── Time-Series Market Data
└── features_temporal.csv
└── Shared Temporal Attention Encoder
├── Technical Analyst
├── Volatility Model
├── Drawdown Risk Model
└── MTGNN Regime Detection
The encoder does not make a trading decision by itself. Its job is to convert a rolling 30-day market-feature sequence into a dense representation of recent price behaviour.
The output embedding is then reused by downstream modules, which avoids forcing every model to relearn basic temporal market structure from scratch.
Financial market behaviour is sequential. A single daily row is not enough to understand:
The Temporal Encoder converts a 30-day sequence into a compressed vector that represents the recent state of a ticker. This gives downstream models access to richer context than raw single-day features.
The project intentionally uses an attention-based encoder rather than making a GNN the main time-series encoder. GNNs are reserved for correlation/contagion and regime graph modules. This preserves clean architectural boundaries:
Temporal Encoder → learns per-ticker temporal behaviour
StemGNN / MTGNN → learns cross-asset or graph-based structure
The main alternatives were LSTM, CNN, and attention-based models.
| Candidate | Strength | Limitation for this project |
|---|---|---|
| LSTM | Good sequential baseline | Processes sequentially, can be slow, can struggle with long-range dependency and parallel GPU utilisation |
| CNN | Efficient local pattern extraction | Fixed receptive field; does not dynamically decide which days matter |
| Transformer Encoder | Parallel, attention-based, flexible dependency modelling | Requires careful regularisation and enough data |
The project chose a Transformer-style temporal encoder because it can learn which days in a window matter most and can process time steps in parallel on GPU.
This is especially useful for financial data because relevant signals may not always be the most recent row. An abnormal volume spike, a volatility cluster, or a momentum reversal several days earlier may matter more than yesterday’s movement.
The encoder consumes the final engineered market-feature file:
data/yFinance/processed/features_temporal.csv
This file is created by the market data pipeline after yFinance/Stooq/Kaggle filling, calendar alignment, missing-value handling, and no-leakage feature engineering.
The active model uses the following 10 engineered features:
| Feature | Meaning | Use |
|---|---|---|
log_return |
Daily log return | Basic return/momentum movement |
vol_5d |
5-day realised volatility | Short-term instability |
vol_21d |
21-day realised volatility | Medium-term instability |
rsi_14 |
14-day RSI | Momentum/overbought/oversold signal |
macd_hist |
MACD histogram | Trend/momentum divergence |
bb_pos |
Bollinger band position | Relative price location in band |
volume_ratio |
Volume relative to recent average | Volume abnormality |
hl_ratio |
High-low range ratio | Intraday range / volatility proxy |
price_pos |
Price position indicator | Relative price state |
spy_corr_63d |
Rolling correlation with SPY proxy | Market co-movement/context |
The code checks these fields during inspection. Missing features indicate the market pipeline has not been completed correctly.
Each training or embedding sample is a rolling sequence:
(batch_size, seq_len, n_features)
The final active configuration uses:
seq_len = 30
n_features = 10
So a typical batch has shape:
(batch_size, 30, 10)
Each sample corresponds to one ticker and one end date. The sequence covers the 30 trading days ending on that date.
The project uses chronological chunks to avoid look-ahead bias.
| Chunk | Training period | Validation period | Test period | Purpose |
|---|---|---|---|---|
| Chunk 1 | 2000–2004 | 2005 | 2006 | Early historical period |
| Chunk 2 | 2007–2014 | 2015 | 2016 | Crisis/post-crisis period |
| Chunk 3 | 2017–2022 | 2023 | 2024 | Recent market period |
This split structure is important because financial data is time ordered. Random train/test splitting would leak future market conditions into training.
The encoder must fit its normalisation and model only on the training portion for the relevant chunk, then apply that fitted state to validation and test embeddings.
Input sequence: (batch, 30, 10)
│
├── Linear input projection: 10 → d_model
│
├── Sinusoidal positional encoding
│
├── Transformer Encoder layers
│
├── Pooling
│ ├── last_hidden
│ ├── mean_pooled
│ └── attention_pooled
│
└── Temporal embedding: (batch, d_model)
The raw 10-dimensional feature vector at each time step is projected into the model dimension:
self.input_projection = nn.Linear(n_input_features, d_model)
This allows the transformer to operate in a richer hidden space.
Because transformer attention does not inherently know sequence order, sinusoidal positional encoding is added to the projected inputs.
The positional encoding lets the model distinguish early, middle, and recent days inside the 30-day window.
The encoder uses PyTorch’s nn.TransformerEncoderLayer with:
batch_first=True,norm_first=True,Important architectural clarification:
The project’s “no residual shortcuts between major modules” rule does not forbid residual connections inside a Transformer block. Transformer residuals are part of the standard internal mechanism required for stable deep attention training.
The model returns a dictionary rather than a single tensor:
{
"sequence": x,
"last_hidden": last_hidden,
"mean_pooled": mean_pooled,
"attention_pooled": attn_pooled,
}
This design makes the encoder flexible for different downstream modules.
| Output | Shape | Meaning |
|---|---|---|
sequence |
(batch, seq_len, d_model) |
Full hidden sequence |
last_hidden |
(batch, d_model) |
Representation of the latest state |
mean_pooled |
(batch, d_model) |
Average sequence representation |
attention_pooled |
(batch, d_model) |
Learned weighted representation |
The final operational embeddings used by downstream modules are 256-dimensional after HPO selected d_model=256 for later chunks.
The encoder is trained with a self-supervised masked prediction task.
Random time steps are masked and the model is trained to reconstruct the original feature values at masked positions.
Original sequence:
[t1, t2, t3, ..., t30]
Masked input:
[t1, 0, t3, ..., t30]
Target:
recover the true feature vector at the masked position
The loss is mean squared error on masked positions only:
loss = MSE(predicted_masked_values, true_masked_values)
This makes the encoder learn temporal structure without needing hand-written supervised labels.
The encoder is shared across multiple downstream tasks, so it should learn a general-purpose temporal representation rather than optimising directly for one task only.
Self-supervised masked reconstruction helps the encoder learn:
The encoder uses feature normalisation so that features with large numeric ranges do not dominate the model.
The feature normaliser stores:
mean(feature)
std(feature)
and transforms:
x_normalised = (x - mean) / std
Normalisation must be fitted on the training split only for a chunk.
Validation and test embeddings must reuse the training-fitted normaliser. They must not fit their own normalisers using validation/test data because that would leak future distribution information into inference.
In the final operational run, train-only normalisers were saved under each chunk model folder, for example:
outputs/models/TemporalEncoder/chunk2/normalizer.npz
outputs/models/TemporalEncoder/chunk3/normalizer.npz
The Temporal Encoder uses Optuna TPE search before final training.
The encoder is upstream of many models. Poor temporal embeddings reduce the quality of:
HPO is therefore not optional for the thesis-quality version.
The code searches over:
| Parameter | Search space / type |
|---|---|
n_layers |
2 to 6 |
n_heads |
2, 4, 8 |
d_model |
64, 128, 256 |
dropout |
continuous range |
attention_dropout |
continuous range |
learning_rate |
log-scale range |
weight_decay |
log-scale range |
warmup_steps |
candidate values |
batch_size |
candidate values |
The best parameters are saved under:
outputs/codeResults/TemporalEncoder/hpo/best_params_chunk1.json
outputs/codeResults/TemporalEncoder/hpo/best_params_chunk2.json
outputs/codeResults/TemporalEncoder/hpo/best_params_chunk3.json
Chunk 1:
{
"params": {
"n_layers": 3,
"n_heads": 2,
"d_model": 256,
"dropout": 0.12946407491739656,
"attention_dropout": 0.17204918134328429,
"learning_rate": 0.00032050950331890453,
"weight_decay": 0.0003087360648369751,
"warmup_steps": 1000,
"batch_size": 64
},
"value": 1.7111254166455785
}
Chunk 2:
{
"params": {
"n_layers": 6,
"n_heads": 8,
"d_model": 256,
"dropout": 0.17854162005663654,
"attention_dropout": 0.19488416358230604,
"learning_rate": 0.0004982755066410893,
"weight_decay": 0.0002305298448157034,
"warmup_steps": 1000,
"batch_size": 32
},
"value": 0.17857432030974485
}
Chunk 3:
{
"params": {
"n_layers": 4,
"n_heads": 4,
"d_model": 256,
"dropout": 0.12580726647987625,
"attention_dropout": 0.1243864375089663,
"learning_rate": 0.00045877596861583427,
"weight_decay": 9.225880372153381e-06,
"warmup_steps": 1000,
"batch_size": 32
},
"value": 0.19655309588954853
}
Each chunk stores model checkpoints under:
outputs/models/TemporalEncoder/chunk{n}/
Expected files include:
best_model.pt
latest_model.pt
training_history.csv
training_summary.json
effective_config.json
normalizer.npz
model_freezed/model.pt
model_unfreezed/model.pt
The training function supports resuming from:
latest_model.pt
training_history.csv
If interrupted, the next run can continue from the last saved epoch. The training history is saved incrementally so progress is not lost when a remote session disconnects.
The Temporal Encoder training stage was the slowest part of the downstream setup because full training over millions of 30-day windows is expensive. In practice, once the validation loss had plateaued and a usable best_model.pt existed, embeddings could be generated from the best checkpoint.
This was especially important for Chunk 2 and Chunk 3, where the project needed embeddings urgently to unblock downstream risk modules.
The key practical rule is:
If best_model.pt exists, validation loss has stabilised, and embeddings are the blocking dependency, generate embeddings from best_model.pt rather than waiting for unnecessary extra epochs.
Embeddings are saved under:
outputs/embeddings/TemporalEncoder/
For every chunk and split, the encoder produces:
chunk{n}_{split}_embeddings.npy
chunk{n}_{split}_manifest.csv
Example:
outputs/embeddings/TemporalEncoder/chunk2_train_embeddings.npy
outputs/embeddings/TemporalEncoder/chunk2_train_manifest.csv
The completed production run produced finite 256-dimensional embeddings for all chunks.
| Chunk | Split | Embedding shape | Manifest shape |
|---|---|---|---|
| Chunk 1 | train | (3,065,000, 256) |
(3,065,000, 2) |
| Chunk 1 | val | (555,000, 256) |
(555,000, 2) |
| Chunk 1 | test | (552,500, 256) |
(552,500, 2) |
| Chunk 2 | train | (4,960,000, 256) |
(4,960,000, 2) |
| Chunk 2 | val | (555,000, 256) |
(555,000, 2) |
| Chunk 2 | test | (555,000, 256) |
(555,000, 2) |
| Chunk 3 | train | (3,700,000, 256) |
(3,700,000, 2) |
| Chunk 3 | val | (550,000, 256) |
(550,000, 2) |
| Chunk 3 | test | (547,500, 256) |
(547,500, 2) |
All sampled embeddings were verified as finite during the final audit.
The .npy embedding arrays contain only numeric vectors. Downstream modules need to know:
embedding row i → which ticker and which date?
This mapping is stored in manifest files:
chunk{n}_{split}_manifest.csv
The minimal manifest columns are:
ticker,date
Some manifest-building tools may also include:
seq_start,seq_end
The manifest reconstructs the same rolling-window order used by the MarketSequenceDataset:
For each ticker:
sort by date
build 30-day windows
assign the embedding date to the final date of the window
The helper script:
code/encoders/build_embedding_manifest.py
exists to rebuild manifest files if embeddings already exist but row metadata is missing.
For every split:
len(embeddings) == len(manifest)
If this is false, downstream model training must not proceed until alignment is fixed.
The Temporal Encoder contributes XAI at the embedding-generation stage.
The attention pooling mechanism identifies which time steps in the rolling window mattered most for the embedding.
Expected outputs:
outputs/results/TemporalEncoder/xai/chunk{n}_{split}_attention_weights.npy
outputs/results/TemporalEncoder/xai/chunk{n}_{split}_attention_weights.csv
These files help answer:
Which days in the 30-day window were most important for the temporal representation?
Gradient-based importance is computed over a small sample of embeddings.
Expected outputs:
outputs/results/TemporalEncoder/xai/chunk{n}_{split}_feature_importance.npy
outputs/results/TemporalEncoder/xai/chunk{n}_{split}_feature_importance.csv
These files help answer:
Which engineered market features had the strongest influence on the embedding?
Temporal Encoder XAI should be interpreted as representation-level explanation, not final decision explanation.
The encoder explains what shaped the embedding. It does not explain the final Buy/Hold/Sell decision. Final explanation is produced later by module-level XAI, position sizing XAI, quantitative/qualitative synthesis, and fusion explanation.
Consumes Temporal Encoder embeddings and learns directional technical scores:
trend_score
momentum_score
timing_confidence
Consumes embeddings as learned market-state features and predicts future volatility outputs used by the risk engine and position sizing.
Consumes embeddings to estimate expected drawdown and related downside path-risk signals.
Consumes temporal embeddings, together with FinBERT/text and macro context, to help classify market regime state.
code/encoders/temporal_encoder.py
code/encoders/build_embedding_manifest.py
data/yFinance/processed/features_temporal.csv
outputs/codeResults/TemporalEncoder/hpo/best_params_chunk1.json
outputs/codeResults/TemporalEncoder/hpo/best_params_chunk2.json
outputs/codeResults/TemporalEncoder/hpo/best_params_chunk3.json
outputs/models/TemporalEncoder/chunk1/
outputs/models/TemporalEncoder/chunk2/
outputs/models/TemporalEncoder/chunk3/
outputs/embeddings/TemporalEncoder/chunk1_train_embeddings.npy
outputs/embeddings/TemporalEncoder/chunk1_train_manifest.csv
...
outputs/embeddings/TemporalEncoder/chunk3_test_embeddings.npy
outputs/embeddings/TemporalEncoder/chunk3_test_manifest.csv
outputs/results/TemporalEncoder/xai/
All commands below are single-line commands to match the project execution preference.
cd ~/fin-glassbox && python code/encoders/temporal_encoder.py inspect --repo-root .
cd ~/fin-glassbox && python code/encoders/temporal_encoder.py hpo --repo-root . --chunk 1 --device cuda
cd ~/fin-glassbox && python code/encoders/temporal_encoder.py train-best --repo-root . --chunk 1 --device cuda
cd ~/fin-glassbox && python code/encoders/temporal_encoder.py train-best --repo-root . --chunk 1 --device cuda && python code/encoders/temporal_encoder.py train-best --repo-root . --chunk 2 --device cuda && python code/encoders/temporal_encoder.py train-best --repo-root . --chunk 3 --device cuda
If the active code version supports performance flags, use a large embedding batch:
cd ~/fin-glassbox && python code/encoders/temporal_encoder.py embed --chunk 1 --split train --device cuda --batch-size 4096 --num-workers 8 --prefetch-factor 4 && python code/encoders/temporal_encoder.py embed --chunk 1 --split val --device cuda --batch-size 4096 --num-workers 8 --prefetch-factor 4 && python code/encoders/temporal_encoder.py embed --chunk 1 --split test --device cuda --batch-size 4096 --num-workers 8 --prefetch-factor 4
cd ~/fin-glassbox && python code/encoders/build_embedding_manifest.py
cd ~/fin-glassbox && python -c "import numpy as np,pandas as pd,pathlib; base=pathlib.Path('outputs/embeddings/TemporalEncoder'); files=['chunk1_train','chunk1_val','chunk1_test','chunk2_train','chunk2_val','chunk2_test','chunk3_train','chunk3_val','chunk3_test']; [print(f, np.load(base/f'{f}_embeddings.npy',mmap_mode='r').shape, pd.read_csv(base/f'{f}_manifest.csv').shape, 'finite_sample=', float(np.isfinite(np.load(base/f'{f}_embeddings.npy',mmap_mode='r')[:10000]).mean())) for f in files]"
The Temporal Encoder is considered complete only if all of the following are true:
| Check | Required result |
|---|---|
features_temporal.csv exists |
Yes |
| All 10 input features exist | Yes |
| Chunk HPO files exist | Yes, for chunks used in final system |
best_model.pt exists |
Yes, per chunk |
model_freezed/model.pt exists |
Yes, per chunk |
normalizer.npz exists |
Yes, per chunk |
| Train/val/test embeddings exist | Yes, per chunk |
| Train/val/test manifests exist | Yes, per chunk |
| Embedding rows equal manifest rows | Yes |
| Embedding finite sample ratio | 1.0 expected |
| XAI sample files exist | Strongly preferred |
This happened during production. The practical fix was to use the best available checkpoint and generate embeddings directly once validation had plateaued.
Also check:
Check manifest alignment:
len(embeddings) must equal len(manifest)
Also verify manifest columns include ticker/date and that dates are parseable.
If validation/test embedding uses a split-fitted normaliser, leakage can occur. Rebuild or copy the train-only normaliser for that chunk.
The downstream modules require all chunks eventually. If only Chunk 1 exists, Temporal Encoder is not complete for final backtesting.
At the current final project state, the Temporal Encoder is complete for all three chunks:
Chunk 1 train/val/test embeddings + manifests: complete
Chunk 2 train/val/test embeddings + manifests: complete
Chunk 3 train/val/test embeddings + manifests: complete
This unblocked the rest of the risk engine and analyst stack.
The Temporal Encoder is the project’s shared market-sequence representation model. It converts 30-day sequences of engineered market features into 256-dimensional embeddings used by technical, volatility, drawdown, and regime modules.
Its importance comes from being upstream of several models. The final system depends on it being:
This module is now a completed core encoder component in the fin-glassbox architecture.