fin-glassbox

Shared Temporal Attention Encoder

1. Document Purpose

It should be treated as the reference for:

why the Temporal Encoder exists,
what data it consumes,
how the model is structured,
how it is trained,
how embeddings are produced,
how manifests align embeddings with ticker/date rows,
how downstream modules consume its outputs,
how XAI is produced,
and how to validate the finished module.

The Temporal Encoder is one of the most important upstream modules in the project because its embeddings feed multiple downstream risk and analyst models. If its outputs are missing, misaligned, or low quality, the Technical Analyst, Volatility Model, Drawdown Model, and Regime Detection module are all affected.

2. Role in the Full Architecture

The Temporal Encoder belongs to the encoder layer of the system.

INPUT DATA FAMILY
└── Time-Series Market Data
    └── features_temporal.csv
        └── Shared Temporal Attention Encoder
            ├── Technical Analyst
            ├── Volatility Model
            ├── Drawdown Risk Model
            └── MTGNN Regime Detection

The encoder does not make a trading decision by itself. Its job is to convert a rolling 30-day market-feature sequence into a dense representation of recent price behaviour.

The output embedding is then reused by downstream modules, which avoids forcing every model to relearn basic temporal market structure from scratch.

3. Why This Module Exists

Financial market behaviour is sequential. A single daily row is not enough to understand:

trend continuation,
momentum reversal,
volatility clustering,
drawdown warning patterns,
volume shocks,
regime transitions,
and broader temporal context.

The Temporal Encoder converts a 30-day sequence into a compressed vector that represents the recent state of a ticker. This gives downstream models access to richer context than raw single-day features.

The project intentionally uses an attention-based encoder rather than making a GNN the main time-series encoder. GNNs are reserved for correlation/contagion and regime graph modules. This preserves clean architectural boundaries:

Temporal Encoder  → learns per-ticker temporal behaviour
StemGNN / MTGNN   → learns cross-asset or graph-based structure

4. Design Decision: Why Transformer-Based Attention

The main alternatives were LSTM, CNN, and attention-based models.

Candidate	Strength	Limitation for this project
LSTM	Good sequential baseline	Processes sequentially, can be slow, can struggle with long-range dependency and parallel GPU utilisation
CNN	Efficient local pattern extraction	Fixed receptive field; does not dynamically decide which days matter
Transformer Encoder	Parallel, attention-based, flexible dependency modelling	Requires careful regularisation and enough data

The project chose a Transformer-style temporal encoder because it can learn which days in a window matter most and can process time steps in parallel on GPU.

This is especially useful for financial data because relevant signals may not always be the most recent row. An abnormal volume spike, a volatility cluster, or a momentum reversal several days earlier may matter more than yesterday’s movement.

5. Input Data

5.1 Source file

The encoder consumes the final engineered market-feature file:

data/yFinance/processed/features_temporal.csv

This file is created by the market data pipeline after yFinance/Stooq/Kaggle filling, calendar alignment, missing-value handling, and no-leakage feature engineering.

5.2 Required columns

The active model uses the following 10 engineered features:

Feature	Meaning	Use
`log_return`	Daily log return	Basic return/momentum movement
`vol_5d`	5-day realised volatility	Short-term instability
`vol_21d`	21-day realised volatility	Medium-term instability
`rsi_14`	14-day RSI	Momentum/overbought/oversold signal
`macd_hist`	MACD histogram	Trend/momentum divergence
`bb_pos`	Bollinger band position	Relative price location in band
`volume_ratio`	Volume relative to recent average	Volume abnormality
`hl_ratio`	High-low range ratio	Intraday range / volatility proxy
`price_pos`	Price position indicator	Relative price state
`spy_corr_63d`	Rolling correlation with SPY proxy	Market co-movement/context

The code checks these fields during inspection. Missing features indicate the market pipeline has not been completed correctly.

5.3 Input shape

Each training or embedding sample is a rolling sequence:

(batch_size, seq_len, n_features)

The final active configuration uses:

seq_len = 30
n_features = 10

So a typical batch has shape:

(batch_size, 30, 10)

Each sample corresponds to one ticker and one end date. The sequence covers the 30 trading days ending on that date.

6. Chronological Split Design

The project uses chronological chunks to avoid look-ahead bias.

Chunk	Training period	Validation period	Test period	Purpose
Chunk 1	2000–2004	2005	2006	Early historical period
Chunk 2	2007–2014	2015	2016	Crisis/post-crisis period
Chunk 3	2017–2022	2023	2024	Recent market period

This split structure is important because financial data is time ordered. Random train/test splitting would leak future market conditions into training.

The encoder must fit its normalisation and model only on the training portion for the relevant chunk, then apply that fitted state to validation and test embeddings.

7. Model Architecture

7.1 High-level structure

Input sequence: (batch, 30, 10)
    │
    ├── Linear input projection: 10 → d_model
    │
    ├── Sinusoidal positional encoding
    │
    ├── Transformer Encoder layers
    │
    ├── Pooling
    │   ├── last_hidden
    │   ├── mean_pooled
    │   └── attention_pooled
    │
    └── Temporal embedding: (batch, d_model)

7.2 Input projection

The raw 10-dimensional feature vector at each time step is projected into the model dimension:

self.input_projection = nn.Linear(n_input_features, d_model)

This allows the transformer to operate in a richer hidden space.

7.3 Positional encoding

Because transformer attention does not inherently know sequence order, sinusoidal positional encoding is added to the projected inputs.

The positional encoding lets the model distinguish early, middle, and recent days inside the 30-day window.

7.4 Transformer encoder

The encoder uses PyTorch’s nn.TransformerEncoderLayer with:

batch_first=True,
GELU activation,
pre-layer normalisation via norm_first=True,
multi-head self-attention,
feed-forward layers,
dropout,
residual connections inside the transformer block.

Important architectural clarification:

The project’s “no residual shortcuts between major modules” rule does not forbid residual connections inside a Transformer block. Transformer residuals are part of the standard internal mechanism required for stable deep attention training.

7.5 Pooling outputs

The model returns a dictionary rather than a single tensor:

{
    "sequence": x,
    "last_hidden": last_hidden,
    "mean_pooled": mean_pooled,
    "attention_pooled": attn_pooled,
}

This design makes the encoder flexible for different downstream modules.

Output	Shape	Meaning
`sequence`	`(batch, seq_len, d_model)`	Full hidden sequence
`last_hidden`	`(batch, d_model)`	Representation of the latest state
`mean_pooled`	`(batch, d_model)`	Average sequence representation
`attention_pooled`	`(batch, d_model)`	Learned weighted representation

The final operational embeddings used by downstream modules are 256-dimensional after HPO selected d_model=256 for later chunks.

8. Training Objective

The encoder is trained with a self-supervised masked prediction task.

8.1 Masked temporal reconstruction

Random time steps are masked and the model is trained to reconstruct the original feature values at masked positions.

Original sequence:
[t1, t2, t3, ..., t30]

Masked input:
[t1, 0, t3, ..., t30]

Target:
recover the true feature vector at the masked position

The loss is mean squared error on masked positions only:

loss = MSE(predicted_masked_values, true_masked_values)

This makes the encoder learn temporal structure without needing hand-written supervised labels.

8.2 Why self-supervised training is suitable here

The encoder is shared across multiple downstream tasks, so it should learn a general-purpose temporal representation rather than optimising directly for one task only.

Self-supervised masked reconstruction helps the encoder learn:

return dynamics,
volatility persistence,
momentum/mean-reversion structure,
volume-price relationships,
and cross-feature interactions inside a 30-day market window.

9. Normalisation and Leakage Control

The encoder uses feature normalisation so that features with large numeric ranges do not dominate the model.

9.1 Normalisation rule

The feature normaliser stores:

mean(feature)
std(feature)

and transforms:

x_normalised = (x - mean) / std

9.2 No-leakage rule

Normalisation must be fitted on the training split only for a chunk.

Validation and test embeddings must reuse the training-fitted normaliser. They must not fit their own normalisers using validation/test data because that would leak future distribution information into inference.

In the final operational run, train-only normalisers were saved under each chunk model folder, for example:

outputs/models/TemporalEncoder/chunk2/normalizer.npz
outputs/models/TemporalEncoder/chunk3/normalizer.npz

10. Hyperparameter Optimisation

The Temporal Encoder uses Optuna TPE search before final training.

10.1 Why HPO matters

The encoder is upstream of many models. Poor temporal embeddings reduce the quality of:

Technical Analyst,
Volatility Model,
Drawdown Model,
Regime Detection,
Quantitative Analyst,
Position Sizing,
Fusion Engine.

HPO is therefore not optional for the thesis-quality version.

10.2 HPO search space

The code searches over:

Parameter	Search space / type
`n_layers`	2 to 6
`n_heads`	2, 4, 8
`d_model`	64, 128, 256
`dropout`	continuous range
`attention_dropout`	continuous range
`learning_rate`	log-scale range
`weight_decay`	log-scale range
`warmup_steps`	candidate values
`batch_size`	candidate values

The best parameters are saved under:

outputs/codeResults/TemporalEncoder/hpo/best_params_chunk1.json
outputs/codeResults/TemporalEncoder/hpo/best_params_chunk2.json
outputs/codeResults/TemporalEncoder/hpo/best_params_chunk3.json

10.3 Known final HPO examples from the production run

Chunk 1:

{
  "params": {
    "n_layers": 3,
    "n_heads": 2,
    "d_model": 256,
    "dropout": 0.12946407491739656,
    "attention_dropout": 0.17204918134328429,
    "learning_rate": 0.00032050950331890453,
    "weight_decay": 0.0003087360648369751,
    "warmup_steps": 1000,
    "batch_size": 64
  },
  "value": 1.7111254166455785
}

Chunk 2:

{
  "params": {
    "n_layers": 6,
    "n_heads": 8,
    "d_model": 256,
    "dropout": 0.17854162005663654,
    "attention_dropout": 0.19488416358230604,
    "learning_rate": 0.0004982755066410893,
    "weight_decay": 0.0002305298448157034,
    "warmup_steps": 1000,
    "batch_size": 32
  },
  "value": 0.17857432030974485
}

Chunk 3:

{
  "params": {
    "n_layers": 4,
    "n_heads": 4,
    "d_model": 256,
    "dropout": 0.12580726647987625,
    "attention_dropout": 0.1243864375089663,
    "learning_rate": 0.00045877596861583427,
    "weight_decay": 9.225880372153381e-06,
    "warmup_steps": 1000,
    "batch_size": 32
  },
  "value": 0.19655309588954853
}

11. Training, Resume, and Checkpointing

11.1 Checkpoint files

Each chunk stores model checkpoints under:

outputs/models/TemporalEncoder/chunk{n}/

Expected files include:

best_model.pt
latest_model.pt
training_history.csv
training_summary.json
effective_config.json
normalizer.npz
model_freezed/model.pt
model_unfreezed/model.pt

11.2 Resume behaviour

The training function supports resuming from:

latest_model.pt
training_history.csv

If interrupted, the next run can continue from the last saved epoch. The training history is saved incrementally so progress is not lost when a remote session disconnects.

11.3 Practical runtime lesson

The Temporal Encoder training stage was the slowest part of the downstream setup because full training over millions of 30-day windows is expensive. In practice, once the validation loss had plateaued and a usable best_model.pt existed, embeddings could be generated from the best checkpoint.

This was especially important for Chunk 2 and Chunk 3, where the project needed embeddings urgently to unblock downstream risk modules.

The key practical rule is:

If best_model.pt exists, validation loss has stabilised, and embeddings are the blocking dependency, generate embeddings from best_model.pt rather than waiting for unnecessary extra epochs.

12. Embedding Generation

12.1 Output directory

Embeddings are saved under:

outputs/embeddings/TemporalEncoder/

12.2 Output files

For every chunk and split, the encoder produces:

chunk{n}_{split}_embeddings.npy
chunk{n}_{split}_manifest.csv

Example:

outputs/embeddings/TemporalEncoder/chunk2_train_embeddings.npy
outputs/embeddings/TemporalEncoder/chunk2_train_manifest.csv

12.3 Final production embedding shapes

The completed production run produced finite 256-dimensional embeddings for all chunks.

Chunk	Split	Embedding shape	Manifest shape
Chunk 1	train	`(3,065,000, 256)`	`(3,065,000, 2)`
Chunk 1	val	`(555,000, 256)`	`(555,000, 2)`
Chunk 1	test	`(552,500, 256)`	`(552,500, 2)`
Chunk 2	train	`(4,960,000, 256)`	`(4,960,000, 2)`
Chunk 2	val	`(555,000, 256)`	`(555,000, 2)`
Chunk 2	test	`(555,000, 256)`	`(555,000, 2)`
Chunk 3	train	`(3,700,000, 256)`	`(3,700,000, 2)`
Chunk 3	val	`(550,000, 256)`	`(550,000, 2)`
Chunk 3	test	`(547,500, 256)`	`(547,500, 2)`

All sampled embeddings were verified as finite during the final audit.

13. Manifest Alignment

13.1 Why manifests are required

The .npy embedding arrays contain only numeric vectors. Downstream modules need to know:

embedding row i → which ticker and which date?

This mapping is stored in manifest files:

chunk{n}_{split}_manifest.csv

The minimal manifest columns are:

ticker,date

Some manifest-building tools may also include:

seq_start,seq_end

13.2 Manifest generation logic

The manifest reconstructs the same rolling-window order used by the MarketSequenceDataset:

For each ticker:
    sort by date
    build 30-day windows
    assign the embedding date to the final date of the window

The helper script:

code/encoders/build_embedding_manifest.py

exists to rebuild manifest files if embeddings already exist but row metadata is missing.

13.3 Alignment validation rule

For every split:

len(embeddings) == len(manifest)

If this is false, downstream model training must not proceed until alignment is fixed.

14. XAI Outputs

The Temporal Encoder contributes XAI at the embedding-generation stage.

14.1 Attention XAI

The attention pooling mechanism identifies which time steps in the rolling window mattered most for the embedding.

Expected outputs:

outputs/results/TemporalEncoder/xai/chunk{n}_{split}_attention_weights.npy
outputs/results/TemporalEncoder/xai/chunk{n}_{split}_attention_weights.csv

These files help answer:

Which days in the 30-day window were most important for the temporal representation?

14.2 Gradient feature importance

Gradient-based importance is computed over a small sample of embeddings.

Expected outputs:

outputs/results/TemporalEncoder/xai/chunk{n}_{split}_feature_importance.npy
outputs/results/TemporalEncoder/xai/chunk{n}_{split}_feature_importance.csv

These files help answer:

Which engineered market features had the strongest influence on the embedding?

14.3 XAI limitations

Temporal Encoder XAI should be interpreted as representation-level explanation, not final decision explanation.

The encoder explains what shaped the embedding. It does not explain the final Buy/Hold/Sell decision. Final explanation is produced later by module-level XAI, position sizing XAI, quantitative/qualitative synthesis, and fusion explanation.

15. Downstream Consumers

15.1 Technical Analyst

Consumes Temporal Encoder embeddings and learns directional technical scores:

trend_score
momentum_score
timing_confidence

15.2 Volatility Model

Consumes embeddings as learned market-state features and predicts future volatility outputs used by the risk engine and position sizing.

15.3 Drawdown Risk Model

Consumes embeddings to estimate expected drawdown and related downside path-risk signals.

15.4 Regime Detection

Consumes temporal embeddings, together with FinBERT/text and macro context, to help classify market regime state.

16. File Structure

16.1 Code files

code/encoders/temporal_encoder.py
code/encoders/build_embedding_manifest.py

16.2 Input file

data/yFinance/processed/features_temporal.csv

16.3 HPO files

outputs/codeResults/TemporalEncoder/hpo/best_params_chunk1.json
outputs/codeResults/TemporalEncoder/hpo/best_params_chunk2.json
outputs/codeResults/TemporalEncoder/hpo/best_params_chunk3.json

16.4 Model files

outputs/models/TemporalEncoder/chunk1/
outputs/models/TemporalEncoder/chunk2/
outputs/models/TemporalEncoder/chunk3/

16.5 Embedding files

outputs/embeddings/TemporalEncoder/chunk1_train_embeddings.npy
outputs/embeddings/TemporalEncoder/chunk1_train_manifest.csv
...
outputs/embeddings/TemporalEncoder/chunk3_test_embeddings.npy
outputs/embeddings/TemporalEncoder/chunk3_test_manifest.csv

16.6 XAI files

outputs/results/TemporalEncoder/xai/

17. CLI Commands

All commands below are single-line commands to match the project execution preference.

17.1 Inspect data

cd ~/fin-glassbox && python code/encoders/temporal_encoder.py inspect --repo-root .

17.2 Run HPO for a chunk

cd ~/fin-glassbox && python code/encoders/temporal_encoder.py hpo --repo-root . --chunk 1 --device cuda

17.3 Train best model for one chunk

cd ~/fin-glassbox && python code/encoders/temporal_encoder.py train-best --repo-root . --chunk 1 --device cuda

17.4 Train all chunks

cd ~/fin-glassbox && python code/encoders/temporal_encoder.py train-best --repo-root . --chunk 1 --device cuda && python code/encoders/temporal_encoder.py train-best --repo-root . --chunk 2 --device cuda && python code/encoders/temporal_encoder.py train-best --repo-root . --chunk 3 --device cuda

17.5 Fast embedding generation for one chunk

If the active code version supports performance flags, use a large embedding batch:

cd ~/fin-glassbox && python code/encoders/temporal_encoder.py embed --chunk 1 --split train --device cuda --batch-size 4096 --num-workers 8 --prefetch-factor 4 && python code/encoders/temporal_encoder.py embed --chunk 1 --split val --device cuda --batch-size 4096 --num-workers 8 --prefetch-factor 4 && python code/encoders/temporal_encoder.py embed --chunk 1 --split test --device cuda --batch-size 4096 --num-workers 8 --prefetch-factor 4

17.6 Build or repair manifests

cd ~/fin-glassbox && python code/encoders/build_embedding_manifest.py

17.7 Verify all embedding outputs

cd ~/fin-glassbox && python -c "import numpy as np,pandas as pd,pathlib; base=pathlib.Path('outputs/embeddings/TemporalEncoder'); files=['chunk1_train','chunk1_val','chunk1_test','chunk2_train','chunk2_val','chunk2_test','chunk3_train','chunk3_val','chunk3_test']; [print(f, np.load(base/f'{f}_embeddings.npy',mmap_mode='r').shape, pd.read_csv(base/f'{f}_manifest.csv').shape, 'finite_sample=', float(np.isfinite(np.load(base/f'{f}_embeddings.npy',mmap_mode='r')[:10000]).mean())) for f in files]"

18. Validation Checklist

The Temporal Encoder is considered complete only if all of the following are true:

Check	Required result
`features_temporal.csv` exists	Yes
All 10 input features exist	Yes
Chunk HPO files exist	Yes, for chunks used in final system
`best_model.pt` exists	Yes, per chunk
`model_freezed/model.pt` exists	Yes, per chunk
`normalizer.npz` exists	Yes, per chunk
Train/val/test embeddings exist	Yes, per chunk
Train/val/test manifests exist	Yes, per chunk
Embedding rows equal manifest rows	Yes
Embedding finite sample ratio	1.0 expected
XAI sample files exist	Strongly preferred

19. Troubleshooting

19.1 Training is too slow

This happened during production. The practical fix was to use the best available checkpoint and generate embeddings directly once validation had plateaued.

Also check:

GPU utilisation,
DataLoader workers,
batch size,
whether normalisation is repeatedly moving tensors to GPU inside the training loop,
whether the dataset is being rebuilt unnecessarily,
whether HDD I/O is bottlenecking the run.

19.2 Embeddings exist but downstream modules fail

Check manifest alignment:

len(embeddings) must equal len(manifest)

Also verify manifest columns include ticker/date and that dates are parseable.

19.3 Normaliser missing

If validation/test embedding uses a split-fitted normaliser, leakage can occur. Rebuild or copy the train-only normaliser for that chunk.

19.4 Chunk 2 or Chunk 3 missing

The downstream modules require all chunks eventually. If only Chunk 1 exists, Temporal Encoder is not complete for final backtesting.

20. Final Status

At the current final project state, the Temporal Encoder is complete for all three chunks:

Chunk 1 train/val/test embeddings + manifests: complete
Chunk 2 train/val/test embeddings + manifests: complete
Chunk 3 train/val/test embeddings + manifests: complete

This unblocked the rest of the risk engine and analyst stack.

21. Summary

The Temporal Encoder is the project’s shared market-sequence representation model. It converts 30-day sequences of engineered market features into 256-dimensional embeddings used by technical, volatility, drawdown, and regime modules.

Its importance comes from being upstream of several models. The final system depends on it being:

chronologically trained,
leakage-safe,
HPO-tuned,
checkpointed,
resumable,
manifest-aligned,
finite and validated,
and explainable through attention and gradient XAI.

This module is now a completed core encoder component in the fin-glassbox architecture.

This site is open source. Improve this page.