code/encoders/ Folder DocumentationThe code/encoders/ directory contains the upstream representation-learning modules for An Explainable Multimodal Neural Framework for Financial Risk Management. These encoders convert raw or engineered financial data into dense machine-learning representations used by downstream analysts, risk modules, and fusion.
The directory supports two major modalities:
The encoders do not make final trading decisions. Their purpose is to produce reusable embeddings that downstream modules can consume consistently:
Raw / engineered data
│
▼
Encoders
│
├── Temporal embeddings: ticker-date market-sequence representation
└── FinBERT embeddings: filing/text-event representation
│
▼
Analysts, Risk Engine, Regime Model, Fusion
code/encoders/
├── temporal_encoder.py
├── finbert_encoder.py
├── build_embedding_manifest.py
├── run_finbert_full_pipeline.py
├── run_finbert_resume_after_hpo.py
├── TemporalEncoder.md
├── FinBERT_Encoder.md
└── TextEncoder.md
| File | Role |
|---|---|
temporal_encoder.py |
Shared attention-based encoder for market time-series windows. |
finbert_encoder.py |
Full FinBERT MLM fine-tuning, HPO, embedding extraction, PCA projection, and model export pipeline. |
build_embedding_manifest.py |
One-time helper for reconstructing temporal embedding manifests when needed. |
run_finbert_full_pipeline.py |
Full orchestration script for FinBERT HPO, MLM training, embedding extraction, and PCA projection. |
run_finbert_resume_after_hpo.py |
Resume script for FinBERT training and embedding generation after HPO has already produced best parameters. |
| File | Role |
|---|---|
TemporalEncoder.md |
Full documentation for the Shared Temporal Attention Encoder. |
FinBERT_Encoder.md |
Full documentation for the FinBERT Financial Text Encoder. |
TextEncoder.md |
Extended text encoder context, data contracts, training decisions, and downstream interface notes. |
The encoders sit directly after data processing and before specialised downstream modules:
INPUT DATA
├── Time-Series Market Data
│ └── Shared Temporal Attention Encoder
│ ├── Technical Analyst
│ ├── Volatility Model
│ ├── Drawdown Risk Model
│ └── Regime Detection
│
└── Financial Text Data
└── FinBERT Financial Text Encoder
├── Sentiment Analyst
├── News Analyst
├── Qualitative Analyst
└── Regime Detection
The project intentionally avoids using one monolithic model. Instead, encoders produce modality-specific representations that are consumed by specialised modules.
The Temporal Encoder converts rolling market-feature windows into 256-dimensional ticker-date embeddings. It captures time dependencies in price, volatility, momentum, volume, and market-position indicators.
It is shared across multiple downstream modules so that technical and risk models use a consistent market representation.
Primary input:
data/yFinance/processed/features_temporal.csv
Expected feature columns include:
log_return
vol_5d
vol_21d
rsi_14
macd_hist
bb_pos
volume_ratio
hl_ratio
price_pos
spy_corr_63d
The model uses rolling sequence windows, typically:
sequence length = 30 trading days
embedding dimension = 256
The Temporal Encoder uses:
input projection
+ positional encoding
+ Transformer encoder layers
+ pooling
+ 256-dimensional embedding output
The main technical encoder is deliberately attention-based rather than plain LSTM/CNN. GNNs are not used here; graph modelling is reserved for contagion and regime risk modules.
The Temporal Encoder uses self-supervised masked temporal reconstruction. Parts of the input sequence are masked and the model learns to reconstruct the hidden market features. This allows it to learn market-state representations without requiring supervised labels at the encoder stage.
The normaliser must be fitted only on the relevant train split. Validation and test splits reuse the train-fitted normaliser. This prevents validation/test distribution information from leaking into training.
Expected output directory:
outputs/embeddings/TemporalEncoder/
Expected files:
chunk1_train_embeddings.npy
chunk1_train_manifest.csv
chunk1_val_embeddings.npy
chunk1_val_manifest.csv
chunk1_test_embeddings.npy
chunk1_test_manifest.csv
chunk2_train_embeddings.npy
chunk2_train_manifest.csv
chunk2_val_embeddings.npy
chunk2_val_manifest.csv
chunk2_test_embeddings.npy
chunk2_test_manifest.csv
chunk3_train_embeddings.npy
chunk3_train_manifest.csv
chunk3_val_embeddings.npy
chunk3_val_manifest.csv
chunk3_test_embeddings.npy
chunk3_test_manifest.csv
Embeddings are stored as .npy; manifests are stored as .csv with ticker/date row alignment.
The FinBERT encoder converts SEC filing text chunks into 256-dimensional text embeddings. These embeddings are used by:
The pipeline starts from FinBERT and performs domain-adaptive masked language modelling on the project’s SEC filings corpus. After training, it extracts 768-dimensional FinBERT hidden representations and projects them to 256 dimensions using Incremental PCA fitted only on the train split.
The final embedding dimension matches the Temporal Encoder output dimension:
Temporal embedding: 256
FinBERT text embedding: 256
This makes multimodal integration cleaner and keeps downstream model sizes controlled.
Primary dataset:
final/filings_finbert_chunks_balanced_25y_cap40000.csv
Important metadata fields include:
chunk_id
doc_id
year
form_type
cik
filing_date
accession
source_name
chunk_index
word_count
The project uses three chronological chunks:
Chunk 1: 2000–2004 train, 2005 val, 2006 test
Chunk 2: 2007–2014 train, 2015 val, 2016 test
Chunk 3: 2017–2022 train, 2023 val, 2024 test
This split design is critical for preventing look-ahead bias.
Expected output directory:
outputs/embeddings/FinBERT/
Expected final 256-dimensional files:
chunk1_train_embeddings.npy
chunk1_train_metadata.csv
chunk1_train_manifest.json
chunk1_val_embeddings.npy
chunk1_val_metadata.csv
chunk1_val_manifest.json
chunk1_test_embeddings.npy
chunk1_test_metadata.csv
chunk1_test_manifest.json
chunk2_train_embeddings.npy
chunk2_train_metadata.csv
chunk2_train_manifest.json
chunk2_val_embeddings.npy
chunk2_val_metadata.csv
chunk2_val_manifest.json
chunk2_test_embeddings.npy
chunk2_test_metadata.csv
chunk2_test_manifest.json
chunk3_train_embeddings.npy
chunk3_train_metadata.csv
chunk3_train_manifest.json
chunk3_val_embeddings.npy
chunk3_val_metadata.csv
chunk3_val_manifest.json
chunk3_test_embeddings.npy
chunk3_test_metadata.csv
chunk3_test_manifest.json
Intermediate 768-dimensional files may also exist, especially during PCA generation:
chunk*_train_embeddings768.npy
chunk*_val_embeddings768.npy
chunk*_test_embeddings768.npy
chunk*_pca_768_to_256.pkl
chunk*_pca_manifest.json
PCA must be fitted on train split only.
build_embedding_manifest.pyThis helper reconstructs temporal embedding manifests from features_temporal.csv. It maps each embedding row back to:
ticker
date
seq_start
seq_end
The manifest is essential because downstream modules must know which ticker-date each embedding row represents. The helper is useful if manifests are missing, corrupted, or need to be regenerated after embedding production.
Main command:
cd ~/fin-glassbox && python code/encoders/build_embedding_manifest.py
Modern Temporal Encoder runs usually write manifests directly during embedding generation, so this file is mostly a repair/one-time utility.
run_finbert_full_pipeline.pyRuns the full FinBERT lifecycle:
HPO on chunk sample
→ train MLM on chunks 1, 2, 3
→ extract 768-dimensional embeddings
→ fit train-only PCA
→ project train/val/test embeddings to 256 dimensions
This script is useful for a clean full rerun when enough time and GPU availability are available.
run_finbert_resume_after_hpo.pyResumes FinBERT training after HPO has already completed. It reads best parameters from:
outputs/codeResults/FinBERT/hpo/finbert_mlm_chunk3_final_best_params.json
Then it trains/resumes all chunks and regenerates embeddings/PCA outputs. This is useful when HPO has already succeeded and training or projection needs to continue after interruption.
Inspect:
cd ~/fin-glassbox && python code/encoders/temporal_encoder.py inspect --repo-root .
Run HPO:
cd ~/fin-glassbox && python code/encoders/temporal_encoder.py hpo --repo-root . --chunk 1 --trials 30 --device cuda
Train best:
cd ~/fin-glassbox && python code/encoders/temporal_encoder.py train-best --repo-root . --chunk 1 --device cuda
Generate embeddings for one split:
cd ~/fin-glassbox && python code/encoders/temporal_encoder.py embed --repo-root . --chunk 1 --split train --device cuda --batch-size 4096 --num-workers 8 --prefetch-factor 4
Generate all embeddings:
cd ~/fin-glassbox && python code/encoders/temporal_encoder.py embed-all --repo-root . --device cuda --batch-size 4096 --num-workers 8 --prefetch-factor 4
Inspect:
cd ~/fin-glassbox && python code/encoders/finbert_encoder.py inspect --repo-root .
Run HPO:
cd ~/fin-glassbox && python code/encoders/finbert_encoder.py hpo --repo-root . --chunk 3 --trials 12 --processor cuda
Train MLM:
cd ~/fin-glassbox && python code/encoders/finbert_encoder.py train-mlm --repo-root . --chunk 1 --processor cuda
Train all MLM chunks:
cd ~/fin-glassbox && python code/encoders/finbert_encoder.py train-all-mlm --repo-root . --processor cuda
Export frozen/unfrozen model folders:
cd ~/fin-glassbox && python code/encoders/finbert_encoder.py freeze --repo-root . --chunk 1
Extract 768-dimensional embeddings:
cd ~/fin-glassbox && python code/encoders/finbert_encoder.py embed768 --repo-root . --chunk 1 --split train --eval-batch-size 64 --workers 6 --overwrite
Fit PCA on train split:
cd ~/fin-glassbox && python code/encoders/finbert_encoder.py fit-pca --repo-root . --chunk 1 --pca-batch-size 4096 --overwrite
Project split to 256 dimensions:
cd ~/fin-glassbox && python code/encoders/finbert_encoder.py project-pca --repo-root . --chunk 1 --split train --pca-batch-size 4096 --overwrite
Full embedding helper:
cd ~/fin-glassbox && python code/encoders/finbert_encoder.py embed --repo-root . --chunk 1 --split train --processor cuda --overwrite
Full all-chunk embedding helper:
cd ~/fin-glassbox && python code/encoders/finbert_encoder.py embed-all --repo-root . --processor cuda --overwrite
cd ~/fin-glassbox && python - <<'PY'
import numpy as np
import pandas as pd
from pathlib import Path
base = Path('outputs/embeddings/TemporalEncoder')
for c in [1, 2, 3]:
for s in ['train', 'val', 'test']:
emb = base / f'chunk{c}_{s}_embeddings.npy'
man = base / f'chunk{c}_{s}_manifest.csv'
if not emb.exists() or not man.exists():
print(f'MISSING chunk{c}_{s}: emb={emb.exists()} manifest={man.exists()}')
continue
arr = np.load(emb, mmap_mode='r')
m = pd.read_csv(man)
finite = float(np.isfinite(arr[:min(len(arr), 10000)]).mean())
print(f'chunk{c}_{s}: emb={arr.shape}, manifest={m.shape}, finite_sample={finite:.6f}')
PY
cd ~/fin-glassbox && python - <<'PY'
import numpy as np
import pandas as pd
from pathlib import Path
base = Path('outputs/embeddings/FinBERT')
for c in [1, 2, 3]:
for s in ['train', 'val', 'test']:
emb = base / f'chunk{c}_{s}_embeddings.npy'
meta = base / f'chunk{c}_{s}_metadata.csv'
if not emb.exists() or not meta.exists():
print(f'MISSING chunk{c}_{s}: emb={emb.exists()} metadata={meta.exists()}')
continue
arr = np.load(emb, mmap_mode='r')
m = pd.read_csv(meta)
finite = float(np.isfinite(arr[:min(len(arr), 10000)]).mean())
print(f'chunk{c}_{s}: emb={arr.shape}, metadata={m.shape}, finite_sample={finite:.6f}')
PY
The encoders are not decision modules, but they still support explainability.
The Temporal Encoder supports:
This allows downstream explanations to refer back to the time window that produced a representation.
The FinBERT encoder supports:
The encoder itself does not produce final sentiment explanations; instead, it preserves the alignment required for downstream text analysts to generate event-level XAI.
Technical Analyst
Volatility Risk Model
Drawdown Risk Model
Regime Detection
Quantitative Analyst indirectly through upstream risk/technical outputs
Sentiment Analyst
News Analyst
Qualitative Analyst
Regime Detection
Fusion indirectly through qualitative branch outputs
Run:
cd ~/fin-glassbox && python code/encoders/build_embedding_manifest.py
Use the best saved checkpoint if validation has plateaued, then generate embeddings with large inference batches. The project already demonstrated that embedding extraction can be much faster than prolonged training.
Ensure PCA was fitted on train split only and then applied to train/val/test. Do not fit PCA separately on validation or test.
Check .npy shape against metadata CSV row count. Any mismatch means downstream row alignment is unsafe.
Check whether output filenames, dimensions, and metadata schemas remained consistent. Downstream modules expect 256-dimensional embeddings.
For a clean full rerun:
1. Process raw data and create clean market/text datasets.
2. Train or load Temporal Encoder.
3. Generate Temporal Encoder embeddings and manifests for all chunks.
4. Train or resume FinBERT MLM.
5. Extract FinBERT 768-dimensional embeddings.
6. Fit PCA on train split only.
7. Project FinBERT embeddings to 256 dimensions.
8. Audit all shapes, manifests, metadata, and finite ratios.
9. Run downstream analyst and risk modules.
10. Run quantitative/qualitative synthesis and fusion.
Use these documents for deeper module-level details:
This README is the folder-level overview; the linked files provide the detailed implementation and methodology notes.
code/encoders/ is the project’s representation-learning layer. It turns large-scale market and text data into aligned embeddings that downstream modules can train on without repeatedly processing raw data. Its main engineering responsibilities are:
chronological correctness
train-only normalisation/projection
reproducible embedding files
metadata/manifest alignment
GPU-efficient inference
XAI traceability
Without this directory, the rest of the multimodal framework would not have stable, reusable inputs. It is therefore one of the foundational layers of the full financial risk-management system.