Project: An Explainable Multimodal Neural Framework for Financial Risk Management
Module: FinBERT Financial Text Encoder
Primary implementation: code/encoders/finbert_encoder.py
Pipeline helpers: run_finbert_full_pipeline.py, run_finbert_resume_after_hpo.py
Previous documentation reference: TextEncoder.md
Output root: outputs/embeddings/FinBERT/, outputs/models/FinBERT/, outputs/results/FinBERT/, outputs/codeResults/FinBERT/
This document is the comprehensive updated documentation for the FinBERT encoder in the fin-glassbox project. It replaces the older text-encoder documentation as the broader context document for how SEC filing text is transformed into reusable financial text embeddings.
The FinBERT encoder is the project’s main text encoder. It converts SEC filing chunks into dense vector representations that are consumed by:
Sentiment Analyst
News Analyst
Regime Risk Module
Qualitative Analyst
Fusion Engine
XAI Layer
The encoder does not directly output Buy/Hold/Sell decisions. It produces reusable representations for downstream specialist models.
The final architecture separates market, graph, macro, risk, and text processing. FinBERT is responsible for the text stream:
SEC filing text chunks
↓
FinBERT domain-adaptive MLM fine-tuning
↓
768-dimensional pooled FinBERT embeddings
↓
train-only PCA projection
↓
256-dimensional text embeddings
↓
Sentiment Analyst / News Analyst / Regime Risk / Qualitative Analyst / Fusion
The encoder belongs to the Data Processing / Encoder Layer, not the final analyst layer. Its output is a learned representation; interpretation happens in downstream modules.
FinBERT is used because the input language is financial. SEC filings contain specialised terminology, risk disclosures, accounting language, forward-looking statements, management discussion, business sections, governance text, and risk factors. General language models are less aligned with this domain.
The project uses FinBERT for:
The base model is:
ProsusAI/finbert
The project then performs domain-adaptive fine-tuning using Masked Language Modelling on the project’s own SEC filing corpus.
The FinBERT stage is primarily self-supervised domain adaptation, not supervised market prediction.
The training objective is:
Masked Language Modelling loss
This means the model learns SEC disclosure language better, but it does not directly learn returns, drawdowns, risk classes, or sentiment labels. Those tasks are handled later by downstream trained modules.
This distinction is important for thesis defensibility:
FinBERT improves the representation of financial text.
Sentiment Analyst and News Analyst learn task-specific predictions from that representation.
Qualitative Analyst learns how to combine those task-specific outputs.
The project uses three chronological chunks:
| Chunk | Train years | Validation year | Test year |
|---|---|---|---|
| Chunk 1 | 2000–2004 | 2005 | 2006 |
| Chunk 2 | 2007–2014 | 2015 | 2016 |
| Chunk 3 | 2017–2022 | 2023 | 2024 |
This design prevents look-ahead bias. Each chunk has its own model/export and embedding set. The train split is used for fitting the chunk’s PCA projection; validation and test are transformed using the train-fitted PCA only.
finbert_encoder.pyThis is the main implementation. It includes:
run_finbert_full_pipeline.pyThis is a helper runner for executing the full FinBERT process across chunks. It is useful for scripted execution but the primary reusable implementation is still finbert_encoder.py.
run_finbert_resume_after_hpo.pyThis helper exists to resume or continue the pipeline after HPO. It is useful in long-running GPU workflows where HPO, training, and embedding generation may be executed in separate sessions.
TextEncoder.mdThis is the older documentation reference. It should now be superseded by this file, but it remains useful as historical context for early encoder design and output expectations.
FinBERTConfigCentral configuration object. It defines:
.env handling;Important defaults include:
base_model_name = ProsusAI/finbert
max_length = 512
base_embedding_dim = 768
projection_dim = 256
batch_size = 24
eval_batch_size = 64
mlm_probability = 0.15
pca_batch_size = 4096
SECChunkTextDatasetLoads the SEC filing text chunks from the final text corpus CSV.
Expected columns include:
chunk_id
doc_id
year
form_type
cik
filing_date
accession
source_name
chunk_index
word_count
text
It supports:
TokenizedMLMDatasetConverts SEC text into tokenised inputs for masked language modelling. It performs truncation to the configured maximum sequence length and returns the tensors needed by the MLM data collator.
TokenizedEmbeddingDatasetConverts SEC text into padded tokenised inputs for deterministic embedding extraction. It also preserves row-level metadata.
CheckpointManagerManages checkpoint saving and recovery. It saves:
latest_checkpoint.pt
chunk{n}/latest_checkpoint.pt
chunk{n}/best_checkpoint.pt
chunk{n}/epoch_XXX.pt
This makes long-running MLM training resumable.
FinBERTMLMTrainerRuns domain-adaptive MLM training. It:
FinBERTBaseEmbeddingExtractorLoads the frozen model and extracts 768-dimensional mean-pooled embeddings. The output is written as memory-mapped .npy arrays with matching metadata CSV and manifest JSON.
FinBERTPCAProjectorFits an IncrementalPCA model on train-only 768-dimensional embeddings, then transforms train/validation/test embeddings into 256-dimensional final vectors.
This train-only PCA rule is critical:
PCA is fit on train only.
Validation/test are transformed only.
FinBERTHyperparameterSearchRuns Optuna-based HPO for MLM settings. HPO is used to estimate useful training hyperparameters, but final training can still be extended if validation loss is clearly improving.
FinBERTProjectedEncoderThis class is a future supervised-stage wrapper for a trainable 768-to-256 projection. It is not the main path for MLM-only embedding generation. The current final embeddings use train-only PCA projection.
The default dataset path is:
final/filings_finbert_chunks_balanced_25y_cap40000.csv
This file is expected to contain cleaned SEC filing chunks with metadata and text. The dataset is intentionally not the full raw SEC corpus; it is a cleaned, filtered, and balanced text dataset suitable for FinBERT processing.
The design supports the project’s storage and compute constraints: text extraction was large and difficult, but final FinBERT training uses a manageable chunked corpus.
Models are written under:
outputs/models/FinBERT/chunk1/
outputs/models/FinBERT/chunk2/
outputs/models/FinBERT/chunk3/
Expected model/export files include:
latest_checkpoint.pt
best_checkpoint.pt
model_freezed/
model_unfreezed/
The frozen export is used for deterministic embedding extraction. The unfrozen export is useful if additional fine-tuning is needed later.
For each chunk and split:
outputs/embeddings/FinBERT/chunk{chunk}_{split}_embeddings768.npy
outputs/embeddings/FinBERT/chunk{chunk}_{split}_metadata.csv
outputs/embeddings/FinBERT/chunk{chunk}_{split}_manifest768.json
These are the direct FinBERT pooled embeddings before dimensionality reduction.
For each chunk:
outputs/embeddings/FinBERT/chunk{chunk}_pca_768_to_256.pkl
outputs/embeddings/FinBERT/chunk{chunk}_pca_manifest.json
The PCA is fit on the train split only.
For each chunk and split:
outputs/embeddings/FinBERT/chunk{chunk}_{split}_embeddings.npy
outputs/embeddings/FinBERT/chunk{chunk}_{split}_manifest.json
The corresponding metadata CSV is shared with the 768 extraction:
outputs/embeddings/FinBERT/chunk{chunk}_{split}_metadata.csv
The final 256-dimensional embeddings are the primary downstream inputs.
Each embedding row must align exactly with one metadata row.
For each split:
chunk{chunk}_{split}_embeddings.npy row i
↕
chunk{chunk}_{split}_metadata.csv row i
The metadata contains the document identity and filing context:
chunk_id
doc_id
year
form_type
cik
filing_date
accession
source_name
chunk_index
word_count
This alignment is essential for:
cd ~/fin-glassbox && python code/encoders/finbert_encoder.py inspect --repo-root .
cd ~/fin-glassbox && python code/encoders/finbert_encoder.py hpo --repo-root . --chunk 1 --trials 20 --batch-size 16 --eval-batch-size 64 --workers 6
HPO should be treated as guidance, not a blind final instruction. If HPO only tries very short training runs, final MLM training may still need more epochs if validation loss is clearly improving.
cd ~/fin-glassbox && python code/encoders/finbert_encoder.py train-mlm --repo-root . --chunk 1 --epochs 6 --batch-size 16 --eval-batch-size 64 --workers 6 --lr 2.5e-5 --weight-decay 0.0003 --warmup-ratio 0.03 --mlm-probability 0.14 --gradient-accumulation-steps 1 --early-stop-patience 2
cd ~/fin-glassbox && python code/encoders/finbert_encoder.py freeze --repo-root . --chunk 1
cd ~/fin-glassbox && python code/encoders/finbert_encoder.py embed768 --repo-root . --chunk 1 --split train --eval-batch-size 64 --workers 6 --overwrite && python code/encoders/finbert_encoder.py embed768 --repo-root . --chunk 1 --split val --eval-batch-size 64 --workers 6 --overwrite && python code/encoders/finbert_encoder.py embed768 --repo-root . --chunk 1 --split test --eval-batch-size 64 --workers 6 --overwrite
cd ~/fin-glassbox && python code/encoders/finbert_encoder.py fit-pca --repo-root . --chunk 1 --pca-batch-size 4096 --overwrite && python code/encoders/finbert_encoder.py project-pca --repo-root . --chunk 1 --split train --pca-batch-size 4096 --overwrite && python code/encoders/finbert_encoder.py project-pca --repo-root . --chunk 1 --split val --pca-batch-size 4096 --overwrite && python code/encoders/finbert_encoder.py project-pca --repo-root . --chunk 1 --split test --pca-batch-size 4096 --overwrite
cd ~/fin-glassbox && python code/encoders/finbert_encoder.py train-mlm --repo-root . --chunk 1 --base-model-name outputs/models/FinBERT/chunk1/model_unfreezed --epochs 15 --batch-size 16 --eval-batch-size 64 --workers 6 --lr 2.5e-5 --weight-decay 0.0003 --warmup-ratio 0.03 --mlm-probability 0.14 --gradient-accumulation-steps 1 --early-stop-patience 2 --no-resume && python code/encoders/finbert_encoder.py freeze --repo-root . --chunk 1 && python code/encoders/finbert_encoder.py embed768 --repo-root . --chunk 1 --split train --eval-batch-size 64 --workers 6 --overwrite && python code/encoders/finbert_encoder.py embed768 --repo-root . --chunk 1 --split val --eval-batch-size 64 --workers 6 --overwrite && python code/encoders/finbert_encoder.py embed768 --repo-root . --chunk 1 --split test --eval-batch-size 64 --workers 6 --overwrite && python code/encoders/finbert_encoder.py fit-pca --repo-root . --chunk 1 --pca-batch-size 4096 --overwrite && python code/encoders/finbert_encoder.py project-pca --repo-root . --chunk 1 --split train --pca-batch-size 4096 --overwrite && python code/encoders/finbert_encoder.py project-pca --repo-root . --chunk 1 --split val --pca-batch-size 4096 --overwrite && python code/encoders/finbert_encoder.py project-pca --repo-root . --chunk 1 --split test --pca-batch-size 4096 --overwrite
cd ~/fin-glassbox && python code/encoders/finbert_encoder.py train-mlm --repo-root . --chunk 2 --base-model-name outputs/models/FinBERT/chunk2/model_unfreezed --epochs 15 --batch-size 16 --eval-batch-size 64 --workers 6 --lr 2.5e-5 --weight-decay 0.0003 --warmup-ratio 0.03 --mlm-probability 0.14 --gradient-accumulation-steps 1 --early-stop-patience 2 --no-resume && python code/encoders/finbert_encoder.py freeze --repo-root . --chunk 2 && python code/encoders/finbert_encoder.py embed768 --repo-root . --chunk 2 --split train --eval-batch-size 64 --workers 6 --overwrite && python code/encoders/finbert_encoder.py embed768 --repo-root . --chunk 2 --split val --eval-batch-size 64 --workers 6 --overwrite && python code/encoders/finbert_encoder.py embed768 --repo-root . --chunk 2 --split test --eval-batch-size 64 --workers 6 --overwrite && python code/encoders/finbert_encoder.py fit-pca --repo-root . --chunk 2 --pca-batch-size 4096 --overwrite && python code/encoders/finbert_encoder.py project-pca --repo-root . --chunk 2 --split train --pca-batch-size 4096 --overwrite && python code/encoders/finbert_encoder.py project-pca --repo-root . --chunk 2 --split val --pca-batch-size 4096 --overwrite && python code/encoders/finbert_encoder.py project-pca --repo-root . --chunk 2 --split test --pca-batch-size 4096 --overwrite
cd ~/fin-glassbox && python code/encoders/finbert_encoder.py train-mlm --repo-root . --chunk 3 --base-model-name outputs/models/FinBERT/chunk3/model_unfreezed --epochs 10 --batch-size 16 --eval-batch-size 64 --workers 6 --lr 2.959475667731825e-05 --weight-decay 0.0003438172512115178 --warmup-ratio 0.030662654054832286 --mlm-probability 0.13584657285442728 --gradient-accumulation-steps 1 --early-stop-patience 2 --no-resume && python code/encoders/finbert_encoder.py freeze --repo-root . --chunk 3 && python code/encoders/finbert_encoder.py embed768 --repo-root . --chunk 3 --split train --eval-batch-size 64 --workers 6 --overwrite && python code/encoders/finbert_encoder.py embed768 --repo-root . --chunk 3 --split val --eval-batch-size 64 --workers 6 --overwrite && python code/encoders/finbert_encoder.py embed768 --repo-root . --chunk 3 --split test --eval-batch-size 64 --workers 6 --overwrite && python code/encoders/finbert_encoder.py fit-pca --repo-root . --chunk 3 --pca-batch-size 4096 --overwrite && python code/encoders/finbert_encoder.py project-pca --repo-root . --chunk 3 --split train --pca-batch-size 4096 --overwrite && python code/encoders/finbert_encoder.py project-pca --repo-root . --chunk 3 --split val --pca-batch-size 4096 --overwrite && python code/encoders/finbert_encoder.py project-pca --repo-root . --chunk 3 --split test --pca-batch-size 4096 --overwrite
cd ~/fin-glassbox && python - <<'PY'
import numpy as np
import pandas as pd
from pathlib import Path
base = Path('outputs/embeddings/FinBERT')
for chunk in [1,2,3]:
print(f'===== chunk{chunk} =====')
for split in ['train','val','test']:
emb = base / f'chunk{chunk}_{split}_embeddings.npy'
meta = base / f'chunk{chunk}_{split}_metadata.csv'
man = base / f'chunk{chunk}_{split}_manifest.json'
if emb.exists():
arr = np.load(emb, mmap_mode='r')
print(emb, arr.shape, 'finite=', float(np.isfinite(arr[:min(1000, len(arr))]).mean()))
else:
print('MISSING', emb)
if meta.exists():
print(meta, pd.read_csv(meta, nrows=1).shape, 'rows=', sum(1 for _ in open(meta))-1)
else:
print('MISSING', meta)
print(('OK ' if man.exists() else 'MISSING ') + str(man))
PY
cd ~/fin-glassbox && python - <<'PY'
import pandas as pd
from pathlib import Path
for c in [1,2,3]:
p = Path(f'outputs/results/FinBERT/chunk{c}_mlm_history.csv')
print('\n===== chunk', c, '=====')
if not p.exists():
print('missing', p)
continue
h = pd.read_csv(p)
print(h.tail(10).to_string(index=False))
if 'val_loss' in h.columns:
print('best val_loss:', float(h['val_loss'].min()))
print(h.loc[h['val_loss'].idxmin()].to_string())
PY
cd ~/fin-glassbox && ls -lh --time-style=long-iso outputs/embeddings/FinBERT/chunk*_embeddings.npy outputs/embeddings/FinBERT/chunk*_metadata.csv | sort
FinBERT should not be trained indefinitely. Use validation loss trends.
Continue training if:
validation loss is still clearly decreasing
and GPU time is available
and downstream sentiment/news performance remains poor
Stop training if:
validation loss plateaus
validation loss worsens
learning rate has decayed close to zero
or improvements become too small to justify more GPU time
A practical rule:
If new best validation loss improves by less than about 1–2% over several epochs, stop and move downstream.
If Sentiment Analyst accuracy remains poor after improved FinBERT embeddings, the bottleneck is likely:
Consumes:
outputs/embeddings/FinBERT/chunk{chunk}_{split}_embeddings.npy
outputs/embeddings/FinBERT/chunk{chunk}_{split}_metadata.csv
Produces sentiment polarity, confidence, uncertainty, and related task outputs.
Consumes the same or derived FinBERT-based representations and predicts event impact, importance, risk relevance, volatility-spike risk, and drawdown-risk relevance.
May use FinBERT embeddings aggregated to a stock/date context and combined with temporal embeddings and macro variables.
Does not directly re-encode text. It combines the outputs of Sentiment Analyst and News Analyst into a trained qualitative branch signal.
Will later consume qualitative and quantitative branch outputs. FinBERT’s contribution enters Fusion indirectly through the trained text-side modules.
FinBERT itself is not the final explanation layer. However, it supports XAI through:
This allows downstream explanations to point back to the original filing context.
Example XAI chain:
Final decision explanation
→ Qualitative Analyst explanation
→ News/Sentiment Analyst explanation
→ FinBERT embedding row
→ SEC filing metadata
→ document ID / accession / section / filing date
This traceability is central to the project’s “glass-box” philosophy.
Final FinBERT artefacts use:
.npy for embeddings
.csv for metadata and training histories
.json for manifests and configuration summaries
.pkl for PCA models only
.pt for PyTorch checkpoints
The project intentionally avoids unnecessary intermediate outputs in final documentation. Raw 768 embeddings may be retained during development, but final downstream modules should use the 256-dimensional embeddings unless a specific experiment requires raw 768 features.
Symptom:
embedding rows != metadata rows
Fix: rerun the affected embed768 command and then rerun PCA projection.
Symptom:
PCA leakage risk
Fix: delete the PCA file for that chunk and rerun fit-pca using only the train embeddings.
Symptom:
Frozen model directory not found
Fix:
cd ~/fin-glassbox && python code/encoders/finbert_encoder.py freeze --repo-root . --chunk 1
HPO may choose short runs because it is optimising a small trial budget. This should not automatically limit final domain-adaptive MLM training. Use validation-loss trends to decide final epochs.
Improving FinBERT may help, but it will not guarantee high sentiment accuracy. If downstream accuracy remains low, inspect sentiment labels, class balance, target definitions, and analyst architecture.
A chunk is considered complete when all of the following exist:
outputs/models/FinBERT/chunk{chunk}/model_freezed/
outputs/models/FinBERT/chunk{chunk}/model_unfreezed/
outputs/embeddings/FinBERT/chunk{chunk}_train_embeddings.npy
outputs/embeddings/FinBERT/chunk{chunk}_val_embeddings.npy
outputs/embeddings/FinBERT/chunk{chunk}_test_embeddings.npy
outputs/embeddings/FinBERT/chunk{chunk}_train_metadata.csv
outputs/embeddings/FinBERT/chunk{chunk}_val_metadata.csv
outputs/embeddings/FinBERT/chunk{chunk}_test_metadata.csv
outputs/embeddings/FinBERT/chunk{chunk}_train_manifest.json
outputs/embeddings/FinBERT/chunk{chunk}_val_manifest.json
outputs/embeddings/FinBERT/chunk{chunk}_test_manifest.json
outputs/embeddings/FinBERT/chunk{chunk}_pca_768_to_256.pkl
outputs/embeddings/FinBERT/chunk{chunk}_pca_manifest.json
The FinBERT Encoder is the reusable financial text representation layer for the project. It adapts FinBERT to SEC filing language using masked language modelling, exports frozen models, extracts 768-dimensional document embeddings, projects them to 256 dimensions using train-only PCA, and preserves strict row-level metadata alignment.
Its role is not to make decisions. Its role is to provide the text-side representation that allows downstream analyst modules to reason about sentiment, event risk, news impact, regime context, and qualitative evidence.
The final architecture depends on FinBERT because it is the bridge between raw financial disclosure text and the system’s explainable multimodal decision process.