Project: fin-glassbox — Explainable Multimodal Neural Framework for Financial Risk Management
Module: Text Encoder / SEC Filing Encoder
Encoder family: FinBERT, domain-adaptive fine-tuned with Masked Language Modelling
Final embedding dimensionality: 256
Status: Completed, verified, committed, and pushed
Final commit: 88c8817
The FinBERT Text Encoder converts SEC filing text chunks into dense, chronologically safe, row-aligned numerical embeddings for downstream financial-risk modelling. It is the text stream of the wider multi-modal architecture, where final asset representations are expected to combine:
Temporal embedding: 128 dimensions
Text embedding: 256 dimensions
Fundamental embedding: 128 dimensions
Total asset vector: 512 dimensions
The encoder was fine-tuned using domain-adaptive Masked Language Modelling (MLM) on SEC filing text chunks. This is a legitimate self-supervised fine-tuning stage: FinBERT weights were updated to adapt the financial language model to SEC disclosure language before producing final embeddings.
The current encoder is not yet supervised on market labels such as returns, volatility, drawdown, or risk classes. That will be a later phase after labelled outcomes are generated from market data.
The completed run produced:
9 final 256-dimensional embedding matrices
9 row-aligned metadata CSV files
9 final manifest JSON files
3 frozen FinBERT model exports
3 unfrozen FinBERT model exports
The final outputs are organised by three chronological chunks:
Chunk 1: train 2000–2004, validation 2005, test 2006
Chunk 2: train 2007–2014, validation 2015, test 2016
Chunk 3: train 2017–2022, validation 2023, test 2024
The FinBERT Text Encoder is responsible for transforming raw SEC filing text chunks into compact learned representations that can be consumed by:
Sentiment Analyst
News Analyst
Regime Model
Fusion Model
Risk Engine
Downstream explainability modules
The encoder addresses the text stream of the architecture. It does not directly make risk predictions. Instead, it produces stable, reusable embeddings that downstream modules can interpret, aggregate, classify, score, or fuse with market, fundamental, macroeconomic, and graph-based features.
The encoder is designed around five constraints:
.npy arrays can be loaded directly without rerunning FinBERT.Main encoder implementation:
code/encoders/finbert_encoder.py
Full end-to-end pipeline runner:
code/encoders/run_finbert_full_pipeline.py
Resume-after-HPO runner:
code/encoders/run_finbert_resume_after_hpo.py
Specification document:
code/config/TextEncoder.md
Recommended future downstream modules:
code/analysts/text_embedding_loader.py
code/analysts/sentiment_analyst.py
code/analysts/news_analyst.py
code/analysts/AnalystModels.md
Primary dataset(only present on remote gpu):
final/filings_finbert_chunks_balanced_25y_cap40000.csv
This file contains SEC filing text chunks from 2000 to 2024.
Expected columns:
chunk_id
doc_id
year
form_type
cik
filing_date
accession
source_name
chunk_index
word_count
text
The text column is the direct textual input to FinBERT. All other columns are treated as metadata and are preserved into the output metadata CSVs.
Final input dataset size:
Rows excluding header: 989,244
Rows including header: 989,245
Coverage: 25 years, 2000–2024
Cap: approximately 40,000 chunks per year
Final year distribution after balancing/capping:
2000: 34,620
2001: 34,624
2002: 40,000
2003: 40,000
2004: 40,000
2005: 40,000
2006: 40,000
2007: 40,000
2008: 40,000
2009: 40,000
2010: 40,000
2011: 40,000
2012: 40,000
2013: 40,000
2014: 40,000
2015: 40,000
2016: 40,000
2017: 40,000
2018: 40,000
2019: 40,000
2020: 40,000
2021: 40,000
2022: 40,000
2023: 40,000
2024: 40,000
The encoder uses three non-overlapping chronological chunks. Each chunk has its own model export and PCA projection, fitted only on that chunk’s training split.
Train: 2000–2004
Validation: 2005
Test: 2006
Final row counts:
chunk1_train: 189,244 rows
chunk1_val: 40,000 rows
chunk1_test: 40,000 rows
Train: 2007–2014
Validation: 2015
Test: 2016
Final row counts:
chunk2_train: 320,000 rows
chunk2_val: 40,000 rows
chunk2_test: 40,000 rows
Train: 2017–2022
Validation: 2023
Test: 2024
Final row counts:
chunk3_train: 240,000 rows
chunk3_val: 40,000 rows
chunk3_test: 40,000 rows
These splits are designed to prevent temporal leakage. Validation and test years are always later than their corresponding training years.
Base model:
ProsusAI/finbert
Model family:
BERT-based financial language model
Tokenizer:
AutoTokenizer
MLM training model:
AutoModelForMaskedLM
Embedding extraction model:
AutoModel
Maximum token length:
512 tokens
Texts longer than 512 tokens are truncated during tokenisation.
The completed training stage used:
Domain-adaptive Masked Language Modelling
This means FinBERT was adapted to SEC filing language by masking tokens inside filing chunks and training the model to predict the masked tokens.
Current objective:
Predict masked tokens from SEC filing text chunks
The model was not trained directly on:
future returns
future excess returns
volatility spikes
drawdown outcomes
risk labels
Those supervised targets will be generated later and may be used for a second-stage supervised fine-tuning run.
The current text dataset contains SEC text and metadata but does not yet contain robust supervised financial labels. Therefore, MLM was the correct first-stage fine-tuning objective.
MLM adapts the model to:
SEC disclosure language
risk factor wording
MD&A wording
governance disclosures
legal reporting structure
financial statement phrasing
material event language
formal issuer-reporting style
Recommended lifecycle:
Stage 1: Domain-adaptive MLM fine-tuning on SEC filings
Stage 2: Generate supervised market/risk labels
Stage 3: Optional supervised FinBERT fine-tuning
Stage 4: Freeze encoder
Stage 5: Generate stable embeddings
Stage 6: Feed embeddings into analysts/fusion/risk models
Remote GPU run hardware:
CPU: AMD Ryzen 5 7600
CPU cores: 6
RAM: 64 GB
GPU: NVIDIA GeForce RTX 3090 Ti
GPU memory: 24 GB
CUDA: working after NVIDIA driver reboot/fix
Python environment:
Virtual environment: venv3.12.7
Python: 3.12.7
PyTorch: CUDA-enabled
Important CUDA validation command:
python -c "import torch; print('torch', torch.__version__); print('cuda available', torch.cuda.is_available()); print('gpu', torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'NO CUDA')"
Expected CUDA validation result:
cuda available True
gpu NVIDIA GeForce RTX 3090 Ti
Final full-run configuration used these operational assumptions:
processor: cuda
fp16: true
num_workers: 6
batch_size: selected by HPO / final run used batch size 16 in observed logs
eval_batch_size: selected by HPO / final run used eval batch size 32 in observed logs
epochs: 3 per chunk
random_seed: 42
sample_mode: balanced-year
Training was performed chunk-wise:
Chunk 1: 3 MLM epochs
Chunk 2: 3 MLM epochs
Chunk 3: resumed and completed to 3 MLM epochs
Total planned full training schedule:
3 chunks × 3 epochs = 9 chunk-level MLM epochs
Hyperparameter search was implemented with:
Optuna
TPE sampler
Median pruner
SQLite persistent study
Search storage during training:
outputs/codeResults/FinBERT/hpo/finbert_optuna.db
Best parameter file during training:
outputs/codeResults/FinBERT/hpo/finbert_mlm_chunk3_final_best_params.json
Trial history during training:
outputs/codeResults/FinBERT/hpo/finbert_mlm_chunk3_final_trials.csv
The HPO stage was run on chunk 3 using a training sample. The selected parameters were then reused for full chronological training across chunks 1, 2, and 3.
The run_finbert_resume_after_hpo.py runner does not rerun HPO. It loads the best HPO parameters and performs/resumes full training and embedding extraction.
HPO artefacts are not part of the final deliverable commit.
epoch 1: train_loss = 1.5737829372435808, val_loss = 1.0838342635154725
epoch 2: train_loss = 1.1011696563720703, val_loss = 0.9906270583152771
epoch 3: train_loss = 1.031179, val_loss = 0.952888
Chunk 2 improved every epoch and exported successfully.
epoch 2: train_loss = 1.676405, val_loss = 1.175561
epoch 3: train_loss = 1.201251, val_loss = 1.071729
Chunk 3 resumed from checkpoint, completed training, improved validation loss, and exported successfully.
Chunk 1 completed successfully and produced model exports and embeddings. Exact epoch-level values should be read from:
outputs/results/FinBERT/chunk1_mlm_history.csv
This file was treated as a training artefact and is not part of the minimal final deliverable unless intentionally retained.
During the training run, checkpoints were saved after every epoch for fault tolerance.
Training-time checkpoint types:
latest_checkpoint.pt
best_checkpoint.pt
epoch_001.pt
epoch_002.pt
epoch_003.pt
Checkpoint contents included:
chunk_id
model_state
optimizer_state
scheduler_state
scaler_state
best_val_loss
no_improve
history
config
epoch
saved_at
Important resume rule:
Chunk-specific training must only resume from that chunk's latest checkpoint.
A global latest checkpoint must not be used across chunks.
A checkpoint-resume bug was corrected so chunk-specific training does not accidentally reuse a checkpoint from a different chunk.
Checkpoints were useful during training but are not final deliverables. They were deleted before final storage cleanup and excluded from the final commit.
Excluded from final deliverable:
outputs/models/FinBERT/latest_checkpoint.pt
outputs/models/FinBERT/**/latest_checkpoint.pt
outputs/models/FinBERT/**/best_checkpoint.pt
outputs/models/FinBERT/**/epoch_*.pt
outputs/models/FinBERT/hpo/
After each chunk finished training, the best available model was exported in two forms.
Frozen model exports:
outputs/models/FinBERT/chunk1/model_freezed/
outputs/models/FinBERT/chunk2/model_freezed/
outputs/models/FinBERT/chunk3/model_freezed/
Unfrozen model exports:
outputs/models/FinBERT/chunk1/model_unfreezed/
outputs/models/FinBERT/chunk2/model_unfreezed/
outputs/models/FinBERT/chunk3/model_unfreezed/
The frozen model is intended for stable embedding extraction and downstream reproducibility.
The unfrozen model is preserved so that later supervised fine-tuning can continue from the domain-adapted model rather than from the original ProsusAI/finbert base.
Typical exported Hugging Face directory contents:
config.json
generation_config.json
model.safetensors
special_tokens_map.json
tokenizer.json
tokenizer_config.json
vocab.txt
Frozen model directories additionally include:
FREEZE_NOTE.json
Embedding extraction was performed after model export.
Pooling method:
Mean pooling over last hidden states using attention mask
Raw embedding dimensionality:
768
Final downstream dimensionality:
256
Data type:
float32
The raw 768-dimensional embeddings were generated temporarily during the pipeline, but they were not retained in the final minimal deliverable commit.
The architecture requires a 256-dimensional text vector. The encoder therefore projects FinBERT’s 768-dimensional output to 256 dimensions using IncrementalPCA.
Projection procedure per chunk:
1. Extract 768-dimensional train embeddings.
2. Fit IncrementalPCA on train split only.
3. Transform train, validation, and test splits with the train-fitted PCA.
4. Save final 256-dimensional embeddings.
This avoids validation/test leakage because PCA is fitted only on the training split.
Observed explained variance ratios:
chunk1 PCA 768→256 explained variance sum: 0.9876773542022481
chunk2 PCA 768→256 explained variance sum: 0.9776208752921536
chunk3 PCA 768→256 explained variance sum: 0.9805763307622046
PCA files were generated during the run:
outputs/embeddings/FinBERT/chunk1_pca_768_to_256.pkl
outputs/embeddings/FinBERT/chunk2_pca_768_to_256.pkl
outputs/embeddings/FinBERT/chunk3_pca_768_to_256.pkl
PCA files are training/projection artefacts and were excluded from the final minimal deliverable commit after the final 256-dimensional embeddings were generated.
All downstream modules should treat the final text embeddings as the stable public interface.
Input to downstream model per row:
shape: (256,)
dtype: float32
Batch input:
shape: (batch_size, 256)
dtype: float32
Final files:
outputs/embeddings/FinBERT/chunk1_train_embeddings.npy
outputs/embeddings/FinBERT/chunk1_val_embeddings.npy
outputs/embeddings/FinBERT/chunk1_test_embeddings.npy
outputs/embeddings/FinBERT/chunk2_train_embeddings.npy
outputs/embeddings/FinBERT/chunk2_val_embeddings.npy
outputs/embeddings/FinBERT/chunk2_test_embeddings.npy
outputs/embeddings/FinBERT/chunk3_train_embeddings.npy
outputs/embeddings/FinBERT/chunk3_val_embeddings.npy
outputs/embeddings/FinBERT/chunk3_test_embeddings.npy
Final verified shapes:
chunk1_train: (189244, 256)
chunk1_val: (40000, 256)
chunk1_test: (40000, 256)
chunk2_train: (320000, 256)
chunk2_val: (40000, 256)
chunk2_test: (40000, 256)
chunk3_train: (240000, 256)
chunk3_val: (40000, 256)
chunk3_test: (40000, 256)
Important correction:
An earlier smoke-test file produced chunk3_val with shape (1000, 256).
This was detected, overwritten, regenerated, verified, amended, and pushed.
The final correct chunk3_val shape is (40000, 256).
Verification command:
python -c 'import numpy as np; from pathlib import Path; base=Path("outputs/embeddings/FinBERT"); [print(f"chunk{c}_{s}", np.load(base/f"chunk{c}_{s}_embeddings.npy", mmap_mode="r").shape) for c in (1,2,3) for s in ("train","val","test")]'
Expected output:
chunk1_train (189244, 256)
chunk1_val (40000, 256)
chunk1_test (40000, 256)
chunk2_train (320000, 256)
chunk2_val (40000, 256)
chunk2_test (40000, 256)
chunk3_train (240000, 256)
chunk3_val (40000, 256)
chunk3_test (40000, 256)
Each embedding matrix has a matching metadata file.
Metadata files:
outputs/embeddings/FinBERT/chunk1_train_metadata.csv
outputs/embeddings/FinBERT/chunk1_val_metadata.csv
outputs/embeddings/FinBERT/chunk1_test_metadata.csv
outputs/embeddings/FinBERT/chunk2_train_metadata.csv
outputs/embeddings/FinBERT/chunk2_val_metadata.csv
outputs/embeddings/FinBERT/chunk2_test_metadata.csv
outputs/embeddings/FinBERT/chunk3_train_metadata.csv
outputs/embeddings/FinBERT/chunk3_val_metadata.csv
outputs/embeddings/FinBERT/chunk3_test_metadata.csv
The row order of each metadata CSV must match the row order of the corresponding .npy file exactly.
Expected metadata fields:
chunk_id
doc_id
year
form_type
cik
filing_date
accession
source_name
chunk_index
word_count
Downstream loaders must enforce:
number of embedding rows == number of metadata rows
metadata year values match expected split years
metadata order is never shuffled unless embeddings are shuffled identically
Each final 256-dimensional embedding file has a manifest.
Manifest files:
outputs/embeddings/FinBERT/chunk1_train_manifest.json
outputs/embeddings/FinBERT/chunk1_val_manifest.json
outputs/embeddings/FinBERT/chunk1_test_manifest.json
outputs/embeddings/FinBERT/chunk2_train_manifest.json
outputs/embeddings/FinBERT/chunk2_val_manifest.json
outputs/embeddings/FinBERT/chunk2_test_manifest.json
outputs/embeddings/FinBERT/chunk3_train_manifest.json
outputs/embeddings/FinBERT/chunk3_val_manifest.json
outputs/embeddings/FinBERT/chunk3_test_manifest.json
Expected manifest fields:
chunk_id
split
rows
dim
embedding_file
metadata_file
pca_file
sha256_embeddings
created_at
projection
The pca_file field may reference a PCA file that existed during generation but was not retained in the minimal final commit. The final .npy embeddings are the stable deliverable.
Allowed formats:
.npy final embeddings
.csv metadata
.json manifests and configuration
.pt temporary PyTorch checkpoints during training only
.pkl temporary PCA files during projection only
Hugging Face model directories for exported models
Disallowed for this module:
Parquet
The final committed deliverable intentionally uses .npy, .csv, .json, and Hugging Face model.safetensors directories.
The final pushed deliverable includes the necessary files for downstream use:
code/encoders/finbert_encoder.py
code/encoders/run_finbert_full_pipeline.py
code/encoders/run_finbert_resume_after_hpo.py
outputs/embeddings/FinBERT/chunk*_train_embeddings.npy
outputs/embeddings/FinBERT/chunk*_val_embeddings.npy
outputs/embeddings/FinBERT/chunk*_test_embeddings.npy
outputs/embeddings/FinBERT/chunk*_train_metadata.csv
outputs/embeddings/FinBERT/chunk*_val_metadata.csv
outputs/embeddings/FinBERT/chunk*_test_metadata.csv
outputs/embeddings/FinBERT/chunk*_train_manifest.json
outputs/embeddings/FinBERT/chunk*_val_manifest.json
outputs/embeddings/FinBERT/chunk*_test_manifest.json
outputs/models/FinBERT/chunk*/model_freezed/
outputs/models/FinBERT/chunk*/model_unfreezed/
The final pushed commit:
88c8817 add final FinBERT encoder models and embeddings
Final push confirmation:
0d9a597..88c8817 main -> main
The following were excluded from the final deliverable because they are training/intermediate artefacts:
outputs/models/FinBERT/hpo/
outputs/models/FinBERT/latest_checkpoint.pt
outputs/models/FinBERT/**/latest_checkpoint.pt
outputs/models/FinBERT/**/best_checkpoint.pt
outputs/models/FinBERT/**/epoch_*.pt
outputs/embeddings/FinBERT/*embeddings768.npy
outputs/embeddings/FinBERT/*manifest768.json
outputs/embeddings/FinBERT/*pca_768_to_256.pkl
outputs/embeddings/FinBERT/*pca_manifest.json
outputs/codeResults/FinBERT/hpo/
outputs/codeResults/FinBERT/*.log
outputs/results/FinBERT/hpo/
Recommended .gitignore rules:
# FinBERT non-deliverable heavy artefacts
outputs/models/FinBERT/hpo/
outputs/models/FinBERT/latest_checkpoint.pt
outputs/models/FinBERT/**/latest_checkpoint.pt
outputs/models/FinBERT/**/best_checkpoint.pt
outputs/models/FinBERT/**/epoch_*.pt
outputs/embeddings/FinBERT/*embeddings768.npy
outputs/embeddings/FinBERT/*manifest768.json
outputs/embeddings/FinBERT/*pca_768_to_256.pkl
outputs/embeddings/FinBERT/*pca_manifest.json
outputs/codeResults/FinBERT/hpo/
outputs/codeResults/FinBERT/*.log
outputs/results/FinBERT/
Recommended safety check before committing:
git diff --cached --name-only | grep -E "checkpoint|epoch_.*\.pt|/hpo/|embeddings768|manifest768|pca_768|pca_manifest|\.log" && echo "BAD FILES STAGED - STOP" || echo "STAGED FILES ARE CLEAN"
Recommended tracked-file check:
git ls-files | grep -E "checkpoint|epoch_.*\.pt|/hpo/|embeddings768|manifest768|pca_768|pca_manifest|\.log" && echo "BAD TRACKED FILES - STOP" || echo "TRACKED FILES ARE CLEAN"
Large binary files should be tracked with Git LFS.
Recommended LFS tracking:
git lfs track "outputs/embeddings/FinBERT/*.npy" "outputs/models/FinBERT/**/*.safetensors"
Useful verification:
git lfs ls-files | grep -E "outputs/embeddings/FinBERT|outputs/models/FinBERT"
Final push uploaded the relevant LFS objects successfully.
Any downstream loader for the text encoder should implement the following checks:
.npy file exists..npy with mmap_mode='r' for memory efficiency where possible.Recommended loader target:
code/analysts/text_embedding_loader.py
The Sentiment Analyst and News Analyst should consume the final 256-dimensional embeddings.
Recommended input contract:
embedding: np.ndarray # shape (256,), dtype float32
metadata: dict # row-aligned SEC filing metadata
Recommended batch contract:
embeddings: np.ndarray # shape (batch_size, 256), dtype float32
metadata_df: DataFrame # length == batch_size
Recommended MLP preference:
Use tanh for hidden activations.
Use sigmoid only when mathematically required for binary probability output.
Suggested downstream modules:
code/analysts/text_embedding_loader.py
code/analysts/sentiment_analyst.py
code/analysts/news_analyst.py
Input:
256-dimensional FinBERT text embedding
optional metadata features such as year, form_type, word_count, chunk_index
Possible outputs:
sentiment_score in [-1, 1]
uncertainty_score
64-dimensional sentiment representation for fusion
Recommended architecture:
Input 256
LayerNorm
Linear 256 → 128
Tanh
Dropout
Linear 128 → 64
Tanh
Dropout
Output head(s)
Important note:
Do not claim supervised sentiment learning unless labels exist.
Initial implementation may support dummy labels, pseudo-label hooks, and future supervised labels.
Input:
256-dimensional text embedding
metadata describing filing year, form type, document id, ticker/CIK linkage, and filing date
Possible outputs:
news/event importance score
risk relevance score
novelty/event intensity score
64-dimensional or 128-dimensional analyst representation for fusion
Recommended architecture:
Chunk-level embedding encoder
Optional document-level aggregation
Optional ticker-date-level aggregation
MLP scoring heads with tanh hidden activations
Potential aggregation methods:
mean pooling across chunks per document
attention pooling across chunks per document
form-type-aware pooling
recency-weighted pooling across filings per ticker/date
Supervised labels should be generated from market and fundamentals data using:
CIK
ticker mapping
filing_date
future stock prices
market benchmark prices
trading calendar
Recommended labels:
future_excess_return_10d_class
future_excess_return_30d_class
future_volatility_spike_30d
future_drawdown_risk_30d
future_realised_volatility_30d
future_abnormal_return_5d
future_abnormal_return_20d
Anti-leakage rules:
Use only prices after the filing date.
If intraday filing timestamp is unavailable, use the next trading day as event start.
Compute classification thresholds using training years only.
Apply training thresholds unchanged to validation and test years.
Attach document-level labels to all chunks of the same document.
Do not use validation/test distributions to set thresholds.
Do not standardise using validation/test statistics.
Current embeddings are chunk-level. Downstream financial targets will usually be document-level, ticker-date-level, or asset-date-level.
Potential hierarchy:
chunk embedding → document embedding → ticker-date embedding → asset risk input
Recommended document-level aggregation options:
simple mean pooling over chunks belonging to the same doc_id
attention pooling where attention weights are learned during supervised training
max/mean hybrid pooling
form-type-aware pooling
risk-section-weighted pooling if section metadata is available
Recommended ticker-date aggregation options:
most recent filing embedding
mean of all filings within a lookback window
recency-weighted average
event-type weighted average
attention over filings in lookback window
Random seed:
42
Precision:
mixed precision fp16 during CUDA training
float32 for saved embeddings
Final embedding format:
.npy
Metadata format:
.csv
Manifest/configuration format:
.json
Model export format:
Hugging Face directory with model.safetensors
Parquet usage:
Not used for this module
Inspect dataset:
python code/encoders/finbert_encoder.py inspect
Train one chunk:
python code/encoders/finbert_encoder.py train-mlm --chunk 3 --epochs 3 --workers 6
Run HPO:
python code/encoders/finbert_encoder.py hpo --chunk 3 --trials 12 --max-rows 30000 --workers 6
Extract raw 768 embeddings for one split:
python code/encoders/finbert_encoder.py embed768 --chunk 3 --split val --workers 6 --eval-batch-size 64 --overwrite
Fit PCA:
python code/encoders/finbert_encoder.py fit-pca --chunk 3 --overwrite
Project to final 256 dimensions:
python code/encoders/finbert_encoder.py project-pca --chunk 3 --split val --overwrite
Run complete embedding pipeline:
python code/encoders/finbert_encoder.py embed-all
Run resume-after-HPO full pipeline:
python -u code/encoders/run_finbert_resume_after_hpo.py
The production run was executed through the resume-after-HPO runner:
python -u code/encoders/run_finbert_resume_after_hpo.py 2>&1 | tee outputs/codeResults/FinBERT/finbert_resume_$(date +%Y%m%d_%H%M%S).log
This command:
loads saved HPO best parameters
does not rerun HPO
trains/resumes chunks 1, 2, and 3
exports model_freezed and model_unfreezed per chunk
extracts raw 768-dimensional embeddings
fits IncrementalPCA on training split only
saves final 256-dimensional embeddings
writes metadata and manifest files
Before using the text embeddings downstream, run:
python -c 'import numpy as np; from pathlib import Path; base=Path("outputs/embeddings/FinBERT"); [print(f"chunk{c}_{s}", np.load(base/f"chunk{c}_{s}_embeddings.npy", mmap_mode="r").shape) for c in (1,2,3) for s in ("train","val","test")]'
Required output:
chunk1_train (189244, 256)
chunk1_val (40000, 256)
chunk1_test (40000, 256)
chunk2_train (320000, 256)
chunk2_val (40000, 256)
chunk2_test (40000, 256)
chunk3_train (240000, 256)
chunk3_val (40000, 256)
chunk3_test (40000, 256)
Check no forbidden files are staged:
git diff --cached --name-only | grep -E "checkpoint|epoch_.*\.pt|/hpo/|embeddings768|manifest768|pca_768|pca_manifest|\.log" && echo "BAD FILES STAGED - STOP" || echo "STAGED FILES ARE CLEAN"
Check no forbidden files are tracked:
git ls-files | grep -E "checkpoint|epoch_.*\.pt|/hpo/|embeddings768|manifest768|pca_768|pca_manifest|\.log" && echo "BAD TRACKED FILES - STOP" || echo "TRACKED FILES ARE CLEAN"
Final status:
SEC filing chunk dataset is complete.
FinBERT domain-adaptive MLM fine-tuning is complete.
CUDA training on RTX 3090 Ti succeeded.
Chunk 1, chunk 2, and chunk 3 models were exported.
Final 256-dimensional train/validation/test embeddings were generated for all chunks.
The incorrect chunk3_val smoke-test embedding was detected and replaced.
Final embedding shapes were verified.
Unnecessary checkpoints, raw 768 embeddings, PCA files, and HPO trial artefacts were excluded from the final deliverable.
Final deliverables were committed and pushed.
Final pushed commit:
88c8817 add final FinBERT encoder models and embeddings
The FinBERT Text Encoder is now ready for downstream modules, especially:
text_embedding_loader.py
sentiment_analyst.py
news_analyst.py
fusion model input preparation
risk engine integration