fin-glassbox

Project Setup

This document describes how to prepare a Linux or WSL2 environment for fin-glassbox, the repository for An Explainable Multimodal Neural Framework for Financial Risk Management.

The recommended development environment is Linux with Python 3.12.7, Git LFS, a virtual environment, and CUDA-capable PyTorch when GPU acceleration is available.


Table of contents


Recommended baseline:

Operating system: Ubuntu 22.04 or Ubuntu 24.04, native Linux or WSL2
Python: 3.12.7
Virtual environment: venv3.12.7
GPU: NVIDIA GPU recommended for model training and embedding generation
CUDA PyTorch: install according to the CUDA version available on the machine
Disk: large local storage recommended for SEC filings, market panels, embeddings, and model outputs

The project can run inspection and smaller data-processing tasks on CPU, but encoder training, embedding generation, graph modules, and neural module training are substantially faster with CUDA.


System packages

Check Ubuntu version:

lsb_release -a

Update packages:

sudo apt update

Install base tools:

sudo apt install -y python3 python3-venv python3-distutils git git-lfs build-essential wget curl

Install build dependencies commonly needed by pyenv and scientific Python packages:

sudo apt install -y make zlib1g-dev libncurses5-dev libgdbm-dev libnss3-dev libssl-dev libreadline-dev libffi-dev libsqlite3-dev libbz2-dev llvm xz-utils tk-dev libxml2-dev libxmlsec1-dev liblzma-dev

Optional monitoring tools:

sudo apt install -y htop tmux nvtop sysstat

Clone the repository

Clone the repository and enter it:

git clone https://github.com/ib-hussain/fin-glassbox.git
cd fin-glassbox

If the repository is already cloned, enter the repository root:

cd ~/fin-glassbox

Use the actual path on your machine if the repository is stored elsewhere.


Git LFS

The repository may use Git LFS for large files. Install and initialise Git LFS:

git lfs install

Pull LFS-tracked files:

git lfs pull

If local repository-level LFS initialisation is required:

git lfs install --local

Install Python 3.12.7 with pyenv

Install pyenv:

curl https://pyenv.run | bash

Add pyenv to Bash startup configuration:

cat >> ~/.bashrc << 'PYENVEOF'
export PYENV_ROOT="$HOME/.pyenv"
export PATH="$PYENV_ROOT/bin:$PATH"
eval "$(pyenv init -)"
PYENVEOF

Reload the shell:

source ~/.bashrc

Verify pyenv:

pyenv --version

Install Python 3.12.7:

pyenv install 3.12.7

Set Python 3.12.7 locally for the repository:

cd ~/fin-glassbox && pyenv local 3.12.7

Verify Python version:

cd ~/fin-glassbox && python --version

Expected:

Python 3.12.7

Create and activate the virtual environment

Create the virtual environment:

cd ~/fin-glassbox && python -m venv venv3.12.7

Activate it:

cd ~/fin-glassbox && source venv3.12.7/bin/activate

Verify activation:

which python
python --version
pip --version

The Python path should point inside venv3.12.7.


Install Python dependencies

Upgrade packaging tools:

cd ~/fin-glassbox && python -m pip install --upgrade pip setuptools wheel

Install repository dependencies:

cd ~/fin-glassbox && pip install -r requirements_linux_venv.txt

If the dependency file on a machine is an environment snapshot rather than a strict pip requirements file, regenerate a clean requirements file from a working environment using:

cd ~/fin-glassbox && pip freeze > requirements.txt

For CUDA-specific PyTorch, install the PyTorch wheel recommended for your CUDA and driver version, then rerun the rest of the dependency installation if needed.

Confirm important packages:

cd ~/fin-glassbox && python -c "import torch, pandas, numpy, sklearn, optuna, transformers; print('torch=', torch.__version__); print('cuda=', torch.cuda.is_available()); print('pandas=', pandas.__version__); print('numpy=', numpy.__version__); print('optuna=', optuna.__version__); print('transformers=', transformers.__version__)"

CUDA and GPU checks

Check the NVIDIA driver and GPU state:

nvidia-smi

Check PyTorch CUDA availability:

cd ~/fin-glassbox && python -c "import torch; print(torch.cuda.is_available()); print(torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'CPU only')"

Monitor GPU usage during long runs:

watch -n 1 nvidia-smi

Use tmux for long-running training or embedding jobs:

tmux new -s fin-glassbox

Detach from a tmux session with Ctrl+B, then D.

Reattach:

tmux attach -t fin-glassbox

Expected repository paths

The repository expects a consistent folder structure. Common paths include:

code/encoders/
code/analysts/
code/gnn/
code/riskEngine/
code/fusion/
data/yFinance/processed/
data/FRED_data/outputs/
data/graphs/
data/sec_edgar/
outputs/embeddings/
outputs/models/
outputs/results/
outputs/codeResults/
outputs/cache/

Most CLI scripts accept:

--repo-root .
--device cuda
--chunk 1
--split train|val|test

Run commands from the repository root unless a module-specific document says otherwise.


Environment validation

Compile key Python files:

cd ~/fin-glassbox && python -m py_compile code/encoders/temporal_encoder.py code/encoders/finbert_encoder.py code/fusion/fusion_layer.py code/fusion/final_fusion.py

Inspect Fusion inputs:

cd ~/fin-glassbox && python code/fusion/final_fusion.py inspect --repo-root .

Run Fusion smoke test:

cd ~/fin-glassbox && python code/fusion/final_fusion.py smoke --repo-root . --device cuda

Run encoder inspection commands according to the module documentation:

cd ~/fin-glassbox && python code/encoders/temporal_encoder.py inspect --repo-root .
cd ~/fin-glassbox && python code/encoders/finbert_encoder.py --help

Smoke-test commands

Use smoke tests before long training jobs. A smoke test should check imports, model construction, data shape assumptions, XAI output shape, and basic forward/backward behaviour.

Fusion:

cd ~/fin-glassbox && python code/fusion/final_fusion.py smoke --repo-root . --device cuda

Technical Analyst:

cd ~/fin-glassbox && python code/analysts/technical_analyst.py smoke --repo-root . --device cuda

Drawdown Risk Module:

cd ~/fin-glassbox && python code/riskEngine/drawdown.py smoke --repo-root . --device cuda

Volatility Risk Module:

cd ~/fin-glassbox && python code/riskEngine/volatility.py smoke --repo-root . --device cuda

MTGNN Regime Module:

cd ~/fin-glassbox && python code/gnn/mtgnn_regime.py smoke --repo-root . --device cuda

StemGNN Contagion Module:

cd ~/fin-glassbox && python code/gnn/stemgnn_contagion.py smoke --repo-root . --device cuda --ticker-limit 32 --batch-size 2 --num-workers 0 --cpu-threads 6 --epochs 1 --max-train-windows 4 --max-eval-windows 2

Use module-specific documentation for exact command options:


Working with large data and outputs

This project can generate very large files. Keep the following in mind:

Recommended disk checks:

df -h
du -h --max-depth=2 outputs | sort -h | tail -30
du -h --max-depth=2 data | sort -h | tail -30

Common troubleshooting

ModuleNotFoundError

Run commands from the repository root:

cd ~/fin-glassbox

Activate the virtual environment:

source venv3.12.7/bin/activate

Reinstall requirements if needed:

pip install -r requirements_linux_venv.txt

CUDA is not detected

Check driver:

nvidia-smi

Check PyTorch:

python -c "import torch; print(torch.cuda.is_available())"

If CUDA is unavailable, reinstall PyTorch with the correct CUDA wheel for your driver/CUDA setup.

Out-of-memory errors

Reduce batch size or node limit. Examples:

cd ~/fin-glassbox && python code/fusion/final_fusion.py smoke --repo-root . --device cuda --batch-size 512
cd ~/fin-glassbox && python code/gnn/mtgnn_regime.py predict --repo-root . --chunk 1 --split test --device cuda --node-limit 512

Close other GPU jobs or inspect memory:

nvidia-smi

Old schema detected

Some downstream modules require specific output schemas. If a module reports old schema columns, rerun the upstream module that produced the stale output.

For Fusion, Quantitative Analyst outputs must include trained attention columns such as:

top_attention_risk_driver
attention_pooled_risk_score
risk_attention_volatility
risk_attention_drawdown
risk_attention_var_cvar
risk_attention_contagion
risk_attention_liquidity
risk_attention_regime

Non-finite loss or NaN outputs

Check input finite ratios using the module’s inspect command. Common causes include:

Use --fresh where appropriate to remove stale checkpoints or HPO databases.

Too many open files

Raise the open-file limit for the shell session:

ulimit -n 4096

For very multiprocessing-heavy jobs, reduce worker count or avoid excessive parallel DataLoader workers.


Contributor notes