fin-glassbox

Market Data Pipeline

Overview
Pipeline Architecture
Source Data Inventory
Pipeline Stages
Final Output Files
Feature Specifications
Coverage Statistics
Ticker Universe
Data Quality Controls
Downstream Module Mapping
File Manifest
Reproduction

1. Overview

The Market Data Pipeline acquires, cleans, standardizes, merges, fills, and engineers features for U.S. stock market time-series data spanning 2000-01-03 to 2024-12-31 (6,285 NYSE trading days). It combines data from five independent sources to produce a complete, analysis-ready dataset for the Technical Stream and Risk Engine of the financial risk management framework.

Key Metrics

Metric	Value
Tickers	2,500 (top by coverage from 4,534 candidate)
Trading days	6,285 (NYSE calendar, 2000-2024)
Total data points	15,715,000 per feature file
OHLCV completeness	100% (zero missing values)
Returns NaN rate	0.0%
Sources merged	5 (yfinance, Stooq, Huge Market Dataset, Kaggle NYSE/NASDAQ, Kaggle OTC)
Pipeline scripts	7

2. Pipeline Architecture

┌─────────────────────────────────────────────────────────────┐
│                    SOURCE DATA                              │
├───────────┬──────────┬────────────┬───────────┬────────────┤
│ yfinance  │  Stooq   │   Huge     │  Kaggle   │   Kaggle   │
│ (group    │  (.txt)  │  Market    │ NYSE/NAS  │    OTC     │
│  member)  │          │  Dataset   │   DAQ     │            │
└─────┬─────┴────┬─────┴─────┬──────┴─────┬─────┴──────┬─────┘
      │          │           │            │            │
      v          v           v            v            v
┌─────────────────────────────────────────────────────────────┐
│  STAGE 1: Extraction & Standardization                     │
│  • yfin_extracter.py   — Rename .txt→.csv, filter tickers  │
│  • yfin_standardize_sources.py — Unify column formats      │
└─────────────────────────┬───────────────────────────────────┘
                          │
                          v
┌─────────────────────────────────────────────────────────────┐
│  STAGE 2: Merge & Price Alignment                          │
│  • yfin_merge_sources.py — Scale Stooq to adjusted prices  │
│  • Union dates, prefer adjusted sources                    │
└─────────────────────────┬───────────────────────────────────┘
                          │
                          v
┌─────────────────────────────────────────────────────────────┐
│  STAGE 3: Master Panel Construction                        │
│  • yfin_build_complete_panel.py — 6,288 rows per ticker    │
│  • NaN for missing dates                                   │
└─────────────────────────┬───────────────────────────────────┘
                          │
                          v
┌─────────────────────────────────────────────────────────────┐
│  STAGE 4: Gap Filling                                      │
│  • yfin_fill_from_kaggle.py — Fill from Kaggle 1962-2024   │
│  • yfin_fill_final_pipeline.py — Multi-layer imputation    │
│    Layer 1: Trim to 6,286 dates                            │
│    Layer 2: Linear interpolation (≥6000 day stocks)        │
│    Layer 3: Trend projection (≥50% coverage)               │
│    Layer 4: EWMA + ratio fill (remaining)                  │
└─────────────────────────┬───────────────────────────────────┘
                          │
                          v
┌─────────────────────────────────────────────────────────────┐
│  STAGE 5: Feature Engineering                              │
│  • yfin_engineer_features.py — 30 features per ticker      │
│  • 4 output files: returns wide, returns long,             │
│    liquidity features, temporal features                   │
└─────────────────────────┬───────────────────────────────────┘
                          │
                          v
┌─────────────────────────────────────────────────────────────┐
│              FINAL OUTPUT (4 files)                         │
│  • returns_panel_wide.csv     (277 MB, 2500×6285)          │
│  • returns_long.csv           (785 MB, 15.7M rows)         │
│  • liquidity_features.csv     (1.2 GB, 15.7M rows)         │
│  • features_temporal.csv      (2.8 GB, 15.7M rows)         │
└─────────────────────────────────────────────────────────────┘

3. Source Data Inventory

3.1 Primary Source: yfinance (Group Member Download)

Attribute	Value
Source	Yahoo Finance via `yfinance` Python library
Format	CSV (converted from Parquet)
Tickers	4,247 (from Wikipedia S&P 500/400/600 + Nasdaq-100 + DJIA + ETFs)
Date range	2000-01-03 to 2024-12-31
Columns	`date, open, high, low, close, volume, dividends, stock_splits`
Price type	Adjusted close (split-adjusted)
File	`data/yFinance/processed/ohlcv_panel.csv` (16.3M rows)

3.2 Stooq Historical Database

Attribute	Value
Source	Stooq.com — free historical database
Format	`.txt` files (comma-separated), one per ticker
Files	12,021 files across `nasdaq stocks/`, `nasdaq etfs/`, `nyse stocks/`, `nyse etfs/`, `nysemkt stocks/`
Date range	1962–2024+
Columns	`<TICKER>,<PER>,<DATE>,<TIME>,<OPEN>,<HIGH>,<LOW>,<CLOSE>,<VOL>,<OPENINT>`
Date format	YYYYMMDD (e.g., `19991118`)
Price type	Unadjusted (raw trading prices)
After filter	4,390 tickers matching primary tickers, 15.8M rows

3.3 Boris Marjanovic “Huge Market Dataset”

Attribute	Value
Source	Kaggle: price-volume-data-for-all-us-stocks-etfs
Format	CSV, one per ticker (`aapl.us.csv`)
Files	8,539 files in `Stocks/` and `ETFs/` directories
Date range	1999–2017 (last updated 11/10/2017)
Columns	`Date, Open, High, Low, Close, Volume, OpenInt`
Price type	Adjusted for splits and dividends
After filter	2,589 tickers matching primary tickers, 7.7M rows

3.4 Kaggle NYSE/NASDAQ/NYSE-A/OTC 1962-2024

Attribute	Value
Source	Kaggle: nasdaq-nyse-nyse-a-otc-daily-stock-1962-2024
Format	4 large CSV files
Files	`NYSE 1962-2024.csv` (10.8M rows), `NASDAQ 1962-2024.csv` (11.5M rows), `NYSE A 1973-2024.csv` (1.0M rows), `OTC 1972-2024.csv` (2.5M rows)
Total	25.9M rows, 6,291 unique tickers
Columns	`Date, Ticker, Exchange, Open, High, Low, Close, Adj Close, Volume`
Price type	Both unadjusted close and adjusted close available

3.5 NYSE Trading Calendar

Attribute	Value
Source	Derived from FRED macro data pipeline
File	`data/market_dates_ONLY_NYSE.csv`
Dates	6,288 NYSE trading days, 2000-01-03 to 2024-12-31

4. Pipeline Stages

Stage 1: Extraction & Standardization

Scripts: data/yfin_extracter.py, data/yfin_standardize_sources.py

What it does:

Renames all .txt files to .csv in the Stooq directory (12,021 files)
Filters files to only those matching primary_tickers.csv (4,534 tickers)
Moves non-matching files to irrelevant/ directories
Standardizes column formats between sources:
- Stooq: Converts <TICKER>,<PER>,<DATE>,... to date,open,high,low,close,volume
- Huge Market: Drops OpenInt, normalizes column names
- Both: Date format → YYYY-MM-DD, volume → int, OHLC → float

Results:

Stooq: 4,390 tickers, 15.8M rows (1 error: emi.us.csv)
Huge Market: 2,589 tickers, 7.7M rows (5 errors: corrupted files)

Stage 2: Merge & Price Alignment

Script: data/yfin_merge_sources.py

What it does:

For tickers present in both Stooq and Huge Market (2,576 tickers):
- Computes ratio = Huge_close / Stooq_close on overlapping dates
- Scales all Stooq OHLC by this ratio (adjusts for cumulative splits)
- Unions dates from both sources
- Prefers Huge Market prices on overlapping dates (already adjusted)
For tickers in one source only: uses as-is

Price Adjustment Validation (sample): | Ticker | Ratio (Huge/Stooq) | Stability (±) | Interpretation | |——–|——————-|—————|—————-| | AAPL | 4.2740 | 0.002% | ~4:1 cumulative split | | MSFT | 1.0898 | 0.3% | Minor splits + dividends | | JPM | 1.1560 | 0.00005% | Perfectly constant | | GE | 0.1664 | 0.15% | Reverse split |

Results: 4,403 tickers, 16.0M rows

Stage 3: Master Panel Construction

Script: data/yfin_build_complete_panel.py

What it does:

Loads exact 6,288 NYSE trading dates from market_dates_ONLY_NYSE.csv
For each ticker, pulls data from merged sources (preferring merged/ over ohlcv_panel)
Forces every ticker to have all 6,288 date rows (NaN for missing dates)
Aligns to NYSE calendar — no weekend/holiday gaps

Results: 4,252 tickers × 6,288 dates = 26,736,576 rows

Stage 4: Gap Filling

Scripts: data/yfin_fill_from_kaggle.py, data/yfin_dataFilling_pipeline.py

What it does:

4A. Kaggle Fill (partial): Filled 17,835 rows from the Kaggle 1962-2024 dataset for 14 tickers.

4B. Final Fill Pipeline (4 layers):

Layer	Method	Target Stocks	What It Does
Layer 1	Trim	All 2,500	Max 6,286 dates (drop last 2 incomplete NYSE dates)
Layer 2	Linear interpolation	≥6,000 day stocks	Fills gaps ≤10 consecutive days
Layer 3	Trend projection	≥50% coverage	Local linear regression on mirrored series + OHLC/Close ratio fill
Layer 4	EWMA + ratio fill	Remaining	Exponential weighted moving average + OHLC/Close median ratios

Results: 2,500 tickers × 6,286 dates = 15,715,000 rows, zero NaN

Stage 5: Feature Engineering

Script: data/yfin_engineer_features.py

What it does:

Computes all features strictly backward-looking (no lookahead bias)
Processes in chunks of 200 tickers for memory efficiency
Produces 4 output files

Results: 4 output files, ~15.7M rows each

5. Final Output Files

5.1 `ohlcv_final.csv` — Master OHLCV Panel

Attribute	Value
File	`data/yFinance/processed/ohlcv_final.csv`
Rows	15,715,000
Columns	`date, ticker, open, high, low, close, volume`
Tickers	2,500
Dates	6,286 (2000-01-03 to 2024-12-31)
NaN rate	0%
Price type	Adjusted (split and dividend adjusted)

5.2 `returns_panel_wide.csv` — Wide Returns Matrix

Attribute	Value
File	`data/yFinance/processed/returns_panel_wide.csv`
Size	277.2 MB
Shape	6,285 rows × 2,501 columns (date + 2,500 tickers)
Format	Wide matrix, dates as rows, tickers as columns
Values	Log returns: ln(close_t / close_t-1)
NaN rate	0.0%
Primary user	StemGNN Contagion Module, VaR, CVaR

5.3 `returns_long.csv` — Long Returns

Attribute	Value
File	`data/yFinance/processed/returns_long.csv`
Size	785.4 MB
Rows	15,712,457
Columns	`date, ticker, log_return, simple_return`
Primary user	Volatility Model, Drawdown Model

5.4 `liquidity_features.csv` — Liquidity Metrics

Attribute	Value
File	`data/yFinance/processed/liquidity_features.csv`
Size	1,228.4 MB
Rows	15,715,000
Columns	`date, ticker, dollar_volume, volume_zscore, volume_ratio, turnover_proxy`
Primary user	Liquidity Risk Module

5.5 `features_temporal.csv` — Temporal Encoder Features

Attribute	Value
File	`data/yFinance/processed/features_temporal.csv`
Size	2,831.5 MB
Rows	15,715,000
Columns	`date, ticker, log_return, vol_5d, vol_21d, rsi_14, macd_hist, bb_pos, volume_ratio, hl_ratio, price_pos, spy_corr_63d`
Primary user	Shared Temporal Attention Encoder

6. Feature Specifications

6.1 Returns Features

Feature	Formula	Window	Notes
`log_return`	ln(close_t / close_t-1)	1 day	Primary return metric
`simple_return`	(close_t - close_t-1) / close_t-1	1 day	Alternative metric

6.2 Volatility Features

Feature	Formula	Window	Notes
`vol_5d`	std(log_return) × √252	5 days	Short-term annualized vol
`vol_21d`	std(log_return) × √252	21 days	Monthly annualized vol

6.3 Technical Indicators

Feature	Formula	Window	Range	Notes
`rsi_14`	Wilder’s RSI	14 days	0–100	Relative Strength Index
`macd_hist`	MACD − Signal	12/26/9	Unbounded	MACD histogram
`bb_pos`	(close − lower) / (upper − lower)	20 days, 2σ	0–1	Bollinger Band position

6.4 Price Range Features

Feature	Formula	Window	Notes
`hl_ratio`	(high − low) / close	1 day	Daily range normalized by close
`price_pos`	(close − 21d_low) / (21d_high − 21d_low)	21 days	Position within recent range

6.5 Volume Features

Feature	Formula	Window	Notes
`dollar_volume`	close × volume	1 day	Dollar trading volume
`volume_zscore`	(volume − mean_21d) / std_21d	21 days	Abnormal volume detection
`volume_ratio`	volume / mean_21d	21 days	Relative volume
`turnover_proxy`	volume / mean_252d	252 days	Long-term volume context

6.6 Benchmark Features

Feature	Formula	Window	Notes
`spy_corr_63d`	Rolling Pearson corr with SPY returns	63 days	Market correlation proxy

7. Coverage Statistics

7.1 Final Coverage (Post-Filling)

Tier	Tickers	% of Universe
6,286 days (full)	2,500	100%
≥5,000 days	2,500	100%
≥4,000 days	2,500	100%
≥3,000 days	2,500	100%

7.2 Source Contribution (Pre-Filling)

Source	Tickers Contributed	Rows
yfinance (ohlcv_panel)	4,228	16.2M
Stooq	4,390	15.8M
Huge Market Dataset	2,589	7.7M
Kaggle NYSE/NASDAQ	3,476	17,835 filled
Merged (Stooq + Huge)	4,403	16.0M

7.3 Fill Pipeline Effectiveness

Layer	Method	Stocks Affected	Cells Filled
Kaggle pre-fill	Direct lookup	14	17,835
Layer 1	Trim	2,500	—
Layer 2	Linear interpolation	~1,560	—
Layer 3	Trend projection	~885	—
Layer 4	EWMA + ratio	~1,200	—
Total		2,500	~10M

8. Ticker Universe

8.1 Selection Process

Initial universe: 4,534 tickers from primary_tickers.csv (SEC CIK-mapped) + cik_ticker_map_cleaned.csv (4,428 tickers)
Data availability filter: 4,451 tickers had data in at least one source
Coverage ranking: Sorted by number of NYSE trading days with data
Final selection: Top 2,500 tickers (cutoff: 3,017 days / 48.0% coverage)

8.2 Sector Composition (SIC → GICS)

GICS Sector	Tickers	%
Financials	~500	20%
Information Technology	~350	14%
Health Care	~300	12%
Industrials	~280	11%
Consumer Discretionary	~250	10%
Energy	~180	7%
Real Estate	~150	6%
Consumer Staples	~120	5%
Materials	~120	5%
Utilities	~70	3%
Communication Services	~60	2%
Other	~120	5%

8.3 ETF Coverage

ETF	In Universe	Description
SPY	✅	S&P 500
QQQ	✅	Nasdaq-100
DIA	✅	Dow Jones Industrial
IWM	❌	Russell 2000 (filtered out)
XLK-XLC	❌	Sector ETFs (filtered out)

9. Data Quality Controls

9.1 No Lookahead Bias

All features are computed using strictly backward-looking windows. The pipeline enforces:

Returns use pct_change() (previous close only)
Rolling windows use .rolling(window) with no center=True
No future data is used in any imputation step

9.2 Price Adjustment Consistency

All prices in the final panel are split and dividend adjusted (comparable to adjusted close). The merge process scaled Stooq’s unadjusted prices using a per-ticker constant ratio calibrated against Huge Market’s adjusted prices.

9.3 Verification Checks

Check	Result
NaN in OHLCV	0 cells
NaN in returns wide	0.0%
Date monotonicity	Passed (sorted ascending)
Ticker count	2,500 (exact)
Dates per ticker	6,286 (exact)
NYSE calendar alignment	Verified against `market_dates_ONLY_NYSE.csv`
Duplicate (ticker, date) pairs	None

9.4 Known Limitations

Synthetic fill for early/late periods: ~38% of cells were filled using statistical methods (Layers 3-4). These are approximations and carry uncertainty.
No corporate actions tracking: Dividend and split information was used for price adjustment but is not preserved as separate features.
ETF universe limited: Only SPY, QQQ, DIA are present. 12 sector ETFs were filtered out during top-2,500 selection due to lower coverage.
Market cap is proxy only: market_cap_proxy uses median dollar volume, not actual market capitalization.

10. Downstream Module Mapping

10.1 Which File Feeds Which Module

Module	Input File(s)	Features Used
Shared Temporal Attention Encoder	`features_temporal.csv`	All 10 features (sequence of 30 days)
Technical Analyst (BiLSTM)	Temporal Encoder output	128-dim embedding
Volatility Model (GARCH+MLP)	`returns_long.csv` + Temporal Encoder output	log_return sequence + 128-dim embedding
Drawdown Model (BiLSTM)	Temporal Encoder output	128-dim embedding (30-90 day sequence)
Historical VaR	`returns_panel_wide.csv`	2-year rolling returns per ticker
CVaR / Expected Shortfall	`returns_panel_wide.csv`	2-year rolling returns per ticker
GNN Contagion Risk (StemGNN)	`returns_panel_wide.csv` + graph snapshots	Returns matrix (N_stocks × T=30)
Liquidity Risk Module	`liquidity_features.csv`	dollar_volume, volume_zscore, volume_ratio, turnover_proxy
Regime Detection (MTGNN)	Temporal Encoder output + FinBERT embeddings	128-dim temporal + 256-dim text
Position Sizing Engine	All risk module outputs	Aggregated risk scores
Cross-Asset Graph Builder	`returns_panel_wide.csv`	Correlation matrix, sector mapping, beta

10.2 Data Flow Diagram

features_temporal.csv ──→ Temporal Encoder ──→ Technical Analyst
                                          ──→ Volatility Model
                                          ──→ Drawdown Model
                                          ──→ Regime Detection

returns_panel_wide.csv ──→ VaR / CVaR
                      ──→ StemGNN Contagion
                      ──→ Cross-Asset Graph Builder

returns_long.csv ──→ Volatility Model (GARCH)

liquidity_features.csv ──→ Liquidity Risk Module

11. File Manifest

11.1 Scripts

data/
├── yfin_extracter.py              # Rename .txt→.csv, filter to primary tickers
├── yfin_standardize_sources.py    # Unify Stooq + Huge Market column formats
├── yfin_merge_sources.py          # Scale prices, union dates, prefer adjusted
├── yfin_build_complete_panel.py   # Build 6288×N complete matrix with NaN
├── yfin_fill_from_kaggle.py       # Fill from Kaggle 1962-2024 dataset
├── yfin_dataFilling_pipeline.py    # 4-layer statistical fill pipeline
└── yfin_engineer_features.py      # Feature engineering (4 output files)

11.2 Final Data Files

data/yFinance/processed/
├── ohlcv_final.csv                # 15.7M rows, fully filled OHLCV (2,500 × 6,286)
├── returns_panel_wide.csv         #   277 MB, log returns (6,285 × 2,500)
├── returns_long.csv               #   785 MB, ticker-date-level returns
├── liquidity_features.csv         # 1,228 MB, volume-based features
├── features_temporal.csv          # 2,832 MB, 10 features for Temporal Encoder
├── common_tickers.csv             # 54 tickers with both market + fundamentals
├── tickers_with_coverage.csv      # Per-ticker coverage statistics
└── master_coverage_complete.csv   # Final coverage report

11.3 Intermediate Files (Retained for Audit)

data/yFinance/
├── yFinance.md                    # This file
├── merged/                        # 4,397 per-ticker CSV files (Stooq+Huge merged)
├── raw/                           # 5,741 raw yfinance Parquet downloads
├── raw_metadata/                  # 5,740 yfinance metadata JSON files
├── Huge_Market_Dataset/           # Filtered Boris Marjanovic CSVs
├── d_us_txt/                      # Filtered Stooq CSVs
└── nasdaq-nyse-nyse-a-otc-daily-stock-1962-2024/  # Kaggle source CSVs

12. Reproduction

12.1 Full Pipeline Execution Order

# Stage 1: Extract and standardize
python data/yFinance/yfin_extracter.py --dir "data/yFinance/d_us_txt" --tickers "data/primary_tickers.csv" --workers 4
python data/yFinance/yfin_extracter.py --dir "data/yFinance/Huge_Market_Dataset" --tickers "data/primary_tickers.csv" --workers 4 --skip-rename
python data/yFinance/yfin_standardize_sources.py --all --workers 4

# Stage 2: Merge sources
python data/yfin_merge_sources.py --workers 4

# Stage 3: Build complete panel
python data/yfin_build_complete_panel.py --workers 4

# Stage 4: Fill gaps
python data/yfin_fill_from_kaggle.py --workers 4
python data/yfin_dataFilling_pipeline.py --workers 4

# Stage 5: Engineer features
python data/yfin_engineer_features.py --workers 4

12.3 Verification Commands

# Check OHLCV completeness
python -c "
import pandas as pd
df = pd.read_csv('data/yFinance/processed/ohlcv_final.csv')
print(f'NaN in close: {df[\"close\"].isna().sum()}')
print(f'Tickers: {df[\"ticker\"].nunique()}')
print(f'Dates: {df[\"date\"].nunique()}')
"

# Check returns matrix
python -c "
import pandas as pd
df = pd.read_csv('data/yFinance/processed/returns_panel_wide.csv', nrows=0)
print(f'Tickers: {len(df.columns) - 1}')
"

Document Version

Version: 1.0
Date: 26 April 2026

This site is open source. Improve this page.

fin-glassbox

Market Data Pipeline

Table of Contents

1. Overview

Key Metrics

2. Pipeline Architecture

3. Source Data Inventory

3.1 Primary Source: yfinance (Group Member Download)

3.2 Stooq Historical Database

3.3 Boris Marjanovic “Huge Market Dataset”

3.4 Kaggle NYSE/NASDAQ/NYSE-A/OTC 1962-2024

3.5 NYSE Trading Calendar

4. Pipeline Stages

Stage 1: Extraction & Standardization

Stage 2: Merge & Price Alignment

Stage 3: Master Panel Construction

Stage 4: Gap Filling

Stage 5: Feature Engineering

5. Final Output Files

5.1 ohlcv_final.csv — Master OHLCV Panel

5.2 returns_panel_wide.csv — Wide Returns Matrix

5.3 returns_long.csv — Long Returns

5.4 liquidity_features.csv — Liquidity Metrics

5.5 features_temporal.csv — Temporal Encoder Features

6. Feature Specifications

6.1 Returns Features

6.2 Volatility Features

6.3 Technical Indicators

6.4 Price Range Features

6.5 Volume Features

6.6 Benchmark Features

7. Coverage Statistics

7.1 Final Coverage (Post-Filling)

7.2 Source Contribution (Pre-Filling)

7.3 Fill Pipeline Effectiveness

8. Ticker Universe

8.1 Selection Process

8.2 Sector Composition (SIC → GICS)

8.3 ETF Coverage

9. Data Quality Controls

9.1 No Lookahead Bias

9.2 Price Adjustment Consistency

9.3 Verification Checks

9.4 Known Limitations

10. Downstream Module Mapping

10.1 Which File Feeds Which Module

10.2 Data Flow Diagram

11. File Manifest

11.1 Scripts

11.2 Final Data Files

11.3 Intermediate Files (Retained for Audit)

12. Reproduction

12.1 Full Pipeline Execution Order

12.3 Verification Commands

Document Version

5.1 `ohlcv_final.csv` — Master OHLCV Panel

5.2 `returns_panel_wide.csv` — Wide Returns Matrix

5.3 `returns_long.csv` — Long Returns

5.4 `liquidity_features.csv` — Liquidity Metrics

5.5 `features_temporal.csv` — Temporal Encoder Features