fin-glassbox

EXECUTIVE SUMMARY

We have completed the fundamentals data engineering pipeline from raw SEC EDGAR bulk archives through to a clean, point-in-time quarterly fundamentals table for U.S. issuers. This covers Family 3 (Fundamental Company Data) of the five required data families in your architecture.


DATA PROCESSED: SCALE AND SCOPE

Source Raw Files Processed To Final Rows
submissions.zip 900,000+ JSON files 6 CSV families (inventory, entities, recent filings, filing files, former names, errors) ~Millions (partitioned)
companyfacts.zip 19,514 JSON files 121,896,885 flattened fact rows (72 partitioned CSVs) 121.9M facts
Final fundamentals Derived from above 7,310 quarterly rows (point-in-time) 7,310 periods

FILES GENERATED AND THEIR PURPOSES

1. Issuer Master Files

Location: data/sec_edgar/processed/issuer_master/

issuer_master.csv (970,000+ rows)

Purpose: Complete registry of every SEC filer in the EDGAR system.

Key fields:

Why this matters:

cik_ticker_map.csv (5,644 rows)

Purpose: Clean, production-ready mapping from CIK to primary ticker for publicly traded U.S. companies only.

Key fields:

Why this matters:

issuer_master_summary.json

Purpose: Pipeline metadata and statistics for reproducibility.

Key stats:


2. Core Fundamentals Files

Location: data/sec_edgar/processed/fundamentals/

core_fundamentals_quarterly.csv (7,310 rows)

Purpose: Point-in-time quarterly and annual fundamentals for U.S. public companies.

Key fields (27 concepts extracted):

Category Fields
Identifiers cik, ticker, fiscal_year, fiscal_period (Q1-Q4, FY)
Timing filing_date, period_end_date, form_type (10-K, 10-Q), accession
Income Statement revenue, cost_of_revenue, gross_profit, operating_expenses, operating_income, net_income, eps_basic, eps_diluted, shares_basic
Balance Sheet - Assets total_assets, current_assets, cash_and_equivalents, inventory, ppe_net, goodwill, intangible_assets
Balance Sheet - Liabilities total_liabilities, current_liabilities, long_term_debt, short_term_debt
Balance Sheet - Equity shareholders_equity, retained_earnings
Cash Flow operating_cash_flow, investing_cash_flow, financing_cash_flow, capex, free_cash_flow

Point-in-Time Method (Option A - Strict): For each (cik, fiscal_year, fiscal_period):

  1. Identify the earliest filing date among all 10-K/10-Q filings for that period
  2. Extract facts only from that earliest filing
  3. This ensures no restated/later-amended values leak into historical analysis

Why this matters:

fundamentals_summary.json

Purpose: Pipeline metadata and concept coverage.

Key stats:


WHY THE OUTPUT IS SMALL (7,310 ROWS) - IMPORTANT CONTEXT

You processed 73.6M observations but only got 7,310 output rows. This is expected and correct. Here’s why:

  1. Sparsity of XBRL filings: The 121.9M facts include every possible XBRL tag (thousands of concepts) across all filing types (8-K, DEF 14A, S-1, etc.). We filter to:
    • Only 10-K and 10-Q forms
    • Only 27 core concepts
    • Only U.S. public companies (5,644 CIKs)
  2. Point-in-time deduplication: Multiple filings/amendments for the same period are collapsed to a single row (earliest filing).

  3. Coverage is building over time: Many companies in the 5,644 list may not have XBRL facts for all periods yet (newer filers, data gaps).

  4. This is sufficient: 7,310 company-periods across 5,644 tickers is a solid foundation. You can:
    • Join with price data to get forward returns
    • Train tabular models (XGBoost/LightGBM)
    • Derive features (growth rates, ratios)

As you add more years of data (via incremental SEC updates), this table will grow.


HOW THIS DATA FEEDS YOUR ARCHITECTURE

Architecture Component Data Source Status
Fundamental Encoder/Model core_fundamentals_quarterly.csv ✅ Ready
Fundamental Analyst core_fundamentals_quarterly.csv ✅ Ready
Qualitative Analysis (fundamental branch) Derived ratios from fundamentals ⏳ Next step
Risk Engine (leverage/liquidity signals) Balance sheet fields ⏳ Feature engineering needed
U.S. Universe Definition cik_ticker_map.csv ✅ Ready
Market Data Joining Join tickers with yfinance ⏳ Next phase

WHAT REMAINS IN THE FUNDAMENTALS PIPELINE

Step 4: Derived Ratios and Features

Goal: Transform raw fundamentals into model-ready features.

Tasks:

  1. Growth metrics (YoY, QoQ):
    • Revenue growth
    • Earnings growth
    • Cash flow growth
  2. Profitability ratios:
    • Gross margin = gross_profit / revenue
    • Operating margin = operating_income / revenue
    • Net margin = net_income / revenue
  3. Efficiency ratios:
    • ROA = net_income / total_assets
    • ROE = net_income / shareholders_equity (handle negative equity)
  4. Leverage ratios:
    • Debt/Equity = long_term_debt / shareholders_equity
    • Current ratio = current_assets / current_liabilities
  5. Valuation metrics (requires price data join):
    • P/E, P/B, P/S, EV/EBITDA proxies

Output: fundamentals_features.csv with forward-filled quarterly data.

Step 5: Point-in-Time Alignment with Market Data

Goal: Ensure no lookahead when joining with prices.

Method:


WHAT REMAINS IN THE OVERALL DATA PIPELINE

Family 1: Time-Series Market Data (Not Started)

Status: ⏳ Pending Source: yfinance (prototype), potentially Alpha Vantage/Finnhub Scope: Daily OHLCV for 5,644 tickers × 15+ years Output: market_data/ with price panels, returns, technical indicators

Family 2: Financial Text Data (Partially Complete)

Status: 🔄 Raw downloads in progress Source: SEC EDGAR filings (.txt files) Completed:

Remaining:

Family 4: Macro/Regime Data (Not Started)

Status: ⏳ Pending Source: FRED API Scope: Interest rates, inflation, VIX, credit spreads, unemployment Output: macro/ with aligned daily/monthly series

Family 5: Cross-Asset Relation Data (Not Started)

Status: ⏳ Pending Source: Derived from price panel Scope: Rolling correlations, sector graphs for GNN Output: graphs/ with edge lists and adjacency matrices


Based on your development priorities (data acquisition first), I recommend:

Option A: Complete Fundamentals Features (1-2 days)

python data/sec_fundamentals_features.py

This would add derived ratios and growth metrics, making the fundamentals table directly usable for modeling.

Option B: Resume Filings Text Pipeline (3-5 days)

# 1. Audit downloaded filings
python data/sec_filings_audit.py

# 2. Parse sections from 10-K/10-Q
python data/sec_filings_parser.py --forms 10-K,10-Q --sections risk,mdna

# 3. Build cleaned text corpus
python data/sec_filings_corpus.py

Option C: Bootstrap Market Data (2-3 days)

python data/market_data_yfinance.py --tickers data/sec_edgar/processed/issuer_master/cik_ticker_map.csv --years 15

SUMMARY TABLE: FILES YOU WILL USE GOING FORWARD

File Purpose Key Fields
cik_ticker_map.csv Universe definition (5,644 tickers) cik, primary_ticker
core_fundamentals_quarterly.csv Model input for Fundamental Analyst All 36 fields
issuer_master.csv Reference for CIK lookups cik, entity_name, is_us_issuer

The fundamentals pipeline is production-ready. The text pipeline is mid-flight. Market data is next.

Which step do you want to tackle next?