We have completed the fundamentals data engineering pipeline from raw SEC EDGAR bulk archives through to a clean, point-in-time quarterly fundamentals table for U.S. issuers. This covers Family 3 (Fundamental Company Data) of the five required data families in your architecture.
| Source | Raw Files | Processed To | Final Rows |
|---|---|---|---|
| submissions.zip | 900,000+ JSON files | 6 CSV families (inventory, entities, recent filings, filing files, former names, errors) | ~Millions (partitioned) |
| companyfacts.zip | 19,514 JSON files | 121,896,885 flattened fact rows (72 partitioned CSVs) | 121.9M facts |
| Final fundamentals | Derived from above | 7,310 quarterly rows (point-in-time) | 7,310 periods |
Location: data/sec_edgar/processed/issuer_master/
issuer_master.csv (970,000+ rows)Purpose: Complete registry of every SEC filer in the EDGAR system.
Key fields:
cik / padded_cik: Unique SEC identifierentity_name: Legal name of fileris_us_issuer: Boolean flag indicating U.S. incorporation/domicileprimary_ticker / primary_exchange: Trading symbol for public companiessic / sic_description: Industry classificationhas_submissions / has_companyfacts: Data availability flagsWhy this matters:
is_us_issuer=1 flag (851,569 entities) enables downstream filtering.cik_ticker_map.csv (5,644 rows)Purpose: Clean, production-ready mapping from CIK to primary ticker for publicly traded U.S. companies only.
Key fields:
cik / padded_cik: SEC identifierprimary_ticker: Trading symbol (e.g., AAPL, MSFT)entity_name: Company nameprimary_exchange: NYSE, Nasdaq, etc.Why this matters:
issuer_master_summary.jsonPurpose: Pipeline metadata and statistics for reproducibility.
Key stats:
Location: data/sec_edgar/processed/fundamentals/
core_fundamentals_quarterly.csv (7,310 rows)Purpose: Point-in-time quarterly and annual fundamentals for U.S. public companies.
Key fields (27 concepts extracted):
| Category | Fields |
|---|---|
| Identifiers | cik, ticker, fiscal_year, fiscal_period (Q1-Q4, FY) |
| Timing | filing_date, period_end_date, form_type (10-K, 10-Q), accession |
| Income Statement | revenue, cost_of_revenue, gross_profit, operating_expenses, operating_income, net_income, eps_basic, eps_diluted, shares_basic |
| Balance Sheet - Assets | total_assets, current_assets, cash_and_equivalents, inventory, ppe_net, goodwill, intangible_assets |
| Balance Sheet - Liabilities | total_liabilities, current_liabilities, long_term_debt, short_term_debt |
| Balance Sheet - Equity | shareholders_equity, retained_earnings |
| Cash Flow | operating_cash_flow, investing_cash_flow, financing_cash_flow, capex, free_cash_flow |
Point-in-Time Method (Option A - Strict):
For each (cik, fiscal_year, fiscal_period):
Why this matters:
fundamentals_summary.jsonPurpose: Pipeline metadata and concept coverage.
Key stats:
You processed 73.6M observations but only got 7,310 output rows. This is expected and correct. Here’s why:
Point-in-time deduplication: Multiple filings/amendments for the same period are collapsed to a single row (earliest filing).
Coverage is building over time: Many companies in the 5,644 list may not have XBRL facts for all periods yet (newer filers, data gaps).
As you add more years of data (via incremental SEC updates), this table will grow.
| Architecture Component | Data Source | Status |
|---|---|---|
| Fundamental Encoder/Model | core_fundamentals_quarterly.csv |
✅ Ready |
| Fundamental Analyst | core_fundamentals_quarterly.csv |
✅ Ready |
| Qualitative Analysis (fundamental branch) | Derived ratios from fundamentals | ⏳ Next step |
| Risk Engine (leverage/liquidity signals) | Balance sheet fields | ⏳ Feature engineering needed |
| U.S. Universe Definition | cik_ticker_map.csv |
✅ Ready |
| Market Data Joining | Join tickers with yfinance | ⏳ Next phase |
Goal: Transform raw fundamentals into model-ready features.
Tasks:
gross_profit / revenueoperating_income / revenuenet_income / revenuenet_income / total_assetsnet_income / shareholders_equity (handle negative equity)long_term_debt / shareholders_equitycurrent_assets / current_liabilitiesOutput: fundamentals_features.csv with forward-filled quarterly data.
Goal: Ensure no lookahead when joining with prices.
Method:
filing_datefiling_date for forward returnsStatus: ⏳ Pending
Source: yfinance (prototype), potentially Alpha Vantage/Finnhub
Scope: Daily OHLCV for 5,644 tickers × 15+ years
Output: market_data/ with price panels, returns, technical indicators
Status: 🔄 Raw downloads in progress Source: SEC EDGAR filings (.txt files) Completed:
Remaining:
Status: ⏳ Pending
Source: FRED API
Scope: Interest rates, inflation, VIX, credit spreads, unemployment
Output: macro/ with aligned daily/monthly series
Status: ⏳ Pending
Source: Derived from price panel
Scope: Rolling correlations, sector graphs for GNN
Output: graphs/ with edge lists and adjacency matrices
Based on your development priorities (data acquisition first), I recommend:
python data/sec_fundamentals_features.py
This would add derived ratios and growth metrics, making the fundamentals table directly usable for modeling.
# 1. Audit downloaded filings
python data/sec_filings_audit.py
# 2. Parse sections from 10-K/10-Q
python data/sec_filings_parser.py --forms 10-K,10-Q --sections risk,mdna
# 3. Build cleaned text corpus
python data/sec_filings_corpus.py
python data/market_data_yfinance.py --tickers data/sec_edgar/processed/issuer_master/cik_ticker_map.csv --years 15
| File | Purpose | Key Fields |
|---|---|---|
cik_ticker_map.csv |
Universe definition (5,644 tickers) | cik, primary_ticker |
core_fundamentals_quarterly.csv |
Model input for Fundamental Analyst | All 36 fields |
issuer_master.csv |
Reference for CIK lookups | cik, entity_name, is_us_issuer |
The fundamentals pipeline is production-ready. The text pipeline is mid-flight. Market data is next.
Which step do you want to tackle next?