Flexible data-mining strategies (Chen-Lopez-Lira-Zimmermann)
Verified May 16, 2026 · tested with live GitHub raw fetch (SignalsTheoryChecked.csv, 200, real columns)
The flexible data-mining dataset (Chen, Lopez-Lira & Zimmermann, “Peer- reviewed theory does not help predict the cross-section of stock returns”) is the free benchmark of ~30,000 long-short strategies built from every constructible CRSP/Compustat accounting ratio, plus a classification of which published signals are theory-motivated vs. purely empirical. It is what the ZeroPaper pipeline uses to ask whether a “novel” predictor is actually distinguishable from data mining.
- Cost: free, no auth (public GitHub + public Google Drive).
- Code: https://github.com/chenandrewy/flex-mining
- Bulk data: a public Google Drive folder (see below).
Access
Section titled “Access”Option 1 — Signal-theory classification (small, start here)
Section titled “Option 1 — Signal-theory classification (small, start here)”import pandas as pdurl = ("https://raw.githubusercontent.com/chenandrewy/flex-mining/" "main/DataInput/SignalsTheoryChecked.csv")signals = pd.read_csv(url) # signalname, Authors, Year, Journal, theory, …Option 2 — Full data-mined returns (large, Google Drive)
Section titled “Option 2 — Full data-mined returns (large, Google Drive)”# pip install gdownimport gdowngdown.download_folder( "https://drive.google.com/drive/folders/1SZe_aF4ZNvK4ZRx2jQUE1j19KQvBaqWr", output="data/flex-mining/", remaining_ok=True)Key files: DataMinedLongShortReturnsEW.csv,
DataMinedLongShortReturnsVW.csv (~30K strategies, monthly).
Gotchas (the ones that bite pipelines)
Section titled “Gotchas (the ones that bite pipelines)”The reason to read this page rather than the repo. The small GitHub CSV was fetched live on the date above (200, real columns); the bulk returns live in the linked Drive folder.
- The bulk data is on Google Drive — use
gdown. Plainrequestson a Drive folder URL does not work.pip install gdownand usedownload_folder. This is the most common failure here. - It’s large (~500 MB+). Download once to
data/flex-mining/and cache. Load big CSVs withusecols=/ chunking or you will OOM. - Start from the small file.
SignalsTheoryChecked.csv(~65 KB, straight from GitHub) is enough to scope an analysis before pulling 30K strategies. - EW vs VW are different files. State which; they tell different stories.
- It’s a benchmark distribution, not a strategy list. The point is to compare a published signal’s return against the distribution of mined returns (is it in the right tail?), not to trade the 30K.
- Drive folder IDs can rotate. If
download_folderfails, check the GitHub README for the current link before assuming the data moved.
What’s inside
Section titled “What’s inside”| File / folder | Description |
|---|---|
DataMinedLongShortReturnsEW.csv | EW long-short returns, ~30K strategies |
DataMinedLongShortReturnsVW.csv | VW long-short returns, ~30K strategies |
DataInput/SignalsTheoryChecked.csv | Published signals: theory vs. empirical |
Risk-vs/ | Risk vs. mispricing decomposition outputs |
Standard operations
Section titled “Standard operations”- Benchmark published vs. mined: is a published signal in the right tail of the mined return distribution?
- Sufficiency: how much cross-sectional variation does the published set capture vs. the full mined set?
- Pre-publication: do soon-to-be-published patterns differ before vs. after publication?
- Theory value: do theory-motivated signals beat empirical ones, conditional on mined performance?
- Always state EW vs VW, sample period, and any filters.
Citation
Section titled “Citation”Chen, A. Y., J. Lopez-Lira, and T. Zimmermann. “Peer-reviewed theory does not help predict the cross-section of stock returns.” Data and code: https://github.com/chenandrewy/flex-mining, accessed YYYY-MM-DD.