Skip to content

Flexible data-mining strategies (Chen-Lopez-Lira-Zimmermann)

Verified May 16, 2026 · tested with live GitHub raw fetch (SignalsTheoryChecked.csv, 200, real columns)

asset-pricingfactorsanomaliesfreeacademic

The flexible data-mining dataset (Chen, Lopez-Lira & Zimmermann, “Peer- reviewed theory does not help predict the cross-section of stock returns”) is the free benchmark of ~30,000 long-short strategies built from every constructible CRSP/Compustat accounting ratio, plus a classification of which published signals are theory-motivated vs. purely empirical. It is what the ZeroPaper pipeline uses to ask whether a “novel” predictor is actually distinguishable from data mining.

Option 1 — Signal-theory classification (small, start here)

Section titled “Option 1 — Signal-theory classification (small, start here)”
import pandas as pd
url = ("https://raw.githubusercontent.com/chenandrewy/flex-mining/"
"main/DataInput/SignalsTheoryChecked.csv")
signals = pd.read_csv(url) # signalname, Authors, Year, Journal, theory, …

Option 2 — Full data-mined returns (large, Google Drive)

Section titled “Option 2 — Full data-mined returns (large, Google Drive)”
# pip install gdown
import gdown
gdown.download_folder(
"https://drive.google.com/drive/folders/1SZe_aF4ZNvK4ZRx2jQUE1j19KQvBaqWr",
output="data/flex-mining/", remaining_ok=True)

Key files: DataMinedLongShortReturnsEW.csv, DataMinedLongShortReturnsVW.csv (~30K strategies, monthly).

The reason to read this page rather than the repo. The small GitHub CSV was fetched live on the date above (200, real columns); the bulk returns live in the linked Drive folder.

  • The bulk data is on Google Drive — use gdown. Plain requests on a Drive folder URL does not work. pip install gdown and use download_folder. This is the most common failure here.
  • It’s large (~500 MB+). Download once to data/flex-mining/ and cache. Load big CSVs with usecols= / chunking or you will OOM.
  • Start from the small file. SignalsTheoryChecked.csv (~65 KB, straight from GitHub) is enough to scope an analysis before pulling 30K strategies.
  • EW vs VW are different files. State which; they tell different stories.
  • It’s a benchmark distribution, not a strategy list. The point is to compare a published signal’s return against the distribution of mined returns (is it in the right tail?), not to trade the 30K.
  • Drive folder IDs can rotate. If download_folder fails, check the GitHub README for the current link before assuming the data moved.
File / folderDescription
DataMinedLongShortReturnsEW.csvEW long-short returns, ~30K strategies
DataMinedLongShortReturnsVW.csvVW long-short returns, ~30K strategies
DataInput/SignalsTheoryChecked.csvPublished signals: theory vs. empirical
Risk-vs/Risk vs. mispricing decomposition outputs
  • Benchmark published vs. mined: is a published signal in the right tail of the mined return distribution?
  • Sufficiency: how much cross-sectional variation does the published set capture vs. the full mined set?
  • Pre-publication: do soon-to-be-published patterns differ before vs. after publication?
  • Theory value: do theory-motivated signals beat empirical ones, conditional on mined performance?
  • Always state EW vs VW, sample period, and any filters.

Chen, A. Y., J. Lopez-Lira, and T. Zimmermann. “Peer-reviewed theory does not help predict the cross-section of stock returns.” Data and code: https://github.com/chenandrewy/flex-mining, accessed YYYY-MM-DD.

Found an error or want a topic covered? Open an issue, use the Edit page link above, or email contact@instituteforautomatedresearch.org. Edits are reviewed before publishing; provenance and accuracy are the point.