With better information policy makers, business owners, researchers and data scientists could better understand this key indicator of the American economy.
We believe that a reliable, unbiased extendable dataset can be built from the Common Crawl—an openly available snapshot of all accessible webpages on the worldwide web, captured every one or two months.
We will use the Common Crawl in combination with big data and supervised machine learning techniques to classify web pages as relating to businesses or not. The resulting dataset will be made available to the Federal Reserve for critical macroeconomic analysis of the US SMB sector. True Business Data will contain:
- Business URL
- Business Address(es)
- Common crawl date
So what? — True Business Data comparisons with existing business datasets
Fig. 1: Overview of TBD vs other existing business data
Fig. 2: Comparing TBD vs existing data on key dimensions of Openness and Granularity