The scope of identifying, categorizing and validating every single business in the US, using a dataset that comprises the entire Common Crawl, is a truly huge data problem.
As a result we decided to narrow the initial scope of our investigations to something more manageable:
Initial scoping
Can we identify every business in Berkeley, California using the common crawl?
To answer this question we broke the problem down into tractable stages.
First Scan
In our first crawl of the Common Crawl we ran through every single file in the most recent corpus (TODO: Valid figures 37,000 files, 55TB, Feb 2016 crawl). During this scan of the entire crawl we looked for every URL that contained a Berkeley address, grouped it by website, and collected statistics like: total number of web pages, number of web pages with a Berkeley address, total number of characters, number of characters from web pages with Berkeley addresses, etc. We did this using the following logic in our code:
import re
from string import digits
addr_pattern = re.compile("berkeley,? ca\.? 947\d\d", re.IGNORECASE|re.DOTALL).search
out = set()
for domain, json in domain_dict.iteritems():
for each in json:
url_content = json[each]
address = url_content['address']
# print address
try:
linebreak, end = addr_pattern(address).span()
except:
# print json[each]
pass
city_state_zip = address[linebreak:]
street = address[:linebreak]
teerts = street[::-1].strip().split()
# Search for numbers
number_things = 0
seen_blocks = 0
for block in teerts:
if seen_blocks <1:
seen_blocks+=1
elif seen_blocks >6:
break
elif re.search("\d+",block):
number_things += 1
break
else:
seen_blocks += 1
try:
more_numbers = re.search("\d+", teerts[seen_blocks+number_things])
if not number_things: break
elif more_numbers:
street = teerts[:seen_blocks+number_things+1]
else:
street = teerts[:seen_blocks+number_things]
out.add(domain +'\t'+ " ".join(street)[::-1] +'\t'+ address[linebreak:end]+'\r\n')
except:
pass
with open("address-pull.tsv","wb") as out_f:
for each in out:
out_f.write(each)
After we gathered the websites and their statistics, we used this information to select the appropriate web pages to include in our dataset. We iterated on this numerous times but ultimately settled for the following logic:
- If the website had less than 1K pages, we included every single one.
- If it had more, we only included the first 1K with a Berkeley address. Most did not come close to that number, but a few big sites did.
One additional decision made here was to exclude any URLs that belonged to top 10,000 top-level domains (TLDs) by frequency. These TLDs represent a significant percentage of the entire common crawl and include huge commerce, news and informational sites like Amazon, Yahoo and Wikipedia. The 10,000th site on this list contained ~700k pages in one crawl, and we feel confident very few local businesses were removed by excluding such large site.
Second scan
Having generated an index file of 9,108 websites that contained a Berkeley address in the page we ran a second pass on the Common Crawl. We used this list of URLs as a manual filter on the common crawl, and we went from 1 TB of data, to around 10 GBs. The result of all this was 865K web pages with the following information:
- URL
- Text content
Business classification
Fig. 2: Identifying and classifying businesses
The next step was to identify which of the websites we'd discovered represented businesses, versus those that represented websites that simply happened to list Berkeley addresses (e.g. lists of locations, housing website, etc).
Stage 1: Manual website classification
We began this process by manually classifying over 900 websites out of the 9K available (>10%). We added two boolean feature flags: Is this website a business (1/0), Is this website a Berkeley business (1/0). The second field was a direct subset of the first field (e.g. no website could be flagged as not being a business, but then be flagged as being a Berkeley business).
A note on identifying businesses.
What is a business, and how to you identify this from its websites? This was a non-trivial question to answer and one we spent much time discussing and refining as we spent more time classifying businesses. Our working definition was:
Any entity that seeks to make profit by selling goods or services.
Therefore we excluded websites with
.org
that were demonstrably asking for donations/volunteering time, as well as ones with.edu
which are generally non-for-profits.
The resulting 900 classified websites would become our training set for our classification model, and the raw data can be found here. You can also find clean versions of the labels here.
Stage 2: Build classification model
The result of our classification model needed to be at the website level to classify as either a business (1) or not a business (0). We established that a stacking ensemble classifier would be necessary to achieve our classification output. Here are the two main steps required to classify websites:
- The first step involves classifying web pages as either business or non-business. This actually includes two stages, which we'll explain in more detail below.
- The second step involves taking the output from each web page, aggregating it per website, and predicting a final classification for the website.
Stage 2: Step 1, classifying web pages
Our starting data was per web page, so we had the URL and the content. From these two data points we created a training dataset with the following features:
- Title text - the first line of the content was always the title, but we trimmed it at 100 characters
- Content text - the rest of the content was considered the content
- Title length - total number of characters
- Content length - same as above
- URL depth — the depth of the relative path of the URL e.g. example.com/home = 1, example.com/home/blog = 2
- Domain/website - extracted from the URL, this is used in the step 2 aggregation
To train our web page level model we determined we could use the content and title to train text based classification models. We called these our stage 1 models, and trained them separately using two logistic regression models taking as input the TF-IDF weights (a classic method of assigning 'importance' to words in a document corpus). We then returned the probabilities individually using these models and added them as features to our training and test data in preparation for stage 2.
Afterwards, the stage 2 model received two probabilities from the previous models, and added the title length, content length, as well as the URL depth. These combined 5 features were used to train a second logistic regression for a final web page level classification. Two results were collected from this model: the web page level classification and the probability estimate of the business class (i.e. of label 1).
Stage 2: Step 2, classifying websites
In this second step, we received those two values per web page, and aggregated the data against the corresponding website extracted from the URL. With that data we generated the following two features per website:
- Mean business classification prediction (e.g. avg. of 1/0 across all web-pages associated with the website).
- Mean business class probability estimate.
Finally, we used these two features to train our last logistic regression classifier, which incorporated all the information collected so far to render a final prediction. At each of the 3 stages, our precision and recall was increased, benefiting from the accumulated signal each model collected.