The scope of identifying, categorizing and validating every single business in the US, using a dataset that comprises the entire Common Crawl, is a truly huge data problem.

As a result we decided to narrow the initial scope of our investigations to something more manageable:


Initial scoping

Can we identify every business in Berkeley, California using the common crawl?

Fig. 1: Stage 1 of creating True Business Data

To answer this question we broke the problem down into tractable stages.


First Scan

In our first crawl of the Common Crawl we ran through every single file in the most recent corpus (TODO: Valid figures 37,000 files, 55TB, Feb 2016 crawl). During this scan of the entire crawl we looked for every URL that contained a Berkeley address, grouped it by website, and collected statistics like: total number of web pages, number of web pages with a Berkeley address, total number of characters, number of characters from web pages with Berkeley addresses, etc. We did this using the following logic in our code:

import re
from string import digits
addr_pattern = re.compile("berkeley,? ca\.? 947\d\d", re.IGNORECASE|re.DOTALL).search

out = set()

for domain, json in domain_dict.iteritems():

    for each in json:
        url_content = json[each]
        address = url_content['address']
#         print address

        try:
            linebreak, end = addr_pattern(address).span()
        except:
#                 print json[each]
            pass
        city_state_zip = address[linebreak:]
        street = address[:linebreak]
        teerts = street[::-1].strip().split()

        # Search for numbers
        number_things = 0
        seen_blocks = 0
        for block in teerts:
            if seen_blocks <1:
                seen_blocks+=1 
            elif seen_blocks >6:
                break
            elif re.search("\d+",block):
                number_things += 1 
                break
            else:
                seen_blocks += 1 

        try:
            more_numbers = re.search("\d+", teerts[seen_blocks+number_things])        
            if not number_things: break
            elif more_numbers:
                street = teerts[:seen_blocks+number_things+1]
            else:
                street = teerts[:seen_blocks+number_things]
            out.add(domain +'\t'+ " ".join(street)[::-1] +'\t'+ address[linebreak:end]+'\r\n')
        except:
            pass
with open("address-pull.tsv","wb") as out_f:
    for each in out:
        out_f.write(each)

After we gathered the websites and their statistics, we used this information to select the appropriate web pages to include in our dataset. We iterated on this numerous times but ultimately settled for the following logic:

  1. If the website had less than 1K pages, we included every single one.
  2. If it had more, we only included the first 1K with a Berkeley address. Most did not come close to that number, but a few big sites did.

One additional decision made here was to exclude any URLs that belonged to top 10,000 top-level domains (TLDs) by frequency. These TLDs represent a significant percentage of the entire common crawl and include huge commerce, news and informational sites like Amazon, Yahoo and Wikipedia. The 10,000th site on this list contained ~700k pages in one crawl, and we feel confident very few local businesses were removed by excluding such large site.


Second scan

Having generated an index file of 9,108 websites that contained a Berkeley address in the page we ran a second pass on the Common Crawl. We used this list of URLs as a manual filter on the common crawl, and we went from 1 TB of data, to around 10 GBs. The result of all this was 865K web pages with the following information:

  • URL
  • Text content

Business classification

Fig. 2: Identifying and classifying businesses

The next step was to identify which of the websites we'd discovered represented businesses, versus those that represented websites that simply happened to list Berkeley addresses (e.g. lists of locations, housing website, etc).

Stage 1: Manual website classification

We began this process by manually classifying over 900 websites out of the 9K available (>10%). We added two boolean feature flags: Is this website a business (1/0), Is this website a Berkeley business (1/0). The second field was a direct subset of the first field (e.g. no website could be flagged as not being a business, but then be flagged as being a Berkeley business).

A note on identifying businesses.

What is a business, and how to you identify this from its websites? This was a non-trivial question to answer and one we spent much time discussing and refining as we spent more time classifying businesses. Our working definition was:

Any entity that seeks to make profit by selling goods or services.

Therefore we excluded websites with .org that were demonstrably asking for donations/volunteering time, as well as ones with .edu which are generally non-for-profits.

The resulting 900 classified websites would become our training set for our classification model, and the raw data can be found here. You can also find clean versions of the labels here.

Stage 2: Build classification model

The result of our classification model needed to be at the website level to classify as either a business (1) or not a business (0). We established that a stacking ensemble classifier would be necessary to achieve our classification output. Here are the two main steps required to classify websites:

  1. The first step involves classifying web pages as either business or non-business. This actually includes two stages, which we'll explain in more detail below.
  2. The second step involves taking the output from each web page, aggregating it per website, and predicting a final classification for the website.
Stage 2: Step 1, classifying web pages

Our starting data was per web page, so we had the URL and the content. From these two data points we created a training dataset with the following features:

  • Title text - the first line of the content was always the title, but we trimmed it at 100 characters
  • Content text - the rest of the content was considered the content
  • Title length - total number of characters
  • Content length - same as above
  • URL depth — the depth of the relative path of the URL e.g. example.com/home = 1, example.com/home/blog = 2
  • Domain/website - extracted from the URL, this is used in the step 2 aggregation

To train our web page level model we determined we could use the content and title to train text based classification models. We called these our stage 1 models, and trained them separately using two logistic regression models taking as input the TF-IDF weights (a classic method of assigning 'importance' to words in a document corpus). We then returned the probabilities individually using these models and added them as features to our training and test data in preparation for stage 2.

Afterwards, the stage 2 model received two probabilities from the previous models, and added the title length, content length, as well as the URL depth. These combined 5 features were used to train a second logistic regression for a final web page level classification. Two results were collected from this model: the web page level classification and the probability estimate of the business class (i.e. of label 1).

Stage 2: Step 2, classifying websites

In this second step, we received those two values per web page, and aggregated the data against the corresponding website extracted from the URL. With that data we generated the following two features per website:

  • Mean business classification prediction (e.g. avg. of 1/0 across all web-pages associated with the website).
  • Mean business class probability estimate.

Finally, we used these two features to train our last logistic regression classifier, which incorporated all the information collected so far to render a final prediction. At each of the 3 stages, our precision and recall was increased, benefiting from the accumulated signal each model collected.

results matching ""

    No results matching ""