With the scale of the Common Crawl data-set it was clear that creating an open source data-set from scratch would be a challenge. We have decisively proven that, through the application of data science, it's possible to do this.

We have created a working foundation and example of this in practice, as outlined in this handbook and on our site. However, as with any large undertaking, our work is not free from criticism. Here we outline further considerations and improvements we would encourage be made in future iterations of True Business Data.

Improving classification through further labeled data

As outlined in our 'Creating True Business Data' section, we used manually labeled training data for Berkeley to train a business URL classification model. We saw strong performance of this model in terms of accuracy and precision scores. As with all supervised machine learning problems, more labeled data would both improve the accuracy of our model performance in development, and fundamentally improve its generalizablity to the rest of the common crawl. A mechanical-turk strategy could be employed to enable this in a cost-effective manner.

This work would go a long way to solve major outstanding concerns with the TBD data—namely implied precision and accuracy (which is hard to quantify) and generalizablity.

Future potential for validation

While our data is novel, and therefore not directly comparable with any other open set of data, we've outlined various future avenues for further validation. One method would be to validate TBD vs Google Maps, Yelp, & Yellow Pages.

Using grid-based random sampling technique, certain streets could be selected from maps of the U.S. For these randomly selected streets manual validation of TBD data vs other sources such as Google maps could be conducted e.g. cross-validating the businesses listed on these streets between the different sources.

Further methods would be:

  • Contacting businesses to confirm existence
  • Gaining access to a complete proprietary data set for a limited period to cross-calidate accuracy e.g. collaboration with Google/Yelp

What are we missing?

'One popular and persistent misconception about Common Crawl, however, is to think that it is truly representative for the Web as a whole.' (http://www.heppnetz.de/files/commoncrawl-cold2015.pdf)

Prior research has shown that the Commmon Crawl does not contain 'a copy of the web', but rather those pages on the web that the CC spider is both 1) allowed to crawl and 2) programmed to crawl. There are many subdomains on certain sites that may not be entirely crawled by the CC spider due to the way it is told to crawl. The Common Crawl is programmed to crawl pages based on supposed importance as outlined by a ranking of page importance developed on a set of URLs from search engine Blekko.

Conclusion

We focused this project on building out the methodology and proving the thesis that a novel open source data-set could be created from the common crawl. We feel that we have proven this decisively, while also acknowledging there are yet major improvements to the groundwork we have laid. We hope that this work proves useful to others, and further spur other data science projects around U.S. businesses and the Common Crawl.

Jaime Vilalpando, Michael Kennedey, Stephen Tracy MIDS 2016

results matching ""

    No results matching ""