To validate our data we used 25% of our manually classified data as a test dataset. We achieved the following rates of accuracy, precision and recall on Berkeley based businesses:
- Accuracy: 85.4%
- Recall: 39.5%
- Precision: 85%
- F1: 54%
We chose a model that was biased towards precision e.g. identifying true businesses. While the recall was low, these results need to be taken in light of the fact that our model was built on ~900 manually labeled examples. Our model score variance was high given the low number of labeled websites; even using different seeds for training/test data sampling led to vastly different recall scores—a range of 53%-74% (21ppts). As a result, we feel this proves a solid benchmark that will be vastly improved by further manual classification.
We also trained a further model on all businesses e.g. including those located outside Berkeley. Here our model scores improved dramatically due to large set of labeled examples:
- Accuracy: 74.7%
- Recall: 74.3%
- Precision: 77.2%
- F1: 75.7%
In the evaluation, we discuss a desire for more labeled data to improve the performance of the TBD models.