Co-authored with Patrick Mutchler.
Two weeks ago we kicked off the MetaPhone project, a crowdsourced study of phone metadata. Our aim is to inform policy and legal debates surrounding dragnet surveillance programs. We are exceedingly grateful to the hundreds of users who have joined. If you have not yet participated, you can still grab the MetaPhone app for Android.
Today we are excited to share some preliminary results: We can predict many romantic relationships. Automatically. Using solely phone metadata.
We began by dividing the problem into two parts. First, is a person in a relationship? If yes, then second, which number belongs to the person’s significant other?
The second problem, we quickly discovered, is much easier to solve. In our sample of individuals with significant others, the SO is the most-called number for over 60% of participants, and the most-texted number for over 70% of participants. Plenty of room for improvement, to be sure, but those simple features are a decent start.
Determining whether a user is in a relationship proved tricky. After some crash machine learning development, we were able to build a dating detector with good performance.
For those interested in the gory details: We began by selecting participants who provide a relationship status on Facebook. Individuals who were “Single” were labeled negative (no offense intended), and those who were “In a relationship,” “Engaged,” or “Married” were labeled positive. Next, we generated features from call and text patterns, including histograms of count and length. We used 10-fold cross-validation to generate training and testing splits and randomly upsampled training singles to account for participant imbalance. For each fold, we built a k-nearest neighbor classifier1 from the training data and calculated a receiver operating characteristic curve over the testing data. Finally, we averaged the curves, as plotted below.
In less jargon: The graph reflects a tradeoff. We can get more individuals with significant others right, the vertical axis, in exchange for getting more singles wrong, the horizontal axis. This means, roughly, we can guess six in ten individuals with SOs right and get relatively few singles wrong. Or, if we accept getting one in three singles wrong, we can jump to getting over eight in ten individuals with SOs right.
The relationship statuses that we studied are not, by the way, volunteered to the general public. Only about one in four were configured to display to a stranger on Facebook, even though that’s the default.
These are, to emphasize, preliminary results. We will have more, better, and higher confidence findings as additional users (like you!) participate. We still have much more work ahead on the MetaPhone project. This is just a first, promising step towards confirmation of metadata’s importance.
Many thanks to our colleagues at Stanford and Princeton for their invaluable suggestions. All views are solely our own.
1. We set the number of neighbors at 5, the default in sklearn‘s implementation.