Sunday’s New York Times included a story about how the presidential campaigns are making extensive use of third-party web trackers. In response to privacy concerns, “[o]fficials with both campaigns emphasize[d] that [tracking] data collection is ‘anonymous.’”1
The campaigns are wrong: tracking data is very often identified or identifiable. Arvind Narayanan has previously written a comprehensive and accessible explanation of why web tracking is hardly anonymous; my survey paper on web tracking provides more extensive discussion.
One of the ways in which web tracking data can become identified or identifiable is “leakage”—data flowing to trackers from the websites that users interact with. Leakage most commonly occurs when a website includes identifying information in a page URL or title. Embedded third parties receive the identifying information if they receive the URL (e.g. referrer headers) or the title (e.g.
document.title). Even a little identifying information leakage thoroughly undermines the privacy properties of web tracking: once a user’s identity leaks to a tracker, all of the tracker’s past, present, and future data about the user becomes identifiable.
Web services frequently fail to account for information leakage in their design and testing; a study I conducted last year found that over half of popular websites were leaking identifying information.2 More than a few website operators have made inaccurate representations about the information they share with third parties; in just the past year the Federal Trade Commission settled deception claims against both Facebook and Myspace for falsely disclaiming identifying information leakage.
The Times coverage piqued my curiosity: Are the campaigns identifying their supporters to third-party trackers? Are they directly undermining the anonymity properties that they are so quick to invoke?
Yes, they are. I tested the two leading candidate websites using the methodology from my prior study of identifying information leakage. Both leak. The following sections describe my observations from the Barack Obama and Mitt Romney campaign websites.
- Username. Several pages include the username in their URL or title, including the user preferences page, the social organizing “Dashboard” profile page, the Dashboard profile editing page, and the Dashboard personal statistics page.3
In my testing, username leaked to ten companies.4
A username is often personally identifying. It might simply be a user’s name, or it could enable linking other public accounts and information about the user. Several companies have already deployed effective username linkage in their products.
The default username selection on
barackobama.comfacilitates identifying users. If the user registers with a Facebook account, the username defaults to the user’s name in dot-separated format (e.g.
leland.stanford). If the user signs up with just an email address, the default username is the first part of the user’s email address—which will often be some form of the user’s name or a fanciful username shared with other services.
The design of the Dashboard website also enables connecting a username to a user’s identity. Any signed-in user (including someone trying to identify tracking data) can look up a user’s profile from their username. Unless a user opts out, their profile page will include their name.
- Name. The title of the Dashboard profile page incorporates the user’s name. A script on the page reports impression information to Chartbeat, including the page title.5
- Street Address and ZIP Code. If a user searches for an organizing team in Dashboard, the results page includes the query street address and ZIP code in its URL. It appears new Dashboard users are required to search for a team.
Similarly, the results page for finding an event includes the query ZIP code in its URL.
I spotted the street address and ZIP code leaking to nine companies, and just the ZIP code leaking to one other company.6
- Name. The post-login landing page and most preference pages include the user’s name in their title.
Scripts from two companies collect the page title as part of their impression reporting.7
- Partial Email Address. If a user registers with their Facebook account, the post-login landing page URL incorporates the first part of the user’s email address (with non-alphanumeric characters removed).
Thirteen companies received the partial email address.8
- User ID. Many pages include a unique user ID in their URL, which leaks to the same companies.9
The ID itself is not identifying information, and
mittromney.comdoes not provide social functionality that would facilitate mapping a user ID to a user’s name. It appears, however, that a quirk in
mittromney.comcan allow anyone (even not logged in) to determine a user’s name from their ID. If the user has recently visited a URL that includes their user ID, anyone who visits that URL in the following (very roughly) fifteen minutes can view the user’s name in the page heading.10
A tracker could identify users by waiting for them to land on one of these URLs, then visiting it and extracting the user’s name. Alternatively, anyone in possession of tracking data could periodically test these user ID URLs.
ZIP Code. The results page for an events search includes the query ZIP code in its URL.11
The ZIP code leaked to the same companies as the partial email address and user ID.
The major presidential campaigns both fell short of best practices in their website design and testing, and they both misrepresented their privacy practices to the Times. The Gray Lady also deserves a light rap on the knuckles for insufficiently scrutinizing the campaigns’ anonymity assertions.
But, in my view, the greatest takeaway is that the myth of web tracking’s anonymity has proven remarkably resilient—despite compelling research results and practical experience to the contrary. Companies and trade groups in the tracking business community frequently invoke unfounded claims of anonymity. Policymakers, website operators, and journalists all-too-often repeat those claims—even, apparently, when they’re of the highest caliber.
My hope is that this episode serves as a learning opportunity and a reminder: there is no such thing as anonymous web tracking.
Thanks to Ed Felten and Arvind Narayanan for valuable comments on a draft. All errors are solely my own.
1. An Obama campaign spokesman went even further, asserting “[w]e do not provide any personal information to outside entities.” The
3. For example, respectively,
4. The companies were: Akamai (CDN used by Chartbeat), Amazon (Amazon Web Services used by the campaign and New Relic), BrightTag, Chartbeat, Facebook, Google (Analytics, DoubleClick, and Hosted Libraries), Hoefler & Frere-Jones (typography.com), New Relic, Think Realtime, and Zendesk. Here and throughout this post I err on the side of comprehensiveness in listing third parties that receive data. Opinions differ on the privacy risks associated with various service providers (e.g. Akamai, Amazon Web Services, and Google Analytics). My intent is not to take a position on that issue, but rather, convey sufficient information to satisfy readers across the spectrum of views.
6. The results page for a Dashboard teams search has a URL formatted like
https://dashboard.barackobama.com/teams/match?street=353+Serra+Mall&zip=94305.... The results page URL for an events search has a format like
https://my.barackobama.com/page/event/search_results?...zip_radius%5B0%5D=94305. I observed the street address and ZIP code leak to: Akamai (CDN used by Chartbeat), Amazon (Amazon Web Services used by the campaign and New Relic), Chartbeat, Facebook, Google (Analytics), Hoefler & Frere-Jones (typography.com), New Relic, Optimizely, and Zendesk. ZIP code also leaked to BrightTag and Google (Maps API).
7. The post-login landing page title has the form
Leland Stanford | Mitt Romney for President. A ShareThis script reports back to a URL like
https://l.sharethis.com/pview?...title=Leland%Stanford%20%7C%20Mitt%20Romney%20for%20President..., and a Syncapse script contacts a URL like
8. An example post-login landing page URL for a Facebook user:
https://www.mittromney.com/users/lelandstanford. The thirteen companies who received the first portion of the user’s email address were: Adobe (Typekit), Akamai (hosting used by the campaign), Amazon (Amazon Web Services used by New Relic and Search Discovery), Compete, comScore (Scorecard Research), Facebook, Google (Ad Services and DoubleClick), Lotame, New Relic, Optimizely, Search Discovery, ShareThis, and Syncapse.
9. No matter how a user registers, many pages include a unique user ID in their URL. A preferences page, for example, might have the URL
https://www.mittromney.com/user/123456789/edit. If the user signs up without a social network login, the post-login landing page has the generic URL
https://www.mittromney.com/user. If the user signs up with a Facebook or Twitter account, however, the landing page URL also includes a unique user ID—but assigned with a different scheme. The Facebook ID allocation system is discussed above; Twitter post-login URLs take the form
10. My best hypothesis is that this property arises from a caching misconfiguration; page content is correctly dynamic between users, but page headings are incorrectly cached for a period independent of user permissions.
11. Thanks to Natasha Singer for identifying ZIP code leakage on