Original at the Stanford Center for Internet and Society.
Despite all the attention they’ve received in the debates around online privacy, cookies are far from the only way to track a user. Broadly speaking, a website can either stash a unique identifier anyplace in the browser (“tagging”)1 or explore features of the browser until it becomes unique (“fingerprinting”).2 Tracking technologies that do not rely on cookies are often referred to as “supercookies,” and they are widely viewed as unsavory in the computer security community because they continue tracking even when a user clears her cookies to preserve privacy. Sometimes a site will use a supercookie to “respawn” its original identifier cookie, creating a “zombie cookie” — the basis of several lawsuits.
In one of our recent FourthParty web measurement crawls we included a cookie clearing step to emulate a user’s privacy choice. We observed that after clearing the browser’s cookies an identifier cookie (named “MUID” for “machine unique identifier”) respawned on live.com, a Microsoft domain. We dug into Microsoft’s cross-domain cookie syncing code and discovered two independent supercookie mechanisms, one of which was respawning cookies. We contacted Microsoft with our observations, and we have collaborated to assist in rectifying the issues we uncovered. Here is what we know.
Thanks, once again, to Jovanni Hernandez and Akshay Jagadeesh for their indispensable research assistance.
Microsoft’s cookie syncing script would, in some cases, function as a cache cookie and respawn the MUID cookie.
One of the foundational concepts in web security is the cookie same-origin policy: cookies can only be read and modified by the domain that set them. If domains collaborate they can trivially circumvent the same-origin policy and share cookies with each other; this practice is called “cookie syncing.” Cookie syncing often raises privacy concerns. For example, in online advertising real-time bidding, cookie syncing allows a single advertising exchange to notify many advertising networks and data aggregators whenever a user visits a website. That said, there are some unequivocally legitimate use cases for cookie syncing, such as when a company has spread its business across multiple domains (e.g. amazon.com and amazonaws.com).
Microsoft uses cookie syncing to share identifiers across many of its web properties, including bing.com, microsoft.com, msn.com, live.com, and xbox.com. Microsoft also syncs its MUID cookie to atdmt.com, the domain for its Atlas third-party advertising network. We found that one of Microsoft’s cookie syncing scripts (wlHelper.js) included an instruction to set the MUID cookie, and the script would get cached indefinitely.3 If the cached script ran and no MUID cookie was present, the script would set a cookie with its stored MUID. Here is a slightly simplified example snippet of the relevant code:4
var id_muid = “5CBC2F2396F14F4EBA255A695D313CD1“;
var muidValue = null;
// the MUID cookie value is read into muidValue
…
if (muidValue == null && id_muid != null) {
…
// cookieDomain is set to “; domain=” + the current domain
var cookieSettings = cookieDomain + “; expires=Fri, 01 Jan 2021 00:00:00 GMT; path=/;”;
document.cookie = “MUID=” + id_muid + cookieSettings;
}
We identified wlHelper.js scripts on several Microsoft domains:
http://analytics.atdmt.com/Scripts/wlHelper.js?i=MUID
http://analytics.live.com/Scripts/wlHelper.jsi=MUID
http://analytics.microsoft.com/Scripts/wlHelper.js?i=MUID
http://analytics.msn.com/Scripts/wlHelper.js?i=MUID
In our crawling data from the Alexa world top 10,000 sites we found that one or more wlHelper.js scripts were loaded when the browser visited:
http://www.microsoft.com/en-us/default.aspx
http://www.microsoftstore.com/store/msstore/DisplayHomePage
http://www.msn.com/
http://ca.msn.com/
http://es.msn.com/
A user would have her MUID respawned if she: 1) ever visited a site with a wlHelper.js embedded, 2) cleared her cookies but not her cache, and 3) visited a site with the same wlHelper.js embedded and no MUID. It is difficult to estimate the number of users affected by Microsoft’s respawning without knowing more about traffic to Microsoft’s web properties and the conditions under which it would set an MUID. We would note that Microsoft’s web properties are popular destinations with tens of millions of visitors per day.
Once a cookie respawned, we often saw it get sent to other Microsoft domains. In the FourthParty data above, for example, the old MUID was passed to atdmt.com.
http://c.atdmt.com/c.gif?…&MXFR=5CBC2F2396F14F4EBA255A695D313CD1
Microsoft therefore had, in at least this case, sufficient information to trivially associate the user’s interactions with msn.com, live.com, and atdmt.com from before and after cookie clearing.6
Microsoft’s cookie syncing script included an ETag cookie.
ETags are a simple cache control mechanism built into HTTP. A website can assign a version number to a resource; when the browser goes to request the resource, and the version hasn’t changed, the website can just tell the browser to use its cached copy. It had long been known that, instead of storing a version number, an ETag could be used to store a user identifier (an “ETag cookie”). Two weeks ago a research team at University of Caliornia, Berkeley discovered the first instance of ETag cookies in use.
We found that, in addition to functioning as a cache cookie, Microsoft’s wlHelper.js script was associated with an ETag cookie containing the MUID.
sqlite> select name, value from cookies where host=’.live.com’ and name=’MUID’ limit 1;
MUID 5CBC2F2396F14F4EBA255A695D313CD1
sqlite> select http_response_headers.name, http_response_headers.value from http_responses, http_response_headers where http_responses.id = http_response_headers.http_response_id and http_responses.url=’http://analytics.atdmt.com/Scripts/wlHelper.js?i=MUID’ and http_response_headers.name=’Etag’ limit 1;
Etag “5CBC2F2396F14F4EBA255A695D313CD13698″
The practical effect was that if a user cleared her cookies but not her cache, subsequent requests for wlHelper.js would be accompanied by both the new MUID value (in a cookie) and the old MUID value (in the ETag). This pairing of old and new MUIDs gave Microsoft sufficient information to associate user interactions with its domains from before and after cookie clearing.7
In addition to supercookies, we spotted two other issues with Microsoft’s advertising practices.
Microsoft’s targeted advertising opt-out button was invisible in Chrome and Safari.
Microsoft operates its own advertising choice page. (Note that Microsoft only allows users to opt out of seeing behaviorally targeted advertising; it does not remove its identifier cookies after a user has opted out, nor does it make any promise to stop tracking.) We observed that the opt-out link on Microsoft’s advertising choice page was invisible for Chrome and Safari users.
Microsoft fixed their opt-out button after we called the issue to their attention.
Microsoft’s approach to segregating advertising data does not meaningfully protect user privacy.
In a 2007 report entitled “Privacy Protections in Microsoft’s Ad Serving System and the Process of ‘De-identification,'” Microsoft’s privacy team explains how the company segregates identified Windows Live user information from de-identified third-party advertising data.
One of Microsoft’s goals is to serve targeted ads in a manner that protects user privacy. To avoid using the LiveID cookie to serve per-user ads—because, as described earlier, it is directly associated with information that could personally identify the user—Microsoft has created an “Anonymous” ID, called the ANID, on which its ad serving capabilities are based.
When a user first registers with Windows Live or MSN, a LiveID and an ANID are created simultaneously. The ANID is derived by applying a one-way cryptographic hash function to the LiveID. A one-way cryptographic hash function ensures that there is no practical way of deriving the original value from the resulting hash value—that is, the process cannot be reversed to obtain the original number.
Microsoft makes several expansive claims about its advertising system’s privacy guarantees.
Because all personally and directly identifying information about a user is stored on servers in association with a LiveID rather than an ANID, there is no practical way to link data stored in association with an ANID back to any data on Microsoft servers that could personally and directly identify an individual user.
Furthermore, to associate any of the ANID-based data in the Microsoft ad system with an individual user, an internal or external attacker would not only need access to the ad serving system (to access the data), the Windows Live ID system (to access all LiveIDs ever issued) and the hashing algorithm but would also need a massive computing infrastructure to run the algorithm on each and every LiveID ever created to try to find the ANID in question.
And in 2008 testimony before the Senate Commerce Committee, attached to a 2009 comment to the Federal Trade Commission:
As a result of this “deidentification” process, search query data and data about Web surfing behavior used for ad targeting is associated with an anonymized identifier rather than an account identifier that could be used to personally and directly identify a consumer.
Microsoft’s attempt at “anonymous” advertising data does not achieve nearly so much. First, as Arvind Narayanan noted in a recent blog post, de-identified online tracking data is far from anonymous. Even using completely unassociated identifiers for Windows Live user information and advertising data would not provide much of a privacy guarantee.8
Second, Microsoft’s use of a cryptographic hash to generate its ANID cookie contributes little privacy protection. The privacy threat that Microsoft is attempting to mitigate is a comparison between a user’s ANID and LiveID. Cryptographic hashing does not make comparison of two known values a challenge: on the contrary, comparison is a core use case for cryptographic hashing. With knowledge of Microsoft’s hash function, an employee or outsider could trivially compare any LiveID to any ANID.
Closing Thoughts
The online advertising industry is currently locking horns in Washington to prove it can regulate itself. Several trade organizations and private firms have billed themselves as rigorous watchdogs. And yet, in our analysis of one of the most prolific online advertising networks, we found significant privacy shortcomings that even a cursory privacy audit would have uncovered. It is increasingly difficult to accept industry claims that recent negative discoveries reflect “just a few bad apples.” And it is more than a little troubling that a few research groups and occasional press coverage seem to be the only present checks on one of the most privacy-invasive industries in history.
Thanks to Ashkan Soltani for providing feedback on a draft.
[1] See Evercookie, Flash Cookies and Privacy II, and An Analysis of Private Browsing Modes in Modern Browsers.
[2] See How Unique Is Your Web Browser? and Any person… a pamphleteer.
[3] The wlHelper.js script was served with a two-day cache expiry. Subsequent requests after the cache expired received a 304 response to keep the cached version with another two-day expiry.
[4] Microsoft replaced its cookie syncing system, including wlHelper.js, after we alerted them to our findings. An example of the old script is available on Google Code.
[5] We also saw the microsoft.com MUID cookie respawn, but not through wlHelper.js. We are still working to discover the additional supercookie mechanism on microsoft.com.
[6] Web measurement provides limited insight into a website’s backend. In cases where a domain’s cookie respawned, it is quite likely that new tracking information was associated with old tracking information. We cannot say how Microsoft used its data in cases where a domain’s MUID didn’t respawn but it received an old MUID from another domain that did respawn. At minimum it seems unlikely Microsoft discarded this information from all logs.
[7] As above, we cannot say what Microsoft did with its ETag cookie data. It again would be unlikely Microsoft discarded this information from all logs.
[8] Google follows this approach: it serves its third-party advertising content from doubleclick.net, and uses a Doubleclick identifier, instead of serving from google.com and using a Google identifier.