Co-authored by Arvind Narayanan.
Debates over web tracking and Do Not Track tend to be framed as a clash between consumer privacy and business need. That’s not quite right. There is, in fact, a spectrum of possible tradeoffs between business interests and consumer privacy.
Our aim with the Tracking Not Required series is to show how those tradeoffs are not at all linear; it is possible to swap a little functionality for a lot of privacy. We only use technologies that are already deployed in browsers, and the solutions we propose are externally verifiable.1
We focus on issues at the center of Do Not Track negotiations in the World Wide Web Consortium. Advertising companies have pledged to stop forms of ad targeting once a user enables Do Not Track, but many maintain that tracking is essential for a litany of “operational uses.” The Tracking Not Required series demonstrates how business functionality can be implemented without exposing users to the risks of tracking.
This first post addresses frequency capping in online advertising, the most frequently cited “operational use” necessitating tracking.
Background
When an online advertiser places a bid, it often sets a “frequency cap” on how many times a user may see a particular ad or ad campaign. Many advertising companies implement frequency capping with a unique ID cookie; when a user loads an ad, the ad company looks up past ad views using the ID and imposes frequency caps. The ID cookie approach is understandable: frequency capping becomes a straightforward database lookup. But ID cookies come at a significant privacy cost: they enable effective tracking of a user’s browsing activities.
Algorithm
When the browser loads a page, for each ad slot:
- The advertising company sends a list of ads it is considering for display, including a frequency cap for each ad. The list is ordered by preference.
- A script iterates through the list. For each ad in the list, the script checks a local ad viewing history to determine whether the ad is frequency capped. It selects the first ad that is uncapped.
- The script sends the uncapped ad back to the ad company for display or entry into an ad exchange auction.
When an ad is displayed:
- The script records the impression in the local ad viewing history.
- The advertising company bills the ad as usual (per impression, per click, per action,2 or a hybrid).
Explanation
Our algorithm leverages several design features to improve the privacy properties of frequency capping.
- Client-side storage. The ad company does not store the user’s ad viewing history.
- Query-response. Our algorithm shares information only about ads that might be shown, not all ads the user has seen.
- Client-side logic. The ad company learns whether certain ads are capped or not; it does not learn their view counts.
- Server-side preprocessing. Our algorithm shares only the first uncapped ad, not the capping state of each ad in the list.
Our algorithm protects user privacy by limiting the number of states the browser can be in. When $latex m$ ads are under consideration, our algorithm only allows the browser to be in one of $latex m$ states–each of the ads might be selected as the first uncapped ad.3 The maximum information entropy contributed is $latex log_2 m$ bits.4, 5
The discussion has thus far centered on choosing a single ad. An ad company will often select multiple ads of multiple formats when populating a page. The associated maximum information entropy is $latex sum_{i=0}^{k}{log_2{m_i choose n_i}}$ for a page with $latex k$ ad formats, where $latex m_i$ is the number of ads considered in the $latex i$th format and $latex n_i$ is the number of ads selected in the $latex i$th format.
For a concrete example, consider a New York Times article page, which often features a banner graphical ad, a sidebar graphical ad, and two footer text ads. If there are three ad formats and five possible ads for each format, an ad network that populates all four slots would gain at most 7.97 bits of information entropy from frequency capping. In other words, the ad network cannot learn more about the average user than if it had set a one-character cookie!
Implementation
Several advertising companies have objected that privacy-preserving frequency capping is not feasible in implementation. We respectfully disagree.
Performance is a non-issue. Our algorithm imposes negligible requirements on browser storage6 and computation.7 As for network latency, the approach would at most require an additional roundtrip–and for the many ad companies that already load an ad over multiple roundtrips, it wouldn’t necessitate any extra roundtrip.
There would, of course, be some software engineering required for implementation. The necessary scripting is straightforward; we developed a prototype implementation of our algorithm in hours. (Source is available on GitHub.) Backend changes may be more demanding; ad companies would have to shift from selecting particular ads for display to generating preference-ordered lists of possible ads. While assuredly not trivial, we do not see how the engineering would be unusually complicated.
1. A website’s compliance with our proposals could be externally verified in a number of ways, such as with an automated auditing tool (e.g. FourthParty), through use of a trusted implementation, or by an independent auditing firm.
2. Privacy-preserving per-action and hybrid billing can be accomplished by querying the impression history in local storage when a billable action occurs.
3. We assume here that at least one ad is uncapped, and we assume in the later discussion that at least $latex n_i$ ads are uncapped. One way to guarantee those assumptions is to designate certain ads as fallbacks. If those assumptions do not hold, the maximum information entropy is $latex log_2 left(m+1right)$ bits for a single format, single ad selection and $latex sum_{i=0}^{k}{log_2{sum_{j=0}^{n_i}{m_i choose j}}}$ bits for multiple format, multiple ad selection.
4. This property follows from the Gibbs inequality. The information entropy is $latex log_2 m$ bits only if users are evenly distributed across the state space. Where the state space is dynamic and the distribution of users among states is skewed, both of which are very likely to occur with the approach we propose, the information entropy will be significantly less than $latex log_2 m$ bits.
5. We analyze our algorithm for marginal privacy impact, that is, the extent to which it makes user privacy worse off. IP address and User-Agent already contribute substantial information entropy and allow tracking many users.
6. For example, if an ad is represented by a 4 byte identifier and frequency count is stored in 1 byte, frequency caps for 100,000 ads consume merely 500 KB. HTML 5 local storage can hold at least 5 MB in the major browsers; Indexed Database and Web SQL can store even more.
7. In our testing, modern browsers can perform dozens of lookups in an HTML 5 local storage instance with over 100,000 keys in mere milliseconds.