Advancing Empirical Legal Scholarship

Modern quantitative analysis has upended the social sciences and, in recent years, made exciting inroads with law. How complex are the nation’s statutes?1 Did a shift in Supreme Court voting dodge President Roosevelt’s court-packing plan?2 How do courts apply fair use doctrine in copyright cases?3 What factors determine the outcome of intellectual property litigation?4 Researchers have begun to answer these and many more questions through the use of empirical methodologies.

Academics have vaulted numerous hurdles to advance this far, including deep institutional siloing and specialization. But barriers do still exist, and one of the greatest remaining is, quite simply, data. There is no easy-to-get, easy-to-process compilation of America’s primary legal materials. In the status quo, researchers are compelled to spend far too much of their time foraging for datasets instead of conducting valuable analysis. Consequences include diminished scholarly productivity, scant uniformity among published works, and—most frustratingly—deterrence for prospective researchers.

My hope is to facilitate empirical legal scholarship by providing machine-readable primary legal materials. In this first release of data, I have prepared XML versions of the U.S. Code and opinions of the Supreme Court of the United States, through approximately early 2012. Subsequent releases may include additional primary legal materials. I would greatly appreciate feedback from the academic community, particularly with regards to the XML schema, text formatting, and prioritizing materials for release.

Update January 13, 2014: The data is now hosted on Amazon S3 in a requester pays bucket. If you have not properly configured your request, you will receive an “Access Denied” error.

United States Code: ZIP (110 MB)
Supreme Court of the United States Opinions: ZIP (348 MB)

Please note, this is a personal project. It is not related to my coursework or research at Stanford University.

1. Michael J. Bommarito II & Daniel M. Katz, A Mathematical Approach to the Study of the United States Code, 389 Physica A 4195 (2010), available at
2. Daniel E. Ho & Kevin M. Quinn, Did a Switch in Time Save Nine?, 2 J. Legal Analysis 69 (2010), available at
3. Matthew Sag, Predicting Fair Use, 73 Ohio St. L.J. 47 (2012), available at
4. Mihai Surdeanu et al., Risk Analysis for Intellectual Property Litigation, Proc. 13th Int’l Conf. on Artificial Intelligence & L. 116 (2011), available at

Presidential Identifying Information

Sunday’s New York Times included a story about how the presidential campaigns are making extensive use of third-party web trackers. In response to privacy concerns, “[o]fficials with both campaigns emphasize[d] that [tracking] data collection is ‘anonymous.’”1

The campaigns are wrong: tracking data is very often identified or identifiable. Arvind Narayanan has previously written a comprehensive and accessible explanation of why web tracking is hardly anonymous; my survey paper on web tracking provides more extensive discussion.

One of the ways in which web tracking data can become identified or identifiable is “leakage”—data flowing to trackers from the websites that users interact with. Leakage most commonly occurs when a website includes identifying information in a page URL or title. Embedded third parties receive the identifying information if they receive the URL (e.g. referrer headers) or the title (e.g. document.title). Even a little identifying information leakage thoroughly undermines the privacy properties of web tracking: once a user’s identity leaks to a tracker, all of the tracker’s past, present, and future data about the user becomes identifiable.

Web services frequently fail to account for information leakage in their design and testing; a study I conducted last year found that over half of popular websites were leaking identifying information.2 More than a few website operators have made inaccurate representations about the information they share with third parties; in just the past year the Federal Trade Commission settled deception claims against both Facebook and Myspace for falsely disclaiming identifying information leakage.

The Times coverage piqued my curiosity: Are the campaigns identifying their supporters to third-party trackers? Are they directly undermining the anonymity properties that they are so quick to invoke?

Yes, they are. I tested the two leading candidate websites using the methodology from my prior study of identifying information leakage. Both leak. The following sections describe my observations from the Barack Obama and Mitt Romney campaign websites.

The Trouble with ID Cookies: Why Do Not Track Must Mean Do Not Collect

Original at the Stanford Center for Internet and Society.

Co-authored by Arvind Narayanan.

The debate over the meaning of Do Not Track has raged for well over a year now. The primary forum is the W3C Tracking Protection Working Group, with frequent sparring in the press and capitals worldwide. There are, broadly, two Do Not Track proposals: one chiefly backed by the ad industry, and another advanced by privacy advocates [1]. These proposals reflect vastly different visions for Do Not Track with vastly different practical consequences. The two sides have unsurprisingly been at loggerheads, with scant movement towards resolution of the key issues.

Tracking Not Required: Advertising Measurement

Co-authored by Arvind Narayanan.

Measurement is central to online advertising: it facilitates billing, performance measurement, targeting decisions, spending allocation, and more. In a pair of earlier posts we explained how advertisement frequency capping and behavioral targeting are achievable without compiling a user’s browsing history. This post similarly proposes practical, privacy-improved approaches to advertising measurement.


Do Track: Browser-Based Do Not Track Exceptions

Users hold widely varying preferences on web tracking.1 Some don’t mind the practice. Some object to it entirely. Many trust certain organizations to follow them around the web.

Do Not Track accomodates these divergent preferences in two ways. First, browsers and other user agents include an option for universally signaling a preference against tracking (“DNT: 1”). Firefox, Internet Explorer, and Safari have all integrated this feature, and Chrome will support it by the end of the year. Second, a user can configure exceptions to the universal signal. Some websites may choose to build a proprietary “out-of-band” exception mechanism, using ordinary web technologies, that trumps the “DNT: 1” signal. The Do Not Track Cookbook includes an example of how a Facebook out-of-band exception mechanism might appear.

The W3C Do Not Track standard will provide another option: a simple JavaScript interface that allows a website to request an exception, paired with a signal that some tracking is allowed (“DNT: 0”).


Tracking Not Required: Behavioral Targeting

Original at 33 Bits of Entropy.

Co-authored by Arvind Narayanan and Subodh Iyengar.

In the first installment of the Tracking Not Required series, we discussed a relatively straightforward case: frequency capping. Now let’s get to the 800-pound gorilla, behaviorally targeted advertising, putatively the main driver of online tracking. We will show how to swap a little functionality for a lot of privacy.

Admittedly, implementing behavioral targeting on the client is hard and will require some technical wizardry. It doesn’t come for “free” in that it requires a trade-off in terms of various privacy and deployability desiderata. Fortunately, this has been a fertile topic of research over the past several years, and there are papers describing solutions at a variety of points on the privacy-deployability spectrum. This post will survey these papers, and propose a simplification of the Adnostic approach — along with prototype code — that offers significant privacy and is straightforward to implement.


Tracking Not Required: Frequency Capping

Co-authored by Arvind Narayanan.

Debates over web tracking and Do Not Track tend to be framed as a clash between consumer privacy and business need. That’s not quite right. There is, in fact, a spectrum of possible tradeoffs between business interests and consumer privacy.

Our aim with the Tracking Not Required series is to show how those tradeoffs are not at all linear; it is possible to swap a little functionality for a lot of privacy. We only use technologies that are already deployed in browsers, and the solutions we propose are externally verifiable.1

We focus on issues at the center of Do Not Track negotiations in the World Wide Web Consortium. Advertising companies have pledged to stop forms of ad targeting once a user enables Do Not Track, but many maintain that tracking is essential for a litany of “operational uses.” The Tracking Not Required series demonstrates how business functionality can be implemented without exposing users to the risks of tracking.

This first post addresses frequency capping in online advertising, the most frequently cited “operational use” necessitating tracking.


Third-Party Web Tracking: Policy and Technology

John Mitchell and I have written a new paper that synthesizes research on policy and technology issues surrounding third-party web tracking. It will appear at the IEEE Symposium on Security and Privacy in May.


In the early days of the web, content was designed and hosted by a single person, group, or organization. No longer. Webpages are increasingly composed of content from myriad unrelated “third-party” websites in the business of advertising, analytics, social networking, and more. Third-party services have tremendous value: they support free content and facilitate web innovation. But third-party services come at a privacy cost: researchers, civil society organizations, and policymakers have increasingly called attention to how third parties can track a user’s browsing activities across websites.

This paper surveys the current policy debate surrounding third-party web tracking and explains the relevant technology. It also presents the FourthParty web measurement platform and studies we have conducted with it. Our aim is to inform researchers with essential background and tools for contributing to public understanding and policy debates about web tracking.

The FTC’s Chairman Groks Do Not Track

Last Thursday the White House hosted a major event on online privacy. Much of the public attention focused on a long-awaited White House report and a commitment by an online advertising self-regulatory group to implement components of the Do Not Track technology. Both the Electronic Frontier Foundation and the Center for Democracy and Technology have written detailed reviews of what transpired.

There has been scant focus on Federal Trade Commission Chairman Jon Leibowitz’s brief remarks on Do Not Track. That’s a mistake.