33 Bits of Entropy

Google+ and Privacy: A Roundup

By all accounts, Google has done a great job with Plus, both on privacy and on the closely related goal of better capturing real-life social nuances. [1] This article will summarize the privacy discussions I’ve had in the first few days of using the service and the news I’ve come across.

The origin of Circles

“Circles,” as you’re probably aware, is the big privacy-enhancing feature. A presentation titled “The Real-Life Social Network” by user-experience designer Paul Adams almost exactly a year ago went viral in the tech community; it looks likely this was the genesis, or at least a crystallization, of the Circles concept.

But Adams defected to Facebook a few months later, which lead to speculation that it was the end of whatever plans Google may have had for the concept. But little did the world know at the time that Plus was a company-wide, bet-the-farm initiative involving 30 product teams and hundreds of engineers, and that the departure of one made no difference.

Meanwhile, Facebook introduced a friend-lists feature but it was DOA. When you’re staring at a giant list of several hundred “friends” — Facebook doesn’t do a good job of discouraging indiscriminate friending — categorizing them all is intimidating to say the least. My guess is that Facebook was merely playing the privacy communication game.

Why are circles effective?

I did an informal poll to see if people are taking advantage of Circles to organize their friend groups. Admittedly, I was looking at a tech-savvy, privacy-conscious group of users, but the response was overwhelming, and it was enough to convince me that Circles will be a success. There’s a lot of excitement among the early user community as they collectively figure out the technology as well as the norms and best practices for Circles. For example, this tip on how to copy a circle has been shared over 400 times as I write this.

One obvious explanation is that Circles captures real-life boundaries, and this is what users have been waiting for all along. That’s no doubt true, but I think there’s more to it than that. Multiple people have pointed out how the exemplary user interface for creating circles encouraged them to explore the feature. It is gratifying to see that Google has finally learned the importance of interface and interaction design in getting social right.

There are several other UI features that contribute to the success of Circles. When friending someone, you’re forced to pick one or more circles, instead of being allowed to drop them into a generic bucket and categorize them later. But in spite of this, the UI is so good that I find it no harder than friending on Facebook.

In addition, you have to pick circles to share each post with (but again the interface makes it really easy). Finally, each post has a little snippet that shows who can see it, which has the effect of constantly reminding you to mind the information flow. In short, it is nearly impossible to ignore the Circles paradigm.

The resharing bug

Google+ tries to balance privacy with Twitter-like resharing, which is always going to be tricky. Amusing inconsistencies result if you share a post with a circle that doesn’t include the original poster. A more serious issue, pointed out by many people including an FT blogger, is that “limited” posts can be publicly reshared. To their credit, Google engineers acknowledged it and quickly disabled the feature.

Meanwhile, some have opined that this issue is “totally bogus” and that this is how life works and how email works, in that when you tell someone a secret, they could share it with others. I strongly disagree, for two reasons.

First, this is not how the real world (or even email) works. Someone can repeat a secret you told them in real life, or forward an email, but they typically won’t broadcast it to the whole world. We’re talking about making something public here, something that will be forever associated with your real name and could very well come up in a web search.

Second, user-interface hints are an important and well-established way of nudging privacy-impacting behaviors. If there’s a ‘share’ button with a ‘public’ setting, many users will assume that it is OK to do just that. Twitter used to allow public retweets of protected tweets, and a study found that this had been done millions of times. In response, Twitter removed this ability. The privicons project seeks to embed similar hints in emails.

In other words, the privacy skeptics are missing the point: the goal of the feature is not to try to technologically prevent leakage of protected information, but to better communicate to users what’s OK to share and what isn’t. And in this case, the simplest way to do that is to remove the 1-click ability to share protected content publicly, and instead let users copy-paste if they really want to do that. It would also make sense to remind users to be careful when they’re sharing a limited to their circles, which, I’m happy to see, is exactly what Google is doing.

The tip you now see when you share a limited post (with another limited group). This is my favorite Google+ feature.

A window into your circles

Paul Ohm points out that if someone shares content with a set of circles that includes you, you get to see 21 users who are part of those circles, apparently picked at random. [2] This means that if you look at these lists of 21 over time you can figure out a lot about someone’s circles, and possibly decipher them completely. Note that by default your profile shows a list of users in your circles, but not who’s in which circle, which for most people is significantly more sensitive.

In my view, this is an interesting finding, but not anything Google needs to fix; the feature is very useful (and arguably privacy-enhancing) and the information leakage is an inevitable tradeoff. But it’s definitely something that users would do well to be aware of: the secrecy of your circles is far from bulletproof.

Speaking of which, the network visibility of different users on their profile page confused me terribly, until I realized Google+ is A/B testing that privacy setting! These are the two possibilities you could see when you edit your profile and click the circles area in the left sidebar: A, B. This is very interesting and unusual. At any rate, very few users seem to have changed the defaults so far, based on a random sample of a few dozen profiles.

Identity and distributed social networking

Some people are peeved that Google+ discourages you from participating pseudonymously. I don’t think a social network that wants to target the mainstream and wants to capture real-world relationships has any real choice about this. In fact, I want it to go further. Right now, Google+ often suggests I add someone I’ve already added, which turns out to be because I’ve corresponded with multiple email addresses belonging to that person. Such user confusion could be minimized if the system did some graph-mining to automatically figure out which identities belong to the same person. [3]

A related question is what this will mean for distributed social networking, which was hailed a year ago as the savior of privacy and user control. My guess is that Google+ will take the wind out of it — Google takeout gives you a significant degree of control over your data. Further, due to the Apple-Twitter integration and the success of Android, the threat of Facebook monopolizing identities has been obliterated; there are at least three strong players now.

Another reason why Google+ competes with distributed social networks: for people worried about the social networking service provider (or the Government) reading their posts, client-side encryption on top of Google+ could work. The Circles feature is exactly what is needed to make encrypted posts viable, because you can make a circle of those who are using a compatible encryption/decryption plugin. At least a half-dozen such plugins have been created over the years (examples: 1, 2), but it doesn’t make much sense to use these over Facebook or Twitter. Once the Google+ developer API rolls out, I’m sure we’ll see yet another avatar of the encrypted status message idea, and perhaps the the n-th time will be the charm.

Concluding thoughts

Two years ago, I wrote that there’s a market case for a privacy-respecting social network to fill Livejournal’s shoes. Google+ seems poised to fulfill most of what I anticipated in that essay; the asymmetric nature of relationships and the ability to present different facets of one’s life to different people are two important characteristics that the two social networks have in common. [4]

Many have speculated on whether, and to what extent, Google+ is a threat to Facebook. One recurring comparison is Facebook as “ghetto” compared to Plus, such as in this image making the rounds on Reddit, reminiscent of Facebook vs. Myspace a few years ago. This perception of “coolness” and “class” is the single biggest thing Google+ has got going for it, more than any technological feature.

It’s funny how people see different things in Google+. While I’m planning to use Google+ as a Livejournal replacement for protected posts, since that’s what fits my needs, the majority of the commentary has compared it to Facebook. A few think it could replace Twitter, generalizing from their own corner of the Google+ network where people haven’t been using the privacy options. Forbes, being a business publication, thinks LinkedIn is the target. I’ve seen a couple of commenters saying they might use it instead of Yammer, another business tool. According to yet other articles, Flickr, Skype and various other Internet companies should be shaking in their boots. Have you heard the parable of the blind men and the elephant?

In short, Google+ is whatever you want it to be, and probably a better version of it. It’s remarkable that they’ve pulled this off without making it a confusing, bloated mess. Myspace founder Tom Anderson seems to have the most sensible view so far: Google+ is simply a better … Google, in that the company now has a smoother, more integrated set of services. You’d think people would have figured it out from the name!

[1] I will use the term “privacy” in this article to encompass both senses.

[2] It’s actually 22 users, including yourself and the poster. It’s not clear just how random the list is; in my perusal, mutual friends seem to be preferentially picked.

[3] I am not suggesting that Google+ should prevent users from having multiple accounts, although Circles makes it much less useful/necessary to have multiple accounts.

[4] On the other hand, when it comes to third party data collection, I do not believe that the market can fix itself.

I’m grateful to Joe Hall, Jonathan Mayer, and many, many others with whom I had interesting discussions, mostly via Google+ itself, on the topics that led to this post.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter or Google+.

July 3, 2011 at 7:04 pm 13 comments

Data-mining Contests and the Deanonymization Dilemma: a Two-stage Process Could Be the Way Out

Anonymization, once the silver bullet of privacy protection in consumer databases, has been shown to be fundamentally inadequate by the work of many computer scientists including myself. One of the best defenses is to control the distribution of the data: strong acceptable-use agreements including prohibition of deanonymization and limits on data retention.

These measures work well when outsourcing data to another company or a small set of entities. But what about scientific research and data mining contests involving personal data? Prizes are big and only getting bigger, and by their very nature involve wide data dissemination. Are legal restrictions meaningful or enforceable in this context?

I believe that having participants sign and fax a data-use agreement is much better from the privacy perspective than being able to download the data with a couple of clicks. However, I am sympathetic to the argument that I hear from contest organizers that every extra step will result a big drop-off in the participation rate. Basic human psychology suggests that instant gratification is crucial.

That is a dilemma. But the more I think about it, the more I’m starting to feel that a two-step process could be a way to get the best of both worlds. Here’s how it would work.

For the first stage, the current minimally intrusive process is retained, but the contestants don’t get to download the full data. Instead, there are two possibilities.

Release data on only a subset of users, minimizing the quantitative risk. [1]
Release a synthetic dataset created to mimic the characteristics of the real data. [2]

For the second stage, there are various possibilities, not mutually exclusive:

Require contestants to sign a data-use agreement.
Restrict the contest to a shortlist of best performers from the first stage.
Switch to an “online computation model” where participants upload code to the server (or make database queries over the network) and obtain results, rather than download data.

Overstock.com recently announced a contest that conformed to this structure—a synthetic data release followed by a semi-final and a final round in which selected contestants upload code to be evaluated against data. The reason for this structure appears to be partly privacy and partly the fact that are trying to improve the performance of their live system, and performance needs to be judged in terms of impact on real users.

In the long run, I really hope that an online model will take root. The privacy benefits are significant: high-tech machinery like differential privacy works better in this setting. But even if such techniques are not employed, although there is the theoretical possibility of contestants extracting all the data by issuing malicious queries, the fact that queries are logged and might be audited should serve as a strong deterrent against such mischief.

The advantages of the online model go beyond privacy. For example, I served on the Heritage Health Prize advisory board, and we discussed mandating a limit on the amount of computation that contestants were allowed. The motivation was to rule out algorithms that needed so much hardware firepower that they couldn’t be deployed in practice, but the stipulation had to be rejected as unenforceable. In an online model, enforcement would not be a problem. Another potential benefit is the possibility of collaboration between contestants at the code level, almost like an open-source project.

[1] Obtaining informed consent from the subset whose data is made publicly available would essentially eliminate the privacy risk, but the caveat is the possibility of selection bias.

[2] Creating a synthetic dataset from a real one without leaking individual data points and at the same time retaining the essential characteristics of the data is a serious technical challenge, and whether or not it is feasible will depend on the nature of the specific dataset.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

June 14, 2011 at 6:54 pm Leave a comment

In Silicon Valley, Great Power but No Responsibility

I saw a tweet today that gave me a lot to think about:

A rather intricate example of social adaptation to technology. If I understand correctly, the cousins in question are taking advantage of the fact that liking someone’s status/post on Facebook generates a notification for the poster that remains even if the post is immediately unliked. [1]

What’s humbling is that such minor features have the power to affect so many, and so profoundly. What’s scary is that the feature is so fickle. If Facebook starts making updates available through a real-time API, like Google Buzz does, then the ‘like’ will stick around forever on some external site and users will be none the wiser until something goes wrong. Similar things have happened: a woman was fired because sensitive information she put on Twitter and then deleted was cached by an external site. I’ve written about the privacy dangers of making public data “more public”, including the problems of real-time APIs. [2]

As complex and fascinating as the technical issues are, the moral challenges interest me more. We’re at a unique time in history in terms of technologists having so much direct power. There’s just something about the picture of an engineer in Silicon Valley pushing a feature live at the end of a week, and then heading out for some beer, while people halfway around the world wake up and start using the feature and trusting their lives to it. It gives you pause.

This isn’t just about privacy or just about people in oppressed countries. RescueTime estimates that 5.3 million hours were spent worldwide on Google’s Les Paul doodle feature. Was that a net social good? Who is making the call? Google has an insanely rigorous A/B testing process to optimize between 41 shades of blue, but do they have any kind of process in place to decide whether to release a feature that 5.3 million hours—eight lifetimes—are spent on?

For the first time in history, the impact of technology is being felt worldwide and at Internet speed. The magic of automation and ‘scale’ dramatically magnifies effort and thus bestows great power upon developers, but it also comes with the burden of social responsibility. Technologists have always been able to rely on someone else to make the moral decisions. But not anymore—there is no ‘chain of command,’ and the law is far too slow to have anything to say most of the time. Inevitably, engineers have to learn to incorporate social costs and benefits into the decision-making process.

Many people have been raising awareness of this—danah boyd often talks about how tech products make a mess of many things: privacy for one, but social nuances in general. And recently at TEDxSiliconValley, Damon Horowitz argued that technologists need a moral code.

But here’s the thing—and this is probably going to infuriate some of you—I fear that these appeals are falling on deaf ears. Hackers build things because it’s fun; we see ourselves as twiddling bits on our computers, and generally don’t even contemplate, let alone internalize, the far-away consequences of our actions. Privacy is viewed in oversimplified access-control terms and there isn’t even a vocabulary for a lot of the nuances that users expect.

The ignorant are at least teachable, but I often hear a willful disdain for moral issues. Anything that’s technically feasible is seen as fair game and those who raise objections are seen as incompetent outsiders trying to rain on the parade of techno-utopia. The pronouncements of executives like Schmidt and Zuckerberg, not to mention the writings of people like Arrington and Scoble who in many ways define the Valley culture, reflect a tone-deaf thinking and a we-make-the-rules-get-over-it attitude.

Something’s gotta give.

[1] It’s possible that the poster is talking about Twitter, and by ‘like’ they mean ‘favorite’. This makes no difference to the rest of my arguments; if anything it’s stronger because Twitter already has a Firehose.

[2] Potential bugs are another reason that this feature is fickle. As techies might recognize, ensuring that a like doesn’t show up after an item is unliked maps to the problem of update propagation in a distributed database, which the CAP theorem proves is hard. Indeed, Facebook often has glitches of exactly this sort—you might notice it because a comment notification shows up and the comment doesn’t, or vice versa, or different people see different like counts, etc.

[ETA] I see this essay as somewhat complementary to my last one on how information technology enables us to be more private contrasted with the ways in which it also enables us to publicize our lives. There I talked about the role of consumers of technology in determining its direction; this article is about the role of the creators.

[Edit 2] Changed the British spelling ‘wilful’ to American.

Thanks to Jonathan Mayer for comments on a draft.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

June 11, 2011 at 7:33 am 13 comments

The Many Ways in Which the Internet Has Given Us More Privacy

There are many, many things that digital technology allows us to do more privately today than we ever could. Consider:

The ability of marginalized or oppressed individuals to leverage the privacy of online communication tools to unite in support of a cause, or simply to find each other, has been earth-shattering.

It has played a key role in the ongoing Middle East uprisings. The Internet helps primarily by enabling rapid communication and coordination, but being able to do it covertly—clumsy governmental hacking attempts notwithstanding—is an equally important aspect.
Clay Shirky tells the story of how some of meetup.com’s most popular groups were (ir)religious communities that don’t find support in broader U.S. culture — Pagans, ex-Jehovah’s witnesses, atheists, etc.
STD-positive individuals can use online dating sites targeted at their group. Can you imagine the Sisyphean frustration of trying to date offline and find a compatible partner if you have an STD?

In the political realm, the anonymity afforded by Wikileaks is leading to a challenge to the legitimacy of high-level government actors, if not entire governments. Bitcoin is another anonymity technology that shows the potential to have serious political effects. [1]

Most of us benefit at an everyday level from improved privacy. When we read, search, or buy online, people around us don’t find out about it. This is vastly more private than checking out a book from a library or buying something at a store. [2]

We’ve benefited not only in our mundane activities, but our kinky ones as well. We take and exchange naked pictures all the time, never having been able to do so back when it involved getting it developed at the store. And slightly over half of us have taken advantage of the fact that “hiding one’s porn” is trivial today compared to the bad old days of magazines.

I could go on—I haven’t even mentioned the uses of Tor or encryption, freely available to anyone willing to invest a little effort—but I’ve made my point. Of course, I’ve only presented one half of the story. The other half, that technology is also allowing us to expose ourselves in ways never before, has been told so many times by so many people, and so loudly, that it is drowning out meaningful conversation about privacy.

Having presented the above evidence, I posit that technology by itself is actually largely neutral with respect to privacy, in that it enhances the privacy of some types of actions and encumbers that of others. Which direction society takes is up to us. In other words, I’m asserting the negation of technological determinism, applied to privacy.

While I do believe that privacy-infringing technologies have been adopted more pervasively than privacy-enhancing ones, I would say that the disparity is far smaller than it is generally thought to be. Why the mismatch in perception? A curious collective cognitive bias. Observe that almost every one of the examples above is generally seen as a new kind of activity enabled by technology whereas they are really examples of technology allowing us to do a familiar activity, but with more privacy (among other benefits).

Another reason for the cognitive bias is our tendency to focus on the dangers and the negatives of technology. Let’s go back do the nude pictures example: just about everyone does it, but only a small number—perhaps 1%?—suffer some harm from it. Like Schneier says, if it’s in the news, don’t worry about it.

To the extent that privacy-infringing technologies have been more successful, it’s a choice we’ve collectively made. Demand for social networking has been so strong that the sector has somehow invented a halfway workable business model, even though it took several tries to get there. But demand for encryption has been so weak that the market never matured enough to make it usable to the general public.

The disparity could be because we don’t know what’s good for us—volumes have been written about this—but it could also be partly because there are costs and benefits to giving up our privacy, and the benefits, in proportion to the costs, are rather higher than is generally made out to be.

Those are all questions worth pondering, but I hope I have convinced you of this: the idea that information technology inherently invades privacy is oversimplified and misleading. If we’re giving up privacy, we have only ourselves to blame.

[1] Many privacy-enhancing technologies are morally ambiguous. I’m merely listing the ways in which people benefit from privacy, regardless of whether they’re using it for good or evil.

[2] It is probably true that the Internet has made it easier for government, advertisers etc. to track your activities. But it doesn’t change the fact that there’s a privacy benefit to regular people in an everyday context, who are far more concerned about keeping secrets from their family, friends and neighbors than about abstract threats.

[ETA] This essay examines the role of consumers in shaping the direction of technology, whereas the next one looks at the role of creators.

Thanks to Ann Kilzer for comments on a draft.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

June 8, 2011 at 6:52 pm Leave a comment

Bad Internet Law: What Techies Can Do About It

From the dangerous copyright lobby-sponsored PROTECT IP to a variety of misguided social networking safety laws, the spectre of bad Internet law is rearing its ugly head with increasing frequency. And at the e-G8 forum, Sarkozy and others talked about even more ambitious plans to “civilize” the Internet that will surely have repercussions in the U.S. as well. Three things are common to these efforts: a general ignorance of technological reality, an attempt to preserve pre-Internet era norms and business models that don’t necessarily make sense anymore, and severe chilling effects on free speech and innovation.

The bad news is that fighting specific laws as they come up is an uphill battle. What has changed in the last ten years is that the Internet has thoroughly permeated society, and therefore the interest groups pushing these laws are much more determined to get their way. The good news is that lawmakers are reasonably receptive to arguments from both sides. So far, however, they are not hearing nearly enough of our side of the story. It’s time for techies to step up and get more actively involved in policy if we hope to preserve what we’ve come to see as our way of life. Here’s how you can make a difference.

1. Stick to your strengths—explain technology. The primary reason why Washington is prone to making bad tech law is that they don’t understand tech, and don’t understand how bits are different from atoms. Not only is educating policymakers on tech more effective, as a technologist you’ll have more credibility if you stick to doing that, rather than opining on specific policy measures.

2. Don’t go it alone. Giving equal weight to every citizen’s input on individual issues may or may not be a good idea in theory, but it certainly doesn’t work that way in practice. Money, expertise, connections and familiarity with the system all count. You’ll find it much easier to be heard and to make a difference if you combine your efforts with an existing tech policy group. You’ll also learn the ropes much more quickly by networking. Organizations like the EFF are always looking for help from outside technologists.

3. Charity begins at home—talk to your policy people. If you work at a large tech company, you’re already in a great position: your company has a policy group, a.k.a. lobbyists. Help them with their understanding of tech and business constraints, and have them explain the policy issues they’re involved in. Engineers often view the in-house policy and legal groups as a bunch of lawyers trying to impose arbitrary rules. This attitude hurts in the long run.

4. Learn to navigate the Three Letter Agencies. “The Government” is not a monolithic entity. To a first approximation there are the two Houses, a variety of Agencies, Departments and Commissions, the state legislatures and the state Attorneys General. They differ in their responsibilities, agendas, means of citizen participation and the receptiveness to input on technology. It can be bewildering at first but don’t worry too much about it; you can pick it up as you go along. Weird but true: most Internet issues in the House are handled by the “Energy and Commerce” subcommittee!

While I have focused on bad Internet laws, since that is where the tech/politics disconnect is most obvious, there are certainly many laws and regulations that have a largely positive, or at least a mixed reception in technology circles. Net neutrality is a prominent example; I am myself involved in the Do Not Track project. These are good opportunities to get involved as well, since there is always a shortage of technical folks. I would suggest picking one or two issues, even though it might be tempting to speak out about everything you have an opinion on.

To those of you who are about to post something like, “What’s the point? Congresspeople are all bought and paid for and aren’t going to listen to us anyway,” I have two things to say:

Tech policy is certainly hard because of the huge chasm, but cynicism is unwarranted. Lawmakers are willing to listen and you will have an impact if you stick with it.
If you’re not interested, that’s your prerogative. But please refrain from discouraging others who’re fighting for your rights. Defeatism and apathy are part of the problem.

Finally, here are some tech policy blogs and resources if you feel like “lurking” before you’re ready to jump in.

Thanks to Pete Warden and Jonathan Mayer for comments on a draft.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

June 7, 2011 at 4:56 pm Leave a comment

The Surprising Effectiveness of Prizes as Catalysts of Innovation

Although the strategic use of prizes to foster sci-tech innovation has a long history, it has exploded in the last two decades—35% annual growth on average, or doubling every 2.3 years.[1] Much has been said on the topic, but I have yet to see a clear answer to the core mystery:

Why do prizes work?

Specifically, why are they more effective than simply hiring people to do it? The question is more complex than it sounds, and a valid explanation must address the following:

Why shouldn’t government and industry research funding be switched over entirely to a prize-based model?
Why did the prize revolution happen in the last two decades, and not earlier?
How do prizes succeed in spite of the massive duplication of effort that you’d expect due to numerous contestants trying to solve the same problem?

Prizes exploit the productivity-reward imbalance

In many fields there is a huge disparity—order of magnitude or more—between the productivity of the top performers and the median performers. The structure of the corporation, having co-evolved with the industrial revolution for harnessing workers to build railroads or textiles, is fundamentally limited in its ability to reward employees in creative endeavors in proportion to their contribution, or even measure it. Academia is a little better due to the precedence of fame over monetary reward, but has its own problems.

Enter prizes. The winner-take-all structure gives individuals or small organizations of exceptional caliber a chance to earn prestige as well as cash that they don’t otherwise have a shot at.

Given that the best innovators are more likely to feel that an academic or corporate job under-rewards them, self-selected prize contestants are likely to skew toward high-performers.

Prizes channel existing research funding

The Netflix prize attracted 34,000 contestants. At an average of just 1 hour (valued at $100) per contestant, the monetary value of the time spent on the contest dwarfs the prize amount. And the majority of contestants—or at least the ones with a serious chance—were already employed as researchers. This effect is broadly true: for example, contestants spent a total of over $100 million in pursuit of the Ansari X Prize which carries a $10 million award.

The real funding for prize-winning efforts comes from Government grants and corporate research labs. The prize itself serves to mainly to legitimize the task as a research goal.

This is in no way meant to be a criticism of prizes—sure, prizes direct attention away from other problems, but one expects that on average, problems for which prizes are offered are more important than others.

Nor does the ability of prizes to spur effort far in excess of the monetary award necessarily mean that contestant behavior is irrational, since the prestige and media attention are typically worth far more than the cash, and because failure to win the prize doesn’t mean the effort is wasted.

That said, the well-known human tendency to systematically overestimate one’s own abilities certainly has a role in explaining the power of prizes to attract talent. According to the same McKinsey report linked above, “many of the participants that we interviewed were absolutely convinced they were going to win [the Ansari X Prize], if not this year, then surely the next.”

What about democratization?

The openness of prizes is often advanced as a key reason for their superiority over traditional research funding. There are two very different components to this assertion: the first is that prizes encourage hybridization of expertise from different fields, given that researchers often fall into the trap of collaborating only within their own communities. There is evidence for this from a study of Innocentive.

The second argument is that prizes allow even non-expert members of the general public, who might otherwise never be involved in research, to participate. I find this argument unconvincing and there is little evidence to support it, if you ignore anecdotes from the 19th century when science funding was meager by today’s standards. However, crowdsourcing to the public seems a good strategy for prizes that are more about problem solving than original research. Challenge.gov may be a good example, depending on how it pans out.

The Internet as an enabler

Now let’s look at the three auxiliary questions I posed above. My explanation for prize effectiveness—self-selection, redirection of funding, and interdisciplinary collaboration—can answer them comfortably. If all research funding were based on prizes, it would defeat the purpose since prizes only serve to redirect existing research funding.

The rapid growth of the sector since 1990 is an obvious indication that the Internet had something to do with it. But how exactly? I think there are several reasons. First, the Internet could be making it easier for experts from different physical locations and/or areas of expertise to team up and to collaborate.

Second, increased reach, shorter cycles and improved economies of scale in most markets in the Internet era have exacerbated the performance-reward imbalance, as well as making the imbalance more obvious to all involved. This is a factor fueling the startup revolution as well.

Finally, and perhaps crucially, I believe the Internet has largely nullified one of the key disadvantages of prizes, which is duplication of effort. The Netflix prize, for one, was marked by a remarkable degree of sharing, and sponsors of new contests are increasingly tweaking the process to ensure that teams build on each other’s ideas.

These factors are only going to accelerate in the future, which suggests that the torrid growth of prizes in number and amount is going to continue for some time to come. There are now many companies dedicated to running these contests—Innocentive is the leader, and Kaggle is a startup focused on the data-mining space. Exciting times.

[1] My numbers are based on this McKinsey report which seems by far the most comprehensive study of prizes and is well worth reading for anyone interested in the subject. The aggregate purse of prizes over $100,000 grew from $50MM to $302MM from 1991 to 2008, during which period the share of “inducement prizes,” the kind we’re concerned with here, showed remarkable growth from 3% of the total to 78%.

Thanks to Steve Lohr for pointers to research when he interviewed me for his NYTimes Bits piece, and to @dan_munz and other Twitter followers for useful suggestions.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

June 6, 2011 at 3:03 pm Leave a comment

Price Discrimination is All Around You

This is the first in a series of articles that will show how we’re at a turning point in the history of price discrimination and discuss the consequences. This article presents numerous examples of traditional price discrimination that you see today, many of which are funny, sad, or downright devious.

Price discrimination, more euphemistically known as differential pricing and dynamic pricing, exploits the fact that in any transaction each customer has a different “willingness to pay.”

What is “willingness to pay,” and how does the seller determine it? To illustrate, let me quote a hilarious story by Steve Blank on selling enterprise software. The protagonist is one Sandy Kurtzig.

Sandy Kurtzig

Since it was the first non-IBM enterprise software on IBM mainframes, [when] she got her first potential order, she didn’t know how to price it. It must have been back in the mid-’70s. She’s [with] this buyer, has a P.O. on his desk, negotiating pricing with Sandy.

…

So, Sandy said she goes into the buyer who says, “How much is it?”

And Sandy gulped and picked the biggest number she thought anybody would ever rationally pay. And said, “$75,000″. And she said all the buyer did was write down $75,000.

And she realized, shit, she left money on the table. … And she said, “Per year.”

And the buyer wrote down, “Per year.”

And she went, oh, crap what else? She said, “There’s maintenance.”

He said, “How much?”

“25 percent per year.”

And he said, “That’s too much.”

She said, “15 percent.”

And he said, “OK.”

Sadly, not all transactions are as much fun as pricing enterprise software ;-) The price usually has to be determined without meeting the buyer face to face. There are three types of price discrimination based on how the price is determined:

Each buyer is charged a custom price. (Traditionally, there has never been enough data to do this.)
Price depends on an attribute of the buyer such as age or gender.
Different price for different categories of buyers, with the seller somehow getting the buyer to reveal which category they fall into. As we’ll see, hilarity frequently ensues.

Additionally, each buyer may be sold the same product, or it could be customized to each segment—in the extreme case, to each buyer. This is called product differentiation.

Alright. Time to dive into some examples.

1. Student discounts at movies, museums, etc. are one of the simplest types of price discrimination. Students are generally poorer and more price sensitive, so the business hopes to attract more of them by making it cheaper.

Why museums and movies, and not say grocery stores? Two reasons: first, if the grocery store tried it, they’d quickly run into the problem of resale by the group that qualifies for the lower price. (It could manifest as parents sending their kids to get groceries.) The museum doesn’t have this problem because they ask for a student ID.

Second, grocery stores set prices pretty close to their marginal cost anyway, so there’s not as much of a scope for variable pricing. With museums, on the other hand, it costs them next to nothing to admit an extra visitor. All of their costs are fixed costs.

Prevention of resale and low marginal costs relative to fixed costs are two important ingredients for price discrimination.

2. Ladies’ night at bars is another simple example of price discrimination based on an attribute (gender). Rather than women having a lower willingness to pay, it is perhaps more accurate to say that men are more desperate to get in :-)

Interestingly, this is one of the few examples whose legality is questionable. Wikipedia has a good survey. Also, it is not a “pure” example since the point of ladies’ night is not just to get more women through the door but also, indirectly, to get more men through the door.

3. A less obvious example is the variation of gas prices (and other commodities) within the same chain across locations. This is because people in richer ZIP codes are willing to pay more on average.

An important caveat: some of the variation is typically explainable by differences in marginal cost (such as rent) between different locations, but not all of it.

4. Financial aid at universities is a rather complex case of price discrimination. Instead of charging different rates to different students, the seller has a base rate and gives discounts (aid) to qualifying students.

Discounting is a frequently used form of “concealed” price discrimination.

You can see aid programs in humanitarian/political terms or in economic terms; the two paradigms are not in conflict with each other. In the economic view, students with higher scores receive aid because they have more college options and are therefore more price-sensitive. Poorer students and minorities receive aid because they are less able/willing to pay.

In the examples so far, the attribute(s) that factor into discrimination are either obvious (gender, race, location) or it is in the buyer’s interest to disclose them to the seller (student status, financial need). Now let’s look at examples where the seller has to be crafty in getting the buyer to disclose it.

5. Car prices vary greatly between market segments, far more than can be explained by differences in marginal cost. Car buyers segment themselves because owning a higher-end car is a status symbol.

Product differentiation is frequently used to get buyers to segment themselves.

The same principle applies to numerous other product categories like wine and coffee. But at least you’re getting at least a nominally superior product for a higher price. Let’s look at examples where buyers voluntarily pay more for the same product.

6. Dell.com used to ask customers if they were home users, small businesses, or other categories. The prices for the same products varied according to the category you declared. There was no legally binding reason to be honest about your disclosure, and no enforcement mechanism.

Now for a more devious example.

7. “Staples brazenly sends out different office supply catalogs with different prices to the same customers. The price-sensitive buyers know which to buy from. The inattentive ones pay extra.” [source]

A similar example: restaurants with long menus sometimes highlight some popular choices on the first page. The same items are available in the long-form menu for cheaper, if only you knew where they’re buried.

These examples illustrate an extremely common form of price discrimination:

Buyers who are willing to jump through hoops demonstrate their high price-sensitivity and therefore get lower prices.

This theme is so fundamental that it has been practiced for thousands of years in the form of haggling.

8. The jumping-through-hoops principle suggests that it makes economic sense for the seller to make discounts hard to get. Nowhere is this more apparent than with Black Friday deals—stand in ridiculously long lines all night to get fabulous discounts. Wealthier customers who don’t bother doing so will get much less of a discount during regular store hours, even on Black Friday.

9. More examples of hard-to-get discounts: woot.com, mailing-list deals and Southwest Airlines DING. Many of these involve artificial scarcity and time-limitations to make them more difficult to get, thus ensuring that those who take advantage are buyers who might otherwise not buy at all.

10. Perhaps the most extreme example of roping in buyers who might otherwise not buy is deliberately crippling your own product, known in economics as damaged goods.

IBM did this with its popular LaserPrinter by adding chips that slowed down the printing to about half the speed of the regular printer. The slowed printer sold for about half the price, under the IBM LaserPrinter E name.

That example and more like it are from here. And a more poignant example from railways of long ago:

It is not because of the few thousand francs which would have to be spent to put a roof over the third-class carriages or to upholster the third-class seats that some company or other has open carriages with wooden benches. What the company is trying to do is to prevent the passengers who can pay the second class fare from traveling third class; it hits the poor, not because it wants to hurt them, but to frighten the rich. And it is again for the same reason that the companies, having proved almost cruel to the third-class passengers and mean to the second-class ones, become lavish in dealing with first-class passengers. Having refused the poor what is necessary, they give the rich what is superfluous.

These examples should make clear that:

Getting buyers to reveal their willingness to pay often has signficant social costs.

11. There are endless examples of clever tricks to learn the customer’s price-sensitivity in the airline industry. The price for the same seat can vary greatly depending on a variety of factors. The most well-known one is that you get lower prices if your trip spans a weekend, because it probably means you’re not a business traveler.

12. First class and business class seating on airlines is also price discrimination, but of a very different kind. Here it’s not different prices for the same product but different prices for slightly different products. Buyers segment themselves due to product differentiation, a phenomenon we’ve seen before with cars.

The first class/economy price spread can often be as high as 10x, which illustrates the wide range of customers’ willingness to pay. For a variety of reasons, most other markets haven’t managed to attain such a high price spread.

The “holy grail” of price discrimination is to achieve dramatically higher price spreads in most markets.

Aaaaaand we’re done with the examples!

Note that this is far from a complete list—I haven’t covered clearance sales, loyalty programs and frequent flyer miles, hi-lo pricing, drug prices that vary by country, and so forth, but I hope I’ve convinced you that price discrimination in some form already happens in nearly every market.

But here’s the kicker: I’ve deliberately left out what I consider the most important class of examples, because I’m going to devote a whole article to it. I will argue that this emerging form of price discrimination is going to explode in popularity and dwarf anything we’ve seen so far. Feel free to guess what I’m thinking about in the comments, and stay tuned!

Many thanks to Justin Brickell, Alejandro Molnar and Adam Bossy for useful discussions and comments. Thanks also to my Twitter followers for putting up with my ‘tweetathon’ on this topic two months ago and providing feedback.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

June 2, 2011 at 2:48 pm 6 comments

“You Might Also Like:” Privacy Risks of Collaborative Filtering

I have a new paper titled “You Might Also Like:” Privacy Risks of Collaborative Filtering with Joe Calandrino, Ann Kilzer, Ed Felten and Vitaly Shmatikov. We developed new “statistical inference” techniques and used them to show how the public outputs of online recommender systems, such as the “You Might Also Like” lists you see on many websites, can reveal individual purchases and preferences. Joe spoke about it at the IEEE S&P conference at Oakland earlier today.

Background: inference and statistical inference. The paper is about techniques for inference. At its core, inference is a simple concept, and is about deducing that some event has occured based on its effect on other observable events or objects, often seemingly unrelated. Think Sherlock Holmes, whether something simple such as the idea of a smoking gun, now so well known that it’s a cliché, or something more subtle like the curious incident of the dog in the night time.

Today, inference has evolved a great deal, and in our data-rich world, inference often means statistical inference. Detection of extrasolar planets is a good example of making deductions from the faintest clues: A planet orbiting a star makes the star wobble slightly, which affects the velocity of the star with respect to the Earth. And this relative velocity can be deduced from the displacement in the parent star’s spectral lines due to the Doppler effect, thus inferring the existence of a planet. Crazy!

Web privacy. But back to the paper: what we did was to develop and apply inference techniques in the web context, specifically recommender systems, in a way that no one had thought of before. As you may have noticed, just about every website publicly shows relationships between related items—products, videos, books, news articles, etc.— and these relationships are derived from purchases or views, which are private information. What if the public listings could be reverse engineered, so that we can infer a user’s purchases from them? As the abstract says:

Many commercial websites use recommender systems to help customers locate products and content. Modern recommenders are based on collaborative filtering: they use patterns learned from users’ behavior to make recommendations, usually in the form of related-items lists. The scale and complexity of these systems, along with the fact that their outputs reveal only relationships between items (as opposed to information about users), may suggest that they pose no meaningful privacy risk.

In this paper, we develop algorithms which take a moderate amount of auxiliary information about a customer and infer this customer’s transactions from temporal changes in the public outputs of a recommender system. Our inference attacks are passive and can be carried out by any Internet user. We evaluate their feasibility using public data from popular websites Hunch, Last.fm, LibraryThing, and Amazon.

The screenshot below shows an example of a related-items list on Amazon. There are up to 100 items in such lists.

Consider a user Alice who’s made numerous purchases, some of which she has reviewed publicly. Now she makes a new purchase which she considers sensitive. But this new item, because of her purchasing it, has a nonzero probability of entering the “related items” list of each of the items she has purchased in the past, including the ones she has reviewed publicly. And even if it is already in the related-items list of some of those items, it might improve its rank on those lists because of her purchase. By aggregating dozens or hundreds of these observations, the attacker has a chance of inferring that Alice purchased something, as well as the identity of the item she purchased.

It’s a subtle technique, and the paper has more details than you can shake a stick at if you want to know more.

We evaluated the attacks we developed against several websites of a diverse nature. Numerically, our best results are against Hunch, a recommendation and personalization website. There is a tradeoff between the number of inferences and their accuracy. When optimized for accuracy, our algorithm inferred a third of the test users’ secret answers to Hunch questions with no error. Conversely, if asked to predict the secret answer to every secret question, the algorithm had an accuracy of around 80%.

Impact. It is important to note that we’re not claiming that these sites have serious flaws, or even, in most cases, that they should be doing anything different. On sites other than Hunch—Hunch had an API that provided exact numerical correlations between pairs of items—our attacks worked only on a small proportion of users, although it is sufficient to demonstrate the concept. (Hunch has since eliminated this feature of the API, for reasons unrelated to our research.) We also found that users of larger sites are much safer, because the statistical aggregates are computed from a larger set of users.

But here’s why we think this paper is important:

Our attack applies to a wide variety of sites—essentially every site with an online catalog of some sort. While we discuss various ways to mitigate the attack in the paper, there is no bulletproof “fix.”
It undermines the widely accepted dichotomy between “personally identiﬁable” individual records and “safe,” large-scale, aggregate statistics. Furthermore, it demonstrates that the dynamics of aggregate outputs (i.e., their variation with time) constitute a new vector for privacy breaches. Dynamic behavior of high-dimensional aggregates like item similarity lists falls beyond the protections offered by any existing privacy technology, including differential privacy.
It underscores the fact that modern systems have vast “surfaces” for attacks on privacy, making it difﬁcult to protect ﬁne-grained information about their users. Unintentional leaks of private information are akin to side-channel attacks: it is very hard to enumerate all aspects of the system’s publicly observable behavior which may reveal information about individual users.

That last point is especially interesting to me. We’re leaving digital breadcrumbs online all the time, whether we like it or not. And while algorithms to piece these trails together might seem sophisticated today, they will probably look mundane in a decade or two if history is any indication. The conversation around privacy has always centered around the assumption that we can build technological tools to give users—at least informed users—control over what they reveal about themselves, but our work suggests that there might be fundamental limits to those tools.

See also: Joe Calandrino’s post about this paper.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

May 24, 2011 at 6:11 pm 4 comments

Insights on fighting “Protect IP” from a Q&A with Congresswoman Lofgren

Summary. Appeals to free speech and chilling effects are at best temporary measures in the fight against Protect IP and domain seizures. Even if we win this time it will keep coming back in modified form; the only way defeat it for good is to convince Washington that artists are in fact thriving, that piracy is not the real problem, and that takedown efforts are not in the interest of society. We in the tech world know this, but we are doing a poor job of making ourselves heard in Washington, and this needs to change.

As most of you know, the Protect IP Act is a horrendous piece of proposed legislation sponsored by the “content industry” that gives branches of the Government powers to sieze domain names at will, force websites to remove links, etc. Congresswoman Zoe Lofgren has been one of the very few legislators fighting the good fight, speaking out against this grave threat to free speech.

I was invited to a brown bag lunch with Rep. Lofgren at Mozilla today. (Mozilla has gotten involved in this because of the events surrounding the Mafiaafire add-on and Homeland Security.) I asked the Congresswoman this question (paraphrased):

“Does the strategy of domain-name seizures even have a prayer of achieving the intended outcome, or is it going to lead to something similar to the Streisand effect, as we’ve seen happen repeatedly on the Internet? Tools for circumvention of censorship in dictatorial regimes, that we can all get behind and that the U.S. government has often funded, may be morally different from tools for circumvention of anti-infringement efforts, but they are technologically identical.” [Princeton professor and now FTC chief technologist Ed Felten has pointed this out in a related context.]

In response, Rep. Lofgren pivoted to the point that seemed to be her favorite theme of the day—the tech world needs to come up with ways to monetize online content, she said. Unless that happens, it’s not looking good for our side in the long run.

At first I was slightly annoyed by her not addressing my question, but after she pivoted a couple of more times to the same point in answer to other questions I started to pay close attention.

What the Congresswoman was saying was this:

The only way to convince Washington to drop this issue for good is to show that artists and musicians can get paid on the Internet.
Currently they are not seeing any evidence of this. The Congresswoman believes that new technology needs to be developed to let artists get paid. I believe she is entirely wrong about this; see below.
The arguments that have been raised by tech companies and civil liberties groups in Washington all center around free speech; there is nothing wrong with that but it is not a viable strategy in the long run because the issue is going to keep coming back.

Let’s zoom in on point 2 above. We techies all say we have the answers. New technology is not needed, we say. The dinosaurs of the content industries need to adapt their business models. Piracy is not correlated with a decrease in sales. Piracy happens not because it is cheaper, but because it is more convenient. Businesses need to compete with piracy rather than trying to outlaw it. Artists who’ve understood this are already thriving.

Washington is willing to listen to this. But no one is telling it to them.

There are a million blog posts that make the points above. But those don’t have an impact in Congress. “You vote up articles on Reddit all day,” Rep. Lofgren said. “Guess what, we don’t check Reddit in Washington.” Yes, she actually said that. The exact wording might be off but she used words to essentially that effect. She also pointed out that the tech industry spends by far the least amount of effort on lobbying. The entire industry has fewer representatives, apparently, than individual companies from many other sectors do.

A lot of information that we consider common knowledge is not available in Washington. It needs to be in a digestible form; for example, academic studies with concrete numbers that can be cited will be particularly useful. But a simple and important first step is to start communicating with policymakers. In my dealings with them, I’ve found them more willing to listen than I would have thought. So here’s my plea to the community to redirect some of the energy that we expend writing blog posts and expressing outrage into something more constructive.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

May 19, 2011 at 10:50 pm Leave a comment

A Map of 33bits.org

Moved here.

May 9, 2011 at 10:11 pm

33 Bits of Entropy

Google+ and Privacy: A Roundup

Data-mining Contests and the Deanonymization Dilemma: a Two-stage Process Could Be the Way Out

In Silicon Valley, Great Power but No Responsibility

The Many Ways in Which the Internet Has Given Us More Privacy

Bad Internet Law: What Techies Can Do About It

The Surprising Effectiveness of Prizes as Catalysts of Innovation

Price Discrimination is All Around You

“You Might Also Like:” Privacy Risks of Collaborative Filtering

Insights on fighting “Protect IP” from a Q&A with Congresswoman Lofgren

A Map of 33bits.org

About 33bits.org

Me, elsewhere

Email Subscription