33 Bits of Entropy

The Master Switch and the Centralization of the Internet

One of the most important trends in the recent evolution of the Internet has been the move towards centralization and closed platforms. I’m interested in this question in the context of social networks—analyzing why no decentralized social network has yet taken off, whether one ever will, and whether a decentralized social network is important for society and freedom. With this in mind, I read Tim Wu’s ‘The Master Switch: The Rise and Fall of Information Empires,’ a powerful book that will influence policy debates for some time to come. My review follows.

‘The Master Switch’ has two parts. The former discusses the history of communications media through the twentieth century and shows evidence for “The Cycle” of open innovation → closed monopoly → disruption. The latter, shorter part is more speculative and argues that the same fate will befall the Internet, absent aggressive intervention.

The first part of the book is unequivocally excellent. There are so many grand as well as little historical facts buried in there. Wu makes his case well for the claim that radio, telephony, film and television have all taken much the same path.

A point that Wu drives home repeatedly is that while free speech in law is always spoken of in the context of Governmental controls, the private entities that own or control the medium of speech play a far bigger role in practice in determining how much freedom of speech society has. In the U.S., we are used to regulating Governmental barriers to speech but not private ones, and a lot of the book is about exposing the problems with this approach.

An interesting angle the author takes is to look at the motives of the key men that shaped the “information industries” of the past. This is apposite given the enormous impact on history that each of these few has had, and I felt it added a layer of understanding compared to a purely factual account.

But let’s cut to the chase—the argument about the future of the Internet. I wasn’t sure whether I agreed or disagreed until I realized Wu is making two different claims, a weak one and a strong one, and does not separate them clearly.

The weak claim is simply that an open Internet is better for society in the long run than a closed one. Open and closed here are best understood via the exemplars of Google and Apple. Wu argues this reasonably well, and in any case not much argument is needed—most of us would consider it obvious on the face of it.

The strong claim, and the one that is used to justify intervention, is that a closed Internet will have such crippling effects on innovation and such chilling effects on free speech that it is our collective duty to learn from history and do something before the dystopian future materializes. This is where I think Wu’s argument falls short.

To begin with, Wu doesn’t have a clear reason why the Internet will follow the previous technologies, except, almost literally, “we can’t be sure it won’t.” He overstates the similarities and downplays the differences.

Second, I believe Wu doesn’t fully understand technology and the Internet in some key ways. Bizarrely, he appears to believe that the Internet’s predilection for decentralization is due to our cultural values rather than technological and business realities prevalent when these systems were designed.

Finally, Wu has a tendency to see things in black and white, in terms of good and evil, which I find annoying, and more importantly, oversimplified. He quotes this sentence approvingly: “Once we replace the personal computer with a closed-platform device such as the iPad, we replace freedom, choice and the free market with oppression, censorship and monopoly.” He also says that “no one denies that the future will be decided by one of two visions,” in the context of iOS and Android. It isn’t clear why he thinks they can’t coexist the way the Mac and PC have.

Regardless of whether one buys his dystopian prognostications, Wu’s paradigm of the “separations principle” is to be taken seriously. It is far broader than even net neutrality. There appear to be two key pillars: a separation of platforms and content, and limits on corporate structures to faciliate this—mainly vertical, but also horizontal, such as in the case of media conglomerates.

Interestingly, Wu wants the separations principle to be more of a societal-corporate norm than Governmental regulation. That said, he does call for more powers to the FCC, which is odd given that he is clear on the role that State actors have played in the past in enabling and condoning monopoly abuse:

Again and again in the histories I have recounted, the state has shown itself an inferior arbiter of what is good for the information industries. The federal government’s role in radio and television from the 1920s to the 1960s, for instance, was nothing short of a disgrace. In the service of chain broadcasting, it wrecked a vibrant, decentralized AM marketplace. At the behest of the ascendant radio industry, it blocked the arrival and prospects of FM radio, and then it put the brakes on television, reserving it for the NBC-CBS duopoly. Finally, from the 1950s through the 1960s, it did everything in its power to prevent cable television from challenging the primacy of the networks.

To his credit, Wu does seem to be aware of the contradiction, and appears to argue that the Government agencies can learn and change. It does seem like a stretch, however.

In summary, Wu deserves major kudos both for the historical treatment and for some very astute insights about the Internet. For example, in the last 2-3 years, Apple, Facebook, and Twitter have all made dramatic moves toward centralization, control and closed platforms. Wu seems to have foreseen this general trend more clearly than most techies did.[1] The book does have drawbacks, and I don’t agree that the Internet will go the way of past monopolies without intervention. It should be very interesting to see what moves Wu will make now that he will be advising the FTC.

[1] While the book was published in late 2010, I assume that Wu’s ideas are much older.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

March 23, 2011 at 7:51 pm Leave a comment

Privacy and the Market for Lemons, or How Websites Are Like Used Cars

I had a fun and engaging discussion on the “Paying With Data” panel at the South by Southwest conference; many thanks to my co-panelists Sara Marie Watson, Julia Angwin and Sam Yagan. I’d like to elaborate here on a concept that I briefly touched upon during the panel.

The market for lemons

In a groundbreaking paper 40 years ago, economist George Akerlof explained why so many used cars are lemons. The key is “asymmetric information:” the seller of a car knows more about its condition than the buyer does. This leads to “adverse selection” and a negative feedback spiral, with buyers tending to assume that there are hidden problems with cars on the market, which brings down prices and disincentivizes owners of good cars from trying to sell, further reinforcing the perception of bad quality.

In general, a market with asymmetric information is in danger of developing these characteristics: 1. buyers/consumers lack the ability to distinguish between high and low quality products 2. sellers/service providers lose the incentive to focus on quality and 3. the bad gradually crowds out the good since poor-quality products are cheaper to produce.

Information security and privacy suffer from this problem at least as much as used cars do.

The market for security products and certification

Bruce Schneier describes how various security products, such as USB drives, have turned into a lemon market. And in a fascinating paper, Ben Edelman analyzes data from TRUSTe certifications and comes to some startling conclusions [emphasis mine]:

Widely-used online “trust” authorities issue certifications without substantial verification of recipients’ actual trustworthiness. This lax approach gives rise to adverse selection: The sites that seek and obtain trust certifications are actually less trustworthy than others. Using a new dataset on web site safety, I demonstrate that sites certified by the best-known authority, TRUSTe, are more than twice as likely to be untrustworthy as uncertified sites. This difference remains statistically and economically significant when restricted to “complex” commercial sites.

[…]
In a 2004 investigation after user complaints, TRUSTe gave Gratis Internet a clean bill of health. Yet subsequent New York Attorney General litigation uncovered Gratis’ exceptionally far-reaching privacy policy violations — selling 7.2 million users’ names, email addresses, street addresses, and phone numbers, despite a privacy policy exactly to the contrary.

[…]
TRUSTe’s “Watchdog Reports” also indicate a lack of focus on enforcement. TRUSTe’s postings reveal that users continue to submit hundreds of complaints each month. But of the 3,416 complaints received since January 2003, TRUSTe concluded that not a single one required any change to any member’s operations, privacy statement, or privacy practices, nor did any complaint require any revocation or on-site audit. Other aspects of TRUSTe’s watchdog system also indicate a lack of diligence.

The market for personal data

In the realm of online privacy and data collection, the information asymmetry results from a serious lack of transparency around privacy policies. The website or service provider knows what happens to data that’s collected, but the user generally doesn’t. This arises due to several economic, architectural, cognitive and regulatory limitations/flaws:

Each click is a transaction. As a user browses around the web, she interacts with dozens of websites and performs hundreds of actions per day. It is impossible to make privacy decisions with every click, or have a meaningful business relationship with each website, and hold them accountable for their data collection practices.
Technology is hard to understand. Companies can often get away with meaningless privacy guarantees such as “anonymization” as a magic bullet, or “military-grade security,” a nonsensical term. The complexity of private browsing mode has led to user confusion and a false sense of safety.
Privacy policies are filled with legalese and no one reads them, which means that disclosures made therein count for nothing. Yet, courts have upheld them as enforceable, disincentivizing websites from finding ways to communicate more clearly.

Collectively, these flaws have led to a well-documented market failure—there’s an arms race to use all means possible to entice users to give up more information, as well as to collect it passively through ever-more intrusive means. Self-regulatory organizations become captured by those they are supposed to regulate, and therefore their effectiveness quickly evaporates.

TRUSTe seems to be up to some shenanigans the online tracking space as well. As many have pointed out, the TRUSTe “Tracking Protection List” for Internet Explorer is in fact a whitelist, allowing about 4,000 domains—almost certainly from companies that have paid TRUSTe—to track the user. Worse, installing the TRUSTe list seems to override the blocking of a domain via another list!

Possible solutions

The obvious response to a market with asymmetric information is to correct the information asymmetry—for used cars, it involves taking it to a mechanic, and for online privacy, it is consumer education. Indeed, the What They Know series has done just that, and has been a big reason why we’re having this conversation today.

However, I am skeptical that the market can be fixed though consumer awareness alone. Many of the factors I’ve laid out above involve fundamental cognitive limitations, and while consumers may be well-educated about the general dangers prevalent online, it does not necessarily help them make fine-grained decisions.

It is for these reasons that some sort of Government regulation of the online data-gathering ecosystem seems necessary. Regulatory capture is of course still a threat, but less so than with self-regulation. Jonathan Mayer and I point out in our FTC Comment that ad industry self-regulation of online tracking has been a failure, and argue that the FTC must step in and enforce Do Not Track.

In summary, information asymmetry occurs in many markets related to security and privacy, leading in most cases to a spiraling decline in quality of products and services from a consumer perspective. Before we can talk about solutions, we must clearly understand why the market won’t fix itself, and in this post I have shown why that’s the case.

Update. TRUSTe president Fran Maier responds in the comments.

Update 2. Chris Soghoian points me to this paper analyzing privacy economics as a lemon market, which seems highly relevant.

Thanks to Jonathan Mayer for helpful feedback.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

March 18, 2011 at 4:37 pm 6 comments

Link Prediction by De-anonymization: How We Won the Kaggle Social Network Challenge

The title of this post is also the title of a new paper of mine with Elaine Shi and Ben Rubinstein. You can grab a PDF or a web-friendly HTML version generated using my Project Luther software.

A brief de-anonymization history. As early as the first version of my Netflix de-anonymization paper with Vitaly Shmatikov back in 2006, a colleague suggested that de-anonymization can in fact be used to game machine-learning contests—by simply “looking up” the attributes of de-anonymized users instead of predicting them. We off-handedly threw in paragraph in our paper discussing this possibility, and a New Scientist writer seized on it as an angle for her article.[1] Nothing came of it, of course; we had no interest in gaming the Netflix Prize.

During the years 2007-2009, Shmatikov and I worked on de-anonymizing social networks. The paper that resulted (PDF, HTML) showed how to take two graphs representing social networks and map the nodes to each other based on the graph structure alone—no usernames, no nothing. As you might imagine, this was a phenomenally harder technical challenge than our Netflix work. (Backstrom, Dwork and Kleinberg had previously published a paper on social network de-anonymization; the crucial difference was that we showed how to put two social network graphs together rather than search for a small piece of graph-structured auxiliary information in a large graph.)

The context for these two papers is that data mining on social networks—whether online social networks, telephone call networks, or any type of network of links between individuals—can be very lucrative. Social networking websites would benefit from outsourcing “anonymized” graphs to advertisers and such; we showed that the privacy guarantees are questionable-to-nonexistent since the anonymization can be reversed. No major social network has gone down this path (as far as I know), quite possibly in part because of the two papers, although smaller players often fly under the radar.

The Kaggle contest. Kaggle is a platform for machine learning competitions. They ran the IJCNN social network challenge to promote research on link prediction. The contest dataset was created by crawling an online social network—which was later revealed to be Flickr—and partitioning the obtained edge set into a large training set and a smaller test set of edges augmented with an equal number of fake edges. The challenge was to predict which edges were real and which were fake. Node identities in the released data were obfuscated.

There are many, many anonymized databases out there; I come across a new one every other week. I pick de-anonymization projects if it will advance the art significantly (yes, de-anonymization is still partly an art), or if it is fun. The Kaggle contest was a bit of both, and so when my collaborators invited me to join them, it was too juicy to pass up.

The Kaggle contest is actually much more suitable to game through de-anonymization than the Netflix Prize would have been. As we explain in the paper:

One factor that greatly affects both [the privacy risk and the risk of gaming]—in opposite directions—is whether the underlying data is already publicly available. If it is, then there is likely no privacy risk; however, it furnishes a ready source of high-quality data to game the contest.

The first step was to do our own crawl of Flickr; this turned out to be relatively easy. The two graphs (the Kaggle graph and our Flickr crawl), were 95% similar, as we were later able to determine. The difference is primarily due to Flickr users adding and deleting contacts between Kaggle’s crawl and ours. Armed with the auxiliary data, we set about the task of matching up the two graphs based on the structure. To clarify: our goal was to map the nodes in the Kaggle training and test dataset to real Flickr nodes. That would allow us to simply look up the pairs of nodes in the test set in the Flickr graph to see whether or not the edge exists.

De-anonymization. Our effort validated the broad strategy in my paper with Shmatikov, which consists of two steps: “seed finding” and “propagation.” In the former step we somehow de-anonymize a small number of nodes; in the latter step we use these as “anchors” to propagate the de-anonymization to more and more nodes. In this step the algorithm feeds on its own output.

Let me first describe propagation because it is simpler.[2] As the algorithm progresses, it maintains a (partial) mapping between the nodes in the true Flickr graph and the Kaggle graph. We iteratively try to extend the mapping as follows: pick an arbitrary as-yet-unmapped node in the Kaggle graph, find the “most similar” node in the Flickr graph, and if they are “sufficiently similar,” they get mapped to each other.

Similarity between a Kaggle node and a Flickr node is defined as cosine similarity between the already-mapped neighbors of the Kaggle node and the already-mapped neighbors of the Flickr node (nodes mapped to each other are treated as identical for the purpose of cosine comparison).

In the diagram, the blue nodes have already been mapped. The similarity between A and B is 2 / (√3·√3) = ⅔. Whether or not edges exist between A and A’ or B and B’ is irrelevant.

There are many heuristics that go into the “sufficiently similar” criterion, which are described in our paper. Due to the high percentage of common edges between the graphs, we were able to use a relatively pure form of the propagation algorithm; the one my paper with Shmatikov, in contast, was filled with lots more messy heuristics.

Those elusive seeds. Seed identification was far more challenging. In the earlier paper, we didn’t do seed identification on real graphs; we only showed it possible under certain models for error in auxiliary information. We used a “pattern-search” technique, as did the Backstrom et al paper uses a similar approach. It wasn’t clear whether this method would work, for reasons I won’t go into.

So we developed a new technique based on “combinatorial optimization.” At a high level, this means that instead of finding seeds one by one, we try to find them all at once! The first step is to find a set of k (we used k=20) nodes in the Kaggle graph and k nodes in our Flickr graph that are likely to correspond to each other (in some order); the next step is to find this correspondence.

The latter step is the hard one, and basically involves solving an NP-hard problem of finding a permutation that minimizes a certain weighting function. During the contest I basically stared at this page of numbers for a couple of hours, and then wrote down the mapping, which to my great relief turned out to be correct! But later we were able to show how to solve it in an automated and scalable fashion using simulated annealing, a well-known technique to approximately solve NP-hard problems for small enough problem sizes. This method is one of the main research contributions in our paper.

After carrying out seed identification, and then propagation, we had de-anonymized about 65% of the edges in the contest test set and the accuracy was about 95%. The main reason we didn’t succeed on the other third of the edges was that one or both the nodes had a very small number of contacts/friends, resulting in too little information to de-anonymize. Our task was far from over: combining de-anonymization with regular link prediction also involved nontrivial research insights, for which I will again refer you to the relevant section of the paper.

Lessons. The main question that our work raises is where this leaves us with respect to future machine-learning contests. One necessary step that would help a lot is to amend contest rules to prohibit de-anonymization and to require source code submission for human verification, but as we explain in the paper:

The loophole in this approach is the possibility of overfitting. While source-code verification would undoubtedly catch a contestant who achieved their results using de-anonymization alone, the more realistic threat is that of de-anonymization being used to bridge a small gap. In this scenario, a machine learning algorithm would be trained on the test set, the correct results having been obtained via de-anonymization. Since successful [machine learning] solutions are composites of numerous algorithms, and consequently have a huge number of parameters, it should be possible to conceal a significant amount of overfitting in this manner.

As with the privacy question, there are no easy answers. It has been over a decade since Latanya Sweeney’s work provided the first dramatic demonstration of the privacy problems with data anonymization; we still aren’t close to fixing things. I foresee a rocky road ahead for machine-learning contests as well. I expect I will have more to say about this topic on this blog; stay tuned.

[1] Amusingly, it was a whole year after that before anyone paid any attention to the privacy claims in that paper.

[2] The description is from my post on the Kaggle forum which also contains a few additional details.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

March 9, 2011 at 12:30 pm 4 comments

In Which I Interrupt Your Regularly Scheduled Programming to Talk about Immigration Policy

On a recent trip to India for the winter break, I needed to renew my US visa. Like many people working on computer security and other subjects on the “Technology Alert List,” I ended up getting stuck there while my application was sent back to the US Department of State, where they supposedly make sure I’m not conducting espionage.I was lucky—I was “only” delayed by a little over a month. (I’m told that the wait used to be several months, and applicants would often give up.) Nonetheless, it was hugely disrputive: I missed a conference where I was supposed to speak, multiple panels and innumerable meetings.

There are several absurd aspects to the way the State Department and the Consulate process these applications:

Processing takes a highly variable amount of time. If it always took a month it wouldn’t be nearly as bad, but since it sometimes takes several months, it wrecks your ability to schedule things.
The consulate is highly understaffed. A decision to reject an applicant or stick them in limbo is made based on a 1-2 minute interview.
I’ve already been in the country for 6.5 years. Besides, my leaving the country was entirely voluntary, and I’m not required to renew my visa unless I do choose to leave voluntarily. One would think that if I were up to something I would have done it by now, or at least not have left.
There is no way to get this time-consuming background check done while I’m still in the country.
All of this would be justifiable in some way if the system at least worked. But the determination of whether an applicant working on something sensitive is entirely dependent on what they put on their application; worse, it’s based on keyword matching. It is often possible to reword your application to avoid these keywords if you know how; I wasn’t smart enough to do so.

Immigrants are not the only ones harmed by the muddleheaded visa policy and the fickle behavior of the visa overlords—all Americans are. The H-1B lottery, processing delays and other visa problems contribute to turning skilled workers and scientists back home, which hurts the economy. In fact the US spends taxpayer money to educate Ph.D’s and then encourages or forces them to leave.

As with many problems of Government, a major factor here seems to be that there is a vast and bloated immigration apparatus mired in rules and with no central oversight. Are there things an ordinary person can do to help improve the situation? I’d welcome any thoughts on the issue.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

February 23, 2011 at 4:31 pm 6 comments

One Click Frauds and Identity Leakage: Two Trends on a Collision Course

One of my favorite computer security papers of 2010 is by Nicolas Christin, Sally Yanagihara and Keisuke Kamataki on “one click frauds,” a simple yet shockingly effective form of social engineering endemic to Japan. I will let the authors explain:

In the family apartment in Tokyo, Ken is sitting at his computer, casually browsing the free section of a mildly erotic website. Suddenly, a window pops up, telling him,

Thank you for your patronage! You successfully registered for our premium online services, at an incredible price of 50,000 JPY. Please promptly send your payment by bank transfer to ABC Ltd at Ginko Bank, Account 1234567. Questions? Please contact us at 080-1234-1234.

Your IP address is 10.1.2.3, you run Firefox 3.5 over Windows XP, and you are connecting from Tokyo.

Failure to send your payment promptly will force us to mail you a postcard reminder to your home address. Customers refusing to pay will be prosecuted to the fullest extent of the law. Once again, thank you for your patronage!

A sample postcard reminder is shown on the screen, and consists of a scantily clad woman in a provocative pose. Ken has a sudden panic attack: He is married, and, if his wife were to ﬁnd out about his browsing habits, his marriage would be in trouble, possibly ending in divorce, and public shame. In his frenzied state of mind, Ken also fears that, if anybody at his company heard about this, he could possibly lose his job. Obviously, those website operators know who he is and where he lives, and could make his life very difﬁcult. Now, 50,000 JPY (USD 500) seems like a small price to pay to make all of this go away. Ken immediately jots down the contact information, goes to the nearest bank, and acquits himself of his supposed debt.

Ken has just been the victim of a relatively common online scam perpetrated in Japan, called “One Click Fraud.” In this fraud, the “customer,” i.e., the victim, does not enter any legally binding agreement, and the perpetrators only have marginal information about the client that connected to their website (IP address, User-Agent string), which does not reveal much about the user. However, facing a display of authority stressed by the language used, including the notion that they are monitored, and a sense of shame from browsing sites with questionable contents, most victims do not realize they are part of an extortion scam. Some victims even call up the phone numbers provided, and, in hopes of resolving the situation, disclose private information, such as name or address, to their tormentors, which makes them even more vulnerable to blackmail.

As a result, One Click Frauds have been very successful in Japan. Annual police reports show that the estimated amount of monetary damages stemming from One Click Frauds and related conﬁdence scams are roughly 26 billion JPY per year (i.e., USD 260 million/year). [emphasis mine]

The authors offer a fascinating economic analysis based on a near-exhaustive collection of fraud reports over a several-year period. Each scam offers 3 types of data points: the domain name where the scam appeared, the phone number the victim is asked to call, and the bank account number where the money is asked to be deposited. They plot the graph of all links between the ~500 domains, ~700 bank accounts and ~200 phone numbers, and report, among other nifty findings, that at most 13 groups are responsible for over half of all one-click frauds. Based on simple cost estimates, they also find that for each scam operated, the scammers recover their costs (bank account fee, bandwidth, etc.) with as few as 4 victims per year.

In this post I want to talk about the possible evolution of one-click frauds. At some point, either due to public awareness campaigns or due to saturation, the Japanese public will catch on to the fact that the attempted blackmail is fake and that the websites don’t actually have their identity. When this happens the scammers will be forced to up their game. Another impetus for increasing sophistication is making the fraud work outside Japan—the current version probably won’t work; the instinctive obedience of apparent authority seems characteristically Japanese.

And by ‘up their game,’ I mean that the scammers will probably get wise to the fact that they can discover the victim’s actual identity, and establish a credible threat instead of a fake one.

Readers of this blog know that I have announced or reported numerous attacks/vulnerabilities under the “ubercookies” series (1, 2, 3, 4, and part of 5) that allow a website to uncover a visitor’s identity, i.e., a Google/Facebook/Twitter handle. At the same time, connecting an online profile or email address to real-world information is becoming increasingly easy to automate. Putting two and two together, it is clear why one-click frauds could get very serious any day.

What might stop this logical progression of one-click frauds? Perhaps all identity-leak vulnerabilities will be found and fixed, but that’s a rather naïve hope, as the history of malware shows. Or maybe the public will eventually learn to resist the scam even in the face of a credible threat. That will take a long time, however, and a lot of damage will be done by then. Perhaps the technical skills required will remain beyond the reach of the scammers. But experience suggests that with a sufficiently lucrative prize, technical sophistication is no barrier—all it takes is one or two actual hackers; script-kiddie scammers can take care of the rest.

The best hope, as with any scam, is law enforcement. The authors list several factors, many specific to Japan, why the prosecution probability for one-click frauds is currently low. In addition, penalties for those who do get caught are also low: “One Click Frauds very often do not meet the legal tests necessary for qualifying as “fraud,” as in the vast majority of cases, the victim pays up immediately, and there is no active blackmailing effort from the miscreant.” A version of the scam that involved identity stealing would likely fall under the US Computer Fraud and Abuse Act or an equivalent, and would thus be more clearly illegal. Will this make a difference? Let’s wait and see.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

February 21, 2011 at 5:30 pm 2 comments

The Linkability of Usernames: a Step Towards “Uber-Profiles”

Daniele Perito, Claude Castelluccia, Mohamed Ali Kaafar, and Pere Manils have a neat paper “How Unique and Traceable are Usernames?“ that addresses the following question:

Suppose you find the same username on different online services, what is the probability that these usernames refer to the same physical person?

The background for this investigation is that there is tremendous commercial value in linking together every piece of online information about an individual. While the academic study of constructing “uber-profiles” by linking social profiles is new (see Large Online Social Footprints—An Emerging Threat for another example), commercial firms have long been scraping profiles, aggregating them, and selling them on the grey market. Well-known public-facing aggregators such as Spokeo mainly use public records, but online profiles are quickly becoming part of the game.

Paul Ohm has even talked of a “database of ruin.” No matter what moral view one takes of this aggregation, the technical questions are fascinating.

The research on Record Linkage could fill an encyclopedia (see here for a survey) but most of it studies traditional data types such as names and addresses. This paper is thus a nice complement.

Usernames are particularly useful for carrying out linkage across different sites for two reasons:

They are almost always available, especially on systems with pseudonymous accounts.
When comparing two databases of profiles, usernames are a good way to quickly find candidate matches before exploring other attributes.

The mathematical heavy-lifting that the authors do is described by the following:

… we devise an analytical model to estimate the uniqueness of a user name, which can in turn be used to assign a probability that a single username, from two different online services, refers to the same user

and

we extend this model to cases when usernames are different across many online services … experimental data shows that users tend to choose closely related usernames on different services.

For example, my Google handle is ‘randomwalker’ and my twitter username is ‘random_walker’. Perito et al’s model can calculate how obscure the username ‘random_walker’ is, as well as how likely it is that ‘random_walker’ is a mutation of ‘randomwalker’, and come up with a combined score representing the probability that the two accounts refer to the same person. Impressive.

The authors also present experimental results. For example, they find that with a sample of 20,000 usernames drawn from a real dataset, their algorithms can find the right match about 60% of the time with a negligible error rate (i.e., 40% of the time it doesn’t produce a match, but it almost never errs.) That said, I find the main strength of the paper to be in the techniques more than the numbers.

Their models know all about the underlying natural language patterns, such as the fact that ‘random_walker’ is more meaningful than say ‘rand_omwalker’. This is achieved using what are called Markov models. I really like this class of techniques; I used Markov models many years ago in my paper on password cracking with Vitaly Shmatikov to model how people pick passwords.

The setting studied by Perito et al. is when two or more offline databases of usernames are available. Another question worth considering is determining the identity of a person behind a username via automated web searches. See my post on de-anonymizing Lending Club data for an empirical analysis of this.

There is a lot to be said about the psychology behind username choice. Ben Gross’s dissertation is a fascinating look at the choice of identifiers for self-representation. I myself am very attached to ‘randomwalker’; I’m not sure why that is.

A philosophical question related to this research is whether it is better to pick a unique username or a common one. The good thing about a unique username is that you stand out from the crowd. The bad thing about a unique username is that you stand out from the crowd. The question gets even more interesting (and consequential) if you’re balancing Googlability and anonymity in the context of naming your child, but that’s a topic for another day.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

February 16, 2011 at 5:19 pm 2 comments

A Cryptographic Approach to Location Privacy

I have a new paper “Location Privacy via Private Proximity Testing” with Narendran Thiagarajan, Mugdha Lakhani, Mike Hamburg and Dan Boneh. Mike spoke about it at NDSS earlier this week, where it won a distinguished paper award.

What is Private Proximity Testing?

The premise behind our paper is simple: smartphone-based location services today require you to reveal your location to the service provider. Is it possible to have at least a limited set of location services without revealing your location?

One might ask why this is useful since your carrier tracks your location anyway. The answer is that while you might (grudgingly) trust your carrier with your location, your might not trust Facebook, Loopt, Foursquare, or whatever the newest location startup is.

We show that it is indeed possible to provide location functionality in a private manner: specifically, it is possible to do proximity testing with privacy. What this means is that a pair of friends will be automatically notified when they are nearby, but otherwise no information about their locations will be revealed to anyone.

This is a strong notion of privacy—not only does the service provider never get to learn your location, your friends don’t learn your location either except that when you are nearby, the learn the fact that you’re nearby. This is appropriate given the loose notion of ‘friend’ in online social networking.

Note that our concept is a natural fit for the background-service model, where the location app sits in the background and constantly monitors your location, whereas most commercial apps today use the check-in model, where explicit user action is required to transmit data or provide service. We will return to this point later.

Tessellations

Three overlapping hexagonal grids. A blue grid cell is highlighted

The way we detect when two friends are nearby is by dividing the plane[1] into a system of 3 overlapping hexagonal grids. Cryptographic protocols for “Private Equality Testing” allow a pair of users to compare if they are within the same grid cell, but otherwise reveal nothing. By repeating this protocol for each of the 3 grids, they learn if they are close to each other.

For details of how this works, and why simpler methods won’t work, you’ll have to read the paper.

[1] The curvature of the Earth can be ignored since the distances across which our app is intended to work are small.

Theory and Practice

My favorite aspect of this paper is that our research spans the spectrum from math to implementation. This is something that Stanford CS is especially good at.

On the theory front, our contributions were mainly new Private Equality Testing algorithms. Not quite brand-new, but optimizations of existing algorithms. At one point we were really excited about having come up with an algorithm based on an improvement to an arcane complexity-theoretic result called Barrington’s theorem, and were looking forward to what would almost certainly have been the first time ever that it had been implemented in actual software! Unfortunately we later found a more efficient algorithm that used much more prosaic math.

Location tags: because every point in space-time has a fingerprint

On to a completely different part of the paper. Think about all the electromagnetic waves and signals floating around us all the time, varying from point to point, constantly changing and carrying data—GPS, GSM, Bluetooth, WiFi, and many, many others. By extracting entropy from these signals, everyone at a given place at a given time has a shared secret—unpredictable if you’re not at the right place at the right time. Think of the possibilities!

We call these shared secrets location tags. The catch is that the tags extracted by two people are largely equal, but not exactly. What we show in the paper is a cryptographic version of error correction that enables using these approximately-equal secrets as if they were exactly equal. Location tags were introduced by my co-author Boneh and others in an earlier paper; we adapted their work to enable the idea of a shared secret for each time and place.

There are many possible uses for location tags. We use them to ensure that it isn’t possible to spoof your location and try to “cheat.” This is a big problem for Foursquare for example. Here’s another possible use: let’s say a conference wants to have an encrypted chatroom. Instead of handing out keys or passwords—insecure and inconvenient—how about automatically extracting the key from the audio of the conference room! This restricts access to those in the room, and also has forward secrecy, since there are no long-term keys.

This part of our paper is theoretical. We did the math but didn’t build it. The main limitation is the ability of phone hardware to extract location tags. Currently the main viable method is using WiFi traffic; we showed experimentally that robust tags can be extracted within a few seconds. We’re confident that as hardware improves, location tag-based cryptography will become more practical.

Adoption. We talked to both Google and Facebook about adopting our technology in their products. Their responses were lukewarm at best. One barrier seemed to be that current services are committed to the check-in model, whereas our method only works in the background-service model. Ironically, I believe that a major reason the check-in model won (even though Loopt, which took the early lead, was a background app), was privacy—users weren’t comfortable broadcasting their location to their service provider and their friends all the time.

While that was somewhat disappointing, the applicability of our research extends well beyond the consumer web, for example in enterprise or even military settings. Imagine a product manager who wants to track who is attending which events, but wants to guarantee employees that no other information is being collected. Our app is a perfect fit for this scenario.

We’re happy that our ideas are out there, and are always looking to talk to people in the industry who might be interested in making our concept and prototype a reality.

Special thanks to students Frank Wang and Kina Winoto for helping us with the implementation.

There are more blog posts in the pipeline related to this paper. For one, I learnt a lot about the challenges of trying to get crypto adopted in the real world. For another, I’m very excited about a sub-project of this paper called SocialKeys that aims to make encryption transparent, largely eliminating the idea of key management from the user perspective. Stay tuned!

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

February 14, 2011 at 4:28 pm 3 comments

The Unsung Success of CAN-SPAM

In today’s debate around Do Not Track, detractors frequently make a comparison to the CAN-SPAM Act and how it failed to stop spam. Indeed, in 2010 an average of 183 billion spam emails were sent per day, so clearly the law was thoroughly ineffective.

Or was it?

Decrying the effect of CAN-SPAM by looking at the total number, or even the percentage, of spam emails betrays a lack of understanding of what the Act was intended to do and how laws operate in general. Clearly, the Act does nothing to deter spammers in Ukraine or China; it’s not like the legislators were oblivious to this. To understand the positive effects that CAN-SPAM has had, it is necessary to go back to 2003 and see why spam filters weren’t working very well back then.

Typically thousands of dimensions are used, but only three are shown here

To a first approximation, a spam filter, like all machine learning-based “classifiers,” works by representing an email as a point in a multi-dimensional space and looking at which side of a surface (such as a “hyperplane”) it falls on. The hyperplane is “learned” by looking at already-classified emails. When you click the “report spam” button, you’re “training” this classifier, and it tweaks the hyperplane to become slightly more accurate in the future.

For emails that look obviously like spam, the classifier will never make a mistake, no matter how many millions of them it sees. The emails that it has trouble with are those that have some properties of ham and some properties of spam — those close to the boundary.

It is difficult for spammers to make their emails look legitimate, because ultimately they need to sell you a penis-enlargement product or whatever other scam they’re peddling. Back in the good old days when spam filters were hand-coded, they’d use tricks like replacing the word Viagra with Vi@gra. But the magic of machine learning ensures that modern filters will automatically update themselves very quickly.

Ham that looks like spam is much more of a problem. E-mail marketing is a grey area, and marketers will do anything they can to entice you to open their messages. Why honestly title your email “October widget industry newsletter” when you can instead title it “You gotta check this out!!” Compounding this problem is the fact that people get much more upset by false positives (legitimate messages getting lost) than false negatives (spam getting through to inbox).

It now becomes obvious how CAN-SPAM made honest people honest (and the bad guys easier to prosecute) and how that changed the game. The rules basically say, “don’t lie.” If you look a corpus of email today, you’ll find that the spectrum that used to exist is gone — there’s obviously legitimate e-mail (that intends to comply) and obviously illegitimate e-mail (that doesn’t care). The blue dots in the picture have been forced to migrate up — or risk being in violation. As you can imagine, spam filters have a field day in this type of situation.

And I can prove it. Instead of looking at how much spam is sent, let’s look at how much spam is getting through. Obviously this is harder to measure, but there is a simple proxy: search volume. The logic is straightforward: people who have a spam problem will search for it, in the hope of doing something about it.

Note: data is not available before 2004

A-ha! A five-fold decrease since CAN-SPAM was passed. That doesn’t prove that the decrease is necessarily due to the Act, but it does prove that those who claim spam is still a major problem have no clue what they’re talking about.

There’s unsolicited email that is legitimate under CAN-SPAM; most people would consider these to be spam as well. Here’s where another provision of the Act comes in: one-click unsubscribe. Michael Dayah reports on an experiment showing that for this type of spam, unsubscription is almost completely effective.

Incidentally, his view of CAN-SPAM concurs with mine:

The CAN-SPAM act then strongly bifurcated spammers. Some came into the light and followed the rules, using relevant subjects, no open relays, understandable language, and an unsubscribe link that supposedly functioned. Other went underground, doing their best to skirt the content filtering with nonsense text and day-old Chinese landing domains.

I would go so far as to say that the Act is a good model for the interplay between law and technology in solving a difficult problem. I’m not sure to what extent the lawmakers anticipated the developments that followed its passage, but CAN-SPAM is completely undeserving of the negative, even derisive reputation that it has acquired.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

December 20, 2010 at 5:37 pm 4 comments

An Academic Wanders into Washington D.C.

I was on a Do Not Track panel in Washington D.C. last week. I spent a day in the city and had many informal conversations with policy people. It was fascinating to learn from close range how various parts of the Government work. If I could sum it up in a single phrase, it would be “so many smart people, so many systemic problems.”

What follows is obviously the account of an outsider, and I’m sure there are many nuances I’ve missed. That said, an outsider’s view can sometimes provide fresh perspective. So without further ado, here are some of my observations.

A deep chasm. Techies are by-and-large oblivious of what goes on in D.C., and have a poor mental picture of what regulators are or aren’t involved in. For example, I attended part of a talk on antitrust concerns around the Google search algorithm, and it blew my mind to realize that something that techies think of as their playground comes under serious regulatory scrutiny. (I hear the Google antitrust issue is really big in the EU, and the US is catching up.) Equally, the policy world is quite lacking in tech expertise.

Libertarian influence. While the libertarian party is not mainstream in the US, libertarian think tanks and lobbying groups exercise significant influence in D.C. While that gladdens me as a libertarian, one unfortunate thing that appears to be common to all think tanks is toeing the party line at the expense of critical thinking. I’m not sure there can be a market failure so complete that libertarian groups will consider acknowledging the need for some government intervention.

A new kind of panel. The panel I attended was very different from what I’m used to. In a scientific or technical panel, there is an underlying truth even if the participants may disagree about some things. Policy panels seem to be very different: each participant represents a group that has an entrenched position and there is no scope for actual debate or any possibility of changing one’s mind. The panel is instead a forum for the speakers to state their respective positions for the benefit of the media and the public. There is nothing wrong with this, but it took me a while to grasp.

Lobbyists. The American public is deeply concerned about the power of lobbyists. But lobbyists perform the valuable function of providing domain expertise to legislators and regulators. Of course, the problem is that they also have the role of trying to get favorable treatment for the industry groups they represent, and these roles cannot be disentangled.

The solution is to increase the power of the counterweights to lobbyists, i.e., consumer advocates, environmental groups etc. A loose analogy is that if we’re worried about wealthy individuals getting better treatment from the judicial system, the answer is not to get rid of lawyers, but to improve the quality of public prosecutors and defenders. I don’t know if the lobbyist imbalance can ever be completely eliminated, but I think it can be drastically mitigated.

A humble suggestion. Generalizing my experiences in the tech field, I suspect that the Government lacks domain expertise in virtually every area, hence the dependence on lobbyists. If only more academics were to get involved in policy, it seems to me that it would solve both problems mentioned above — it would address the lack of expertise and it would shift the balance of advocacy in favor of consumers. (There are certainly many law scholars involved in policy, but I’m thinking more of scientists and social scientists here — those who have domain knowledge.)

To reiterate, I believe that a greater involvement of academics in policy has the potential to hugely improve how government works. But how do we make that happen? I have a couple of suggestions. Government people seem to have a tendency to listen to whoever talks the loudest in Washington. Instead, they should make an effort to seek out people with actual expertise. Second, I hope academics will take into account benefits like increased visibility and consider moonlighting in policy circles.

Thanks to Joe Calandrino for comments on a draft.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

December 6, 2010 at 7:38 pm 8 comments

Web Crawlers and Privacy: The Need to Reboot Robots.txt

This is a position paper I co-authored with Pete Warden and will be discussing at the upcoming IAB/IETF/W3C Internet privacy workshop this week.

Privacy norms, rules and expectations in the real world go far beyond the “public/private” dichotomy. Yet in the realm of web crawler access control, we are tied to this binary model via the robots.txt allow/deny rules. This position paper describes some of the resulting problems and argues that it is time for a more sophisticated standard.

The problem: privacy of public data. The first author has argued that individuals often expect privacy constraints on data that is publicly accessible on the web. Some examples of such constraints relevant to the web-crawler context are:

Data should not be archived beyond a certain period (or at all).
Crawling a small number of pages is allowed, but large-scale aggregation is not.
“Linkage” of personal information to other databases is prohibited.

Currently there is no way to specify such restrictions in a machine-readable form. As as result, sites resort to hacks such as identifying and blocking crawlers whose behavior they don’t like, without clearly defining acceptable behavior. Other sites specify restrictions in the Terms of Service and bring legal action against violators. This is clearly not a viable solution — for operators of web-scale crawlers, manually interpreting and encoding the ToS restrictions of every site is prohibitively expensive.

There are two reasons why the problem has become pressing: first, there is an ever-increasing quantity of behavioral data about users that is valuable to marketers — in fact, there is even a black market for this data — and second, crawlers have become very cheap to set up and operate.

The desire for control over web content is by no means limited to user privacy concerns. Publishers concerned about copyright are equally in search of a better mechanism for specifying fine-grained restrictions on the collection, storage and dissemination of web content. Many site owners would also like to limit the acceptable uses of data for competitive reasons.

The solution space. Broadly, there are three levels at which access/usage rules may be specified: site-level, page-level and DOM element-level. Robots.txt is an example of a site-level mechanism, and one possible solution is to extend robots.txt. A disadvantage of this approach, however, is that the file may grow too large, especially in sites with user-generated content what may wish to specify per-user policies.

A page-level mechanism thus sounds much more suitable. While there is already a “robots” attribute to the META tag, it is part of the robots.txt specification and has the same limitations on functionality. A different META tag is probably an ideal place for a new standard.

Taking it one step further, tagging at the DOM element-level using microformats to delineate personal information has also been proposed. A possible disadvantage of this approach is the overhead of parsing pages that crawlers will have to incur in order to be compliant.

Conclusion. While the need to move beyond the current robots.txt model is apparent, it is not yet clear what should replace it. The challenge in developing a new standard lies in accommodating the diverse requirements of website operators and precisely defining the semantics of each type of constraint without making it too cumbersome to write a compliant crawler. In parallel with this effort, the development of legal doctrine under which the standard is more easily enforceable is likely to prove invaluable.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter.

December 5, 2010 at 7:54 pm 5 comments

33 Bits of Entropy

The Master Switch and the Centralization of the Internet

Privacy and the Market for Lemons, or How Websites Are Like Used Cars

In Which I Interrupt Your Regularly Scheduled Programming to Talk about Immigration Policy

The Linkability of Usernames: a Step Towards “Uber-Profiles”

The Unsung Success of CAN-SPAM

An Academic Wanders into Washington D.C.

Web Crawlers and Privacy: The Need to Reboot Robots.txt

About 33bits.org

Me, elsewhere

Email Subscription