Web Privacy Measurement: Genesis of a Community

Last week I participated in the Web Privacy Measurement conference at Berkeley. It was a unique event because the community is quite new and this was our very first gathering. The WSJ Data Transparency hackathon is closely related; the Berkeley conference can be thought of as an academic counterpart. So it was doubly fascinating for me — both for the content and because of my interest in the sociology of research communities.

A year ago I explained that there is an information asymmetry when it comes to online privacy, leading to a “market for lemons.” The asymmetry exists for two main reasons: one is that companies don’t disclose what data they collect about you and what they do with it; the second is that even if they do, end users don’t have the capacity to aggregate and process that information and make decisions on the basis of it.

The Web Privacy Measurement community essentially exists to mitigate this asymmetry. The primary goal is to ferret out what is happening to your data online, and a secondary one is making this information useful by pushing for change, building tools for opt-out and control, comparison of different players, etc. The size of the community is an indication of how big the problem has gotten.

Before anyone starts trotting out the old line, “see, the market can solve everything!”, let me point out that the event schedule demonstrates, if anything, the opposite. The majority of what is produced here is intended wholly or partly for the consumption of regulators. Like many others, I found the “What privacy measurement is useful for policymakers?” panel to be the most interesting one. And let’s not forget that most of this is Government-funded research to begin with.

This community is very different from the others that I’ve belonged to. The mix of backgrounds is extraordinary: researchers mainly from computing and law, and a small number from other disciplines. Most of the researchers are academics, but a few work for industrial research labs, a couple are independent, and one or two work in Government. There were also people from companies that make privacy-focused products/services, lawyers, hobbyists, scholars in the humanities, and ad-industry representatives. Overall, the community has a moderately adversarial relationship with industry, naturally, and a positive relationship with the press, regulators and privacy advocates.

The make-up is somewhat similar to the (looser-knit) group of researchers and developers building decentralized architectures for personal data, a direction that my coauthors and I have taken a skeptical view of in this recent paper. In both cases, the raison d’être of the community is to correct the imbalance of power between corporations and the public. There is even some overlap between the two groups of people.

The big difference is that the decentralization community, typified by Diaspora, mostly tries to mount a direct challenge and overthrow the existing order, whereas our community is content to poke, measure, and expose, and hand over our findings to regulators and other interested parties. So our potential upside is lower — we’re not trying to put a stop to online tracking, for example — but the chance that we’ll succeed in our goals is much higher.

Exciting times. I’m curious to see how things evolve. But this week I’m headed to PLSC, which remains my favorite privacy-related conference.

Thanks to Aleecia McDonald for reviewing a draft.

To stay on top of future posts, subscribe to the RSS feed or follow me on Google+.

June 4, 2012 at 8:47 am Leave a comment

Selfish Reasons to do Peer Review, and Other Program Committee Observations

I’ve been on several program committees in the last year and a half. As I’ve written earlier, getting a behind-the-scenes look at how things work significantly improved my perception of research and academia. This post is a more elaborate set of observations based on my experience. It is targeted both at my colleagues with the hope of starting a discussion, as well as at outsiders as a continuation of my series on explaining how the scientific community functions (that began with the post linked above) .

Benefits of doing peer review. Peer review is often considered a burden that one grudgingly accepts in order to keep the system working. But in my experience, especially for a junior researcher, the effort is well worth the time.

The most obvious advantage of being on a PC is that it forces you to read papers. Now if you’re the type that never needs external motivation to get things accomplished, this wouldn’t matter to you — you’d do literature study on a regular basis anyway. But many of us aren’t that disciplined; I’m certainly not.

There are also insights you get that you can’t reproduce by having perfect self-discipline. PC work gives you a raw, unfiltered look into the research that people have chosen to work on. This is a 6-month-or-so head start for getting on top of emerging trends compared to only reading published papers. You also get a better idea of common pitfalls to avoid.

Finally, peer review is one of the rare opportunities to read papers critically (it is harder with published work because it doesn’t have as many loopholes). This is not a natural skill for most people — our cognitive biases predispose us to confuse good rhetoric with sound logic.

Which type of meeting? I’ve been on PCs with all three types of discussions: physical meetings, phone meetings and online. I think it’s important to have a meeting, whether physical or phone. I learn a lot, and the outcome feels fairer. Besides, quite often one reviewer is able to point out something the others have missed. Chairs of online-only PCs do try to elicit some interaction between reviewers, but for hard-to-explain but easy-to-understand reasons, the bandwidth in an interactive meeting tends to be much higher.

Phone meetings are suitable for smaller conferences and workshops. In my experience, members mostly tend to go on mute and tune out except when the papers they reviewed are being discussed. I don’t necessarily see a problem with this.

In physical meetings, I’ve found that members often make comments or voice opinions on papers they haven’t really read. I don’t think this is in the best interest of fair reviewing (although I’ve heard a contrary opinion). I wonder if a strategy involving smaller breakout groups would be more effective.

The one advantage of not having a meeting is of course that it saves time. I’ve found that the time commitment for the meeting is about a third of the reviewing time (for both physical and phone meetings), which I don’t consider to be too much of a burden given the improved outcomes.

Overall, my experience from these meetings is that members act professionally for the most part without egos or emotions getting in the way. While there is inevitably some randomness in the process, I believe that the horror stories of careless reviewers — everyone has at least one to narrate — are exaggerated. One possible reason for this misunderstanding is that there is a lot that’s discussed at meetings after the reviews are written, and often this feedback doesn’t make it into the reviews.

Problem areas. Finally, here are some aspects of PCs that I think could be improved. I have deliberately omitted the most common problems (such as an untenable number of submissions and low acceptance rates) that everybody knows and talks about. Instead, these are less frequently discussed but yet (IMO) fairly important issues.

Lost reviews. Since reviewers aren’t perfect, sometimes bad papers with persistent authors manage to get published by being resubmitted to other venues until they hit a relatively sloppy panel of reviewers. The reason this works (when it does) is that past reviews of a recycled paper are “lost”. This is a shame; it wastes reviewer effort and lowers the overall quality of publications.

Community boundaries. As a reviewer I’ve started to realize how difficult it is to publish in other communities’ venues. As an example, at security conferences we often see papers by outsiders that have something useful to say, but are unfortunately inadequately familiar with the “central dogma” of crypto/security research, namely adversarial thinking. [1] While I can see the temptation to reject these papers with a cursory note, I think we should be patient with these people, explain how we do things and if possible offer to work with them to improve the paper.

Unfruitful directions. Sometimes research directions don’t pan out, either because the world has moved on and the underlying assumptions are no longer true, or because the technical challenges are too hard. But researchers naturally resist having to change their research area, and so there are lots of papers written on topics that stopped being relevant years ago. The reason these papers keep getting published is that they are assigned for review to other people working in the same area. I’ve seen program chairs make an effort to push back on this, but the current situation is far from optimal.

In conclusion, my opinion is that peer review in my community is a relatively well-functioning process, albeit with a lot of scope for improvement. I believe this improvement can be accomplished in an evolutionary way without having to change anything too radically.

[1] The crypto/security community essentially derives its identity from adversarial thinking. Incidentally, I feel that it is not always suitable for privacy, which is why I believe computer scientists who study privacy should stop viewing ourselves as a subset of the security community.

May 2, 2012 at 9:37 am Leave a comment

A Critical Look at Decentralized Personal Data Architectures

I have a new paper with the above title, currently under peer review, with Vincent Toubiana, Solon Barocas, Helen Nissenbaum and Dan Boneh (the Adnostic gang). We argue that distributed social networking, personal data stores, vendor relationship management, etc. — movements that we see as closely related in spirit, and which we collectively term “decentralized personal data architectures” — aren’t quite the panacea that they’ve been made out to be.

The paper is only a synopsis of our work so far — in our notes we have over 80 projects, papers and proposals that we’ve studied, so we intend to follow up with a more complete analysis. For now, our goal is to kick off a discussion and give the community something to think about. The paper was a lot of fun to write, and we hope you will enjoy reading it. We recognize that many of our views and conclusions may be controversial, and we welcome comments.

Abstract:

While the Internet was conceived as a decentralized network, the most widely used web applications today tend toward centralization. Control increasingly rests with centralized service providers who, as a consequence, have also amassed unprecedented amounts of data about the behaviors and personalities of individuals.

Developers, regulators, and consumer advocates have looked to alternative decentralized architectures as the natural response to threats posed by these centralized services.  The result has been a great variety of solutions that include personal data stores (PDS), infomediaries, Vendor Relationship Management (VRM) systems, and federated and distributed social networks.  And yet, for all these efforts, decentralized personal data architectures have seen little adoption.

This position paper attempts to account for these failures, challenging the accepted wisdom in the web community on the feasibility and desirability of these approaches. We start with a historical discussion of the development of various categories of decentralized personal data architectures. Then we survey the main ideas to illustrate the common themes among these efforts. We tease apart the design characteristics of these systems from the social values that they (are intended to) promote. We use this understanding to point out numerous drawbacks of the decentralization paradigm, some inherent and others incidental. We end with recommendations for designers of these systems for working towards goals that are achievable, but perhaps more limited in scope and ambition.


To stay on top of future posts, subscribe to the RSS feed or follow me on Google+.

February 21, 2012 at 8:27 am 3 comments

Is Writing Style Sufficient to Deanonymize Material Posted Online?

I have a new paper appearing at IEEE S&P with Hristo Paskov, Neil Gong, John Bethencourt, Emil Stefanov, Richard Shin and Dawn Song on Internet-scale authorship identification based on stylometry, i.e., analysis of writing style. Stylometric identification exploits the fact that we all have a ‘fingerprint’ based on our stylistic choices and idiosyncrasies with the written word. To quote from my previous post speculating on the possibility of Internet-scale authorship identification:

Consider two words that are nearly interchangeable, say ‘since’ and ‘because’. Different people use the two words in a differing proportion. By comparing the relative frequency of the two words, you get a little bit of information about a person, typically under 1 bit. But by putting together enough of these ‘markers’, you can construct a profile.

The basic idea that people have distinctive writing styles is very well-known and well-understood, and there is an extremely long line of research on this topic. This research began in modern form in the early 1960s when statisticians Mosteller and Wallace determined the authorship of the disputed Federalist papers, and were featured in TIME magazine. It is never easy to make a significant contribution in a heavily studied area. No surprise, then, that my initial blog post was written about three years ago, and the Stanford-Berkeley collaboration began in earnest over two years ago.

Impact. So what exactly did we achieve? Our research has dramatically increased the number of authors that can be distinguished using writing-style analysis: from about 300 to 100,000. More importantly, the accuracy of our algorithms drops off gently as the number of authors increases, so we can be confident that they will continue to perform well as we scale the problem even further. Our work is therefore the first time that stylometry has been shown to have to have serious implications for online anonymity.[1]

Anonymity and free speech have been intertwined throughout history. For example, anonymous discourse was essential to the debates that gave birth to the United States Constitution. Yet a right to anonymity is meaningless if an anonymous author’s identity can be unmasked by adversaries. While there have been many attempts to legally force service providers and other intermediaries to reveal the identity of anonymous users, courts have generally upheld the right to anonymity. But what if authors can be identified based on nothing but a comparison of the content they publish to other web content they have previously authored?

Experiments. Our experimental methodology is set up to directly address this question. Our primary data source was the ICWSM 2009 Spinn3r Blog Dataset, a large collection of blog posts made available to researchers by Spinn3r.com, a provider of blog-related commercial data feeds. To test the identifiability of an author, we remove a random k (typically 3) posts from the corresponding blog and treat it as if those posts are anonymous, and apply our algorithm to try to determine which blog it came from. In these experiments, the labeled (identified) and unlabled (anonymous) texts are drawn from the same context. We call this post-to-blog matching.

In some applications of stylometric authorship recognition, the context for the identified and anonymous text might be the same. This was the case in the famous study of the federalist papers — each author hid his name from some of his papers, but wrote about the same topic. In the blogging scenario, an author might decide to selectively distribute a few particularly sensitive posts anonymously through a different channel.  But in other cases, the unlabeled text might be political speech, whereas the only available labeled text by the same author might be a cooking blog, i.e., the labeled and unlabeled text might come from different contexts. Context encompasses much more than topic: the tone might be formal or informal; the author might be in a different mental state (e.g., more emotional) in one context versus the other, etc.

We feel that it is crucial for authorship recognition techniques to be validated in a cross-context setting. Previous work has fallen short in this regard because of the difficulty of finding a suitable dataset. We were able to obtain about 2,000 pairs (and a few triples, etc.) of blogs, each pair written by the same author, by looking at a dataset of 3.5 million Google profiles and searching for users who listed more than one blog in the ‘websites’ field.[2] We are thankful to Daniele Perito for sharing this dataset. We added these blogs to the Spinn3r blog dataset to bring the total to 100,000. Using this data, we performed experiments as follows: remove one of a pair of blogs written by the same author, and use it as unlabeled text. The goal is to find the other blog written by the same author. We call this blog-to-blog matching. Note that although the number of blog pairs is only a few thousand, we match each anonymous blog against all 99,999 other blogs.

Results. Our baseline result is that in the post-to-blog experiments, the author was correctly identified 20% of the time. This means that when our algorithm uses three anonymously published blog posts to rank the possible authors in descending order of probability, the top guess is correct 20% of the time.

But it gets better from there. In 35% of cases, the correct author is one of the top 20 guesses. Why does this matter? Because in practice, algorithmic analysis probably won’t be the only step in authorship recognition, and will instead be used to produce a shortlist for further investigation. A manual examination may incorporate several characteristics that the automated analysis does not, such as choice of topic (our algorithms are scrupulously “topic-free”). Location is another signal that can be used: for example, if we were trying to identify the author of the once-anonymous blog Washingtonienne we’d know that she almost certainly resides in or around Washington, D.C. Alternately, a powerful adversary such as law enforcement may require Blogger, WordPress, or another popular blog host to reveal the login times of the top suspects, which could be correlated with the timing of posts on the anonymous blog to confirm a match.

We can also improve the accuracy significantly over the baseline of 20% for authors for whom we have more than an average number of labeled or unlabeled blog posts. For example, with 40–50 labeled posts to work with (the average is 20 posts per author), the accuracy goes up to 30–35%.

An important capability is confidence estimation, i.e., modifying the algorithm to also output a score reflecting its degree of confidence in the prediction. We measure the efficacy of confidence estimation via the standard machine-learning metrics of precision and recall. We find that we can improve precision from 20% to over 80% with only a halving of recall. In plain English, what these numbers mean is: the algorithm does not always attempt to identify an author, but when it does, it finds the right author 80% of the time. Overall, it identifies 10% (half of 20%) of authors correctly, i.e., 10,000 out of the 100,000 authors in our dataset. Strong as these numbers are, it is important to keep in mind that in a real-life deanonymization attack on a specific target, it is likely that confidence can be greatly improved through methods discussed above — topic, manual inspection, etc.

We confirmed that our techniques work in a cross-context setting (i.e., blog-to-blog experiments), although the accuracy is lower (~12%). Confidence estimation works really well in this setting as well and boosts accuracy to over 50% with a halving of recall. Finally, we also manually verified that in cross-context matching we find pairs of blogs that are hard for humans to match based on topic or writing style; we describe three such pairs in an appendix to the paper. For detailed graphs as well as a variety of other experimental results, see the paper.

We see our results as establishing early lower bounds on the efficacy of large-scale stylometric authorship recognition. Having cracked the scale barrier, we expect accuracy improvements to come easier in the future. In particular, we report experiments in the paper showing that a combination of two very different classifiers works better than either, but there is a lot more mileage to squeeze from this approach, given that ensembles of classifiers are known to work well for most machine-learning problems. Also, there is much work to be done in terms of analyzing which aspects of writing style are preserved across contexts, and using this understanding to improve accuracy in that setting.

Techniques. Now let’s look in more detail at the techniques I’ve hinted at above. The author identification task proceeds in two steps: feature extraction and classification. In the feature extraction stage, we reduce each blog post to a sequence of about 1,200 numerical features (a “feature vector”) that acts as a fingerprint. These features fall into various lexical and grammatical categories. Two example features: the frequency of uppercase words, the number of words that occur exactly once in the text. While we mostly used the same set of features that the authors of the Writeprints paper did, we also came up with a new set of features that involved analyzing the grammatical parse trees of sentences.

An important component of feature extraction is to ensure that our analysis was purely stylistic. We do this in two ways: first, we preprocess the blog posts to filter out signatures, markup, or anything that might not be directly entered by a human. Second, we restrict our features to those that bear little resemblance to the topic of discussion. In particular, our word-based features are limited to stylistic “function words” that we list in an appendix to the paper.

In the classification stage, we algorithmically “learn” a characterization of each author (from the set of feature vectors corresponding to the posts written by that author). Given a set of feature vectors from an unknown author, we use the learned characterizations to decide which author it most likely corresponds to. For example, viewing each feature vector as a point in a high-dimensional space, the learning algorithm might try to find a “hyperplane” that separates the points corresponding to one author from those of every other author, and the decision algorithm might determine, given a set of hyperplanes corresponding to each known author, which hyperplane best separates the unknown author from the rest.

We made several innovations that allowed us to achieve the accuracy levels that we did. First, contrary to some previous authors who hypothesized that only relatively straightforward “lazy” classifiers work for this type of problem, we were able to avoid various pitfalls and use more high-powered machinery. Second, we developed new techniques for confidence estimation, including a measure very similar to “eccentricity” used in the Netflix paper. Third, we developed techniques to improve the performance (speed) of our classifiers, detailed in the paper. This is a research contribution by itself, but it also enabled us to rapidly iterate the development of our algorithms and optimize them.

In an earlier article, I noted that we don’t yet have as rigorous an understanding of deanonymization algorithms as we would like. I see this paper as a significant step in that direction. In my series on fingerprinting, I pointed out that in numerous domains, researchers have considered classification/deanonymization problems with tens of classes, with implications for forensics and security-enhancing applications, but that to explore the privacy-infringing/surveillance applications the methods need to be tweaked to be able to deal with a much larger number of classes. Our work shows how to do that, and we believe that insights from our paper will be generally applicable to numerous problems in the privacy space.

Concluding thoughts. We’ve thrown open the doors for the study of writing-style based deanonymization that can be carried out on an Internet-wide scale, and our research demonstrates that the threat is already real. We believe that our techniques are valuable by themselves as well.

The good news for authors who would like to protect themselves against deanonymization, it appears that manually changing one’s style is enough to throw off these attacks. Developing fully automated methods to hide traces of one’s writing style remains a challenge. For now, few people are aware of the existence of these attacks and defenses; all the sensitive text that has already been anonymously written is also at risk of deanonymization.

[1] A team from Israel have studied authorship recognition with 10,000 authors. While this is interesting and impressive work, and bears some similarities with ours, they do not restrict themselves to stylistic analysis, and therefore the method is comparatively limited in scope. Incidentally, they have been in the news recently for some related work.

[2] Although the fraction of users who listed even a single blog in their Google profile was small, there were more than 2,000 users who listed multiple. We did not use the full number that was available.

To stay on top of future posts, subscribe to the RSS feed or follow me on Google+.

February 20, 2012 at 9:40 am 7 comments

An Update on Career Plans and Some Observations on the Nature of Research

I’ve had a wonderful time at Stanford these last couple of years, but it’s time to move on. I’m currently in the middle of my job search, looking for faculty and other research positions. In the next month or two I will be interviewing at several places. It’s been an interesting journey.

My Ph.D. years in Austin were productive and blissful. When I finished and came West, I knew I enjoyed research tremendously, but there were many aspects of research culture that made me worry if I’d fit in. I hoped my postdoc would give me some clarity.

Happily, that’s exactly what happened, especially after I started being an active participant in program committees and other community activities. It’s been an enlightening and humbling experience. I’ve come to realize that in many cases, there are perfectly good reasons why frequently-criticized aspects of the culture are just the way they are. Certainly there are still facets that are far from ideal, but my overall view of the culture of scientific research and the value of research to society is dramatically more positive than it was when I graduated.

Let me illustrate. One of my major complaints when I was in grad school was that almost nobody does interdisciplinary research (which is true — the percentage of research papers that span different disciplines is tiny). Then I actually tried doing it, and came to the obvious-in-retrospect realization that collaborating with people who don’t speak your language is hard.

Make no mistake, I’m as committed to cross-disciplinary research as I ever was (I just finished writing a grant proposal with Prof’s Helen Nissenbaum and Deirdre Mulligan). I’ve gradually been getting better at it and I expect to do a lot of it in my career. But if a researcher makes a decision to stick to their sub-discipline, I can’t really fault them for that.

As another example, consider the lack of a “publish-then-filter” model for research papers, a whole two decades after the Web made it technologically straightforward. Many people find this incomprehensibly backward and inefficient. Academia.edu founder Richard Price wrote an article two days ago arguing that the future of peer review will look like a mix of Pagerank and Twitter. Three years ago, that could have been me talking. Today my view is very different.

Science is not a popularity contest; Pagerank is irrelevant as a peer-review mechanism. Basically, scientific peer review is the only process that exists for systematically separating truths from untruths. Like democracy, it has its problems, but at least it works. Social media is probably the worst analogy — it seems to be better at amplifying falsehoods than facts. Wikipedia-style crowdsourcing has its strengths, but it can hit-or-miss.

To be clear, I think peer review is probably going to change; I would like it to be done in public, for one. But even this simple change is fraught with difficulty — how would you ensure that reviewers aren’t influenced by each others’ reviews? This is an important factor in the current system. During my program committee meetings, I came to realize just how many of these little procedures for minimizing bias are built into the system and how seriously people take the spirit of this process. Revamping peer review while keeping what works is going to be slow and challenging.

Moving on, some of my other concerns have been disappearing due to recent events. Restrictive publisher copyrights are a perfect example. I have more of a problem with this than most researchers do — I did my Master’s in India, which means I’ve been on the other side of the paywall. But it looks like that pot may finally have boiled over. I think it’s only a matter of time now before open access becomes the norm in all disciplines.

There are certainly areas where the status quo is not great and not getting any better. Today if a researcher makes a discovery that’s not significant enough to write a paper about, they choose not to share that discovery at all. Unfortunately, this is the rational behavior for a self-interested researcher, because there is no way to get credit for anything other than published papers. Michael Neilsen’s excellent book exploring the future of networked science gives me some hope that change may be on the horizon.

I hope this post has given you a more nuanced appreciation of the nature of scientific research. Misconceptions about research and especially about academia seem to be widespread among the people I talk to both online and offline; I harbored a few myself during my Ph.D., as I said earlier. So I’m thinking of doing posts like this one on a semi-regular basis on this blog or on Google+. But that will probably have to wait until after my job search is done.

To stay on top of future posts, subscribe to the RSS feed or follow me on Google+.

February 7, 2012 at 11:05 am 2 comments

Printer Dots, Pervasive Tracking and the Transparent Society

So far in the fingerprinting series, we’ve seen how a variety of objects and physical devices [1234], often even supposedly identical ones, can be uniquely fingerprinted. This article is non-technical; it is an opinion on some philosophical questions about tracking and surveillance.

Here’s a fascinating example of tracking that’s all around you but that you’re probably unaware of:

Color laser printers and photocopiers print small yellow dots on every page for tracking purposes.

My source for this is the EFF’s Seth Schoen, who has made his presentation on the subject available.

The dots are not normally visible, but can be seen by a variety of methods such as shining a blue LED flashlight, magnification under a microscope or scanning the document with a commodity scanner. The pattern of dots typically encodes the device serial number and a timestamp; some parts of the code are yet unidentified. There are interesting differences between the codes used by different manufacturers. [1] Some examples are shown in the pictures. There’s a lot more information in the presentation.

Pattern of dots from three different printers: Epson, HP LaserJet and Canon.

Schoen says the dots could have been the result of the Secret Service pressuring printer manufacturers to cooperate, going back as far as the 1980s. The EFF’s Freedom of Information Act request on the matter from 2005 has been “mired in bureaucracy.”

The EFF as well as the Seeing Yellow project would like to see these dots gone. The EFF has consistently argued against pervasive tracking. In this article on biometric surveillance, they say:

EFF believes that perfect tracking is inimical to a free society. A society in which everyone’s actions are tracked is not, in principle, free. It may be a livable society, but would not be our society.

Eloquently stated. You don’t have to be a privacy advocate to see that there are problems with mass surveillance, especially by the State. But I’d like to ask the question: can we really hope to stave off a surveillance society forever, or are efforts like the Seeing Yellow project just buying time?

My opinion is that it impossible to put the genie back into the bottle — the cost of tracking every person, object and activity will continue to drop exponentially. I hope the present series of articles has convinced you that even if privacy advocates are successful in preventing the deployment of explicit tracking mechanisms, just about everything around you is inherently trackable. [2]

And even if we can prevent the State from setting up a surveillance infrastructure, there are undeniable commercial benefits in tracking everything that’s trackable, which means that private actors will deploy this infrastructure, as they’ve done with online tracking. If history is any indication, most people will happily allow themselves to be tracked in exchange for free or discounted services. From there it’s a simple step for the government to obtain the records of any person of interest.

If we accept that we cannot stop the invention and use of tracking technologies, what are our choices? Our best hope, I believe, is a world in which the ability to conduct tracking and surveillance is symmetrically distributed, a society in which ordinary citizens can and do turn the spotlight on those in power, keeping that power in check. On the other hand, a world in which only the government, large corporations and the rich are able to utilize these technologies, but themselves hide under a veil of secrecy, would be a true dystopia.

Another important principle is for those who do conduct tracking to be required to be transparent about it, to have social and legal processes in place to determine what uses are acceptable, and to allow opting out in contexts where that makes sense. Because ultimately what matters in terms of societal freedom is not surveillance itself, but how surveillance affects the balance of power. To be sure, the society I describe — pervasive but transparent tracking, accessible to everyone, and with limited opt-outs — would be different from ours, and would take some adjusting to, but that doesn’t make it worse than ours.

I am hardly the first to make this argument. A similar position was first prominently articulated by David Brin his 1999 book Transparent Society. What the last decade has shown is just how inevitable pervasive tracking is. For example, Brin focused too much on cameras and assumed that tracking people indoors would always be infeasible. That view seems almost quaint today.

Let me be clear: I have absolutely no beef with efforts to oppose pervasive tracking. Even if being watched all of the time is our eventual destiny, society won’t be ready for it any time soon — these changes take decades if not generations. The pace at which the industry wants us to make us switch to “living in public” is far faster than we’re capable of. Buying time is therefore extremely valuable.

That said, embracing the Transparent Society view has important consequences for civil libertarians. It suggests working toward an achievable if sub-optimal goal instead of an ideal but impossible one. It also suggests that the “democratization of surveillance” should be encouraged rather than feared.

Here are some currently hot privacy and civil-liberties issues that I think will have a significant impact on the distribution of power in a ubiquitous-surveillance society: the right to videotape on-duty police officers and other public officials, transparent government initiatives including FOIA requests, and closer to my own interests, the Do Not Track opt-out mechanism, and tools like FourthParty which have helped illuminate the dark world of online tracking.

Let me close by calling out one battle in particular. Throughout this series, we’ve seen that fingerprinting techniques have security-enhancing applications (such as forensics), as well as privacy-infringing ones, but that most research papers on fingerprinting consider only the former question. I believe the primary reason is that funding is for the most part available only for the former type of research and not for the latter. However, we need a culture of research into privacy-infringing technologies, whether funded by federal grants or otherwise, in order to achieve the goals of symmetry and transparency in tracking.

[1] Note that this is just an encoding and not encryption. The current system allows anyone to read the dots; public-key encryption would allow at least nominally restricting the decoding ability to only law-enforcement personnel, but there is no evidence that this is being done.

[2] This is analogous to the cookies-vs-fingerprinting issue in online tracking, and why cookie-blocking alone is not sufficient to escape tracking.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter or Google+.

October 18, 2011 at 11:35 am 5 comments

Everything Has a Fingerprint — Don’t Forget Scanners and Printers

Previous articles in this series looked at fingerprinting of blank paperdigital cameras and RFID chips. This article will discuss scanners and printers, rounding out the topic of physical-device fingerprinting.

To readers who’ve followed the series so far, it should come as no surprise that scanners can be fingerprinted, and this can be used to match an image to the device that scanned it. Scanners capture images via a process similar to digital cameras, so the underlying principle used in fingerprinting is the same: characteristic ‘pattern noise’ in the sensor array as well as idiosyncracies of the algorithms used in the post-processing pipeline. The former is device-specific whereas the latter is make/model specific.

There are two important differences, however, that make scanner fingerprinting more difficult: first, scanner sensor arrays are one-dimensional (the sensor moves along the length of the device to generate the image), which means that there is much less entropy available from sensor imperfections. Second, the paper may not be placed in the same part of the scanner bed each time, which rules out a straightforward pixel-wise comparison.

group at Purdue has been very active in this area, as well as in printer identification, which I will discuss later in this article. These two papers are very relevant for our purposes. The application they have in mind is forensics; in this context, it can be assumed that the investigator has physical possession of the scanner to generate a fingerprint against which a scanned image of unknown or uncertain origin can be tested.

To extract 1-dimensional noise from a 2-dimensional scanned image, the authors first extract 2-dimensional noise, in a process similar to what is used in camera fingerprinting, and then they collapse each noise pattern into a single row, which is the average of all the rows. Simple enough.

Dealing with the other problem, the lack of synchronicity, is trickier. There are broadly two approaches: 1. try to synchronize the image by trying various alignments 2. extract fingerprints using statistical features of the image that are robust against desynchronization. The authors use the latter approach, mainly moment-based features of the noise vector.

Here are the results. At the native resolution of scanners, 1200–4800 dpi, they were able to distinguish between 4 scanners with an average accuracy of 96%, including a pair with identical make and model. In subsequent work, they improved the feature extraction to be able to handle images that are reduced to 200 dpi, which is typically the resolution used for saving and emailing images. While they achieved 99.9% accuracy in classifying 10 scanners, they can no longer distinguish devices of identical make and model.

The authors claim that a correlation based approach — searching for the right alignment between two images, and then directly comparing the noise vectors — won’t work. I am skeptical about this claim. The fact that it hasn’t worked so far doesn’t mean it can’t be made to work. If it does work, it is likely to give far higher accuracies and be able to distinguish between a much larger number of devices.

The privacy implications of scanner fingerprinting are of an analogous nature to digital camera fingerprinting: a whistleblower exposing scanned documents may be deanonymized. However, I would judge the risk to be much lower: scanners usually aren’t personal devices, and a labeled corpus of images scanned by a particular device is typically not available to outsiders.

The Purdue group have also worked on printer identification, both laser and inkjet. In laser printers, one prominent type of observable signature arising from printer artifacts is banding — alternating light and dark horizontal bands. The bands are subtle and not noticeable to the human eye. But they are easily algorithmically detectable, constituting a 1–2% deviation from average intensity.

Fourier Transform of greyscale amplitudes of a background fill (printed with an HP LaserJet)

Banding can be demonstrated by printing a constant grey background image, scanning it, measuring the row-wise average intensities and taking the Fourier Transform of the resulting 1-dimensional vector. One such plot is shown here: the two peaks (132 and 150 cycles/inch) constitute the signature of the printer. The amount of entropy here is small — the two peak frequencies — and unsurprisingly the authors believe that the technique is good enough to distinguish between printer models but not individual printers.

Detecting banding in printed text is difficult because the power of the signal dominates the power of the noise. Instead the authors classify individual letters. By extracting a set of statistical features and applying an SVM classifier, they show that instances of the letter ‘e’ from 10 different printers can be correctly classified with an accuracy of over 90%.

Needless to say, by combining the classification results from all the ‘e’s in a typical document, they were able to match documents to printers 100% of the time in their tests. Presumably the same method would apply for all other characters, but wasn’t tested due to the additional manual effort required for different shapes.

Vertical lines printed by three different inkjet printers

Inkjet printers seem to be even more variable than laser printers; an example is shown in the picture taken from this paper. I found it a bit hard to discern exactly what the state of the art is, but I’m guessing that if it isn’t already possible to detect different printer models with essentially perfect accuracy, it will soon be.

The privacy implications of printer identification, in the context of a whistleblower who wishes to print and mail some documents anonymously, would seem to be minimal. If you’re printing from the office, printer logs (that record a history of print jobs along with user information) would probably be a more realistic threat. If you’re using a home printer, there is typically no known set of documents that came from your printer to compare against, unless law enforcement has physical possession of your printer.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter or Google+.

October 11, 2011 at 10:02 am 1 comment

Fingerprinting of RFID Tags and High-Tech Stalking

Previous articles in this series looked at fingerprinting of blank paper and digital cameras. This article is about fingerprinting of RFID, a domain where research has directly investigated the privacy threat, namely tracking people in public.

The principle behind RFID fingerprinting is the same as with digital cameras:

Microscopic physical irregularities due to natural structure and/or manufacturing defects cause observable, albeit tiny, behavioral differences.

The basics. First let’s get the obvious question out of the way: why are we talking about devious methods of identifying RFID chips, when the primary raison d’être of RFID is to enable unique identification? Why not just use them in the normal way?

The answer is that fingerprinting, which exploits the physical properties of RFID chips rather than their logical behavior, allows identifying them in unintended ways and in unintended contexts, and this is powerful. RFID applications, for example in e-passports or smart cards, can often be cloned at the logical level, either because there is no authentication or because authentication is broken. Fingerprinting can make the system (more) secure, since fingerprints arise from microscopic randomness and there is no known way to create a tag with a given fingerprint.

If sensor patterns in digital cameras are a relatively clean example of fingerprinting, RF (and anything to do with the electromagnetic spectrum in general) is the opposite. First, the data is an arbitrary waveform instead of an fixed-size sequence of bits. This means that a simple point-by-point comparison won’t work for fingerprint verification; the task is conceptually more similar to algorithmically comparing two faces. Second, the probe signal itself is variable. RFID chips are passive: they respond to the signal produced by the reader (and draw power from it).[1] This means that the fingerprinting system is in full control of what kind of signal to interrogate the chip with. It’s a bit like being given a blank canvas to paint on.

Techniques. A group at ETH Zurich has done some impressive work in this area. In their 2009 paper, they report being able to compare an RFID card with a stored fingerprint and determine if they are the same, with an error rate of 2.5%–4.5% depending on settings.[2] They use two types of signals to probe the chip with — “burst” and “sweep” — and extract features from the response based on the spectrum.

Chip response to different signals. Fingerprints are extracted from characteristic features of these responses.

Other papers have demonstrated different ways to generate signals/extract features. A University of Arkansas team exploited the minimum power required to get a response from the tag at various frequencies. The authors achieved a 94% true-positive rate using 50 identical tags, with only a 0.1% false-positive rate. (About 6% of the time, the algorithm didn’t produce an output.)

Yet other techniques, namely the energy and Q factor of higher harmonics were studied in a couple of papers out of NIST. In the latter work, they experimented with 20 cards which consisted of 4 batches of 5 ‘identical’ cards in each. The overall identification accuracy was 96%.

It seems safe to say that RFID fingerprinting techniques are still in their infancy, and there is much room for improvement by considering new categories of features, by combining different types of features, or by using different classification algorithms on the extracted features.

Privacy. RF fingerprinting, like other types of fingerprinting, shows a duality between security-enhancing and privacy-infringing applications, but in a less direct way.  There are two types of RFID systems: “near-field” based on inductive coupling, used in contactless smartcards and the like, and “far field” based on backscatter, used in vehicle identification, inventory control, etc. The papers discussed so far pertain to near-field systems. There are no real privacy-infringing applications of near-field RF fingerprinting, because you can’t get close enough to extract a fingerprint without the owner of the tag knowing about it. Far-field systems, to which we will now turn, are ideally suited to high-tech stalking.

Fingerprinting provides the ability to enhance the security of near-field RFID systems and to infringe privacy in the context of far-field RFID chips.

In a recent paper, the Zurich team mentioned earlier investigated the possibility of tracking a people in a shopping mall based on strategically placed sensors, assuming that shoppers have several (far-field) RFID tags on them. The point is that it is possible to design chips that prevent tracking at the logical level by authenticating the reader, but this is impossible at the physical level.

Why would people have RFID tags on them? Tags used for inventory control in stores, and not deactivated at the point-of-sale are one increasingly common possibility — they would end up in shopping bags (or even on clothes being worn, although that’s less likely). RFID tags in wallets and medical devices are another source; these are tags that the user wants to be present and functional.

What makes the tracking device the authors built powerful is that it is low-cost and can be operated surreptitiously at some distance from the victim: up to 2.75 meters, or 9 feet. They show that 5.4 bits of entropy can be extracted from a single tag, which means that 5 tags on a person gives 22 bits, easily enough to distinguish everyone who might be in a particular mall.

To assess the practical privacy risk, technological feasibility is only one dimension. We also need to ask who the adversary is and what the incentives are. Tracking people, especially shoppers, in physical space has the strongest incentive of all: selling products. While online tracking is pervasive, the majority of shopping dollars are still spent offline, and there’s still no good way to automatically identify people when they are in the vicinity in order to target offers to them. Facial recognition technology is highly error-prone and creeps people out, and that’s where RF fingerprinting comes in.

That said, RF fingerprinting is only one of the many ways of passively tracking people en masse in physical space — unintentional leaks of identifiers from smartphones and logical-layer identification of RFID tags seem more likely — but it’s probably the hardest to defend against. It is possible to disable RFID tags, but this is usually irreversible and it’s difficult to be sure you haven’t missed any. RFID jammers are another option but they are far from easy to use and are probably illegal in the U.S. One of the ETH Zurich researchers suggests tinfoil wrapping when going out shopping :-)

[1] Active RFID chips exist but most commercial systems use passive ones, and that’s what the fingerprinting research has focused on.

[2] They used a population of 50 tags, but this number is largely irrelevant since the experiment was one of binary classification rather than 1-out-of-n identification.

 

Thanks to Vincent Toubiana for comments on a draft.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter or Google+.

October 4, 2011 at 1:20 pm Leave a comment

No Two Digital Cameras Are the Same: Fingerprinting Via Sensor Noise

The previous article looked at how pieces of blank paper can be uniquely identified. This article continues the fingerprinting theme to another domain, digital cameras, and ends by speculating on the possibility of applying the technique on an Internet-wide scale.

For various kinds of devices like digital cameras and RFID chips, even supposedly identical units that come out of a manufacturing plant behave slightly differently in characteristic ways, and can therefore be distinguished based on their output or behavior. How could this be? The unifying principle is this:

Microscopic physical irregularities due to natural structure and/or manufacturing defects cause observable, albeit tiny, behavioral differences.

Digital camera identification belongs to a class of techniques that exploits ‘pattern noise’ in the ‘sensor arrays’ that capture images. The same techniques can be used to fingerprint a scanner by analyzing pixel-level patterns in the images scanned by it, but that’ll be the focus of a later article.

A long-exposure dark frame [source]. Click image to see full size. Three ‘hot pixels’ and some other sensor noise can be seen.

A photo taken in the absence of any light doesn’t look completely black; a variety of factors introduce noise. There is random noise that varies in every image, but there is also ‘pattern noise’ due to inherent structural defects or irregularities in the physical sensor array. The key property of the latter kind of noise is that it manifests the same way every image taken by the camera.[1] Thus, the total noise vector produced by a camera is not identical between images, nor is it completely independent.

The pixel-level noise components in images taken by the same camera are correlated with each other.

Nevertheless, separating the pattern noise from random noise and the image itself — after all, a good camera will seek to minimize the strength or ‘power’ of the noise in relation to the image — is a very difficult task, and is the primary technical challenge that camera fingerprinting techniques must address.

Security vs. privacy. A quick note about the applications of camera fingerprinting. We saw in the previous article that there are security-enhancing and privacy-infringing applications of document fingerprinting. In fact, this is almost always the case with fingerprinting techniques. [2]

Camera fingerprinting can be used on the one hand for detecting forgeries (e.g., photoshopped images), and to aid criminal investigations by determining who (or rather, which camera) might have taken a picture. On the other hand, it could potentially also be used for unmasking individuals who wish to disseminate photos anonymously online.

Sadly, most papers studying fingerprinting study only the former type of application, which is why we’ll have to speculate a bit on the privacy impact, even though the underlying math of fingerprinting is the same.

Most fingerprinting techniques have both security-enhancing and privacy-infringing applications. The underlying principles are the same but they are applied slightly differently.

Another point to note is that because of the focus on forensics, most of the work in this area so far has studied distinguishing different camera models. But there are some preliminary results on distinguishing ‘identical’ cameras, and it appears that the same techniques will work.

In more detail. Let’s look at what I think is the most well-known paper on sensor pattern noise fingerprinting, by Binghamton University researchers Jan Lukáš, Jessica Fridrich, and Miroslav Golja. [3] Here’s how it works: the first step is to build a reference pattern of a camera from multiple known images taken from it, so that later an unsourced image can be compared against these reference patterns. The authors suggest using at least 50, but for good measure, they use 320 in their experiments. In the forensics context, the investigator probably has physical possession of the camera and therefore can generate an unlimited number of images. We’ll discuss what this requirement means in the privacy-breach context later.

There are two steps to build the reference pattern. First, for each image, a denoising filter is applied, and the denoised image is subtracted from the original to leave only the noise. Next, the noise is averaged across all the reference images — this way the random noise cancels out and leaves the pattern noise.

Comparing a new image to a reference pattern, to test if it came from that camera, is easy: extract the noise from the test image, and compare this noise pixel-by-pixel with the reference noise. The noise from the test image includes random noise, so the match won’t be close to perfect, but nevertheless the correlation between the two noise patterns will be roughly equal to the contribution of pattern noise towards the total noise in the test image. On the other hand, if the test image didn’t come from the same camera, the correlation will be close to zero.

The authors experimented with nine cameras, of which two were from the same brand and model (Olympus Camedia C765). In addition, two other cameras had the same type of sensor. There was not a single error in their 2,700 tests, including those involving the two ‘identical’ cameras — in each case, the algorithm correctly identified which of the nine cameras a given image came from. By extrapolating the correlation curves, they conservatively estimate that for a False Accept Rate of 10-3, their method achieves a False Reject Rate of anywhere between 10-2 to 10-10 or even less depending on the camera model and camera settings.

The takeaway from this seems to be that distinguishing between cameras of different models can be performed with essentially perfect accuracy. Distinguishing between cameras of the same model also seems to have very high accuracy, but it is hard to generalize because of the small sample size.

Improvements. Impressive as the above numbers are, there are at least two major ways in which this result can, and has been improved. First, the Binghamton paper is focused on a specific signal, sensor noise. But there are several stages in image acquisition and processing pipeline in the camera, each of which could leave idiosyncratic effects on the image. This paper out of Turkey incorporates many such effects by considering all patterns of certain types that occur in the lower order (least significant) bits of the image, which seems like a rather powerful technique.

The effects other than sensor noise seem to help more with identifying the camera model than the specific device, but to the extent that the former is a component of the latter, it is useful. They achieve a 97.5% accuracy among 16 test cameras — but with cellphone cameras with pictures at a resolution of just 640×480.

Second is the effect of the scene itself on the noise. Denoising transformations are not perfect — sharp boundaries look like noise. The Binghamton researchers picked their denoising filter (a wavelet transform) to minimize this problem, but a recent paper by Chang-Tsun Li claims to do it better, and shows even better numerical results: with 6 cameras (all different models), accurate (over 99%) identification for image fragments cropped to just 256 x 512.

What does this mean for privacy? I said earlier that there is a duality between security and privacy, but let’s examine the relationship in more detail. In privacy-infringing applications like mass surveillance, the algorithm need not always produce an answer, and it can occasionally be wrong when it does. The penalty for errors is much lower. On the other hand, the matching algorithm in surveillance-like applications needs to handle a far larger number of candidate cameras. The key point is:

The parameters of fingerprinting algorithms can usually be tweaked to handle a larger number of classes (i.e., devices) at the expense of accuracy.

My intuition is that state-of-the-art techniques, configured slightly differently, should allow probabilistic deanonymization from among tens of thousands of different cameras. A Flickr or Picasa profile with a few dozen images should suffice to fingerprint a camera.[4] Combined with metadata such as location, this puts us within striking distance of Internet-scale source-camera identification from anonymous images. I really hope there will be some serious research on this question.

Finally, a word defenses. If you find yourself in a position where you wish to anonymously publicize a sensitive photograph you took, but your camera is publicly tied to your identity because you’ve previously shared pictures on social networks (and who hasn’t), how do you protect yourself?

Compressing the image is one possibility, because that destroys the ‘lower-order’ bits that fingerprinting crucially depends on. However, it would have to be way more aggressive than most camera defaults (JPEG quality factor ~60% according to one of the studies, whereas defaults are ~95%). A different strategy is rotating the image slightly in order to ‘desynchronize’ it, throwing off the fingerprint matching. An attack that defeats this will have to be much more sophisticated and will have a far higher error rate.

The deanonymization threat here is analogous to writing-style fingerprinting: there are simple defenses, albeit not foolproof, but sadly most users are unaware of the problem, let alone solutions.

[1] That was a bit simplified; mathematically, there is an additive component (dark signal nonuniformity) and a multiplicative component (photoresponse nonuniformity). The former is easy to correct for, and higher-end cameras do, but the latter isn’t.

[2] Much has been said about the tension between security and privacy at a social/legal/political level, but I’m making a relatively uncontroversial technical statement here.

[3] Fridrich is incidentally one of the pioneers of speedcubing i.e., speed-solving the Rubik’s cube.

[4] The Binghamton paper uses 320 images per camera for building a fingerprint (and recommends at least 50); the Turkey paper uses 100, and Li’s paper 50. I suspect that if more than one image taken from the unknown camera is available, then the number of reference images can be brought down by a corresponding factor.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter or Google+.

September 19, 2011 at 9:25 am 5 comments

Everything Has a Fingerprint: The Case of Blank Paper

This article is the first in a series that looks at “fingerprinting” techniques and the implications for privacy.

Unique-identification techniques similar to fingerprints have been applied in an astonishing variety of contexts in recent decades. Biometrics like iris and DNA profiling are well known, but there are lesser known methods like hand geometry, as well as “behavioral biometrics” like voice, handwriting, typing patterns, and even gait analysis. Many techniques for deanonymization, the principal topic of this blog, work by “fingerprinting” people’s preferences, habits, or style.

But this article is not about biometrics, nor is it about fingerprinting of content or complex systems such as a web browser in conjunction with the OS and the user.[1] I will instead discuss one of the most surprising domains of fingerprinting — blank paper.

This is what paper looks like up close — far from being smooth, it has a rich natural structure. Even considering this, the state-of-the-art study on fingerprinting of physical documents, by Will Clarkson and colleagues at Princeton, achieves something remarkable: they show how to extract fingerprints from paper using just commodity scanners, and no microscopic technology. The fingerprint survives when the document/paper is printed on, written or scribbled on, or even soaked in water.

A small (10mm tall) region of paper scanned from two different angles — top-to-bottom and left-to-right

The image above, taken from the Princeton paper, shows what the output of a scanner looks like. Not quite the resolution of the microscopic image, but a lot of structure is still visible. The key technique is: by scanning the paper at different orientations and comparing the images, the height at each point is estimated from which a 3-D map of the not-so-flat surface of the paper is constructed.

These 3-D maps can be used as fingerprints, but for efficiency they look at the maps of only about 100 randomly picked small “patches” on the paper. To further compress the extracted information, they do a “dimensionality reduction,” resulting in a 400 byte “feature vector” for each piece of paper, which is the fingerprint.

To verify or compare an observed fingerprint against a stored one, they simply look at the Hamming distance between the two bit-vectors. Why does this simple comparison technique succeed? Comparison of two human fingerprints is a lot more difficult, after all. It’s because a rectangular piece of paper has a nice property that human skin doesn’t: when the objects being fingerprinted have a precise, fixed geometry, fingerprint verification is easy — it is just a pointwise comparison of the corresponding features.

The result of such comparisons is this: two fingerprints from different pieces of paper match in roughly 50% of the bits, almost always in the 45%–55% range. Two fingerprints from the same piece of paper, on the other hand, differ in less than 5% of the bits, and occasionally up to 20% of bits if it has been handled particularly badly, such as by soaking. Therefore it is straightforward to infer whether or not two fingerprints came from the same piece of paper.

Readers familiar with the “33 bits of entropy” concept might notice that the fingerprint here is 400 bytes long, or 3200 bits, which is ridiculously high. There are surely less than 250 pieces of paper in the world — that’s a million for every person — which means that these fingerprints should easily be able to uniquely identify every piece of paper in the world. [2] The authors estimate that the chance of an error is no more than 1 in 10148. In other words, they achieve perfect accuracy.

What are the implications? As the authors point out, document identification “has a wide range of applications, including detecting forged currency and tickets, authenticating passports, and halting counterfeit goods.” On the negative side, it “could also be applied maliciously to de-anonymize printed surveys and to compromise the secrecy of paper ballots.”

[1] This is often referred to as device fingerprinting, but I find that a poor choice of terminology and will use reserve that term for a different concept in this series.

[2] It is hard to estimate entropy exactly in cases like this, but the feature vector is obtained via Principal Component Analysis, which makes it likely that the entropy is close to the maximum value of 3200 bits.

Thanks to Will Clarkson for reviewing a draft of this post.

To stay on top of future posts, subscribe to the RSS feed or follow me on Twitter or Google+.

September 13, 2011 at 10:41 am 1 comment

Older Posts Newer Posts


About 33bits.org

I’m an associate professor of computer science at Princeton. I research (and teach) information privacy and security, and moonlight in technology policy.

This is a blog about my research on breaking data anonymization, and more broadly about information privacy, law and policy.

For an explanation of the blog title and more info, see the About page.

Me, elsewhere

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 265 other subscribers