The Shifting Privacy Left Podcast
Shifting Privacy Left features lively discussions on the need for organizations to embed privacy by design into the UX/UI, architecture, engineering / DevOps and the overall product development processes BEFORE code or products are ever shipped. Each Tuesday, we publish a new episode that features interviews with privacy engineers, technologists, researchers, ethicists, innovators, market makers, and industry thought leaders. We dive deeply into this subject and unpack the exciting elements of emerging technologies and tech stacks that are driving privacy innovation; strategies and tactics that win trust; privacy pitfalls to avoid; privacy tech issues ripped from the headlines; and other juicy topics of interest.
The Shifting Privacy Left Podcast
S2E17 - Noise in the Machine: How to Assess, Design & Deploy 'Differential Privacy' with Damien Desfontaines (Tumult Labs)
In this week’s episode, I speak with Damien Desfontaines, also known by the pseudonym “Ted”, who is the Staff Scientist at Tumult Labs, a startup leading the way on differential privacy. In Damien’s career, he has led an Anonymization Consulting Team at Google and specializes in making it easy to safely anonymize data. Damien earned his PhD and wrote his thesis at ETH Zurich, as well as his Master's Degree in Mathematical Logic and Theoretical Computer Science.
Tumult Labs’ platform makes differential privacy useful by making it easy to create innovative privacy and enabling data products that can be safely shared and used widely. In this conversation, we focus our discussion on Differential Privacy techniques, including what’s next in its evolution, common vulnerabilities, and how to implement differential privacy into your platform.
When it comes to protecting personal data, Tumult Labs has three stages in their approach. These are Assess, Design, and Deploy. Damien takes us on a deep dive into each with use cases provided.
Topics Covered:
- Why there's such a gap between the academia and the corporate world
- How differential privacy's strong privacy guarantees are a result of strong assumptions; and why the biggest blockers to DP deployments have been eduction & usability
- When to use "local" vs "central" differential privacy techniques
- Advancements in technology that enable the private collection of data
- Tumult Labs' Assessment approach to deploying differential privacy, where a customer defines its 'data publication' problem or question
- How the Tumult Analytics platform can help you build different privacy algorithms that satisfies 'fitness for use' requirements
- Why using gold standard techniques like differential privacy to safely release, publish, or share data has value far beyond compliance
- How data scientists can make the analysis & design more robust to better preserve privacy; and the tradeoff between utility on very specific tasks & number of tasks that you can possibly answer
- Damien's work assisting the IRS & DOE deploy differential privacy to safely publish and share data publicly via the College Scorecards project
- How to address security vulnerabilities (i.e. potential attacks) to differentially private datasets
- Where you can learn more about differential privacy
- How Damien sees this space evolving over the next several years
Resources Mentioned:
- Join the Tumult Labs Slack
- Learn about Tumult Labs
Guest Info:
Privacy assurance at the speed of product development. Get instant visibility w/ privacy code scans.
Shifting Privacy Left Media
Where privacy engineers gather, share, & learn
Disclaimer: This post contains affiliate links. If you make a purchase, I may receive a commission at no extra cost to you.
Copyright © 2022 - 2024 Principled LLC. All rights reserved.
Debra Farber 0:00
Hello, I am Debra J. Farber. Welcome to The Shifting Privacy Left Podcast, where we talk about embedding privacy by design and default into the engineering function to prevent privacy harms to humans, and to prevent dystopia. Each week we'll bring you unique discussions with global privacy technologists and innovators working at the bleeding-edge of privacy research and emerging technologies, standards, business models and ecosystems.
Debra Farber 0:27
Today, I'm delighted to welcome my next guest, Damien Desfontaines, Staff Scientist at Tumult Labs, a startup focusing on differential privacy. Before that, he led an anonymization consulting team at Google, and he specializes in making it easy to safely anonymize data. He also earned his PhD and did his thesis at ETH Zurich, and a master's degree in Mathematical Logic and Theoretical Computer Science. Tumult Labs's platform makes differential privacy useful by making it easy to create innovative privacy-enabling data products that can be safely shared and used widely. And so today, we're going to be focusing our discussion on differential privacy techniques.
Debra Farber 1:15
Welcome, Damien.
Damien Desfontaines 1:17
Thank you, Debra. It's a pleasure to be here.
Debra Farber 1:19
Excellent. Okay, so first, you know, why don't you tell us a little bit about your background; how did you get into tackling this data science privacy problem with differential privacy; and also, I just discovered that you use the name 'Ted,' as a pseudonym online? So, I'd love for you to include why you do that in your answer?
Damien Desfontaines 1:38
Sure. So, the pseudonym, 'Ted,' goes back to, I think, a long time when I was first on the Internet, and I was looking for a pseudonym such that if you looked it up on Google, you would not find me. And then, over time, I kept it and my identity became intermingled because, as it turns out, maintaining anonymity online over time is very difficult. So, at some point, I just gave up.
Damien Desfontaines 1:57
I went into differential privacy, I think, a little randomly - no pun intended. When I was at Google and joined the privacy team, I noticed that there was a really big disconnect between the way anonymization practices works in industry and the way I could see computer scientists focusing on in academia, looking at the papers that were talking about the stuff. In academia, it seemed like everybody used this new fancy, cool thing called 'differential privacy,' which gives you like a bunch of cool guarantees that we're gonna get to later. But in industry, this technology was not used a lot, and people are still using sort of more old school techniques that I'm also going to probably a bit about. I was very interested in the reasons behind the disconnects, and so this is what I initially focused my PhD thesis on that I did back then.
Debra Farber 2:46
Oh, excellent. Well, then I would love to hear some of the reasons that you discovered through your research, like why is there such a gap between the academia and the corporate world?
Damien Desfontaines 2:58
So, just to give like a little, little context, what I mean by 'old school techniques,' or sometimes 'ad hoc techniques' (people also use the word 'syntactic privacy'), are techniques like removing names, removing things that look like identifiers, or only publishing statistics when there's enough people in them. And that's what people do when they want to, let's say, share sensitive data with partners without revealing information about individuals.
Damien Desfontaines 3:25
So, these techniques were born like a few decades ago. And, it turns out that in academia, until 2010, most of the research was focusing on those, but successive researchers looked at those and found cases where they didn't work and essentially created attacks on these old school techniques and successful attacks to get individual information out of even aggregated data. So, they invented this new notion called 'differential privacy,' which was proposed in 2006, which provided sort of much-stronger guarantees. But, differential privacy, it makes very strong assumptions about the attacker, which is good from a formal mathematical perspective, but might not feel realistic in practice when you look at this initially. And, it also feels complicated. You have to add some randomness to your data. You have to do all sorts of data transformation operations to make that possible. And, the very notion of having to add noise to data was scary to people, which I think was one of the main reasons why it didn't really catch up in in industry, at least not initially.
Debra Farber 4:31
So, then what is the differential privacy guarantee? And then maybe how was that mismatched? How was the alignment between industry mismatch based on the true guarantee and, I guess, the expectation of what differential privacy, you know, is supposed to do from a corporate perspective?
Damien Desfontaines 4:50
So, when I started my research I was actually set out to do...what I actually set out to do is to try and look at these old school definitions and say, "Okay, maybe they don't work perfectly all the time and you can make theoretical attacks against them, but maybe in practical scenarios, maybe they're good enough. Maybe we can interpret their guarantees in the same language as differential privacy." What does that mean, the same language? What differential privacy tells you is that it gives you a bounds on how much probabilistic information you can know, you can learn about the person by looking at the anonymized data.
Damien Desfontaines 5:22
So let's say, for example, I have a an initial...I don't know whether somebody particular in the data has a cancer diagnostic associated with their records. Initially, I have no idea about this. I'm gonna give it, let's say, 50%-50% odds. If I had to guess, I would maybe like give you $1 and maybe get $2 if I'm right and lose my $1 if I'm wrong. What differential privacy tells you is, after seeing the output of the anonymized data, after looking at the data that was published, maybe I can increase that knowledge a little bit. So, instead of having, let's say, 1:1 odds, I'm going to have 2:1 odds getting the answer right, but not more than this. So, I can't gain full certainty, which is very powerful. It means I provably cannot learn anything for sure about an individual. And, even better, I can quantify that. I can tell you exactly how much more I can learn. That's very strong, and that's not a property that any other privacy definition sort of gave you at the time.
Damien Desfontaines 6:21
So, I initially set out to figure out whether we could come up with the same formulation of a guarantee for old-school notions under certain assumptions. Differential privacy, for example, doesn't depend what auxiliary data you use, or it assumes an attacker that is very strong and that has infinite computing power at their disposal. So, these strong assumptions, maybe, I would think maybe we can reduce these assumptions to get results and these old-school privacy notions. What I discovered that actually, the main obstacle to the wide-scale deployment of differential privacy is more along the lines of education and usability.
Damien Desfontaines 6:56
Basically, if you make it easier to use these much stronger techniques, people will actually use them and find that it works well for the use cases. But, that work of transforming theory to practice and make things usable and, and scalable, and explainable and so on, that's sort of the whole point. And, that's what we're working on right now.
Debra Farber 7:15
Excellent. Well, thank you for that summary of, you know, kind of how you...how you view this space and what differential privacy is. And so, there's different types of differential privacy from what I understand: 'local' versus 'central' differential privacy. When would you use one over the other, and like, what are the differences between those two approaches?
Damien Desfontaines 7:38
Right. So 'central differential privacy' is the model that is most deployed right now, and it essentially happens when you have one organization - we usually call that the 'data custodian' - who has a lot of data about many people, so they have access to all of everybody's records, and so on. So, this happens a lot, right? Think about a government agency having records on citizens or big tech companies having interaction data with services, and so on, so forth, and they want to publish insights about the data, statistics about the data, or even machine learning models trained on that data, externally, while making sure that this does not leak information about individuals. They have all the data initially, and then they publish something about the data. That's central differential privacy because this requires to have like a central point where all of the data is.
Damien Desfontaines 8:28
'Local differential privacy' is a technique you can use when there is no central...like you want the central aggregator to not have access to real data. So instead, every person who will: send data to the aggregator; will add noise; will add some randomness on their own data; then they will pass it to the aggregator; and then the aggregator will publish some statistics based on that. Local differential privacy has this nice thing where you make fewer trust assumptions. You don't have to trust the aggregator because the price guarantee applies to them. Each user is hiding their own true data from the aggregator.
Damien Desfontaines 9:03
The problem with local differential privacy is that the amount of total noise you have to add...the total uncertainty, you have to add to your data to collect statistics is much, much, much wider. So, it really only works for applications where you're okay having lots of uncertainty in your data and you really want to be able to not trust the aggregator. I think this is a common use case and this sort of exists, but there are other techniques you can use in that context like other techniques, in a way that avoid the amount of noise. But, at this stage is stuff, you know, I don't know any real-world deployments using that. So, it's sort of mostly research.
Damien Desfontaines 9:44
Today, most of the big names who deploy differential privacy use the central model: Google, Microsoft, U.S. Census Bureau, Wikimedia, etc.
Debra Farber 9:53
Do you think this is mostly because there's more research in that because the research is being sponsored by large organizations or is it that it's harder to do local differential privacy? Is there just less funding for that? The central ask...
Damien Desfontaines 10:09
Yeah, there are theoretical limitations, as in fundamental limitations to what you can do.
Debra Farber 10:13
Okay.
Damien Desfontaines 10:13
So basically, like one rule of thumb that was published in a paper by Google folks who...Google also deployed, like when one of the very first real-world deployments of differential privacy was actually a local differential privacy in Google Chrome. But, the people, the researchers who publish this report that if you have end users - let's say, like a million - you can, at best, observe behavior that's at least you have a square root of n people that display the behavior. So, if you have like a million users, you can maybe detect behavior that's common between 1,000 people. Right? Square root of 1 million, but not fewer; and often, you need even more people in order to get any signal out. So, in many data analyst applications, that sort of fundamental limits, which is you can prove fairly easily that you can do better with local DP, severely limits the range of applicable use cases.
Damien Desfontaines 11:08
In parallel, also, in the last 10 years, there's been some more research on this problem of collecting data privately. And there's other approaches to doing this. Rather than only adding noise locally, you can do things like multi-party computation, or secure aggregation, or federated learning, and like all of these sort of set of techniques that can be used in that context and actually might be...like, you can combine them with some central differential privacy afterwards. But, it's feels like more promising for the 'data collection problem' then local differential privacy.
Damien Desfontaines 11:38
Central differential privacy, on the other hand, is very much...has been proven useful over and over again and it's currently gaining adoption because it's...like the properties that it displays are...both provide strong guarantees and also are usable and useful in practice.
Debra Farber 11:54
That's really helpful; and, I know, it's also I know, you can kind of mix and match different privacy enhancing technologies based on use cases. When you say stronger guarantees, it sounds also like you've got more simpler...simple deployment with more accuracy, right? Is that what it sounds like? So, you wouldn't even use differential privacy in some cases...local differential privacy, because you have more accurate easier-to-deploy privacy enhancing technologies for those use cases - is that when I'm hearing?
Damien Desfontaines 12:24
About use cases, you're talking about the collection or the data sharing up in publishing?
Debra Farber 12:30
You know, I wasn't making a distinction. Yeah. Why don't you make that distinction.
Damien Desfontaines 12:33
Okay, so basically, central DP solves the problem of publishing or sharing data that you own, that you already have, that in you're custody and you know...you already know like what the data is.
Damien Desfontaines 12:46
Local differential privacy solves a different problem, which is how do you collect data in a private way? How do you get statistics about people without knowing they're there to answer?
Debra Farber 12:54
Got it? Okay.
Damien Desfontaines 12:56
The techniques are not one or the other, it's usually the problem statement that sort of tells you whether one is a good option or whether you should use the other one.
Debra Farber 13:03
Makes sense. So I know from your website that to ensure effective sharing of sensitive data, that Tumult Labs provides a suite of offerings to help organizations of any size assess, design, and deploy differentially private solutions. So, I really want the audience to better understand this approach to deploying differentially private solutions. So, if your game, I'd like to list each stage and then kind of have you expound upon what some of those best practices are. What's the thought process at this stage when it comes to protecting personal data? Sound good?
Damien Desfontaines 13:43
For sure. Yeah.
Debra Farber 13:44
Great. So first, is the assess, design, & deploy. Right? So, first is the assess stage? What goes on in this stage?
Damien Desfontaines 13:52
So, at this stage, what we're trying to understand is the organization we want to talk to has a 'data publication' problem or question. They have some data that they hold, and they want to share it with a somewhat untrusted third-party or publish it to the world. The first step for us is to understand what is the problem that this data publication solves? Often, people start out thinking, like, "I would like to publish this data to encourage research or to allow people to: derive useful insights, to do add measurements, or to answer a question," and we try to get to the root of what that fundamental question is. How is the data going to be used?
Damien Desfontaines 14:30
In that process, you try to determine, not just what the problem is, but also how is the solution going to be? How is the solution's utility going to be quantified? Like what will success mean, in the context of that data publication? This usually, you start the discussion at the sort of high-level - "What problem do you solve?" - and you end up, like the result of that process is to generate the level of error metrics. Like, we're going to look into the difference between the data we're going to publish and [inaudible] that we'd like to publish. We're going to look at how much of the data are we able to get our insights about, how much of the data was suppressed, what is the relative noise in to the database compared to the actual signal - all of these error metrics we can quantify and translate in mathematical terms. This is both useful for us because then we know what to optimize for when using an algorithm in the next stage, but it's also often very useful for the organization to really understand what success looks like for them, what the goal of the operation is, and what the value of the data product they are gonna ship is going to be for them.
Damien Desfontaines 15:37
Typically, the outcome of this is both an understanding by the organization of what can be done and what can be realistically done with the privacy guarantees we propose. It's often includes an explanation, either for them and/or for the data users, about the privacy guarantees they're gonna get, like how...what we sell is priced technology, right? We sell tools that allow people to publish data and still sleep well at night. So, that second part is very important; we have to convince them why this data release can be made safe enough for their legal / privacy / ethical requirements. And, you'd usually produce reports, presentations, error metrics, and so on, so we can like set us up for success in the next stage.
Debra Farber 16:20
Nice, thank you so much for that. And then, the second stage would be design. So, you're starting to figure out, I guess, the architecture and how you're going to design this process. So tell us about that.
Damien Desfontaines 16:31
That's right. So once we know what we're trying to build, once we know what we're designing for, we as experts help you (or do for you) write an algorithm that will generate the data that you want to generate, you know, the function of your private way. So, the two things to note here is that first, everything we do in this stage is built on a general-purpose differential privacy platform that we built called Tumult Analytics. It's also open source. So you can go and, you know, try and build your first different private algorithms using our open source platform today, if you want to. I'm sure we can have the link next to the podcast.
Damien Desfontaines 17:07
And second, that process of designing an algorithm is a very iterative one. What often happens is a first algorithm is produced either by us - Tumult Labs experts - or by the organization itself. So, the engagement we have with Wikimedia, for example, Wikimedia data scientists are directly using our platform to build an algorithm; and then, we run the algorithm, we evaluate the results against the error metrics defined in the previous stage, and often the very first part of the algorithm is never good enough, so we look at where are their unacceptable utility or too much suppression or something we are not going as well as we expected in the data. And then, we iteratively optimize some parameters, revisit the total privacy protection we are providing by increasing or decreasing privacy (but that's the thing that quantifies how much public information an attacker can gain), and we iterate between like...this often requires sort of refining error metrics over time, as you realize that there's maybe things that you hadn't thought of that actually also are important to solve the problem. And at the end of the design process, you end up with a prototype algorithm that satisfies the fitness for use requirements that makes it so you're releasing the data produced by this algorithm. This will actually solve the problem that you're trying to solve, and this will make your data users happy.
Debra Farber 18:25
That makes sense. And I just want to point out to the audience, too, how nothing about what we've said so far has anything to do with compliance. It's just good privacy practices, preventing privacy harms, putting these practices in place to protect humans. Right? So, this isn't about compliance, although it can get you to regulatory compliance, right, and how you're sharing data and such. But, I just wanted to throw that out there because it's just, I think, a really important point to underscore that privacy, like security, is really should just be baked into an organization and all aspects of the business, especially around protecting personal data.
Damien Desfontaines 19:03
I think you're completely right. And I think what we tell our customers is using gold standard techniques like differential privacy to safely release, publish, or share data, we tell them that this goes beyond compliance. But very clearly, part of the value proposition, right, is that it gets you to the compliance bar because it's the gold standard out of these techniques.
Debra Farber 19:25
Yes, yes. It gets you there. But what you really want to do is unlock the value of your data so that you can use it in ways that are profitable to the company and gets you to your KPIs or whatever are your goals, but that is privacy-protective and meets the regulation. So...
Damien Desfontaines 19:41
Absolutely.
Debra Farber 19:41
Yeah. Okay, why don't you tell us about the third stage, and that's deployment. How do you help customers deploy this technology?
Damien Desfontaines 19:49
So, deployment is once you have, you know, an algorithm to generate data, it's often not as simple as clicking a button to actually push that thing to production. Or when you need to do things like document the algorithm and explain how it works and what these properties are to the people who are going to use the data. So...
Debra Farber 20:07
And that is a requirement in the EU.
Damien Desfontaines 20:10
Right. Another aspect of deployment is if you're doing a data release that is sort of continuous over time, for example, within the media, we are releasing every day based on the logs generated every single day, then the initial strategy, the initial algorithm, was designed using maybe some historical data, you want to make sure that the assumptions you've made then are still holding up over time and that there's no, you know, data drift that will make your algorithms perform worse over time. So, part of the deployment is to understand how that can happen and how you can catch it and detect when that happens, and how you can correct. Finally, you know, there's all sorts of technical support involved in actually shipping this stuff to large organizations and clusters and so on. So, we have a great team of engineers who can help you through that.
Debra Farber 21:01
Thank you for that. I really appreciate it. So, some use cases around data publication (so publishing data) or sharing use are well suited to the use of differential privacy, while some are not. In fact, you wrote a really comprehensive blog post on this for Tumult Labs. And again, I'll share that in the in the show notes. Can you shed some light on what are good use cases for differential privacy techniques, and what use cases may not be good candidates? Basically, what's the litmus test for when it's appropriate to use differential privacy?
Damien Desfontaines 21:35
Yeah, it's probably simpler than then you'd expect. Essentially, differential privacy at its core, it's going to try and hide any individual contribution from the from the data in the end. So, let's say you're doing something very simple, like counting the number of people with a certain diagnostic in your data set. A single person being added or removed to the datasets, a single person can at most influence that total statistic by one. Right? Because, you know, either they have this diagnostic or they don't, but the total count will not change by much because of that person's data. So, all you need to do to satisfy differential privacy for the specific statistics is to add some random noise - you can imagine of like rolling a dice and adding the results to it - and the magnitude of that noise needs to cover the contribution of any given person. If their contribution is one, you just need to add, you know, random knows of variance or standard deviation, you know, 1,2,5,10...like, not great, like not a very large number if your statistics are big enough.
Damien Desfontaines 22:34
So, that's essentially the litmus test is, if somebody like...can the statistic that you're outputting, that you want to generate can be significantly changed if you just change the data of one single person. If a single person can have a great influence on the result of your analysis, if it's critical that you catch changes in individual behavior in your analysis, then differential privacy is not going to work because what the algorithm is always going to try to do is to try and cover this individual contribution. Conversely...
Debra Farber 23:05
Is it an outlier?
Damien Desfontaines 23:06
This is a typical...this is a typical example.
Debra Farber 23:09
Okay.
Damien Desfontaines 23:10
So, in a use case like 'outlier detection,' if you're going to detect the data points that are outlier in the data, then very clearly, differential privacy is not going to allow you to do this well because it's designed to prevent this kind of inference about specific individuals. That litmus test - the nice thing about it is it also works the other way around. So, if fundamentally, the thing you want to get out of your data are global insights, statistics about the behavior of like multiple people at once, then often, it's not gonna cost you a lot of utility to add the nodes necessary to get these very, very strong privacy guarantees.
Damien Desfontaines 23:45
So use cases around measuring aggregate user behavior, computing statistics over medium to large groups, all of these kinds of use cases typically work very well with differential privacy. One, maybe exception here is again, if the statistics are such that a single person can contribute to them took another very outsized influence on them. If you imagine you're in a room, and you're trying to get the average salary of the people in this room, and all of a sudden, Jeff Bezos enters the room, very clearly, that's going to make that number go up like wildly. It's going to...this single person is going to have a wildly large influence on the data. It's often in data analysis something you don't want to happen. When you're doing statistics, when you're doing science, you don't want your analysis to be very brittle and to depend on the value of individual people. You want your analysis to be robust and the results, the the knowledge you're gonna get from this, you don't want it to wildly change if a single person has different data. So, whenever you want to do something that's sort of robust in the sense, then differential privacy is going to be a great fit.
Debra Farber 24:54
So then how can data scientists make the analysis more robust to better preserve privacy?
Damien Desfontaines 25:01
Well, if we go back to this average salary example, if what you want is some information about the salary of the people in a room, the average is actually not a great metric. Because of these outlier data points, we can, you know, make your metrics widely. And, the results, you get a high result, you don't know it, because you don't know whether it was because there was a single, super-large outlier or because the average salary of the people in the room was actually that high. So, in the sense, you usually want to do something like either consider other kinds of statistics - like instead of averages using medians, which are typically a little more robust - or, you can say, "Okay, if we have outlier data points, we're gonna bound the contribution that they do." Maybe every salary above $1 million, we're just gonna pretend it's $1 million. And this way, they're not gonna they're gonna have a bounded, measurable influence, maximum influence over the data that makes your statistics more robust. And it also typically improves the, like both the robustness of your analysis and the how well is differential privacy is going to perform for this use case.
Debra Farber 25:59
Yeah, that makes a lot of sense. What do you do if you don't yet know how the data will be used? I would hope that data scientists would try to formulate how data should be used before it's collected; but, in the...you know, I know, that's not always the case. So what would be the approach if you don't know how the data will be used?
Damien Desfontaines 26:17
Well, this goes back to the process we were describing before. The reason why we we spent a lot of time working on having a process that gets people to good outcomes and to solve their data publication problems is because that assess...that initial assess stage to understand the requirements better, is really going to make a huge difference in how how happy people are going to be with the results. If you don't know how they will be used, it's very hard to know what to optimize for. And, a little bit like machine learning has all of these parameters you need to to fine-tune to get a good model in the end, differential pricing requires often some, you know, manual optimization to get to exactly the utility requirements you need.
Damien Desfontaines 27:01
The more vague the unique requirements are, the more likely you are to discover too late that maybe the data is not good enough for what you wanted to do. All hope is not lost. If you don't know in advance exactly what the data is going to be useful, there are some techniques that allow you to generate sort of, you know, either synthetic data or statistics that are good enough to generate...to answer a large range of possible queries. But, in general, there's a tradeoff between how much flexibility and how much different unknown use cases you want to be able to tackle with the published data and how well you're going to be able to tackle each one of these use cases. This tradeoff between utility on very specific tasks and number of tasks that you can possibly answer is not just with differential privacy. It's a fundamental privacy reason that says that if you can answer precisely too many different types of queries, then it also means that you can reconstruct the data and actually, like, have catastrophic privacy loss on the inputs. So basically, differential privacy allows you to quantify that and to make sure you're not releasing too much. That tradeoff means that the most precise and explicit you can be about your success metrics, the better off you'll be with a deployed strategy.
Debra Farber 28:22
That makes a lot of sense. It also aligns with, you know, kind of the mantra that privacy folks have been saying for a long time, like, you know, don't share data or basically manipulate data without first understanding what you're going to be doing. Why? It's purpose-driven really, right?
Damien Desfontaines 28:39
It's also in regulation, right? GDPR has this whole thing about purpose limitation and understanding what the data will be used for the fact that it's going to be good to help you with the strategy. It's also going to make your compliance story much more rock solid, because you know, what you're optimizing for and what what problem you're solving and while you're processing the data,
Debra Farber 28:56
Absolutely. Just a major tenet of privacy. Absolutely. So, I know that there's an example of Tumult Labs, having worked with the IRS with deploying differential privacy. Can you tell us a little bit about that use case?
Damien Desfontaines 29:11
For sure. Yeah. So, the use case we have the IRS with is called "College Scorecards." So, the goal of this data release is to help The Department of Education in the U.S. understand and give public information about the earnings of students who graduate with certain degrees in certain schools. So, what they do is they joined the data from the Department of Education that has data about which students go to which degree with IRS data that has obviously very sensitive data about the taxes and like the income that people declare, and they join these two datasets together and then publish differentially private statistics on the data to allow people to get an idea of how much people earn after certain degrees.
Damien Desfontaines 29:12
All of the data is now published online on the website called the College Scorecards. We can also add the link in the description. And yeah, that was generated using technology that we developed. One interesting thing that I find sort of fascinating about that use case is that what we hear from these customers - from the IRS and Department of Education - is that the value provided during that engagement was not just being able to publish more statistics with better privacy guarantees (which is typically something we achieved with differential privacy), but also that it made a lot of tradeoffs much more quantifiable, much more tangible.
Damien Desfontaines 30:37
Before using differential privacy, for these use cases, there used to be, you know, dozens and dozens of email exchanges between the two government agency, one of them wanting more data from the other, and the other not being quite sure how to juggle the need for the Department of Education to get better data with, you know, obviously, there's the very strong regulation that regulates the IRS and forbade them from revealing individual information. So, before that, there was lots of lots and lots of endless discussions that it was very hard to quantify the risk or to understand the value. Because when you're not using anything formal to define privacy, you can't understand the additional risk of being a little more precise, drilling down one more dimension, or the statistics you're releasing, and so on. Differential privacy makes all these explicit. So, it makes the process of choosing a tradeoff that works for everybody much smoother and much, much faster.
Debra Farber 31:31
Thank you for that. That's a really compelling example and just really illustrates the benefits of this technology. So, sometimes deployments of differential privacy have vulnerabilities susceptible to exploitation, based on what I understand. Can you tell us about some of the challenges here? You know, what does vulnerability mean in this context, as opposed to like a security context? And then, what approaches should data scientists take to avoid these vulns?
Damien Desfontaines 32:00
Right, so as I was saying earlier, we're developing a platform called Tumult Analytics that allows these deployments of differential privacy. And in developing this platform, we learn that there's a few very difficult engineering challenges that emerge when trying to sort of take the results from the scientific literature on differential privacy and make them into, you know, a really neat software package. One of these issues is security. It turns out that if you have a, you know, discrete probability background, you can probably open a textbook about differential privacy, go to my blog, and understand the basic math behind the technique. It looks simple on the outside. The basic techniques are very simple. The one comparison I like a lot is it's the same issue. If you know a little bit of math, and you want to understand how cryptography works, you can understand the RSA algorithm like fairly simply; it's not like it's not rocket science. Right? It's not complicated.
Damien Desfontaines 32:58
However, when you take these things, and you bring them down to you know, the level of code, it's much, much, much more difficult to get this right than you'd expect. In the cryptographic community, there are things like, you know, timing attacks or, you know, all sorts of different software vulnerabilities that can make your perfect paper...the perfect math in your paper, actually not translate to field guarantees in practice. And differential privacy is essentially the same. Anytime you're trying to do something complicated, it's very hard to track the security of your program, the end-to-end privacy of your program. And, even the very basic primitives can have surprising vulnerabilities. So, there's two things we're doing to tackle these questions specifically on security on the primitives.
Damien Desfontaines 33:42
One class of vulnerabilities that existed in the literature already for awhile, but we discovered sort of new new instances of that class of vulnerabilities. There's translation from continuous math in the papers. When when you say...when we will take a random sample from a normal distribution, in a scientific paper, it means that you're taking a sample with arbitrary precision - it's like a magic number that's like has infinite precision, you will need an infinite sheet of paper to write it completely. Of course, computers don't work this way; computers are finite machines. So, they translate this into a floating points value. And in that translation in differential privacy, you can get vulnerabilities and the surprising behavior of floating points values, that can lead to very subtle problems. It's hard to explain, but we have a blog post that hopefully with like diagrams that can explain it better. So, there we basically developed our own algorithm that we can show to be secure against the different abilities and, you know, ship that in our platform.
Damien Desfontaines 34:42
The second class of problems (I was talking about earlier) is the fact that even though the building blocks are simple, building complicated algorithms of stuff on these building blocks gets quickly intractable and very hard to understand. To mitigate...to grapple with that complexity, we built the meta analytics on top of software foundation, we call Tumult Call, which is made of very small individual components that individually are very simple and very easy to edit, understand and verify that they're doing the right thing. But, when you assemble these building blocks together, you get an end-to-end privacy proof that only depends on the validity of each individual component. And so, this modularity allows us to go implement very complex use cases. We're helping the US Census Bureau with data releases that are technically very, very complicated. We're gonna take that complexity in terms of like, lack of credibility or difficulty to understand the privacy guarantees because all you need to do is the audit records component by component, which, by the way, like third-party auditors have done for our code, and we get very positive feedback about that approach to software safety.
Debra Farber 35:53
Excellent. Where do you recommend that listeners of The Shifting Privacy Left Podcast should go to learn more about differential privacy? You know, are there communities or standards organizations or conferences that they should know about?
Damien Desfontaines 36:09
One of the things we did when we published that open source library in Python last December, is it goes along a series of tutorials that explain how to use it and how to generate the first differential pricing statistics. If you're...like, we wrote it thinking about...thinking that our idealized user was somebody with prior experience with data analysis. They know how to use Python, maybe they've used Pandas, or NumPy, or Spark, but they know nothing about differential privacy. And we walk them with like very practical examples and scenarios, you know, code that you can copy and paste and modify, about how you actually make certain data publication differentially private. So, I think it's a pretty good resource. And so, I would encourage people try and check this out, especially if the way you learn best is by doing, by actually running some code and modifying the code and seeing seeing how it works. If you're doing this, then also join our Slack, because we're here and we can answer questions, and we always love to get feedback on both the platform and the onboarding experience of the software library.
Damien Desfontaines 37:16
If you're more interested into understanding how it works on more of an abstract basis, or like understanding more about the guarantees it provides, I have a blog post series on my personal website that also sort of walks you through the basics of differential privacy and assumes very little technical knowledge. So, I've written it in line with the idea of like, you know what a probability is, but that's pretty much all you need to really sort of figure out how this stuff works. Also, we can add the link to that, and I can I can send them the link.
Debra Farber 37:49
Excellent.
Damien Desfontaines 37:49
You're asking specifically about communities and standards and conferences, and these are things that don't really like have a super-strong existence right now because differential privacy is still a very new. There's been talks in the community about doing some standardization. I know there's some efforts along these lines at NIST. And, it's very much sort of early days of that. I don't know about communities. We opened the Slack channel two months ago, but it's sort of very new and nascent, but you're welcome to join and sort of ask any questions about your use cases. Yeah, conferences - most conferences I know, are mostly academic. I had the pleasure to participate to the Data Privacy Week last month, where I think I met you, Debra.
Debra Farber 38:32
Yes. Yeah. The Rise of Privacy Tech's Data Privacy Week event.
Damien Desfontaines 38:36
That's right. That's a nice place to meet people who work in the space and learn from us.
Debra Farber 38:42
I agree. Thanks for plugging that one.
Debra Farber 38:47
Okay, so my last question to you is, how do you see this space evolving in 5-10 years? I know that's far off, you know, but maybe that's too far off. Maybe you want to do two years. But what's next in the evolution of differential privacy?
Damien Desfontaines 39:02
So, I'll answer the long-term question first because I am gonna go back to the comparison I had earlier with cryptography.
Debra Farber 39:10
Okay.
Damien Desfontaines 39:11
20-25 years ago, cryptographers in academia were starting to do the...to lay the sort of formal foundations of the science. Right? They were starting to make the case that if you want to deploy cryptography, you need to have formal proofs that your algorithms are sosolid. You need to have proof that shows very strong security guarantees. You need to have a layer of mathematical grounding to help you do everything. And, you need to have trusted implementations made by experts, maintained by experts, that a large number of people use. At the time, when they were trying to bang that drum, the industry...a significant fraction of the industry was like, "You know, this HTTPS thing looks complicated. I'm not really doing anything sensitive on my website. I probably don't need that." Or, you know, people are running their own cryptographic thing, and they were saying, you know, "I can't see your problem with this. It looks decent to me." And 25 years ago, it was kind of no more to hear these these kind of arguments. Obviously nowadays, we know this is nonsense, but we know we need to have guarantees about what you're doing. We know we need secure implementations that are well-maintained, because getting this stuff right is complicated. And, you know that the hand wavy way of doing security just doesn't cut it.
Damien Desfontaines 40:25
I think we're at the stage today in data, safe data publishing and differential privacy in a similar...we're in a similar place that cryptography was all this time ago. We still have to bang that drum to tell people, "No, if you remove the names and identifiers from a dataset, that doesn't mean it's safe." No, because you came up with like a clever way of maybe adding some like randomness here or there or generalizing some of the attributes, you're doing privacy, right? Like, no, if you look at if you take any PhD student, and you ask them to break your stuff, they will probably break your stuff in a matter of weeks. Right? But this has not yet committed completely the industry; it's still an acceptable thing to do to not...to not use any formal guarantees. My hope, to answer your initial question, was that in 10 years, that has changed and it becomes sort of normal for people to rely on strong, provable guarantees. When they do...when they want to achieve a high0-level of privacy guarantees. So, that's the long-term, and I hope we can get there. Yeah.
Debra Farber 41:28
I think you're right. I think this is kind of the rise of privacy assurance and guarantees.
Damien Desfontaines 41:34
I also want to add one thing, because when you ask how do you see differential privacy evolving, I also think about what we're doing right now to get into the direction. And I think the will number one thing that we're trying heavily to invest in right now and what the field needs is 'usability.' Today, if you want to deploy...if you start from zero and you want to deploy a different privacy algorithm, it's still pretty complicated. Right? Up until a few months ago, I think existing open source tools out there were sort of meant for experts and very hard to use. We're trying to change this, but it's obviously not a flip-of-a-switch process. It's interactive. We're making the interface better. We're inventing new science to make it easy to have good defaults for everything and so on.
Damien Desfontaines 42:19
We need to make tools and processes that are standardized, that are easy to use, that don't require any, like the least amount of technical background possible, in order to put the technology in the hands of most people and to drastically increase the number of use cases. That's what we're investing in today, and so that's what I'm really excited to see this evolving in the very near future, hopefully, a couple of years.
Debra Farber 42:41
Yeah, I think so too. I think in a few years, you'll have more of that abstraction layer and just make it you know...not that you're press just a button, but you put in certain parameters, and then, you know, kind of press a button, and the guarantees are there. Right? So, making it more usable makes a lot of sense. And, I think a lot of training, too. Right? Those who use differential privacy are typically data scientists. So, I know privacy is kind of a new vertical for them to think about and might not really have understood what is involved in privacy until learning about these new techniques and systemising that into their workflows. So, makes a lot of sense. Anything else before we close for today?
Damien Desfontaines 43:23
No, I think it was a great discussion. Thank you so much for the fantastic questions, Debra. It was great.
Debra Farber 43:29
Oh, I'm so glad you had a good time. I did as well. Thank you for joining us today on Shifting Privacy Left to discuss differential privacy techniques, vulnerabilities and where to plug into the community. Until next Tuesday, everyone when we'll be back with engaging content and another great guest.
Debra Farber 43:48
Thanks for joining us this week on Shifting Privacy Left. Make sure to visit our website shiftingprivacyleft.com where you can subscribe to updates so you'll never miss a show. While you're at it, if you found this episode valuable, go ahead and share it with a friend. And, if you're an engineer who cares passionately about privacy, check out Privado: the developer-friendly privacy platform and sponsor of the show. To learn more, go to privado.ai. Be sure to tune in next Tuesday for a new episode. Bye for now.