This week, I'm joined by Katharine Jarmul, Principal Data Scientist at Thoughtworks & author of the the forthcoming book, "Practical Data Privacy: Enhancing Privacy and Security in Data." Katharine began asking questions similar to those of today's ethical machine learning community as a university student working on her undergrad thesis during the war in Iraq. She focused that research on natural language processing and investigated the statistical differences between embedded & non-embedded reporters. In our conversation, we discuss ethical & secure machine learning approaches, threat modeling against adversarial attacks, the importance of distributed data setups, and what Katharine wants data scientists to know about privacy and ethical ML.
Katharine believes that we should never fall victim to a 'techno-solutionist' mindset where we believe that we can solve a deep societal problem simply with tech alone. However, by solving issues around privacy & consent with data collection, we can more easily address the challenges with ethical ML. In fact, ML research is finally beginning to broaden and include the intersections of law, privacy, and ethics. Katharine anticipates that data scientists will embrace PETs that facilitate data sharing in a privacy-preserving way; and, she evangelizes the un-normalization of sending ML data from one company to another.
Copyright © 2022 - 2023 Principled LLC. All rights reserved.
Debra Farber 0:00
Hello, I am Debra J. Farber. Welcome to The Shifting Privacy Left Podcast, where we talk about embedding privacy by design and default into the engineering function to prevent privacy harms to humans, and to prevent dystopia. Each week, we'll bring you unique discussions with global privacy technologists and innovators working at the bleeding edge of privacy research and emerging technologies, standards, business models, and ecosystems.
Debra Farber 0:27
Today, I'm delighted to welcome my next guest, Katharine Jarmul, Principal Data Scientist at ThoughtWorks, and author of the book, "Data Wrangling with Python," and the forthcoming book, "Practical Data Privacy: Enhancing Privacy and Security in Data." She's interested in ethical and secure machine learning and real-world production systems and is passionate about automating data workflows and distributed data setups.
Debra Farber 0:57
Katharine Jarmul 0:59
Thanks, Debra. I'm excited to be here.
Debra Farber 1:02
Excellent. Well, I know we have a lot to talk about today. In fact, why don't we just start off with you telling us a little bit about how you ended up focusing on ethical machine learning; and then with a focus on privacy, what motivates your work in this space, and what was your career path?
Katharine Jarmul 1:18
It probably starts way, way back when I was working on my master's undergrad thesis, which was actually basically doing natural language processing on the war in Iraq.
Debra Farber 1:31
Wow. That's fascinating. Were you working for the military?
Katharine Jarmul 1:35
No, I was I was a university student, and I wanted to investigate the difference in reporting between embedded and non-embedded reporters. And for those of you that maybe weren't alive or paying attention to the media landscape at that point in time, the war in Iraq was one of the first wars where officially journalists were deployed, embedded in troops. Whereas of course, you had other war reporters that were kind of, you know, the same way as normal freelance or that traveled there on the budget of their own media agencies and so forth.
Katharine Jarmul 2:09
I wanted to study whether there was statistical differences between these two groups of reporters. And indeed, what I ended up finding out is there was statistical differences between reporters who were fluent in Arabic and those who are not, which was a pretty fun realization. And that probably sparked my initial interest in how do we think about ethics of whatever topic we're thinking about in a data-driven manner. And then for, for many years, I went and did other things. I was a teacher. I was a data journalist. And then, I ended up back in the field of natural language processing, and this ethical question at that point in time had shifted because I rejoined natural language processing around the time that we started shifting to deep learning methods, and they started working.
Katharine Jarmul 3:00
So this was, you know, 2011, 2012, and we started having Word2Vec, and at that point in time, it greatly shifted the way that we build language models and it also brought up a lot of ethical concerns of these less-supervised or completely unsupervised setups, which we see also today with the GPT3-based models like ChatGPT. And, all of these are, you know, shown massive, massive spans of text, much of it collected from the internet, much of it collected maybe without consent. We can maybe debate that later. And, we show them to learning algorithms, and we also expand the memory of these models. And the implications from an ethical standpoint become quite large because when they're trained with that much internet text, quite a lot of really awful stuff from the internet ends up in these language models and ends up being able to repeat. So, I did some research, some years back on Word2Vec that kind of started my work in the ethical machine learning space.
Debra Farber 4:13
Wow. So, then how did you get inspired to write a book on practical privacy because, you know, that's clearly not centered on machine learning?
Katharine Jarmul 4:22
Yeah, good question. Yeah, I kind of...so, I was in the quote, unquote, ethical machine learning space, I guess you could say, for a while; and I was intrigued by a lot of the work by the communities there. But, what quickly became apparent to me is that the technology part of the ethical machine learning space is probably the easiest part and the least impactful part. And I don't mean to say it's easy in a sense that it's not really important work that people who are researchers will continue doing, but it's in the sense that the hardest part of ethical machine learning - and really of any ethical approach to technology - is not falling into a 'techno-solutionist mindset,' where we believe that we can solve a societal problem with technology.
Katharine Jarmul 5:15
And I think that some of the amazing work that, for example, Timnit Gebru and others, are doing...the DAIR Research Lab that Timnit runs, these researchers were really asking the right questions. I'm pushing it forward from a multidisciplinary point of view, and I asked myself, "Hey, what's, what's the contribution that I can make?" At the same time obviously, being an NLP researcher or being an NLP engineer, even, you end up seeing quite a lot of what later you realize are 'private data sets.' And, I became interested in that part of the question and how it related to the ethical part of the question.
Katharine Jarmul 5:59
There were several really interesting pieces of work, including Cynthia Dwork's work of 'fairness, through awareness.' And, via those pieces of work, I started wondering to myself, and then working on, "If we can solve the privacy parts of the problem and the consent parts of data collection, and so forth, maybe we also have a way to essentially jumpstart the ethical parts of the problem; and, maybe that's an area where I can contribute something meaningful."
Debra Farber 6:32
I'm glad that you're assisting in that area, and that you're writing a book on this because I think...I mean, my podcast is called Shifting Privacy Left. Right? So the idea is that you're addressing privacy earlier in the product development process, and before you ever shipped the code. Right? And, I focused so much on the GRC side for so long, and kept going, "Well, we're just in a compliance 'paper chase' right now - just keeps getting buried in like, you know, DSARs and data deletion requests and, you know, kind of enacting...making sure that the new rights for individuals were being obeyed by organizations (which is important), but kept trying to make the argument in multiple companies that if you just train up your data scientists to use privacy enhancing technologies, you could actually not only prevent some of the unethical problems and the compliance paper chase, and minimize your risks, but you could also increase the value and unlock the value of your data.
Debra Farber 7:37
And so, yeah, I'm just really excited you're writing a book on this to help educate more in that space, especially when you say it's one of the easier problems compared to some of the other machine learning challenges. I'm guessing, when you say easy, is it because there is an assurance you can kind of place around the inputs and expect a certain output based on math; or is it something else?
Katharine Jarmul 8:03
I think, for PETs and so forth, and I guess it depends on the privacy enhancing technology. So, one of one of the things that I found is it was really hard when I got interested in this to learn about these technologies. Most of the media that was available was research papers, and it was really starting from a research level and research point of view, which I strongly encourage it - it's great. But, one of the reasons of writing the book is that it's a fairly-steep learning curve. But, once you have the basic building blocks, it's quite easy if you already have a data scientist brain to kind of see where they fit and to think through, "Okay, if we can do this mathematically, it will fit with these types of models or systems or analysis I'm trying to do.
Katharine Jarmul 8:54
And, I think that it goes most of the basis of PETs, as you already know, is around math and statistical knowledge, which of course, a lot of data scientists have already honed and focused on for most of their career. So, it ends up I think quite lending...the communities lend themselves to to each other in a really nice, supportive way. And, I don't think it's always easy. I guess, what I think is maybe it's an easy step in comparison to questioning what is right or wrong or what is the moral implications of machine learning in our world, which of course, is like a huge, really difficult question that as a technologist...I can answer as a person, but as a technologist, I don't really have the right skillset there to answer or address that question. What I do have is a skillset where if there are smaller, more actionable parts of the problem, and the addressing of those issues is led by a multi disciplinary team of experts whose expertise range wide beyond what my skills are, then I can help by being the one that knows how to do the math or knows how to implement this, or that cryptographic, or how to leverage this or that cryptographic method. If I can be that person, then I can contribute in a meaningful way; and, I think that's the...to me, that seems easier. Of course, probably, if you're not skilled in math and statistics, maybe other parts of the problem would be easier.
Debra Farber 10:33
That makes sense, yeah.
Katharine Jarmul 10:34
Other people's problems like, oh, that seems really hard. But this one, I could do so.
Debra Farber 10:40
Katharine Jarmul 10:41
Debra Farber 10:42
So, who is the intended audience for the book, and what do you hope readers will come away learning?
Katharine Jarmul 10:49
So, I wrote the book primarily with a focus for data scientists because again, they have this nice skill set that I think makes it a little bit faster for them to learn some of these fundamental or foundational aspects of privacy enhancing technologies. But, I've gotten some feedback. So, I'm at ThoughtWorks, which means I have a wide network now of technologists, and I had some internal reading groups and feedback from internal folks. And interestingly enough, there was a lot of people from like QA and software and heads of technology that wanted to read it. And I was a bit concerned, I was like, "Well, you're not really the target audience. I don't know if it will all make sense to you." But actually, I got some overwhelming feedback from those folks that they learned something. They probably didn't like going through all the math sections, but they did like kind of the overall.
Katharine Jarmul 11:47
As you know, I mean, privacy engineering is as much about figuring out how to architect it into a system, figuring out how to govern it, as it is about understanding the mathematics behind differential privacy or encrypted computation. And so, I think that there's a little blips for everyone; but, the people who I wrote it for our data scientists.
Debra Farber 12:09
That makes sense. And I think it's essential to know your audience. Even in this podcast, I've chosen to make it pretty technical content, but I think anyone could pick it up and learn something interesting. It's just a matter of willingness and whether or not you want to you have the curiosity to kind of delve into that world.
Katharine Jarmul 12:26
Yeah, I think, I think having a podcast like this is so great because it's like...like I said, when I started my journey, it was like, wanna read watch this five hour video of a conference on privacy enhancing technologies? And I was like, not really, but if it's the only way, I'll do it.
Katharine Jarmul 12:46
It's like, really nice that there's highly-technical, but still to am accessible audience material. Right? Because there's a lot of technologists who really care about privacy.
Debra Farber 12:57
Yeah, I agree; and, I think there's definitely a thirst amongst privacy engineers and the community at large for more knowledge-sharing and more content and more upskilling. And, you know, there's some people who are just kind of transitioned into a role of quote, unquote, 'privacy engineer,' and so and it means something different to different companies that are hiring for that title. So, as that world is shaking out to kind of figure out what is the privacy engineer do within business and, you know, what are the overlapping skill sets and such I...you know, I'm hoping to, you know, be a small part of helping out there by bringing on guests, such as you who are...you know, who can go deep into their area of expertise and then help surface it through questions and meeting people where they're at to better understand one concept at a time.
Debra Farber 13:49
So. I'm having the time of my life. And I think by the end of this, I'll convince you to start your own podcast. So okay, you have said ethical machine learning a few times. In fact, I included that in your intro, and I just want you to define what ethical machine learning means to you. What areas must be addressed for machine learning to be considered ethical?
Katharine Jarmul 14:12
I wonder this often to myself, and I don't think I'm the right person to define ethical machine learning. I think I want to point out the work of of DAIR, again, which is Timnit Gebru's research lab. Timnit was one of the researchers who worked at Ethical AI at Google and then was fired for criticizing some of Google's practices around large language models and the biases encountered in these large language models or societal biases. I mean, that not weights and biases, so to speak. And DAIR is doing some really amazing work in the space and continue to...I think be the thought leader in what should we call 'ethical machine learning.'
Katharine Jarmul 15:00
And then there's also The Algorithmic Justice League, which is Joy Buolamwini's work on creating algorithmic justice and what does it mean. And there's also the FAT ML community: 'Fairness, Accountability and Transparency and Machine Learning,' which is an amazing research community meeting yearly, and I think now also hosting workshops pretty regularly at NeurIPS and some of the other large machine learning conferences. And most of these, though, are a quite technical point of view. Maybe some are more multidisciplinary, but they're also technical points of view. But, I think there's a broadening area of research in looking at legal intersections of ethical machine learning. So, how we legally start to define ethics, as well as I've met quite a few artists and philosophers working in AI in recent years. And I think that conversation is now happening in multi...multiple areas of expertise and in multiple locations.
Katharine Jarmul 16:01
And it must be said, too, that, you know, a lot of machine learning is being defined mainly in the Global North. And, the impact that it then has on a global south and on datasets emerging from the Global South, is also a huge thing that that needs to be more hotly-debated than it currently is right now; and, I think that that can easily also be shown with the treatment of workers. And Kate Crawford's work on 'Anatomy of an AI' and looking through folks, like folks in Kenya paid $2 a day, you know, to train GPT3 and so forth.
Katharine Jarmul 16:42
And so I think there's so many, there's so many aspects we can look at, from what does it mean to create an AI system and how can we make the workers rights more evenly-distributed and the renumeration more evenly-distributed of that work, and that people know how these things are actually collected and trained to how they're deployed, and how the model themselves behave.
Katharine Jarmul 17:07
But then also, what are the the after-effects of the deployment of these systems? So the idea and the work and data and society and some of Danah Boyd's work in the space is also looking at, what does it mean, when these systems are deployed? How do they unequally impact different parts of society? And, we can see this very, very clearly, when we look at gig workers and the rise of gig work is...you know, maybe the model is fair, but how it impacts, let's say drivers versus the drivees (the customers) is maybe massively different. And, the 'quality of life' impact of these algorithms is also massively different. And so, I think it like expand so hugely that you needed one sentence, which means...yeah, I mean, there's really important work from all of those aspects. And, a little bit of what I've contributed to, and some of what the book talks about, is how maybe privacy can help some of these bits become a bit more fair, or at least more consensual. And that, I think, is a pretty important piece of the puzzle.
Debra Farber 18:18
Absolutely. And that was a great answer. Like I think it helped raise the fact that it's it's not a simple answer, and that the ethics part goes beyond just the model itself, and more around what it takes to get that model created. And then, especially with the gig workers, and such. So, thank you for that.
Debra Farber 18:38
So, I know that fairness, accountability, and transparency are core to ethical machine learning, especially in the way that the EU frames it. What's interesting to me there is that those are also three of the goals of ethical privacy. GDPR lists, fairness, accountability, and transparency as some of the goals in that regulation as well. So, it almost seems to me that privacy and ethical machine learning, therefore have quite a bit of an overlap. I mean, are you seeing that as well, or do people kind of talk about trying to meet the bar for fairness, accountability, and transparency across both dimensions - both ethical machine learning as well as privacy within the data science community or not?
Katharine Jarmul 19:23
I think that the bridge is starting. I think what's difficult is that privacy is kind of not been the central question of a lot of the ethical machine learning research. And for a lot of the privacy research is you take on certain assumptions that the users have been contacted or the data subjects have consented and so forth. But, I think what we're seeing happening or what at least I'm noticing happening is those conversations are starting to get closer to one another. And regulation like the GDPR and the AI Act in Europe have explicitly put them next to each other, which I think starts to make you say, "Ah, ha, yeah, transparency. If we're talking about transparency in machine learning, what does it mean? Well, maybe it means providing explanations for our predictions; or maybe it means allowing users the ability to opt out; or maybe it means both; or maybe it means some combination that we haven't thought of yet."
Katharine Jarmul 20:27
And I think that maybe it even means transparency of the core training sources and the consent of those training sources. And all of this starts to bring the two topics much closer to one another. I mean, I think it's really interesting from a fairness aspect because I wonder to myself, like, "How fair is it that if your code was used, for example, to train GitHub's Copilot and GitHub gets money for the essentially...you know, we won't go into intellectual property rights - but essentially something that you created, where does fairness play into that? Where does privacy play into that? And I think all of this touches a lot on Helen Nissenbaum's work on 'contextual integrity' and trying to think through "What is the context of machine learning and these large scale data collection systems in connection to the way that we do things now?" And, I think most people haven't really adapted how they interact or post things online to the way that we do machine learning today.
Debra Farber 21:35
It's fascinating. As a consultant, are you seeing additional trends across like maybe different data science teams when it comes to privacy and new ethical approaches to sharing data? Are they trying to solve for mostly the same needs across one another, or they...is there a lot of variety?
Katharine Jarmul 21:55
I mean, I think data sharing is still such a...such a...ridden with problems part of the way that we do data. In fact, you know, it's one of the things the reasons why I wanted to join ThoughtWorks. What I was doing before ThoughtWorks is I was working in the field of encrypted machine learning, where I was working with a team formally called Dropout Labs (now called Cape Privacy), that at the time, at least, was working on how do we do encrypted deep learning with multiple data owners. And really fun stuff - learned a lot about crypto.
Debra Farber 22:32
The other kind of crypto.
Katharine Jarmul 22:33
Debra Farber 22:35
The original - the OG of crypto.
Katharine Jarmul 22:39
And one of...you know, the problem we're trying to solve is like, why share data. You want to build a machine learning model together, why share data by shipping data back and forth and just signing a bunch of contracts. Why not share data encrypted and still get the value; and, then you can make decisions about who can use the model, who can see the model, and so on and so forth? And, I think that data sharing, it's like when you look at it today, it's like the technologies that we have are so advanced for how we could share data, and then was implemented is like from 1998 or something. Like, "Why, why, why is this the way that we're doing it?" And I think, personally, what I would love to see in my career is the shift to embrace privacy enhancing technologies in data sharing, and to no longer normalize the idea of actually sending data to another company, no matter what contract you have.
Debra Farber 23:41
You mean not sending it at all or to obfuscate it in some way that's safe.
Katharine Jarmul 23:45
Yeah, I mean, to either use encrypted computation in areas where it makes sense for really, you only need the result; or to leverage tools like differential privacy or even to make data products available or something like this for people that you'd normally share data with, and instead share some sort of insights, hopefully, with something like differential privacy as part of a protection. And to me, that much more mirrors like the legal language, when you read through these data sharing agreements, basically is asking for something like this. And I think that the reason why it's not more often deployed is: A) not enough knowledge about privacy enhancing technologies; and, B) not enough of the legal staff really knowing that these these technologies are actually ready to go into production systems. Yes. Are they still some of them still cutting edge? Yes, so to speak. But, these are by no means something that can't be supported anymore.
Debra Farber 24:46
I have a feeling we're gonna see like, we're at the dawn of an explosion of privacy enhancing technology use. The UN just came out with a report maybe this past week. It's research into PETs. I have not read it yet. Have you by any chance?
Katharine Jarmul 25:03
Debra Farber 25:04
It's okay. I haven't either. But, even seeing the U.S. government has come out with a request for information, I believe, on privacy enhancing technologies. The Rise of Privacy Tech had delivered a paper or response basically, to address that, perhaps we look a little wider than just PET's and that maybe we expand the category to other privacy tech solutions that could meet the needs of organizations. But, I do think there's a lot more movement where governments are investigating privacy enhancing technologies; and, in some way, once they come out with these reports, it's almost kind of percolates up the chain to different, you know, lands on someone's desk in business where they're like, "Maybe we should look into this." It looks almost like the government is putting a stamp of approval on them in some way. And what are your thoughts on where PETs are gonna go in the future in terms of their uptake within organizations in the near future? Let's say the next two years.
Katharine Jarmul 26:08
Yeah, I mean, just to even contribute. I know that the Singapore government has already also launched some investigation into PET's and more wide-scale use, as well as the UK Government. And so, I think I just wanted to double underline your your comment there. I think the amount of attention that it's getting from regulatory agencies and from governments and policymakers is growing definitely with time. So, I myself have been working in the area of privacy technology now for six years, and just being able to see how much we've shifted - how much has gone from, "Oh, wouldn't it be nice if we could run this as scale?" to "Okay, this is actually something that these companies are already running at scale or at least running on these types of systems." Maybe this can be more widespread deployed, even just the amount of tools that you can now use to support some of the basic and better techniques like format preserving encryption for pseudonymisation, or something like this, or also things like differential privacy. Although one has to be careful with what you buy there, whether it can fulfill end-to-end privacy guarantees. I think that it's pretty amazing to see how it's happening.
Katharine Jarmul 27:25
And I had a chance as part of the book to talk a lot with Damien Desfontaines, who is one of the core researchers, that Tumult Labs who worked with the U.S. Census on deploying the differentially private U.S. Census; and, it's just really cool that this stuff is actually already happening. So, at large-scale technology companies, at many of the FANGs, at some of the cloud providers, and also even at places like the U.S. government, these are already being used at scale. And yes, it's an investment; it's not something you know, 'set it up and forget it.' But, it's an investment that people are actively making. And, I think that's, that's really a testament to where the field has gone and the fact that it's out of the research lab now in its into production systems.
Debra Farber 28:14
Makes sense to me. So tell us, what are some of your favorite privacy enhancing technologies?
Katharine Jarmul 28:19
Oh, that's a very difficult question.
Debra Farber 28:21
What's your pet PET?
Katharine Jarmul 28:24
Do you have a pet PET?
Debra Farber 28:26
I'm not sure if it's necessarily considered a PET, but I am really excited about self-sovereign identity. So, that's one that I dive into a lot. I haven't actually done a podcast on it yet. but that one is one of the ones that really excites me about the future - mostly because if we could get enough companies to implement the protocols and standards and such, you know, that effectuate self-sovereign identity and create that ecosystem, which is going to take time - so this will take about 10, 15, 20 years. But, if we could do that, I do believe that like 99% of the privacy problems we have today would go away, given that individuals would be more in control of how their data is used or not, or be able to revoke it or you have all these, you know, zero trust relationships where you can...you can share data about yourself without revealing the underlying data.
Debra Farber 29:20
So, I could say I'm over 21. Right? Because my bank tells a company that I'm over 21, but I don't have to show my ID that has my address on it and other personal information. So, yeah, I think it hits a whole bunch of privacy problems with what seems like one giant stone, but it's really like a whole host of protocols, and is still in development. So, it's not exactly like ripe for primetime yet, but it's the one that I'm most excited about.
Katharine Jarmul 29:49
Yeah, that one is really cool. I've also...I love the idea of keeping data as distributed as possible. So, I'm a big fan of distributed data setups, and I think the hardest part of that question becomes validation of identity in a non-centralized manner. And, I think, therefore, that that's like key one of the key building blocks. Probably my favorite PET is 'multi party computation' because that's the cryptography that I had a chance to work on and work with. And I also think, like...I think that in combination with distributed data and self-sovereign identity could create a fundamentally different way that we do data today and could democratize, and make much more privacy-friendly, the way that individuals control and manage their data and when and how they contribute to analysis, when and how they contribute to machine learning models, when and how they contribute to data sharing, and so forth. And, I think that could be really extremely powerful. But obviously, getting each of those, you know, to the state where they need to go, where I could say, "Hey, I actually don't want any data shared from my device anymore. I'd like to keep all the data local, and I'd only ever like to share it in encrypted form, and you are only ever allowed to decrypt computations that I contribute to if you can prove to me that this many participants have taken part or if you can prove to me that differential privacy noise was added to the tensors at some point in time or, or if you can meet these other requirements that I have." And I think that there is a possibility that we get there - and I would love to see it in the time of my career - but all of the pieces and the building blocks we know are there, they just need to be kind of brought, as you said, by a larger amount of companies and also made more present and more user-friendly for the users who I think would opt into this stuff if they knew it was available and if it was made present or, you know, if it was shown to them in a way where they can understand it.
Debra Farber 32:08
Yeah, I agree. It's one of the things that gets me excited about privacy is knowing what is being worked on what privacy enhancing technologies or privacy techs are creating for the future. So I've gone into some standards and am looking at certain specifications. I'm finding that is giving me a renewed sense of hope and optimism for privacy. Whereas, most people are looking at the world. They're looking at, you know, feeling powerless about the amount of money that is, you know, taken from them through advertising or their data without, you know, consent; and, you know, just feeling like they don't have the ability to have control over their own information. And whether or not you can be in control of your own data or if that's something that's not possible is not the topic I want to go into; but the feeling of being out of control right now is what makes people super pessimistic and wary of technology and distrustful. And so, it's just a weird dichotomy to have most people feeling that way, but then as an expert, telling them how I think the future is going to be pretty rosy, in my opinion. You know, there's definitely a disconnect that feels uncomfortable. But, you know, that's where the evangelist in me comes out and starts trying to paint the picture of the future.
Katharine Jarmul 33:34
Yeah, exactly. And I think sometimes when I talk, I don't know about when you talk with folks that aren't in the privacy space, or in the technology space. They're like, "Whoa! I didn't even know that was possible. That's so cool. Why can't I do it on my iPhone right now?" And it's like, "Oh, okay." I think there's a growing consumer demand. And as you know, privacy engineers and other privacy technologists who are listening to your show and who are deploying these technologies are doing it; hopefully, they're also chatting with the product people and saying, "Hey, you know what? This is like a perk that we can offer people. This is something that we can advertise and say, "Hey, look, where we're handling data responsibly." And I think it can be truly a win-win in both cases. It also reduces risk and all these other things. So, I think it can be a big business win as long as businesses say, "Hey, this is something that we really want to do we want to invest in."
Debra Farber 34:35
Right. I want to turn to talking about adversarial attacks. This is something that's really interesting to me because, well, my fiancee is a hacker. And you know, again, he works in bug bounty. And so, as a result of being together for eight years, I've learned a lot about that space. And so I find it fascinating. And what I see here is that adversarial attacks to machine learning models could be things like: adversarial input, feature and model extraction, certain privacy attacks - and I'd love for you to just unpack what it means in the machine learning space because it's definitely not the same thing as security per se. Like, you're not gonna have a bunch of hackers that you don't know, and ask them to submit bugs through a bug bounty platform, and their ability to do blackbox testing is going to be, I would think, pretty impossible. Right? For machine learning, like wouldn't you have to have access to the models?
Katharine Jarmul 35:34
I definitely think more people should be doing that. I mean, some of even what we see with the prompt engineering and the prompt hacking of the GPT3-based models - it's pretty extraordinary how quickly, let's say, folks learn clever ways. I mean, just a hacker mindset - learn clever ways to say, "Hmm, can I get it to show me the prompts that it's actually getting? Can I get it to reveal anything about the training data?" All of these are active things that are happening.
Katharine Jarmul 36:06
And one of the other things that's happening in the machine learning space is we're starting to ship more models to the edge. So, we're shipping models to edge devices or shipping models to phones; and, we're shipping models to cars. And, this means that the model is literally sitting on a piece of hardware that somebody can probably find a USB port for or another port to extract. And then, this person has the model. So, let's even forget about advanced stealing techniques. You have direct access to the model. And, when you have direct access to the model, if you know what you're doing, there's a number of adversarial attacks that you can plan, that you can propose. And then you can also then use the model to reveal information about the training data and so forth.
Katharine Jarmul 36:55
So, one of the classic ones is 'model inversion attacks,' where you have a particular target class that you would like to target and have the model reveal to you what it knows about that target class. In the case of, let's say, a facial recognition or voice recognition or anything biometric-related, this is extremely dangerous because the target is the biometric details of that individual. So, it's dangerous not only from a security perspective, but also privacy perspective.
Katharine Jarmul 37:25
But then, there's other attacks like 'membership inference attacks,' which try to look and say, "Was this person or was this example in the model training data set." And those have become more and more advanced with the time. In fact, to the point where there's now more generalized ways you can do membership inference to give you details about the types of people or the types of population that the model saw in the training data.
Katharine Jarmul 37:52
And within all of that, there's also been some really interesting work on model explanations, or in any outliers in any of these large language models or large other large machine learning models where the outlier space - so we can think of a multi-dimensional vector space that the model has been trained on seen and now essentially has saved in a variety of the ways in the architecture and so forth - and we can use the model itself to try to find sparse regions of decision space and to actually output that information. And that is readily-seen and easily-hackable even by people without machine learning knowledge. When we look at things like Stable Diffusion, or DALL-E, when you give it a prompt, and if you give it like an obtuse, an artist's name, something that you think maybe it hadn't seen many examples of, or if you go to GPT3 and you ask who you are, ask the model who you are, and see what if response it gives you, you can sometimes find little pockets where you're like, "Wow, this is an outlier space, and this is directly revealing the training data because the training data in that particular vector space - or tensor space, really the multi-dimensional - the training data was so sparse that it basically memorized these outliers." And there, we not only have security risks, but the deep, deep privacy risks.
Debra Farber 39:24
Wow. So, would you suggest then the bug bounty programs, or I mean, it wouldn't necessarily be disclosure, ethical disclosure, but yeah, do you think that bug bounty programs should include now potential bugs for machine learning? I mean, when you consider these bugs, or like an ethics bug?
Katharine Jarmul 39:45
I mean, I guess it depends what it does. Right? I mean, one of the things that OpenAI did when they...so they basically put a reinforcement learning or an agent-based learning method on top of the language model because...at least my guess is the language model itself was a bit too dangerous to just use without any filter. And so, the reinforcement learning - of course, that was made, I think, I think as far as I know, it was agent-based reinforcement learning. So, it goes out and explores what are potential useful answers to this prompt or useful answers to this; and then they poured tons of money, as far as I've heard, into doing reinforcement learning with these workers who were paid unfair wages, but with these workers to try to make sure that it wouldn't respond in really icky ways, like displaying racism or homophobia or sexism and other things like that. Because what we often find with these large models, and you can also see it in the first release of stable diffusion, is there was examples of revenge porn; there was examples of executions; there was all of these other examples because they basically pulled as much image data from the internet that they could.
Katharine Jarmul 41:00
And when we think of attacks on these models, yeah, we can think of adversarial attack where I wanted to give me a certain result and I tried to poison the data or I tried to learn enough about the model to elicit that response or hide something in my prompt that gets it to do that. Yes, those are valid security attacks, too. But, there's also like the massive damage risk and brand risk of companies releasing these models or things based on these models if something comes out and it's like really horrific. Right? I think, personally, I would be surprised if it didn't become a thing that you want to be like, "Hey, send us any bugs you might find that are either huge brand risks or also privacy risks or also security loopholes that we might have missed. "And we especially can think of this in the case of something like self-driving cars or other things like this that it's not going to be feasible probably for these companies to think of every potential scenario. It's much cooler to have hackers like your partner and so forth to kind of help with that. Right? To bring the really knowledgeable hacker mindset to something like machine learning and to figure out...because, you know, we only know what we know so far. Maybe there's even stuff that we don't know right now that could be attack vectors that we haven't even thought about.
Debra Farber 42:27
That makes sense. It seems like there might be an opportunity there. It certainly would upskill the the external hackers to make them more aware of what they should be looking for rather than, I guess, train data scientists to be hackers and submit bug bounty programs. But maybe, maybe there's an opportunity to overlap there. Who knows?
Katharine Jarmul 42:47
Yeah, it'd be a cool collaboration.
Debra Farber 42:49
It would. I think it would be.
Katharine Jarmul 42:51
Debra Farber 42:52
So let's see, you were talking earlier on about consent, and I was hoping we could talk now about maybe LLMs and the lack of consent in creating large language models.
Katharine Jarmul 43:07
Yeah. So these large models, there is a bit of a conundrum that we're in. So, I come from early NLP days, which means I've actually done quite a lot of work and is easily found online on 'web scraping.' And, I think that we kind of just marched forward as a field being like, "Okay, we can build better models if we have more data. One way that we get more data is we go scrape it. Let's go scrape more data, and then let's traina and let's see if we improve." And it wasn't that prominent, I think, at that point in time to think about the privacy aspects because all this still felt so new and it didn't feel like we were ever going to be able to build machine learning models that were going to be so powerful. Right?
Katharine Jarmul 44:02
That since changed. Now it's indeed the case that the field has moved on by amazing research. Right? And I don't...I'm not somebody that's against further research and development of these, ut I am somebody that's deeply aware of some of the both privacy and ethical issues of these massive, massive models. And I think that we have to, at some point, start asking as an industry, particularly for machine learning and data science folks, but also as a greater field as anybody that deals with data, including privacy engineers, and so forth. Like, what really makes sense here? How can we create a more consensual experience for people who want to either be renumerated for their own work - so let's say you're a journalist or you're an artist, or you're a writer, and you're happy to contribute your work to these models. But, if there used to, let's say, produce more art based on your art, maybe you'd like a few pennies per user, or maybe you'd like to be made aware that some art is being made, and you want to approve it, or, you know, just give some credit where credit's due.
Katharine Jarmul 45:21
And I think that that's also the case when we look at, let's say, the image generation models, and we look at generating faces of people who don't exist. You know, that face existed somewhere. There's pieces of that face, and they've actually been able...there's a really interesting paper called, "This Person Probably Does Exist," where they reverse engineered some of the 'This Person Does Not Exist' examples to the actual celebrity dataset that they came from. But I mean, I think the thing is like, maybe we now know we can build really powerful performant models, and now it's time to ask, "Can we build powerful models with privacy?" I think they, they answer is 'yes.' I hope that some of the listeners of your show, they believe the answer is yes too, and I think it's the time for maybe the machine learning community to truly embrace privacy and consent, and maybe even thinking through the whole ways that we collect data and just making them more transparent, more fair, more privacy-based, more consent-based. I think that would be a really cool next step.
Debra Farber 46:31
I agree. I think that would be a really cool next step. I wonder if the political will is there from...whether it's OpenAI or other companies. It sounds like there's an arms race right now to see who's, you know, chatbot's going to win, and if we asked OpenAI to throw out the training, you know, set that ChatGPT was trained on, would they be willing? I'm thinking, 'definitely no.'
Katharine Jarmul 46:59
I mean, I think you're right there.
Debra Farber 47:01
Right? So, it would just they would lose their lead. They would, you know, have to go back to the drawing board. But even if they came to market arguably unethically because they did not include privacy, ou know, it's not like you could just add or subtract a box of privacy off of the product. Right? It has to be built into the whole design.
Debra Farber 47:20
So, what can they do, if anything? You know, what is...I mean, I have to admit, I've even used it when...well, maybe it wasn't that...whatever the Lensa app uses, I've used one of the LLM to train my, you know, photo, and I gave up my photos on to be part of the training set - which I didn't care about personally in that moment - so that I can get like, you know, cartoonish-looking pictures done in a certain way of me, which I wanted and I got. But then, you know, after all whole debate in the public sphere about whether or not, you know, it's even ethical to contribute data at all to an unethical data model.
Katharine Jarmul 47:57
Yeah, yeah. But I have to, I have to just jump in real quick, because I think it's like, it's such a dangerous slope that we're on if we start blaming the users for the privacy problems or the ethical problems. Like, I absolutely 100% disagree with the idea that like, "Well, you're at fault if you're ever on Facebook or something like this." It's like, no, we should have better standards as an industry so that if, you know, folks in my life want to reach out on WhatsApp or whatever it is, it shouldn't be that it's shameful to want to use something cool, like the Lensa app or another thing that you want to try out? Absolutely not. I mean, that's totally the wrong way of the power in that relationship.
Katharine Jarmul 48:49
Instead, I think...or what I hope, and I know that from this conversation what you hope too is, you know, the industry needs to up the ante and say, "You know what, like, we can offer you something cool like this and make sure that your data remains more protected, and make sure your data won't be saved and stored on our servers unless you click this extra special box; or that if you're involved in a machine learning trading round that maybe you get access to the model, or maybe you get dividends or something." I mean, there's so many possibilities we could think of if we decided, "Hmm, the user is at the center of their data. If we're going to use the users data - and guess what, we could get so much more data if users actually wanted to give their data." Right?
Katharine Jarmul 49:39
So, like, scraping the web is like the hard way to do it. If we really wanted to build amazing large language models, we would create ways where people feel comfortable uploading much more texts that they're willing to share because they get a copy of the model or because they get to go contribute to something that's cool and is going to be used in a way that they thought they find meaningful, right? And I think there's...this conversation as to shift where the blame is never on the user. The responsibility is on the company that's making these and the people involved in producing these models. And they need to own the power that they have in that relationship and say, "You know what, I'm going to do a really cool privacy-respecting Lensa app;" and, yeah, you heard about it here first.
Debra Farber 50:34
Awesome. Yeah. I mean, that makes sense. I guess I was, I was just thinking in terms...more in terms of, as an ethicist myself, you know, probably not the best leadership to go and like, "Oh, wait, you're telling me it's trained on other people's data? Let me go give my data to it, and then try it out. But, you know, I did. You know, I needed to understand how it worked, and then I obviously, I just wanted to do it and I kind of convinced myself that that was why. But it's kind of like with Facebook, even when back when you had even fewer privacy protections, all the privacy experts were going to Facebook because they had to advise their clients who were using it how it worked, what settings they suggested, you know, things along those lines. So, I do agree with you; and I wasn't...I was just saying as an ethicist myself, probably, you know, it a little embarrassing to say that I went and I did something that I thought was unethical, right. But, here I am, doing it on my own show for all of you to hear.
Katharine Jarmul 51:38
Yeah, I think machine learning will keep going on, and I think it's cool to see where we can go and that...I hope that researchers in the machine learning and privacy space focus on ways that we can we can do it better as we move forward.
Debra Farber 51:55
Agreed, especially ways that we can make it safe. Oh my god, this episode has just like, flown by. I honestly, I feel like I could speak to you for hours, but unfortunately, we're at the conclusion of our talk. Is there anything you want to plug before we say goodbye?
Katharine Jarmul 52:12
Nothing top of mind. But, if you're curious to learn more, I have a newsletter that's called 'Probably Private' and I have the book coming out - should be published probably May or June this year called "Practical Data Privacy," and yeah, I'm excited to hear responses. Gift it to somebody if you want them to get into privacy engineering and so far they're just into statistics and math.
Debra Farber 52:41
Well, thank you so much. I know I'm definitely gonna read your book, and I'm gonna put links to those resources in the show notes.
Debra Farber 52:52
So Katherine, thank you so much for joining us today on Shifting Privacy Left to discuss ethical machine learning and PETs.
Katharine Jarmul 53:00
Debra Farber 53:01
Until next Tuesday, everyone, when we'll be back with engaging content and another great guest.
Debra Farber 53:09
Thanks for joining us this week on Shifting Privacy Left. Make sure to visit our website: shiftingprivacyleft.com where you can subscribe to updates so you'll never miss a show. While you're at it, if you've found this episode valuable, go ahead and share it with a friend. And, if you're an engineer who cares passionately about privacy, check out Privado: the developer-friendly privacy platform and sponsor of this show. To learn more, go to Privado.ai. Be sure to tune in next Tuesday for a new episode. Bye for now.