This week, we welcome Lipika Ramaswamy, Senior Applied Scientist at Gretel AI, a privacy tech company that makes it simple to generate anonymized and safe synthetic data via APIs. Previously, Lipika worked as a Data Scientist at LeapYear Technologies, and was the Machine Learning Researcher at Harvard University's Privacy Tools Project.
Lipika’s interest in both machine learning and privacy comes from her love of math and things that can be defined with equations. Her interest was piqued in grad school and accidentally walked into a classroom holding a lecture on Applying Differential Privacy for Data Science. The intersection of data combined with the privacy guarantees that we have available today has kept her hooked ever since.
---------
Thank you to our sponsor, Privado, the developer-friendly privacy platform
---------
There's a lot to unpack when it comes to synthetic data & privacy guarantees, as she takes listeners on a deep dive of these compelling topics. Lipika finds elegant how privacy assurances like differential privacy revolve around math and statistics at their core. Essentially, she loves building things with 'usable privacy' & security that people can easily use. We also delve into the metrics tracked in the Gretel Synthetic Data Report, which assesses both 'statistical integrity' & 'privacy levels' of a customer's training data.
Topics Covered:
Resources Mentioned:
Guest Info:
Copyright © 2022 - 2023 Principled LLC. All rights reserved.
This week, we welcome Lipika Ramaswamy, Senior Applied Scientist at Gretel AI, a privacy tech company that makes it simple to generate anonymized and safe synthetic data via APIs. Previously, Lipika worked as a Data Scientist at LeapYear Technologies, and was the Machine Learning Researcher at Harvard University's Privacy Tools Project.
Lipika’s interest in both machine learning and privacy comes from her love of math and things that can be defined with equations. Her interest was piqued in grad school and accidentally walked into a classroom holding a lecture on Applying Differential Privacy for Data Science. The intersection of data combined with the privacy guarantees that we have available today has kept her hooked ever since.
---------
Thank you to our sponsor, Privado, the developer-friendly privacy platform
---------
There's a lot to unpack when it comes to synthetic data & privacy guarantees, as she takes listeners on a deep dive of these compelling topics. Lipika finds elegant how privacy assurances like differential privacy revolve around math and statistics at their core. Essentially, she loves building things with 'usable privacy' & security that people can easily use. We also delve into the metrics tracked in the Gretel Synthetic Data Report, which assesses both 'statistical integrity' & 'privacy levels' of a customer's training data.
Topics Covered:
Resources Mentioned:
Guest Info:
Copyright © 2022 - 2023 Principled LLC. All rights reserved.
Debra Farber 0:00
Hello, I am Debra J. Farber. Welcome to The Shifting Privacy Left Podcast where we talk about embedding privacy by design and default into the engineering function to prevent privacy harms to humans, and to prevent dystopia. Each week we'll bring you unique discussions with global privacy technologists and innovators working at the bleeding-edge of privacy research and emerging technologies, standards, business models and ecosystems.
Debra Farber 0:27
Today, I'm delighted to welcome my next guest, Lipika Ramaswamy, Senior Applied Scientist at Gretel.AI, a privacy tech company that makes it simple to generate anonymized and safe synthetic data via API's so you can innovate faster and preserve privacy while doing it. Previously, she worked as a Data Scientist at LeapYear Technologies and the Machine Learning Researcher at Harvard University's Privacy Tools Project. So, as you might imagine, today, we're going to be chatting about synthetic data.
Debra Farber 1:05
Welcome, Lipika.
Lipika Ramaswamy 1:06
Thanks, Debra. Glad to be here.
Debra Farber 1:08
Oh, I'm so excited to dive into this topic. You know, to kick off the discussion, I'd love for you to tell us a little bit about your origin story, your interest in machine learning and privacy, and how you ended up in the field.
Lipika Ramaswamy 1:21
Yeah, I've always been interested in maths. So, I've always been a big fan of things that can be defined with equations: Greek alphabets, numbers. And so, you know, the intersection of data with the techniques that we have now just sort of piqued my interest in machine learning about five to seven years ago. When I was in graduate school, I kind of accidentally walked into a classroom. It was a lecture on applying differential privacy for data science; and, I just fell in love with the topic. It was...it really hooked me. And, of course, you know, the societal implications of privacy, and the more data that we put out there, the more there's a need to think about our own personal privacy, and also just, you know, what we're creating for future generations. And also, it's just really, really elegant how solutions revolve around mathematics and statistics at the core. So, all of that really hooked me. I really enjoy what I do on a day-to-day basis. At the core of it, it's math, and I love doing that every day and building things that people can use without ever having to think about an equation.
Debra Farber 2:30
Well, I'm thankful for people like you who are really excited to do math every day - because that is not me. But I do enjoy the topic itself. So, let's get into it. What is 'synthetic data'? When should it be used and when an organization would use synthetic data for their job roles?
Lipika Ramaswamy 2:50
Yeah, I like to start with a definition - I did not write this; I picked it up from an NVIDIA blog, but it's really very accurate. So, I'd like to start with this: 'synthetic data' is annotated information that computer simulations or algorithms generate as an alternative to real-world data. So, there's a lot to unpack there, and a lot of it gets into the how and why, hich I think we'll cover a little bit later on. But, the important thing is that it's an alternative to real-world data.
Lipika Ramaswamy 3:22
So when should it be used? It should be used when real-world data is unavailable, or perhaps should be unavailable - so, when there are privacy concerns about sharing real data. Often, you know, a lot of the data that is collected out there in the world is sensitive in nature because it's about people or it's about private processes. So, when there are privacy concerns, it's really great to use synthetic data when there are regulatory concerns or issues with taking data across borders, that style of thing. And honestly, when there's a simple lack of ground truth data, synthetic data is a great alternative.
Lipika Ramaswamy 4:02
As far as who should be using it as an organization, honestly, anyone can, but what we've seen the most is, you know, developers who are building in test environments, so they don't really need real data that might be proprietary or sensitive, and it might take them forever to get access to that. Synthetic data is really useful then. Data scientists who are trying to draw insights from a dataset where they don't necessarily need to know personal information - like, you know, this record is about Lipika; this record is about Debra - but, they need to know general information such as, you know, our usage patterns on Zoom, for example. So, data scientists can get great value from synthetic data. Machine learning researchers who might be building a production grade predictive model and they're trying to improve performance of a minority class - that's another great population. And honestly, like anyone lacking permissions to view real data; these are typically the places in an organization where we see synthetic data can be really, really valuable.
Debra Farber 5:09
Thank you for that. Yeah. You know, I think some people think that it's just mostly data scientists that are using synthetic data and not necessarily developers...or just certain use cases, and not necessarily seeing that there's multiple use cases. So, thanks for that explanation. It's really helpful.
Debra Farber 5:24
How do you go about creating synthetic data? What's, what is that process like?
Lipika Ramaswamy 5:28
There are so many ways to create synthetic data, some obviously superior to others. The simplest way literally could be, you know, I open an Excel spreadsheet or choose my favorite tool, even a text box, and I simply start typing up, you know, random names, random integers, random places of birth, and any other information that I want. So like, even that came out of my brain, it's not real unless I'm writing about real people I know. And so, that even in itself is synthetic. You could use something like a Python program to sort of simulate, you know, real people or real events and draw randomly from a distribution, and that's a synthetic data set. Now, these are not very good synthetic data sets, they may not be necessarily providing statistical value for some sort of downstream jobs, but to learn something insightful about like a true population, but these are still methods. So like, these things are synthetic datasets.
Debra Farber 6:29
So it's definitely a wider definition than the fancy stuff that tools like Gretel provides. Its synthetic data is a pretty broad category, it sounds like.
Lipika Ramaswamy 6:38
Yeah, that's right. Like anything that's not real, I like to call 'synthetic' - just from my own personal nomenclature. That's how I define it. So, you would have heard the term 'fake data' a lot. And yes, a lot of fake data is synthetic data, but the distinguishing factor is...I like to think of it as like, when you're thinking of clothing, you know, synthetic fiber is not something that comes from the natural environment, but it still serves a purpose, right? Like, you and I probably wear synthetic fibers on a day-to-day basis. So, it's all over my house, I know that for sure. But it serves a purpose, and it does its job pretty well. But it's not fake; it's not going to, you know, fall apart the second I touch. It's not just there for, I don't know, like set design; it actually provides value. And so, that's kind of the distinction that I see between fake and synthetic data, where synthetic data is still actually valuable as long as I know what it's valuable for.
Debra Farber 7:30
Well, I know we're gonna get into that later to talk about some use cases. But first, I definitely want to ask you how you make synthetic data 'privacy-preserving' at a high-level, because we'll dive deeper into some specifics.
Lipika Ramaswamy 7:43
Yeah, so I guess to get back to the question of how we actually create synthetic data, aside from those like very, sort of primitive, quote, unquote, examples that I described. The typical way to do this is using a 'generative model,' and a generative model is effectively...like it's a machine learning model, some sort of system that takes data as an input and provides data as an output. And, so effectively, what those models are trying to do is to learn the underlying distribution - like what are the relationships in the data set that really give meaning to the data set, right? So, is there really a relationship between smoking and lung cancer? What are the different ways to measure that and can that be captured sort of mathematically in a way that I can draw from that mathematical model and produce more samples of it, and in turn, provide data as an output? So that's sort of the general way of creating synthetic data.
Lipika Ramaswamy 8:36
So as you might imagine, you're trying to learn general trends about a population or like, really what makes that population tick? What is the relationship? Are there some outliers in that relationship? Yes, that there might be, but unless I have a lot of examples of it or a good understanding of those outliers, chances are that mathematical model is not going to learn all that much about it. But, there are ways in which things go wrong, and there's like the storm in machine learning that's called 'overtraining' or 'memorization.' Or effectively, you can train any mathematical model, you know, to death and it'll learn literally everything about the sample that you provided, and it can just regurgitate that. And so, that's also a problem with synthetic data producing mechanisms. So algorithms.
Debra Farber 9:24
So if you trained it on something that included personal data as an input and you did just a tremendous amount of training, it could pretty much generate almost an identical output of your personal data?
Lipika Ramaswamy 9:36
Yep, 100%. You could train like within Gretel, you could take the LSDM model, and you could train...it's just a simple language model. It's not pre-trained or anything. You just take the data set that you have you train it for, like 2000 epochs; then, you try generating some data from it; it's probably just going to give you what was in the training set. And so, there are ways to prevent that, obviously, and a lot of work in the machine learning community and the deep learning community is on 'how do we prevent overtraining,' and how do we improve generalizability so that we're not just regurgitating things from the training set and we're actually learning something about the population that it comes from.
Debra Farber 10:16
Fascinating. So why use synthetic data instead of other techniques like tokenization, anonymization, aggregation, and others?
Lipika Ramaswamy 10:29
Okay, so tokenization and anonymization, I'll speak to those two first. Tokenization is just the process of taking some sensitive identifier in a data set. So like, take my name, Lipika, and just replace it with, I don't know, 'Patient 1,' or 'ID1.' That's one technique to remove, like, very-obviously-sensitive information from a data set. The data set is still referencing things about me. So if it's about my recent visit to the doctor, it's going to have exact measurements that the doctor took, exact notes that the doctor took, and that's all going to reference me, me particularly. And so, even if my name is removed, I just told literally everyone that I just went to the doctor. You might be able to find out which doctor I went to and maybe find a dataset that they really use for public research, and that includes my data. So, now you can kind of figure out...you link the dates back, and you figured out that, you know, this is my particular PHI. And so, that's one way that it's open to vulnerabilities. So that's like...it's linkage attacks.
Lipika Ramaswamy 11:36
Basically, the idea with tokenization is, it's not very good at doing much aside from hiding names or hiding sensitive information because our data doesn't really live in isolation. Right? I might put out the same types of data in different places and if someone has access to multiple of those things, even if they're anonymized, they might be able to link them back together. There are tons of like really famous studies on this. There's the Netflix Challenge that I won't explain in detail. But effectively, Netflix released an anonymizer...like a tokenized dataset, where they replaced names with IDs and, you know, they removed any users or any of their customers who didn't have over a certain number of records in the data set. And they released that information about the consumption, and they put out this challenge saying, "Hey, machine learning community, build a really cool recommendation system," and then were two privacy researchers who took that data set and linked it back to an IMDB data set, which is public information, like it's all...your usernames are public. And, they were able to identify folks through that.
Lipika Ramaswamy 12:41
So the idea is, you know, everybody that lives in isolation, and that's why something like tokenization is very susceptible to privacy attacks. Anonymization, too, is very similar. There are many...I guess my notions of what anonymization is - and sometimes it's a catch-all term; sometimes it's very specific - but something like 'key anonymization' or 'K-anonymity,' which the technical way of describing it is: if you use 'K anonymity' as the definition, there are no more than k minus one...or no less than k minus one people who are exactly like me in a data set, or exactly like any individual in a data set. And so, that's sort of the definition of "anonymization," and even that is vulnerable to linkage attacks after a certain point because there's so much data out there that can be linked.
Lipika Ramaswamy 13:34
The final method of aggregation, which...I throw a lot of different things in here...which is literally anything that's an aggregate of data. So, that goes all the way from, you know, an average to a machine learning model or a deep learning model, which is effectively like a lot of different aggregates put together with complex fancy math, and they produce some really awesome things. But, all of that is aggregation, and that is also vulnerable to privacy attacks, when there are successive releases of these aggregates. So, you know, let's say, my company released average salary information and the next day, my friend joined the company, and then they released another aggregate, and you could compare those two, and you could figure out what my friend makes.
Lipika Ramaswamy 14:25
So, there are many different attacks. I mean, these are just like a small set of examples. But, all of these techniques are susceptible to not just the things that I mentioned, but also like 'memorization attacks,' 'model inversion attacks,' 'model poisoning attacks,' 'model extraction attacks' - there's just so many different ways that data on its own or an aggregate can be compromised. And, so the reason that synthetic data is very useful is because it's not actually linked to any particular individual. Right? It's based on sort of a model of the population, and so by definition that doesn't link to any specific individual.
Debra Farber 15:08
That's really helpful. And it's also, I think, really helpful to understand all of these potential vulnerabilities that could be exploited by using tokenization, anonymization, or aggregation. I don't think these are things we generally talk about in the larger privacy community. So, I really, thank you for going through all that.
Debra Farber 15:26
By any chance, do you know if there are security researchers working on privacy research that, you know...coming up with ways to kind of address these attacks? Or is the answer just like "No, we're all just working on synthetic data because those attacks are things we don't want to deal with?"
Lipika Ramaswamy 15:43
I think there are formal methods for achieving privacy that a lot of folks are working on, and they may or may not be related to synthetic data. But, in general, it's a set of algorithms that are intended to be privacy-preserving. So, formally, there's the world of 'privacy enhancing technologies,' right? So, depending on sort of the model of the world you're working with - whether your data is distributed, whether it lives in one place, what types of insights you're trying to derive from it, you might choose to use differential privacy - standalone differential privacy with federated learning, multi party computation, right? Like, there are a number of things that can be used and can be used in combination with each other.
Lipika Ramaswamy 16:24
There's a lot out there; it really depends on what's the data model and adversarial modeling you're working with and what you're trying to get out from your data. I think the complexity with a lot of these methods is that they're very specific to your end goal. So, for example - with differential privacy, if all I wanted to do was to reveal that, you know, that information about average salary at X Company, and I wanted to do that every month or every year in an effort to drive transparency, or, you know, whatever, choose something - if that was my end goal and I always wanted to produce an aggregate (like an average or a median), then I can build mechanisms with differential privacy that do that really well; and they protect the identity or protect the information of each individual with a certain high probability. So, it might be a probability of like 95%, a probability of 98%, you tweak those parameters.
Lipika Ramaswamy 17:27
But, in that case, differential privacy is really great, because all I ever want to do is produce this one statistic, and that statistic is going to drive value in a certain way. And so, in that case, differential privacy is great; it works really well. But, if you're trying to share more than just one statistic, or if you're trying to do more than just sharing, right, like one machine learning model, or one type of machine learning model, synthetic data provides a good way to do this without having to change the tools that a data scientist or machine learning researcher or developer uses and it just sort of fits in exactly how the real data fit in.
Debra Farber 18:06
Wow, that's really compelling. So, I know, you've mentioned some use cases already, but like, what would you say are some good use cases versus poor use cases for using synthetic data? Like, you know, you've got multiple tools in the toolbox, as you mentioned, as privacy enhancing technologies, you know, when are you like, "This is perfect for synthetic data, and this is like not really ever useful for synthetic data?"
Lipika Ramaswamy 18:28
Yeah, I think, as far as poor use cases go first, when you're trying to do it with just one thing. And you know, that's the only thing you ever want to do with your data set in a public-facing manner, like the example of producing one statistic or a handful of statistics and doing it consistently over time, I think synthetic data might just be unnecessary for that and different methods like differential privacy are better-suited.
Lipika Ramaswamy 18:56
Another sort of poor use case for synthetic data is when you don't actually have good training data or training data that's not representative of the population that you're trying to model. So, you know, either you don't have enough information, you only have a handful of records, or you just don't really have good use of synthetic data models to build synthetic data; it's typically not going to yield very good solutions or very good synthetic data. A lot of this has changed with the advent of 'foundation models' and foundation models being so accessible, but in large part, there's still like a quality control problem there. So, I think that we haven't quite gotten to that yet, but I'll leave that there just to pause.
Debra Farber 19:40
Okay, yeah. Well we'll definitely get into quality a little later on. But, before that, I'd like to understand if there's any common misperceptions around synthetic data generally?
Lipika Ramaswamy 19:53
Yeah, I think a big one is that it is too immature as a technology or it can't be used as 'drop and replace' for a lot of functions and a lot of use cases. In general, we have very powerful models now. So, in the generative modeling world, there's the ability to use, you know, not just GANs (or 'generative adversarial networks') or 'variational autoencoders' or language models trained from scratch, there's also the ability to use foundation models or models that have been trained on so much public data or quote unquote "public data" that really can drive value and provide insight where you might not have that much data to train with, or you might not necessarily know what exactly the input should be.
Lipika Ramaswamy 20:43
So, I think it's not that nascent a technology; there are a lot of good use cases; but, they also haa to be good guardrails. So, you know, you certainly need to have some idea about why you're using synthetic data, what you hope to accomplish with it, and what the downstream task is. But, I think that's the case with any new technology or really any technology. You can't just, you know, be using it blindly. There's a lot of Twitter content that can be generated by ChatGPT, and that's great.
Debra Farber 21:12
I was just gonna say the irony of that statement with OpenAI's ChatGPT - like all the rage right now - where it seems that there haven't been guardrails put in place and it was trained on all public data that includes our own data and pictures of us and without consent. And, I even know they're hiring right now for someone who could put some guardrails and policies around some things. So, a perfect example of...you know, before technology's released, it would be great if organizations built the ethics and guardrails into their products and services. Sorry, that was my PSA right there.
Lipika Ramaswamy 21:48
No, I think it's incredibly valuable. Like, privacy shouldn't be an after-thought, and very often it is, unfortunately. But, things are changing, and that's where I really do believe that in a privacy-first world, we'll have a lot less of these challenges. But that's, I guess the trouble is, we don't know what might be coming in the future. It's a really good argument for something like differential privacy, that is future proofed. But, you know, there are concerns with differential privacy, like it's actually usability and the hit that, you know, downstream accuracy takes with the addition of noise. But, that's a whole separate topic; we can talk about that and there are people much more experienced than me, and who can speak to that better.
Debra Farber 22:30
Sure, sure. Yeah. Let's move on now to privacy assurances and what Gretel calls "validation." You know, I guess from the data science world, it's, you're validating your models and stuff, but I know Gretel offers its customers, the Gretel Synthetic Report that assesses how well the statistical integrity of a customer's training data was maintained and what is the privacy level of the synthetic data. And to me, that's really exciting. Because one of the things we're missing in the world of privacy, generally are metrics, our empirical data, our assurances - that, like, what you're doing will produce a particular threshold of an output. And so, can you tell us a little more about these assurances and how they assist your customers from a privacy perspective?
Lipika Ramaswamy 23:19
Yeah. So this Synthetic Report that we produce with literally every single model that is trained in the Gretel ecosystem, that provides information on what the outputs are, or how the outputs are compared to the outputs. You know, if we talk about generative models as understanding the underlying distributions, then the first thing to do would be to think of, well how well are distributions maintained in the synthetic data set, right, and the output data set? And so that's one of the very basic things to measure, but incredibly important. So you know, if a distribution is bi-modal, was that maintain...if a distribution was uniform was that maintained? And, if you know, you have a boolean variable, what's the ratio of instances produced of ones and zeros? So that type of thing is incredibly important, just for first glance, right? Like, you want to make sure that in a univariate world, everything looks right.
Lipika Ramaswamy 24:19
Then we move one step higher to the multivariate world. I mean, most datasets that we train on contain multiple columns. And so, we want to make sure that the correlations in the data set are roughly baked in. So, you know, if age is correlated with graying, do we see that in the training set and do we also see that in the synthetic set, and what's the difference in those correlations? So that's also something that we measure.
Lipika Ramaswamy 24:46
And the final thing that's very relevant for downstream machine learning tasks, is how well do we maintain sort of the underlying deep structure...so, more than just looking at two-way correlations, looking at, like, overall deep structure in that data and how well is it maintained? And, we measure that using 'dimensionality reduction methods' like 'principal components analysis.' So really reducing the data to two dimensions or three dimensions of principal components and comparing those. So, that's the way that we assess sort of the quality of the synthetic data. That's what we have in our Synthetic Report at the moment; we're always thinking about how this report can be most general and applicable to the vast majority of data sets and data types that we see. We're also actively working on how to make this better.
Lipika Ramaswamy 25:38
There are many, many different metrics out there. So, there are many ways...you can find a lot of stuff on GitHub on how to compare two data sets. But ultimately, a lot of this is industry-specific. So, we're trying to build something that's general and also working on things that are incredibly specific to industries, incredibly specific to data types. And, that introduces a lot of complexity. But, at the very base level, these three things we have found are incredibly important.
Lipika Ramaswamy 26:08
As far as the privacy that's maintained in the data set goes, we, at the moment, measure the privacy protection mechanisms that are used during model training, and also after the fact. So, one of the things that we really, really focus on is "privacy filters;" and this work was actually borne out of a paper that showed that large language models can replicate training data. And, it was published, I want to say in early 2021 - a big paper out of Google. And effectively, you know, they showed that if you introduce a secret in your training data set, it can be replayed by a large language model.
Lipika Ramaswamy 26:49
And so, what we found was that what is really vulnerable with large language models is just like outliers. And, obviously, like repeating training instances, which happens with overtraining and such. And so we built privacy filters, specifically outlier filters and similarity filters, to determine these things and also to filter them out in the generated data. So, it is...it's sort of like a data-dependent way of getting rid of things that look either overly-similar to a training record or that look like outliers in the data set. And so, we just want to get rid of those so that they can't ever be used for privacy attacks.
Lipika Ramaswamy 27:34
So, those are the two things that we really focus on and we think provide a lot of value to our customers. There are other things that we do provide as optional, of course, with one of our models, that is the language model or the LSDM. There's 'overfitting protection' that effectively provides like, early stopping the ability to set a validation set aside for training on measurement of performance. So, we have a few different mechanisms, and we're working on, we're working on more. So, things are coming.
Debra Farber 28:08
Awesome. So, the privacy guarantees are built into the product, which I think makes it super helpful for companies to then be able to, you know, demonstrate privacy assurance within their organizations. So, you know, bravo, I think this is a really exciting work.
Lipika Ramaswamy 28:27
Thank you.
Debra Farber 28:27
Now, if listeners are interested in learning more about synthetic data, are there communities that you can recommend they plug into?
Lipika Ramaswamy 28:35
Yeah, there's so many great places where you can find content. I wouldn't say just Googling because there's a lot of random content out there. However, a great place to start is communities on Discord. So, there's a synthetic data community hosted by Gretel on Discord, and that's a great place to engage with folks regardless of your level of comfort with the topic itself.
Debra Farber 28:58
If you give me the link, I'll put it in the show notes so that people can go to that community. So they could go to the Discord.
Lipika Ramaswamy 29:05
Yeah. Yeah, that's right. It's on Discord. And, aside from that, I think there are plenty of great blogs on synthetic data. The Gretel blog is great. We produce a lot of content and a lot of things from our research team, stuff that's in progress...also, a lot about our product and how to use it and just, you know, like in three lines of code, get up and running, or even just point-and-click and get something out of it. So, there's some great stuff out there if you're itching to try using synthetic data or try generating synthetic data.
Lipika Ramaswamy 29:34
I also highly encourage folks who use new and buzzy tools, right, like...choose your favorite of the OpenAI models or StabilityAI models - choose your favorite. If you're using those types of models, I highly encourage you to really look at the research behind it. I know, you know, looking at archive and looking at a PDF that's like 20 pages long might be very intimidating, but even just the abstract can be very useful...and looking at results, because a lot of these models are evaluated on sort of standard data sets that exist in the community. We need to, I think, be better about introducing more diverse data sets and diverse evaluation benchmarks to make sure that the models that we're producing, and like the new research that's coming out, is actually net positive. But, I highly encourage folks who use these models to look through the research, because that's really like, the foundation of this is the research and all these papers that, of course, are peer-reviewed. But, it's always good to know what the good, bad, and ugly is; and, it usually comes straight from the mouths of the authors. So, it's a really great place to look as long as you're not intimidated by those 20 pages or 30 pages,
Debra Farber 30:51
You mentioned 'the good, the bad and the ugly. What would you say is currently ugly about the space?
Lipika Ramaswamy 30:58
I think there is a lot of bias and maybe not as much of a realization of how a lot of these models are built - especially those for public consumption - and just misconceptions about what it can and should do or what they can and should do. A lot of this is changing; there's a lot of good content out there on exactly what to expect from models that are available for public consumption, or what to expect from diffusion models, right, like the new hot thing in images. So, there's a lot of good content out there, but it's very hard to reach that content. So, I think just, you know, public consumption without knowing the risks, that's potentially very ugly.
Debra Farber 31:46
That brings me to the fact that with this brand new technology that is being talked about as being hopefully widely-used in the future, and being such a breakthrough, you know...basically LLM, whose job is it to educate the public on this technology? Because if you look at other technologies of the past, very often the platform providers kind of push a lot of responsibility and a lot of ethics onto users themselves. You know, just consent to this. Great. Consent to this giant contract of things you don't quite understand. "Awesome. Now we can use our data to do all these things" but without giving kind of an education to individuals as to how do you protect yourself or how do you preserve your own privacy by doing X, Y and Z? And so, you know, given that you're deep in the synthetic data space and data science, who do you think should be responsible for that education?
Lipika Ramaswamy 32:45
Yeah, my opinion is somewhat unpopular. I think, at least for me, personally, the onus lies with me. I know I use a lot of tools for free. Right? There's a lot of things in my daily life that don't cost anything to use, and I gain a lot of value from them. But you know, it may not cost money; it does cost my privacy. So, really, I think the best thing to do is still read the fine print. As a consumer worried about my own privacy. I read the fine print. I read privacy policies. I opt out of data collection. If a platform isn't really meeting the standards that I set for myself, I disengage or delete my account. I think that work...it's a lot of work. It is, certainly, work. It's not easy. As a consumer, you know, you sign up for something and you're like, "Yeah, yeah, yeah. I'll read the privacy policy later" or I'll read the terms and conditions later on, just agree to something. And it happens a lot, especially with new and exciting technologies. It's just...
Debra Farber 33:49
So things like...right now, it looks to some people like the confidence in which an LLM output is being outputted, it gives high confidence to data that is not accurate. Right? And so, to the layperson, that could look like it's stating truth, when in reality...or that, you know, it was based on a conclusion and when in reality, it's just training data...So, I'm trying to understand, like, who's going to educate the public on what synthetic data can and cannot do, what LLM can and cannot do, on the risks generally? You know, my personal opinion is that it's the job of the platforms to educate the public. But, in this case, the synthetic data providers are actually like a vendor that's not typically interacting with the consumer. It's a B2B kind of play. But, with LLMs, there is a consumer interface for that; and so, I believe that it's the job of the LLM providers, like OpenAI or Google or Microsoft or whoever, whatever, whoever has the model to educate the public. And, the reason I asked this is I just really think that there needs to be a reminder of that in the industry that you can't just do all the fun, 'happy path' stuff; you've got to educate the market on how to differentiate between what is it outputted from an LLM versus what is just written by a human? Right? How is there a watermark? Like, how do you...how do you prevent consumers from being confused or prevent disinformation or misinformation? You know, I think all of these things should not be put on the individual because there's just no way that lay people would be able to learn all that information themselves, right?
Lipika Ramaswamy 35:39
Oh, yeah. 100%. And so my comment about, you know, reading privacy policies and whatnot is just so my data is not part of the training set that's used for god-knows-what - a machine learning system that I know is being built, or maybe I don't even know what's being built, or maybe I'll find out it's being built in 20 years. But, I think there's also a distinction that has to be made between models...as you said, models producing output that looks real versus something that someone actually wrote. And, there's some private citizens who produced literature on...or models that tried to distinguish between these two. So like, is what ChatGPT...is this something that ChatGPT produced or is that something that the human wrote, those types of models. And, I think, I think OpenAI has something on that as well.
Lipika Ramaswamy 36:30
I think overall, the struggle is that there's no universally-digestible way to produce this type of content for education of those using these tools because everyone comes in at a different level. So, there might be users like me, oh, can literally, you know, go and parse the models if I had access or at least read the papers and understand what they're doing. Or, you know, it could be my mom who just thinks it's really cool. But, there's no content out there that can speak to the both of us in ways that are satisfying or equally satisfying to each of us. So, yeah, I think, also more conversation about this is really important. There's...I don't know if you saw...or like everyone is just kind of like very amused by these consumer interfaces, but also kind of scared of them. So there's, that was recent news about how about sweet laid off a lot of its content creation theme because they were just going to switch to, or at least try using, large language models to produce content. And that's interesting, because that's the case where people are scared of what such technologies might do. But also, there's a lot of value that such technologies can provide. The question is when is it useful? When is it not? Who needs to be responsible?
Debra Farber 37:57
When it's dangerous.
Lipika Ramaswamy 37:59
Yeah, yeah. When is it dangerous? That's right. And to what extent do users need to know the internals? I mean, I think there's a lot to unpack there. It's certainly regulation of a lot of these generative modeling techniques and their uses; and also, like, what type of training data can be used to be incredibly helpful? I think that was just a lot of rambling. I don't have any answers.
Debra Farber 38:25
No, no, that's fair.
Lipika Ramaswamy 38:26
It's incredibly cool.
Debra Farber 38:27
It's fair. I mean, I'm really excited about LLMs, but I'm also really worried about the potential harms they could cause given that it's like the 'Wild West.' I'm even more nervous with the amount of money that's now investors are so excited about it; they're pouring into the space. And I really want them to just set aside some of that money to educate consumers. And that never happens. So, this is just my plea, to remind folks that the responsibility is on those who are building these models, but also building the platforms that enable the building of the models.
Debra Farber 38:58
You did mention before about, you know, when it's your data, and you're looking at whether or not you want, I don't know, let's say, to upload a few pictures to something that's going to then output some really fun, generative art with your face on it, but like kind of slightly modified, right? Like the Lensa app, for instance. I tried that myself, you know, just to check it out. One of the things I wanted to just point out to the audience is that the photos that you upload, or whatever you provide to the model, is now also going to train the model for the future. So, you're not just getting an individualized output, but you are now contributing to the training of the model, and so you're giving your personal data - your photos - to this company now. So, that is the fine print that you're supposed to look at in the Terms of Service and such, right, that you were you were suggesting before?
Lipika Ramaswamy 39:50
Yeah, that's right. I think that also speaks to, you know, that's a case where you're consenting in one way, knowingly or unknowingly, to provide your data to improve the performance of models, but then there's also this like larger question of what is 'public data'? What actually constitutes public data? Is it fair to take something from Twitter and call that public? Did I or you or any Twitter user write that with the intention of having it be used to generate something new? Probably not. At least, I didn't have that intention when I wrote my first tweet, like, I don't know, a decade ago. So, that's the big question to what is public data? How do we better define that? Does somebody define it? Does, you know, government define it?
Debra Farber 40:38
It's a great point. I know, in the EU, just because data is public, it could still be personal data, based on context. It's a great point that we don't have this distinction in the U.S. But, privacy-wise, information that's provided is always about the context in which we provide it. Like yes, we might have made something public, but only in context. If you're going to take all my tweets, and then like do something with them and manipulate them and put them in a, you know, in another context now that I did not consent to necessarily. So, you know, it would be a violation of "contextual privacy," but it wouldn't necessarily be illegal right now. It would be unethical though, according to norms of privacy and data protection. So, it is a definitely evolving space and I am...I for one am, you know, gonna keep my ears peeled to hear about what changes happen in the space.
Debra Farber 41:35
So, with that, are there any conferences or research or papers that you'd like to plug before we conclude for today?
Lipika Ramaswamy 41:44
Yeah. Gretel just hosted the first synthetic data conference for developers yesterday, which was February the 8th. The talks will all be on YouTube.
Debra Farber 41:53
Yes, I attended about 3 hours worth of that. I thought it was excellent content.
Lipika Ramaswamy 41:58
Great. Yeah, all the talks will be on YouTube soon, and I highly recommend getting into one or more of them. They come in at different levels, and my particular favorite was a talk on leveraging privacy enhancing technologies with large foundation models by Peter Kairouz. So, I think there's some really, really great content that's come out of this conference. Otherwise, I really, again, encourage folks, if you're using models, you know, try to read the abstract of the paper that presented the model; or, try to read up on blogs or just any content provided by the publisher of the model on what you can expect and generally like how they built it. It's so valuable to know where things you're using are coming from; and, I don't think access to that should be dependent on your technical capabilities. There's plenty of content that is accessible to everybody at every level.
Debra Farber 43:00
Well, Lipika, thank you so much for joining us today on Shifting Privacy Left to discuss privacy and synthetic data.
Lipika Ramaswamy 43:07
Thanks, Debra. It was fun.
Debra Farber 43:09
It was.
Debra Farber 43:11
Thanks for joining us today, everyone. Until next Tuesday when we'll be back with engaging content and another great guest.
Debra Farber 43:20
Thanks for joining us this week on Shifting Privacy Left. Make sure to visit our website shiftingprivacyleft.com where you can subscribe to updates so you'll never miss a show. While you're at it, if you found this episode valuable, go ahead and share it with a friend. And, if you're an engineer who cares passionately about privacy, check out Privado: the developer-friendly privacy platform and sponsor of this show. To learn more go to Privado.ai. Be sure to tune in next Tuesday for a new episode. Bye for now.