The Shifting Privacy Left Podcast

S2E29 - "Synthetic Data in AI: Challenges, Techniques & Use Cases" with Andrew Clark and Sid Mangalik (Monitaur)

September 26, 2023 Debra J. Farber / Andrew Clark and Sid Mangalik Season 2 Episode 29
The Shifting Privacy Left Podcast
S2E29 - "Synthetic Data in AI: Challenges, Techniques & Use Cases" with Andrew Clark and Sid Mangalik (Monitaur)
Show Notes Transcript Chapter Markers

This week I welcome Dr. Andrew Clark, Co-founder & CTO of Monitaur, a trusted domain expert on the topic of machine learning, auditing and assurance; and Sid Mangalik, Research Scientist at Monitaur and PhD student at Stony Brook University. I discovered Andrew and Sid's new podcast show, The AI Fundamentalists Podcast. I very much enjoyed their lively episode on Synthetic Data & AI, and am delighted to introduce them to my audience of privacy engineers.

In our conversation, we explore why data scientists must stress test their model validations, especially for consequential systems that affect human safety and reliability. In fact, we have much to learn from the aerospace engineering field who has been using ML/AI since the 1960s. We discuss the best and worst use cases for using synthetic data'; problems with LLM-generated synthetic data; what can go wrong when your AI models lack diversity; how to build fair, performant systems; & synthetic data techniques for use with AI.

Topics Covered:

  • What inspired Andrew to found Monitaur and focus on AI governance
  • Sid’s career path and his current PhD focus on NLP
  • What motivated Andrew & Sid to launch their podcast, The AI Fundamentalists
  • Defining 'synthetic data' & why academia takes a more rigorous approach to synthetic data than industry
  • Whether the output of LLMs are synthetic data & the problem with training LLM base models with this data
  • The best and worst 'synthetic data' use cases for ML/AI
  • Why the 'quality' of input data is so important when training AI models 
  • Thoughts on OpenAI's announcement that it will use LLM-generated synthetic data; and critique of OpenAI's approach, the AI hype machine, and the problems with 'growth hacking' corner-cutting
  • The importance of diversity when training AI models; using 'multi-objective modeling' for building fair & performant systems
  • Andrew unpacks the "fairness through unawareness fallacy"
  • How 'randomized data' differs from 'synthetic data'
  • 4 techniques for using synthetic data with ML/AI: 1) the Monte Carlo method; 2) Latin hypercube sampling; 3) gaussian copulas; & 4) random walking
  • What excites Andrew & Sid about synthetic data and how it will be used with AI in the future

Resources Mentioned:

Guest Info:



Privado.ai
Privacy assurance at the speed of product development. Get instant visibility w/ privacy code scans.

Shifting Privacy Left Media
Where privacy engineers gather, share, & learn

Disclaimer: This post contains affiliate links. If you make a purchase, I may receive a commission at no extra cost to you.

Copyright © 2022 - 2024 Principled LLC. All rights reserved.

Andrew Clark:

One of the things that people are now talking about is, "Okay, we can't use data now that's on the internet because so much of it is generated, so can we make fake data to train our models off of instead of real data? That's where there's a lot of issues there that's not helping either, because you're now creating. . . one of the key things to highlight with synthetic data is, although there are academic methods and we can get it into them on how can you best replicate data, synthetic data is not 'real data,' and one of the key things you miss is those.

Andrew Clark:

the interplay, the interconnectedness between inputs to a model, is very important. You can't replicate those, even with the most in-depth methods, at a generalized level - w e can get into what that means a little bit - but, it has a problem: a lack of diversity of language and bias issues and a lot of things there. I'll let Sid add some more from the LLM specific space, but there's a major disconnect here. And just moving to synthetic data versus training from open source data, neither one is actually good from model training. You need curated data.

Debra J Farber:

Welcome everyone to The Shifting Privacy Left podcast. I'm your host and resident privacy guru, Debra J Farber. Today I'm delighted to welcome my next two guests: Dr. Andrew Clark and Sid Mangalik, Co-hosts of The AI Fundamentalists Podcast. Dr. Andrew Clark is Monitaur's Co-founder and Chief Technology Officer. A trusted domain expert on the topic of machine learning, auditing and assurance, Andrew built and deployed machine learning auditing solutions at Capital One. He also contributed to ML auditing standards at organizations like ISACA and the UK's ICO. Andrew co-founded Monitaur to lead the way in end- to- end AI and machine learning assurance, enabling independently verifiable proof that AI ML systems are functioning responsibly and as expected. Before monitor, Andrew served as an Economist and Modeling Advisor for several prominent crypto economic projects, while at Block Science. Sid Mangalik is a Research Scientist at monitor and he's working towards his PhD at Stony Brook University, with a focus on Natural Language Processing and Computational Social Sciences. Welcome, Andrew and Sid. I'm so excited that you're here.

Andrew Clark:

Thanks, really excited to be here as well, and thank you so much for reaching out. Looks like a fantastic podcast and audience and we're really excited to be here.

Debra J Farber:

Great. I'm excited for you to be here as well. So, I was searching for podcasts that focused on privacy enhancing technologies on this website called Podchaser and it allows you to search across all podcasts everywhere, and I came across episode five of The AI Fundamentalists Podcast, and that focused on the pros and cons to using synthetic data for ML/ AI. But now, I've been following how synthetic data can be used for privacy use cases, but what struck me about this episode was how you underscored how synthetic data may not be all that useful for machine learning and AI purposes, except for a few key scenarios. I found the discussion really riveting. You guys are just a fountain of knowledge, and I thought it would be helpful to my audience of privacy engineers to get a deep dive on exactly what these limitations are. So, to start off, Andrew, why don't you tell us what motivated you to found Monitaur and focus on AI governance? And, I guess, tell us a little bit about Monitaur as well?

Andrew Clark:

Definitely, thank you. Monitaur is, you described earlier at a high- level, which is a great example of machine learning, assurance, and governance company. So, I have an undergrad in accounting and was working in the accounting field and was realizing accounting is a great career, but it's really not for me. So, I taught myself how to program in Python, started taking extra statistics and economics classes and started going over to the more technical side of the house. First job was IT auditor -started doing, I call it Sarbanes Oxley is one of the compliance regulations that came out of Enron and some of those accounting disasters. So, we call it "the full employment act for accountants and lawyers.' So there's lots of rote work you have to do and IT auditing and things of that nature. So, I started figuring out how can I start automating these processes? Going to continuous auditing was a big thing at the time. Where can we use machine learning and data analytics to try and enhance the audit process? So, when I started doing that deep dive, I did got a Master's Degree in Data Science and just kept getting more interested in this field and learning more and I started looking like, "Hey, how we traditionally audit and evaluate systems such as you get a bank reconciliation and make sure that 2+2=4. If you're auditing a finances of a company, for example, in IT auditing we also check to make sure who has access to things and it's very black and white type scenarios. Well, as I kept digging deeper, and specifically as machine learning is coming on the scene, how model auditing and assurance is usually done or historically has been done has been more aggregated, high- level, looking at residual plots and things like that, not the granularity and re-performance that traditionally auditing does for financial auditing, technical auditing. So, I started working with ISACA on guidance like hey, with machine learning models, it's less about (and this something we can get into as well) - it's less about the type of model. So, like machine learning versus statistical modeling, versus deep neural networks - any of those things, that's a paradigm of modeling.

Andrew Clark:

The key difference is, with the rise of big data and "machine learning becoming popular, the real thing of why we want to increase that assurance is that these are being used, affecting end users. Traditionally - and the techniques that originally developed about I mentioned, we're looking at residual plots and things like that - was if you're looking at liquidity measurements for a bank or forecasting those sorts of requirements that modeling had traditionally been used for, there's a very different level of assurance that you need than if you are making decisions about Debra, Andrew, Sid. You want to be making sure that you're not biased; you're fair. Those are very specific things, and you can re-perform and understand how that system is working. So, I really started getting excited about that and digging in, did presentations, worked with ISACA, worked with Capital One, worked over there on some of their systems for a while and setting up that audit process and really kept digging deeper into

Andrew Clark:

there's like a major gap here on how AI systems, which we're going to call the umbrella term. All modeling can kind of go under the umbrella of AI. AI is really. . . it's the field of trying to have automated computer systems that replicate human activity. That's very broad. You can even have RPA and some other things underneath there. But for any type of these AI systems that are becoming more prominent, being used for decisions about Andrew, Debra and Sid, we want to make sure that that level of governance and assurance is higher.

Andrew Clark:

So, that's what really prompted the founding of Monitaur. We started very machine learning-esque; and we've expanded now to really be solid model governance and really focusing on what are those fundamentals you need to have correct, that we can be confident that our systems are performing as expected. So, we are confident. That's where it really goes in well with privacy and maybe in those specific attributes of how can us as a society be comfortable that insurance underwriting, healthcare diagnoses, all of these things are now being done by machines? How can we get comfortable with that? So the genesis of Monitaur is providing really that 'batteries included' software package that can help through those parts of your journeys for enforcing best practices.

Debra J Farber:

That's pretty awesome and, my goodness, do we need that. Right? I mean, is there a framework even that you're incorporating or you're kind of coming up with best practices based on just your experience and knowledge of how machine learning works and is deployed today?

Andrew Clark:

That's a great question. So short answer: we've developed our own internal thing looking at all these different resources and systems engineering a lot of influences. Sid and I can talk about in my work doing this with multiple companies for a long time. There's also a lot of frameworks out there that we reference and map to and that do exist. The thing is there's not a definitive framework. For example, in privacy or cybersecurity for example, there's that NIST Cybersecurity Framework that everybody kind of rallies around in AI and ML. There's not like the standard thing. NIST came out with an AI Risk Management Framework, which we were (Sid and I) were both contributors to for that whole process, but nobody's really said you must use that. So, there are frameworks that exist but they're very high- level and conceptual where we try and be a little bit more targeted. But, that's one of the issues with AI and ML right now is there's not that definitive everybody must follow, which is why people are talking about regulation. EU AI is probably going to be that first mover in that space.

Debra J Farber:

Yeah, that makes sense. Thank you. Okay, Sid, tell us about your career path and why did you choose to focus in this space and how did you end up at

Sid Mangalik:

Yeah, so I've followed a very traditional path, you can say. I was always a computer scientist at heart, going back maybe 10 or 11 years, I really found that I was really passionate about AI and using these types of systems to solve human problems and understand humans. So, I naturally gravitated towards natural language processing, understanding how language exists as a medium between humans and computer systems, and how we can understand each other in some sense. And then, I think later on, I discovered that what I really wanted to do was this 'AI for good' and 'AI for increasing and improving human outcomes' and generally making people happier. And so, I started off my career working at Capital One as a data engineer, doing classic pre-pipeline work, pre-AI work. And that's where I met Andrew and we hit it off really well and he started to monitor and he said, w"e need a data scientist. Are y ou ready to hop on board?" and I couldn't say yes any faster because I was so excited to do data science work and AI work.

Sid Mangalik:

About a year into that, I decided, "it's time for me to enter a PhD. I think I really need to become a subject matter expert in this. I really want to pursue my passion and I really want to do this type of AI in the field of making human lives better. The research I work on in my PhD is working on mental health outcomes across the US using NLP systems. So, we do a lot of work for The National Institute of Mental Health and the NIH at large. And, just feels like a really great fit for that because we are focused on making safe, unbiased, and fair AI models available for the general public that's going to have to interface with these models and is going to have to live in this new AI world.

Debra J Farber:

Awesome, so thanks for that. Let's talk about your new podcast that you brought to life, called The AI fundamentalists Podcast. What motivated you to start it and who's your intended audience?

Andrew Clark:

So, The AI fundamentalists - I forget exactly the genesis of it. I think Sid and I were just talking about things happening in this space. Specifically, we're both passionate about the current state of LLMs and not necessarily in the way that the media is portraying them. Something like, "Hey, there's a major gap of like people speaking the truth on how these systems actually work. I think one of the things that Sid was talking to me about was a lot of his other co-researches in NLP are not as happy or as confident in these systems as what the media is portraying, and there's just a major gap.

Andrew Clark:

And, our whole focus is how do we do best practices? How do we build systems properly? You saw a major space that needed to be filled on. Let's go back to first principles. Let's think about how should you build AI systems that are safe, fair, performant? What are those building blocks? How do we start helping practitioners that wanna be better, that wanna be that 1% better every day, that wanna be professionals? How can we help them kind of cut through the noise and learn? What are the key building blocks you need to do?

Andrew Clark:

Because if you're building a generative AI LLM system or if you're building that simple model that's gonna predict pop-tart usage at your company, the same fundamentals at the lowest level are the same. Understand what you're building and why. Understand the data, understand the modeling process. How are you gonna evaluate that this is doing what it should be doing? All of those different aspects are very the same across, and they're also not new. Well, there's lots of this discussion around like, "oh, all this new AI stuff it's not new at all. If you listen to The AI Fundamentalists Podcast, we keep talking about NASA and the Apollo program all the time, because that's where the genesis of a lot of these larger-scaling modeling systems and the really in-depth systems engineering approach and really goes back farther than that even. So, we really are doing that fundamentalist mindset for practitioners that don't believe the hype and wanna be trying to improve and make safe responsible systems. Sid, is there anything you'd like to add to that?

Sid Mangalik:

Yeah, I think that's great. I mean, to me, The AI Fundamentalists is about two people that love AI, wanting to talk about AI and wanting it to exist in the real world. And, things like hype and large investing excited people that like buzzwords and don't understand how data science and AI works taking over the field is not gonna be healthy for our field and for building trust in our field and getting it out there and using it to solve problems. So, it's really for us to talk to other data scientists and people who work in this field to think about doing AI the right way, which is oftentimes the hard way.

Debra J Farber:

Yeah, usually the hard way. It makes sense, especially if you're gonna go make testable systems and you gotta kind of plan that stuff out from the beginning, from the design phase I would think. Thank you for that. I know that today we're going to be centering our conversation around synthetic data. Before we get started on that, do you mind describing or defining what synthetic data is? I realize this might vary according to industry or academia, but what are some working definitions?

Sid Mangalik:

Yeah. So I think a very simple definition of synthetic data is we want to create data which looks as close to original source data as possible. Insofar as if we have some sample data from customers and we only have a hundred customers, but we would like to supplement that data to tell a more full and rich story, we need to sometimes create synthetic data sets that can augment the original data set, or act as a proxy or clone for the original data set. Synthetic data does not mean that you pull in a random generator or your dice or your Ouija board and remake the data totally at random. You wanna make data that looks like the original data. In terms of industry / academia, industry's gonna be more interested in making data that is either carbon copies of the original data, insofar as you can use them for privacy. Academia is really interested in making data that is meaningfully similar data, statistically.

Debra J Farber:

So that's an interesting slight difference between the two. Has that created tension at all?

Sid Mangalik:

I think we've seen tensions insofar as what is considered good synthetic data. Academics are looking at synthetic data as virtually indistinguishable from the original data, whereas in industry, you can just feel like "what is sufficient that I eyeball test and it looks fine. It's not this same rigorous process that we expect in an academic setting.

Debra J Farber:

That's interesting. That makes sense. That also kind of tracks with how you kinda look at HIPAA regulations and when it comes to de-identification, which we're not gonna talk about whether it's good or bad or it's sufficient or not actually for privacy these days, but when you look at that, there's a little more leeway for academics during research around the data sets versus if you were releasing it to non-academics. So, that makes sense to me. I'm seeing that play out over the years. So the AI community has been talking about using open-sourced information or information that's scraped from the internet in order to help create synthetic data. But, this approach seems to be backfiring. Some argue that LLM output is synthetic data and while they're ingesting LLMs from other models, I hear there might be some degrading going on there. What are your thoughts on doing this?

Andrew Clark:

Yeah, that's definitely happening where you're having the outputs of other LLM models. LLM models fundamentally, the current state of those large language models, they're trained to sound human, not be accurate. So, the objective function - every time you train a model, there's a specific objective function of what are you optimizing for? And that's one of the key things when we get into privacy and bias and things like that is what are you optimizing for? In this case, when you've built these models, they're optimizing to sound like a human, not that they're actually factual. So, it's a very important distinction.

Andrew Clark:

What's gonna hinder them being adopted? They might look really cool if you say, "Hey, write me an essay, and everybody gets all excited about it. This is that disconnect of why The Fundamentalists was started. But, can you actually use this in a business? That's gonna really change because when you realize that you have to be correct about things. So, we have started running into issues, people with copyright and things like that was scraping data from the internet which was written by humans, to now people are using Chat GPT and LLM models to generate content, which is now back online and it is being ingested.

Andrew Clark:

And there's some recent papers out on how the latest versions of the GPT models are degrading and they're not as accurate. They're not as even smart as they used to be because they're doing that feedback loop of ingesting other LLM data, so it's just kind of like a downward spiral. So, one of the things that people are now talking about is "okay, we can't use data now that's on the internet because so much of it is generated, so can we make fake data to train our models off of instead of real data? And that's where there's a lot of issues there that's not helping either because you're now creating. . .

Andrew Clark:

one of the key things to highlight with synthetic data is, although there are academic methods and we can get into them on how can you best replicate data, synthetic data is not real data, and one of the key things you miss is the interplay. The interconnectedness between inputs to a model is very important and you can't replicate those, even with the most in-depth methods at a generalized level. We can get into what that means a little bit, but it has a problem and a lack of diversity of language and bias issues and a lot of things there. I'll let Sid add some more from the LLM specific space, but there's a major disconnect here in just moving to synthetic data versus training from open source data, and neither one is actually good for model training. You need curated data.

Sid Mangalik:

Yeah, so I'll just pick up right there.

Sid Mangalik:

So, using LLM outputs as synthetic data poses some pretty immediate and maybe obvious issues, and you can think about this on your own.

Sid Mangalik:

In terms of, if you've used ChatGPT before, you may have felt that it has a certain tone. You may have felt that it has a certain consistency in the way that it likes to speak. And maybe you've even received an email and thought, "I think chatGPT wrote this email and this really highlights this lack of diversity, this lack of variance that comes in synthetic outputs.

Sid Mangalik:

And this is not a surprise, because if we remember that all AI modeling is about creating high-level, broad-stroke representations of the real world, of what human language looks like. We're not going to get these deeper nuances, these different styles of talking, maybe even the same types of mistakes we would expect from human authors. And so, when you keep spiraling and ingesting and outputting, and ingesting and outputting the same data over and over and over again, you only further entrench these biases and these repeated behaviors that are in AI modeling systems already. And so, you do see a natural degradation process. Maybe you even experientially, anecdotally, have felt ChatGPT is not as good as it used to be because now they need to put more and more data into it. But, the data they're putting into it is this synthetic data and it's not getting this rich diet of nutritious data. It's just being given the same hashed out data over and over and over again.

Debra J Farber:

And so here we're talking mostly about the base models, right? What's happening with those who build on top of LLMs. Are they able to adjust for the bad outputs, or is it pretty much, "No. If you build on top of something that's uncertainly or not necessarily a great output, then it's going to be more garbage in garbage out.

Sid Mangalik:

There are still some early studies that are in this field. I actually just recently put out. . . I was a co-author in a paper where we kind of discussed this idea, which is looking at downstream tasks for AI systems. Is fine-tuning going to make any difference based on the quality of the underlying model? It seems like you can mitigate a good amount of problems in your specific domain. Giving it really good data to fine-tune on will make it really good at your use case. But, you will have a lot of struggles changing the baseline model to adapt to your needs in broader contexts. If someone has a customer service bot and you've trained it to be really good at customer service, that doesn't mean that you fix any problems with the facticity it had before.

Andrew Clark:

And one key thing to highlight, to underscore what Sid just said, is we're assuming that your fine-tuning has very high quality data, there's a sufficient amount of it, it has those correlations and it's not synthetically generated data that's from your use case. So, if we can mitigate some of the base model issues, you need to have extremely good data that you're then training on. So, there is still good data in the system. You're just allowing to use a little bit of synthetic at the base level.

Debra J Farber:

Okay, got it. So then, maybe you could tell us what are some use cases where synthetic data is most useful and then what are some use cases where it's least useful?

Andrew Clark:

I'll take the pros and then, Sid, you can take the cons. So, pros, as Sid actually started discussing this earlier, it's 'supplemental data., So one of the ways you can do it is, if you don't have enough records, it can be helpful to expand your data set a little bit more, because modeling, specifically machine learning, needs lots of data. So, you want to be able to have enough data to be able to train your model properly. S ometimes augmenting your data set is good. One of the ways that, personally, I like to use it to help with the augmentation is to create data that is a little bit outside the norm. What you're wanting to make. . .because one of the basic issues with machine learning and AI models is that they get overfit. So, if your data set is on this one specific thing is we're looking at data for let's just do a for LLM discussion, right? We've just looked at history books, but now we start wanting to write poetry. Well, it's not going to be the exact one-to-one definition, right? So, then you want to make sure you have enough poetry examples. Maybe you have enough synthetic poetry in there to help expand it, so it's not overfit. So, sort of like expansion of your data set is a good way to use it. Leading into that is one of the best ways that we've found.

Andrew Clark:

One of the discussions we had on The AI Fundamentalists Podcast was using it for stress testing. This is, I think, the most valid use for synthetic data is stress testing. So, for instance, we want to determine we talked about, we want safe performance systems. We need to determine where is this model safe and performant? If I have pricing information from Utah and I have a great model that works in Utah off of those demographics, that doesn't mean I can take that same model and move it to Massachusetts. The demographics, the income levels, there's lots of things that could be different there. So we need to then test can it perform correctly in Massachusetts? We would need to stress test it and figure out where it breaks down.

Andrew Clark:

Once we know where that model falls apart, we know if it hits the business objectives and where to put those monitoring bounds. So stress testing - there's a huge lot to validation research and goes back to that NASA discussion on the stress testing and supplemental synthetic data I think is one of the best ways to do that. Stress testing: it's the gap that most people in ML don't do stress testing, and that's something we think is very important. Synthetic data linking to privacy, an area that you can definitely use it, is differential privacy is technically synthetic data. So there's two definitions of that. We could get into that if you'd like later, but there's global and there's local and basically you're adding noise at either the aggregator level or at the localized level, depending on who's the person of trust. So, we can dig into that if we'd like to later. But, that's another way where you're basically anonymizing certain attributes, as you mentioned about HIPAA earlier.

Debra J Farber:

Fascinating. Thanks. And then, Sid, some of the maybe not- so- great use cases for synthetic data.

Sid Mangalik:

Yeah.

Sid Mangalik:

So, I think Andrew's examples here are great because they really talk about how synthetic data is really useful when you need to push the boundaries of what you can safely collect.

Sid Mangalik:

Maybe you don't want to collect outcomes for people that are on the edges of these groups, or maybe they're just really hard to find.

Sid Mangalik:

So this is a really great use case for collecting data that's essentially not possible to collect.

Sid Mangalik:

Where this becomes a lot less useful is when you're trying to learn about the base distribution itself, when you're trying to learn about the original group of people being studied. If you create synthetic data just repeating that data, and you repeat it back out to itself, you'll just create a system that becomes a little bit narrow, maybe even just overfit, to look more and more like that original data set, because now your model looking at this is well I have so many great examples of person A20 times, and so your model will become something of a person A classifier or regressor. This is the same problem with LLMs that if you just put in the base distribution again, that center distribution over and over and over again, you can just narrow the standard deviations, you can narrow the variance, and you end up with a very singular model that becomes difficult and hard to tame to do these outside tasks, which is what synthetic data is much better for.

Debra J Farber:

That makes sense. I also see that that could be a privacy problem as well. Right? If you're training the model with pictures of Debra Farber and all of a sudden then you're like I'll throw in a few other pictures of other people, it's still most likely going to spit out something that looks like me. Or, if it's like I could see it being a privacy challenge, or I could see where some of this copyright issues get in the way if it's over-trained with either personal data or copyrightable material. Talk to me about the importance of the quality of the inputs.

Andrew Clark:

That's what's huge. We've liked to coin on The AI Fundamentalists Podcast that 'we're bringing stats back.' There's been a bunch of. . .data science as a field kind of grows out of statistics not giving the world what it wanted, and statistics has taken a back burner. Essentially, what we're realizing here, and people are learning the hard way with some of these systems, is you really have to have that quality data that captures those complex inter- relationships. Although there are a lot of research areas on how to create better synthetic data, you still have to have some sort of good data. Data augmentation is really the research we're doing here in making better data, but you have to have that good core data. And how do you know data is good? How do you capture those inter- relationships? Surveying, making all this traditional statistical evaluation techniques, all of those things are vitally important for building these AI systems. You have to have those core good inputs and maybe you can sprinkle some synthetic to help expand that or stress test.

Andrew Clark:

But, quality of data inputs - machine learning and AI are really garbage in, garbage out. Your model is only as good as the data you have in. That's why people have 'data as a new oil' and all those discussions, but it's quality data. We've gone to the extreme of you had all this big data and then we didn't know how to handle it. Now we've gone like, "okay, we trade off the internet, we have copyright issues, we don't know how to handle these things, let's just generate our own." But you can build really solid models. This is one of the big downfalls with the deep neural network structure is the amount of data it requires. Sometimes the parsimonious, easier- to- use models can be better because they can handle smaller inputs and smaller data sets better. So, quality of data is directly correlated with the quality of your modeling system that you're using.

Debra J Farber:

That makes sense. So, after the backlash of using open source data or just data scraped from the internet, written by humans though, Sam Altman from OpenAI has been talking about using LLM-generated synthetic data, what do you think about this approach with LLMs generally?

Sid Mangalik:

Yeah, I mean this is exactly what we've talked about before. It is this problem of just regurgitating data in and over and over and over again, and there's been a lot of great research coming out of Harvard and Oxford where they're trying to study what are the negative effects of doing this type of approach. People are already a few steps ahead in trying this out and seeing what happens and these models really, really suffer in these settings. And so, while it's going to be great to tell investors, "ey, gpt-5 has twice the data GPT-4 had if that wasn't high quality data, we're just going to be creating weaker and weaker models over time. They might look better and better at the specific use cases that show up in the demos, but it's not going to create more factual models. It's not going to create models that are better aligned to human needs. It's just going to look more like human text, but that's very superficial.

Debra J Farber:

Why do you think they're taking this approach? Is it just about PR? Is it about just investment to get people to think that this is going to be the path forward for optimizing LLMs? How will they be able to, I guess, demonstrate to the public or even Microsoft, who's giving them quite a bit of money as a partner to continue this path if, in the end, it ends up using LLM-generated synthetic data is going to degrade the quality of the data?

Andrew Clark:

Depends. We have the appropriate responses, and then we have the real responses.

Debra J Farber:

Oh, I want the real responses. Unless you don't want to say so on this podcast, but I would love to hear what you think the bullshit is from the PR.

Andrew Clark:

Yeah, well, I'll take a swing and then. Sid, we'll piñata this a little bit. I'm not a big fan of OpenAI and their structure. They actually started as a research lab - OpenAI is in the name - they were going to open source everything. They were going to be a good community member about how can we all just expand technology? They've really come to a closed source monopoly. They even were naive enough to say "let's suggest regulations that we can then create a mode around ourselves which obviously I haven't looked at that before.

Andrew Clark:

There's a lot of hubris - in the computer science community, there are individuals who are very egotistical and this is where a lot of these concepts - and they have great marketing around machine learning and things - it's new, it's brand new. We've done it here. Well, NASA was doing it in the 60s. It was just called engineering back then. Like, there's this hype train that comes out of Silicon Valley because the success of Facebook and things like that that, as long as you have a computer science degree and you hang out in Silicon Valley, you're infallible and you can just do awesome stuff.

Debra J Farber:

Especially if you're a white male. Sorry, I just wanted to get that in there.

Andrew Clark:

Yeah, well, it's very true. No, I agree, I agree. And, there's that hype around that and that's synthetic data. There's a research area. What could possibly go wrong? There's that lack of fundamental understanding, lack of doing things the hard way. Everything is growth hacking, hacking this, hacking that, trying to cut corners and just thinking it's all going to work out. I think there's a lot of that culture and, you're right, it's a white male culture that operates that way. And honestly, I don't think - they're just like how can we quickly make money, sell it, flip it, create a moat around ourselves? I really think it's kind of operating that way.

Sid Mangalik:

I think we should remember a little bit of how we got in the situation in the first place. We got in the situation because there's a huge backlash from creators, from people that own data, from people that create text data, saying "I did not consent to this, I did not ask for my data to be put into this model, I certainly did not ask for it to be made public and available to everyone for free on the internet, and this underscores a common and consistent issue we see with open AI. When we talk to lay people about this, they're excited, they're ecstatic, they're like, "ow, this AI is here, it's cool, it wasn't here before. I can't believe this is happening all of a sudden. But people that are in NLP labs don't feel this way.

Sid Mangalik:

This type of work has been out and ready for almost five years now. This is not new research. Google had these types of models long, long, long before Open AI publishes ChatGRT. Why didn't they share it? Because they felt it was irresponsible. They didn't feel they were ready. They didn't feel they're in good shape. They hadn't done the proper channeling to make safe and secure and consent to data. And then, Open AI does it very openly, very recklessly, and it feels like a bandage solution. They're in a situation where they're finally getting backlash to using everyone's data and they say, "fine, we'll just make our own data.

Debra J Farber:

Fascinating. It'll be interesting to see how that actually develops. I am closely watching potential legislation in the area. I mean, gosh, I have so much to say about how they're coming to market and I might just save it for another conversation because it could take up its own conversation, basically. Let's switch to something that I want to say, not the opposite, but talk about ethics a little bit. So, talk to me about the importance of diversity in the training of AI models.

Andrew Clark:

That's huge. You mentioned it a little bit earlier with the image. You mentioned an image use case of a bunch of image models just trained off of your face and then what happens if we get a different face? That's been a major problem in training data. There's been a lot of white faces for image thing and that's created issues. There are cases when synthetic data and up sampling / down sampling really does make sense because these models, despite the PR around them, are not very intelligent. We talked earlier about the optimization; you optimize over something.

Andrew Clark:

So, an area of research that Sid and I have been trying to highlight is 'multi-objective modeling,' where your model has to be both performant and fair. You have to find the best model that fits those versus most of these models. ChatGPT is focused on "how can I make the model that sounds the most human, not sounds the most human and is not racist? You're not focusing on the fairness. Now, you could; it just means that you need more training data, and you need representative data. So, that's another reason why statistics is so important. Everybody likes to make fun of statisticians and polls and U. S. Census Bureau and all those things, but that's a really hard job that is focused on "how can we accurately represent the underlying population. So why statistics is so huge is if we're building a model that has those implications as we talked about earlier about end users, we want to make sure that it is representative of that underlying population: all the demographics, all the socioeconomic status, all of those attributes. Also, we want to make sure that if there are minorities in that group that aren't well-represented, but if you look at the U. S. as a general, we are a very diverse group so you should be able to use that normally U. S. data.

Andrew Clark:

The problem is some of the samples people use aren't. But, let's play it out anyway. Say, you have a set of data and you still want to make sure it's fair towards minorities, even if it's not having enough data. So, you can either up sample that or you use these like multi- objective modeling to ensure the fairness and bias. So, this is where it gets really complicated and we don't like this approach of a willy-nilly synthetic data everything. There's some great use cases for synthetic data, but they're more to add robustness and stress test and safety to systems. So, for instance, if I'm looking at a set of billionaires, maybe there's not enough African-American or female billionaires. Well, I'll make up a couple so my data set is more balanced when I'm modeling billionaires, as an example. That's a great case for synthetic data. Saying, "I'm just going to make up a set of data off of synthetic data with no reference point," that's a scary use case and that's what open AI is at least publicly suggesting.

Debra J Farber:

Oh, fascinating. Okay.

Andrew Clark:

Yeah. S ynthetic data can be used for good, but it's also going to be used for bad. For bad, also, one of the things - I'm going to not talk about Open AI here - normally what I notice, and we help a lot of companies with bias and fairness, is that the majority of people actually do want to do the right thing and how these models can be so easily biased. The researchers that we've talked about with the image thing, I don't think anybody (maybe it's because there's a white male culture), I'm sure they're bad apples, but not everybody's like "I want to make a bias model. A lot of times, it's just ignorance or not understanding the nuance. Right? So, these complexities are huge, and that's what we really need to do those best practices on how can we build fair, performant systems and make sure we're doing that responsibly. Synthetic data is definitely part of it, but it is not it's not a panacea. You have to know what you're doing and taking it slow and making sure you have that balanced data.

Debra J Farber:

So, don't move fast and break things? That's right. For consequential systems. For consequential systems. That makes sense.

Sid Mangalik:

Yeah, and this isn't just a human issue. This is an AI modeling problem, too. Right? This is class imbalance, point blank. If you don't let your model see these types of outcomes, if you don't let it see healthy outcomes for minority groups, you will only create a model that assumes that they only have unhealthy outcomes and will never allow them to have the same type of fairness and equity that other people experience in these models.

Debra J Farber:

Yeah, I mean, you could see how maybe not those building the models, but you could see how maybe a nefarious billionaire or someone (I'm not addressing anyone specifically) or someone with a lot of power who wants to I don't know curtail a certain population somewhere, could inject bias into these models or somehow create restraints that come out of these models that do harm, on purpose, to certain populations. So, yeah, that was scary. You want responsible AI.

Debra J Farber:

Yeah, I could see why, and I think that there's not a great way for media to really understand how can they actually cut through the hype and talk about the true challenges that are going on today. Right now, they've been pretty much distracted by the far future potential risks to humanity that those bringing LLMs to market talk about, to avoid speaking about the risks that are inherent in the models today that could hurt people today. I don't know what the answer is to get the media more educated on it so that they can actually have a real discussion with the public at their level. But gosh, do I hope that they get there.

Andrew Clark:

Yeah, that's what's really tough is, as we're digging into this. there's so many layers, it's hard to just explain in a quick note. And, honestly, with OpenAI, I would like to think there's nobody doing anything nefarious there. They just don't know. This is not how to do this bias mitigation, how to do this synthetic data in the capturing. It's not a straightforward thing. It's not even very well known in the ML community.

Andrew Clark:

So I'd like to think, and I have no reason not to think so, that OpenAI, they don't understand some of these things. They're not trying to be bad. Maybe they are, but we don't have any evidence that they are. That makes sense. It's just very difficult for, like, how does the media portray this, and how do you understand this type of thing? And how do we even educate our universities that this is a thing? I honestly don't think - Google understood this; that's why they did not want to release BARD until OpenAI did it first. They had to. OpenAI, I honestly think they're just move fast, break things and didn't think about it is honestly what I think happened.

Debra J Farber:

I agree. Watching it come to market, I remember a few weeks before people like maybe there was I don't remember if it was Google or Meta, I think it was Meta had released some new LLM or something new to the AI community that people were like have you you released this? It's not ready for general release because there's so many issues with it and there was some backlash online. And then, all of a sudden OpenAI goes, boom, here's ours. Right? and provided an API access to it and really hyped it so that everybody and their mother could try it - at least the free version, not the commercial version, which is even worse because there's fewer protections in there. I mean, it really felt like at least a PR campaign of let's capitalize on the interest for AI now, let's make it available everywhere in the world all at once and really hype it so that regulators can't shut us down. That's what it felt like to me, as someone who's been in privacy for 18 years and has been looking at how companies come to market and you know I sit on several privacy tech advisory boards with help with go- to- market and I'm like watching this happen and it really feels like to me,

Debra J Farber:

I'm not saying that the people who work for the company themselves were bad or nefarious, but I think the way they came to market was arrogant and daring regulators to act. So, to me it just seems a little ridiculous, given what Google and Meta and the big tech companies already know what happens. They already are under 20 year FTC consent decrees for some actions they've had on whether it was privacy or misleading users in the past. So, they've learned their lessons, or they're currently under consent decrees, so they don't want to like rock that boat because they're being audited.

Debra J Farber:

Right? Open AI doesn't have any of those restrictions, and what I see from the Silicon Valley investors is they're like, "well, okay, we'll just make this a line item and you're all of the potential lawsuits and whatever is a line item and the potential billion, you know our annual spend. At least we'll be able to come to market and make billions of dollars, and then we'll just have to pay for the lawsuits and any fines later, but at least we get the big market share. So, that's the part I'm most angry about: the way that they came to market. But, I do see that it's also each PhD that works there isn't necessarily, you know, bad or evil, and it could be just ignorance. It certainly isn't helping them to have this kind of stress on the data scientists that are working on this to try to fix something that might not be fixable. So, I know I've heard you talk about in the past this concept of the 'fairness through unawareness fallacy.' Can you just unpack that a little bit for us?

Andrew Clark:

Yes, I'll take the high- level and then let's add some details. So, fairness through awareness is something where a lot of companies will think that, "hey, if I'm not looking at an attribute" so say, I have like a tabular data set that's just like an Excel spreadsheet, think of it with rows and columns where I have information about, like pricing for an insurance policy as an example. So I'll have a bunch of information about this person lives in Tennessee. This person drives a Ford pickup. They drive around 30,000 miles a year. That kind of information. Right? Well, I'm going to just remove age and gender and ethnicity; "because my model doesn't see that stuff, it's fair. I don't have to look at it. Actually, in certain industries and insurance, you're not allowed to look at it. In federal housing, they actually have an aggregated statistical database on that to make sure their models are fair. That's a rabbit hole. But anyway, you basically remove any of those sensitive attributes and train your model without those, and then there's no way to track it and you don't know if you're fair or you're not fair. But you think you're fair because you didn't train off of that.

Andrew Clark:

It's under the assumption that if I had age in the system, it might be ageism, but the fact that I took age out, it's not going to do anything. The problem is those inner correlations and relationships why you can't get synthetic data that's as accurate as real data. Those relationships still exist. Well, maybe there's a proxy or something that exists around, "Well, people drive Ford pickups and they live in this specific area around China. Well, turns out, they're actually rich white males and we're realizing that this other group - that correlation still exists in the data set and those inner relationships - so, you are just being blind to the fact that there might be bias, thinking because I'm not training on age or gender or ethnicity, I'm just magically not biased, which is a fallacy.

Debra J Farber:

Oh, that's interesting. Okay, I get that. That makes sense.

Andrew Clark:

And, as we talked about earlier, like the multi-objective, it's counterintuitive, but we actually want to know ethnicity, gender, age, because we can then train our model to explicitly not be biased. There are ways to say, "f you flip male / female gender or, sorry, male female ethnicity, any of those attributes, I want to have the exact same response based off of that. So if I flip ethnicity, I should get the exact same pricing. That variable should not have any impact. I can train my model to do that. But, I don't know if that's the case. If I just threw out that attribute and oftentimes, as we figured out, it actually is having this implicit bias that you just can't see. So, that's kind of a dangerous solution.

Debra J Farber:

That's really helpful to avoid. Okay, awesome, and you've also spoken in the past about the difference between 'randomized data' and 'synthetic data.' How is randomized data different?

Sid Mangalik:

Yeah, so I mean you can really think about randomized versus synthetic data as "is the data that I get at the end of this coherent? If you just create purely random data, you'll create 17-year-olds that have been through three divorces and are the senior executive of the bank.

Andrew Clark:

Great examples.

Sid Mangalik:

Just for random data, right? This is not coherent data. This doesn't make sense. This doesn't look like anything we've seen before. The goal of synthetic data is to not just be random data. It's to be data that looks like the original data and acts like the original data and interacts between the variables the way that real data interacts.

Debra J Farber:

Excellent, that's a really helpful definition. I really appreciate it. Okay, so there are several techniques that you've described in the past for using synthetic data with ML/AI" the Monte Carlo method; Latin hypercube sampling, gaussian copulas (I hope I'm saying that right) and random walking. Do you mind walking us through those four methods? Obviously at a high level, but if you wanted to use a use case, that would be, I think, helpful for the audience. Maybe first for the Monte Carlo method what are the pros and cons to using this technique for synthetic data?

Andrew Clark:

Sure, I'll take Monte Carlo and Latin hypercube and then Sid, you can do the last two. So, Monte Carlo method is basically named after the Monte Carlo famous casino in Monaco, I believe it is. So in general, what it is is it's a more intelligent way of sampling. You're basically trying to do repeated random sampling to obtain numeric results.

Andrew Clark:

Normally, what you're trying to do is use randomness to solve deterministic principles. So, it's something we'll do when we're trying to test the system or determine what are all the possible outcomes. We'll run like 100 times through this scenario where you have some randomness factors or stochastic generators. Essentially, we're trying to represent "here are the different scenarios for economic growth "for instance, those sorts of things. You try and have this whole system where you run through that, you're sampling each time, and you're trying to figure out "what's really the true value by perturbing that input space and figuring out just tweaking attributes. It's really this system of running experiments, if you will, for defining what the outcomes could be for a system.

Andrew Clark:

So, Latin hypercube sampling is a more intelligent way, instead of just random sampling. It's essentially, if you think of a chessboard. it helps you make sure you hit all of the areas within a chessboard versus random sampling. Sid's description was great of a 17-year-old with three divorces and CEO of a bank. If you're just randomly sampling in the input space you could get some crazy outputs there. Latin hypercube tries to be a little bit more intelligent way of sampling for Monte Carlo simulation. I used to work doing economic simulations and trying to build economies and figuring out different growth patterns and things like that. So, Monte Carlo is a really good system technique for stress testing models; determining complex systems; what are all the possible inputs / outputs; and defining how you should build a system. It can be used specifically for those stress testing discussions we talked about earlier to generate data for synthetic data and those different scenarios.

Debra J Farber:

Oh cool.

Sid Mangalik:

And I think, on the other side, gaussian copulas and random walking solve the other problem.

Sid Mangalik:

So, if the first to solve the problem of picking good samples, these Gaussian copulas and the random walk both solve the problem of making coherent selections.

Sid Mangalik:

So the Gaussian copula is a technique, which is just grounded on our favorite bell curve, our famous normal distribution, and thinking about how random variables can be interrelated and creating output data that mimics the shape and these correlations. If you take a correlation matrix of the original data and a correlation matrix of your synthetic data, you want them to look as similar as possible, right? You want age and number of divorces to be highly correlated. You don't want those to be inversely correlated or something, right? So this helps to make data that matches the shape of the original data, even in the interaction space. It's funny that I'm doing random walking here because this is Andrew's PhD topic, but it's also a very similar technique where we're kind of walking through the data and progressing through it in a very natural way. Where we want to consider two variables, let's walk over to the fourth variable - the third variable and the fourth variable, and then to the fifth variable in a logical progression, which will let you get closer to the shape of the original data by, in a sense, walking the original data.

Debra J Farber:

Awesome. Okay, thank you so much. Is there somewhere that if people want to learn more about techniques they could go to do to have more in-depth overview of synthetic data techniques with ML and AI? Do you know where they can go for that?

Andrew Clark:

That's a tough one. This is where it gets a little bit of you can research some of these techniques like gaussian copulas is a huge research field; and it actually - because of the underlying distributions and things - it got hit really hard off the financial crisis. It was one of the contributing factors for some of that. There's not, that I'm aware of - Sid, weigh in if there's like any good textbooks or anything. Definitely, you can reach out to us and we're happy to point you to different things, but I don't know if like a definitive "here. Look at this place for synthetic data. It's very tough, and that's what this lack of this fundamentalist principles approach on how do people even learn some of these techniques? It's very much embedded within aerospace engineering. It does a lot of Monte Carlo, or computational finance does this thing, or a complex systems engineering does this. We try and take that interdisciplinary approach, but it's hard to find like a resource to direct people towards.

Sid Mangalik:

Yeah, sometimes it feels like you just start on Wikipedia. You learn about the techniques, you've picked up a stats textbook or there's some great online stats textbooks. You learn about the methods, but there's not necessarily this space yet - and this is a very growing and nascent space - of taking these techniques and bring them to AI modeling. This is still a relatively new idea, so while the math is there, it hasn't quite been married yet to AI.

Debra J Farber:

Got it. Well, maybe you guys can write that book, and I would be glad to promote it, because clearly data scientists need this information. I'm n ot saying you have to go and write a book, but somebody should because clearly people are thirsty for doing the right thing when it comes to building models and they don't necessarily know what to do. Okay, my last question for you guys before we end today, ending on a really positive note. What are you most excited about when it comes to synthetic data and how it will be used in the future?

Andrew Clark:

I love simulations. My PhD work is on it. I love building conflict systems and stress testing models and I really think the more awareness we can have around these techniques and building systems that are safe, performant, reliable that's simulating it and stress testing it. That's really the future and as we're starting to move modeling systems and AI systems that are being used for consequential decisions, we really have to start doing that stress testing step. The OCC, which is Officer of the Comptroller of the Currency, after the financial crisis, set in a model risk management framework that all banks have to follow as examples. Model validation, effective objective challenge is a benchmark, but really only large banks have to follow that. Stress testing - other larger banks have to do those things.

Andrew Clark:

But outside of that realm, model validation stress testing has not really been utilized besides aerospace engineering. It's used a lot in aerospace engineering for spaceships and fighter planes and all that kind of fun stuff. (Debra: Human safety purposes there,. Exactly) exactly - that's where it's been used. There's full fields of safety engineering and reliability engineering that NASA helped pioneer and it's been used for. . .Boeing uses this for building 737s and all that kind of stuff. Right? There's those other fields that are doing these things. Once we're using consequential systems that do affect human safety and reliability. We need to start bringing in those techniques to building these systems. Before Chat GPT goes live, live run a bunch of safety and reliability engineering tests on it, that kind of thing. So, So I'm excited on the potential of using synthetic data to do those things. And that's where synthetic data is not new; new it's building scenarios. So, So I think there's a great area for research there. Just caveat the profession - don't think it's a replacement for real data for training your models; use it to augment and stress test your models.

Debra J Farber:

Excellent. Sid, anything to add to that?

Sid Mangalik:

Yeah, I mean I don't think the field is ready yet, but I'm really excited about the potential of synthetic data as a way to do un-intrusive privacy for data processing and management. We have patients and medical data and we want to learn about them, and it can be difficult sometimes to have large sample sizes, to have safe sample sizes, to collect data from patients that aren't just the standard profile we always collect. Synthetic data might pose one of the first chances for us to have good quality, privacy- forward data available to us, and there's still a lot of problems that we're figuring out in this field of what that looks like if we're just entrenching the same biases again and again. But, there's a really strong possibility that in the next decade we could see synthetic data being used in a lot of use cases where we couldn't safely do it before.

Andrew Clark:

Thank you so much for having us on. I think this is a really fun discussion. It's great talking with you and, yeah, this is a huge topic and privacy and data is becoming more important, so thank you for having this podcast. I think it's definitely something very much needed.

Debra J Farber:

Excellent. Well, Andrew and Sid, thank you so much for joining us today on The Shifting Privacy Left podcast. Until next Tuesday, everyone, when we'll be back with engaging content and another great guest. . .or guests. Thanks for joining us this week on Shifting Privacy Left. Make sure to visit our website, shiftingprivacyleft. com where you can subscribe to updates so you'll never miss a show. While you're at it, if you found this episode valuable, go ahead and share it with a friend. And, if you're an engineer who cares passionately about privacy, check out Privado: the developer- friendly privacy platform and sponsor of the show. To learn more, go to privado. ai. Be sure to tune in next Tuesday for a new episode. Bye for now.

Introducing Andrew Clark & Sid Mangalik
What motivated Andrew to found Monitaur and focus on AI governance
Sid shares his career path, why he chose to focus on AI governance, and how he ended up at Monitaur
The definition of 'synthetic data' and why academia takes a more rigorous approach to deploying and testing synthetic data than industry does
Whether the output of LLMs are synthetic data and the problem with continuing to train LLM base models with this data
What 'synthetic data' use cases are most helpful when it comes to AI, and which ones are the most unhelpful?
Andrew & Sid discuss why the 'quality' of input data is so important for training AI models; and discussion of OpenAI's announcement that it plans to use LLM-generated synthetic data
Andrew & Sid critique OpenAI's approach, the AI hype machine, and the problems with cutting corners via 'growth hacking'
Andrew emphasizes the importance of diversity when training AI models and using 'multi-objective modeling'
Andrew unpacks the "fairness through unawareness fallacy" for us
Sid explains the difference between using 'randomized data' and 'synthetic data' with a fun example
Andrew & Sid describe 4 techniques for using synthetic data with ML/AI: 1) the Monte Carlo method; 2) Latin hypercube sampling; 3) gaussian copulas; & 4) random walking
Andrew & Sid describe what they are each most excited about when it comes to synthetic data and how it will be used in the future

Podcasts we love