The Shifting Privacy Left Podcast
Shifting Privacy Left features lively discussions on the need for organizations to embed privacy by design into the UX/UI, architecture, engineering / DevOps and the overall product development processes BEFORE code or products are ever shipped. Each Tuesday, we publish a new episode that features interviews with privacy engineers, technologists, researchers, ethicists, innovators, market makers, and industry thought leaders. We dive deeply into this subject and unpack the exciting elements of emerging technologies and tech stacks that are driving privacy innovation; strategies and tactics that win trust; privacy pitfalls to avoid; privacy tech issues ripped from the headlines; and other juicy topics of interest.
The Shifting Privacy Left Podcast
S2E30: "LLMs, Knowledge Graphs, & GenAI Architectural Considerations" with Shashank Tiwari (Uno)
This week's guest is Shashank Tiwari, a seasoned engineer and product leader who started with algorithmic systems of Wall Street before becoming Co-founder & CEO of Uno.ai, a pathbreaking autonomous security company. He started with algorithmic systems on Wall Street and then transitioned to building Silicon Valley startups, including previous stints at Nutanix, Elementum, Medallia, & StackRox. In this conversation, we discuss ML/AI, large language models (LLMs), temporal knowledge graphs, causal discovery inference models, and the Generative AI design & architectural choices that affect privacy.
Topics Covered:
- Shashank describes his origin story, how he became interested in security, privacy, & AI while working on Wall Street; & what motivated him to found Uno
- The benefits to using "temporal knowledge graphs," and how knowledge graphs are used with LLMs to create a "causal discovery inference model" to prevent privacy problems
- The explosive growth of Generative AI, it's impact on the privacy and confidentiality of sensitive and personal data, & why a rushed approach could result in mistakes and societal harm
- Architectural privacy and security considerations for: 1) leveraging Generative AI, and those to avoid certain mechanisms at all costs; 2) verifying, assuring, & testing against "trustful data" rather than "derived data;" and 3) thwarting common Generative AI attack vectors
- Shashank's predictions for Enterprise adoption of Generative AI over the next several years
- Shashank's thoughts on proposed and future AI-related legislation may affect the Generative AI market overall and Enterprise adoption more specifically
- Shashank's thoughts on the development of AI standards across tech stacks
Resources Mentioned:
- Check out episode S2E29: Synthetic Data in AI: Challenges, Techniques & Use Cases with Andrew Clark and Sid Mangalik (Monitaur.ai)
Guest Info:
Privacy assurance at the speed of product development. Get instant visibility w/ privacy code scans.
Shifting Privacy Left Media
Where privacy engineers gather, share, & learn
Disclaimer: This post contains affiliate links. If you make a purchase, I may receive a commission at no extra cost to you.
Copyright © 2022 - 2024 Principled LLC. All rights reserved.
The idea is, at least in my mind, that until this technology matures and until there is a good understanding of how we can partition it, stay away from putting any confidential data whatsoever on those public LLMs. It's an absolute no-no. The same way that you would have absolute controls over confidential data today not being put in certain places or sitting out there in plain text unencrypted. So, the same kind of policy, that same kind of rigor needs to be applied, I think, from a standpoint of making these onto the public LLMs. So that's definitely one good architecture principle I would say one should start with.
Debra J Farber:Welcome everyone to Shifting Privacy Left. I'm your host and resident privacy guru, Debra J Farber. Today I'm delighted to welcome my next guest, Shashank Tawari, Co-founder and CEO of Uno, a path-breaking autonomous security company. He started with algorithmic systems on Wall Street and then transitioned to building Silicon Valley startups, including previous stints at Nutanix, Elementum, Medallia, and StackRox. As a seasoned engineering and product leader, Shashank builds highly scalable systems and high-performing teams. He has keen interest and expertise in a number of very disciplines. That includes cybersecurity, artificial intelligence, mathematics, investment analysis, trading, law, marketing, and accounting. Today, we're going to talk about machine learning and AI, large language models, temporal knowledge graphs, causal discovery, and the design and architectural choices that affect privacy. Welcome,
Shashank Tiwari:Thank you for having me, Debra. I'm very glad to be here on your fantastic podcast.
Debra J Farber:Thank you so much. To open up this discussion, why don't you tell us about your origin story? How did you get interested in security, privacy, and AI; and, what led you to found Uno?
Shashank Tiwari:Yeah, absolutely. I think. For me, as for many people who've been two decades and more, a lot of this was serendipitous and natural, if you may, as we started building systems for the internet, as we started building the modern stack. My first encounter, or my first real work in the area of privacy and security, really started almost two decades back when I was still back in Wall Street working on, as you included as part of my introduction, algorithmic systems up there. The stacks looked very different. It was a data center- driven world, but there was certainly a very high awareness and a very important part of the strategy to make everything secure because security mattered. Security mattered, of course, from a standpoint of how we see today in terms of unauthorized access and breaches, but also from an IP standpoint, that we had our own intellectual property and that nobody else got hold of that. That's where I first started dabbling with it. Back in those days, we had far fewer tools. The world looked much, I would say, simpler to an extent, even from an attack surface standpoint. The privacy nightmare that has been, I would say, unfurled with the rise of social media and the now AI didn't exist. Privacy in those days was all about making sure you had the right controls and the right level of access and encryption and other few things to keep it all secure and nicely tied up. That's where I started. As I started building more and more technology-centric companies and started looking at the core layers of data, the core layers of intelligence being the foundation for many of these companies, security and privacy played a very important part; because wherever there is data, there is the aspect of security and privacy. That's my background, if you may, of how I segued into the space and grew into it over the years.
Shashank Tiwari:More specifically, in terms of Uno, we got started about a couple of years back. In fact, this October, we'll celebrate exactly our second birthday as a company. Of course, this company was born out of our experiences from prior companies, including StackRox, where I ran engineering for a bit; and then, of course, other places where we had built a lot of scalable, secure platforms. Those certainly played into the puzzle. Of course, the timing was right. Certain technologies and certain things become available at certain points in time in their evolution and it was just the AI moment and problems that we had been wanting to solve for a long time appeared more solvable. That's when we got going. That's really the story behind Uno. In the two years, it's been a lot of fun, lots of learning and lots of contribution back to this community.
Debra J Farber:Awesome. I know that you are using some really interesting technologies. Can you talk to us about what is a "temporal knowledge graph?" How is that used with LLMs to create what you call causal discovery? How can this be helpful to privacy? I know I'm just throwing a whole bunch of terms out there, but maybe you can thread those together and give us an overview.
Shashank Tiwari:Indeed, LLMs or large language models, of course have become commonly known, and credit goes to Chat GPT for democratizing it, at least for people who are non-technical and people who are not deep in this privacy / security puzzle. Everyone's heard of it. Yes, everyone's very enamored by the fact that machines have become smart enough to hold intelligible conversations with humans and appear like it's a human across the buyer. That is something that I think we are getting very familiar with across board. There's power in that technology, because that technology can end up doing things that we as humans do today, including a large part of the puzzle that is either very mundane or is difficult to do because it involves a lot of effort or plowing through very large amounts of data - things like summarization, or generating stories, or extracting the meaning out of paragraphs, or reading documents for us and telling us what the gist of it is. Then, of course, also asking questions and then having a meaningful conversation. All of this the language models provide, which is fantastic, and we certainly use that as a part of our stack as well at Uno. The problem, though, is when you start working with serious enterprise- grade technology, especially as it relates to security and privacy and infrastructure the world that we live in. There's a problem that the language models pose, which is a serious one. That is really bordered around the fact that, as intelligible as it sounds, it's not really an intelligent machine. It's not a thinking machine and it makes things up. There's a lot of talk about that. There is, of course, the word "hallucination that is being used to summarize that whole problem space, where the language model pretends to know and speaks with confidence, but in reality what it's saying is actually untrue or complete fiction. This is good when you're doing creative stuff and writing new stories, but it's extremely dangerous when you are doing some enterprise- grade decision making, especially as it relates, as I mentioned, to security, the works. In that realm. You will need to temper that down with some state of reality. You need to bring in a state of reality that can be enmeshed and mixed with these language models and somehow get the best of both. That's where we came up with this interesting innovation, which of course builds on years of work that lots of people have done over the years to extend this concept of knowledge graphs.
Shashank Tiwari:Knowledge graphs by themselves are not new. Knowledge graphs have been in academia and in research for a long time, for a couple of decades at least. They were originally used largely to represent things - concepts that we as humans know and the relationship between those concepts. Essentially, if you had Person A and Person B and Person A and Person B were friends, that notion could be captured in a knowledge graph. If Person A lived in a certain city, well, that could also be modeled in a knowledge graph. If the Person A had also certain sets of choices or certain kinds of behavioral traits, those could also be thrown in. Essentially, knowledge graph was looked at as a way to abstract and encapsulate human knowledge. That's where the word knowledge graph comes from, of course, because it embeds relationships between things - the word 'graph.' That's where the origin of that idea is.
Shashank Tiwari:When you relate it back to systems and the stack and, again, concerns about security and privacy, one of the realities we live in is that no threat, no attack really is just a point in time activity. There's nothing that happens at the very moment and there is nothing before and after that. There is no such thing. If you start looking at most of these privacy and security concerns, it's almost like unfolding of a story. There is either a deliberate or inadvertent activity or action that occurs that leads to some sort of an effect. That effect leads to either a compromise or some sort of an exfiltration and so on and so forth, but in any case it's usually over a time window. It's almost as if the story has a beginning, and then the middle where all the activity happens, and then an end and an unfortunate end sometimes where the data is leaked or the privacy is compromised.
Shashank Tiwari:And so, from our standpoint, when we started piecing that part of the puzzle, the important piece that we really kind of thought of early, and I think it's very pertinent, was that we need some sort of a time machine. We need a concept which is not just point- in- time, but something that we can walk back in time, walk forward in time, and reason through time. So, we need to capture almost like in time windows and think of it in terms of our temporal dimension: where we started, how it's evolved, where is it headed, and then, of course, even reason through it from "what if? Standpoint of "where could it go. And again, we're talking security / privacy, so of course we'll overlay this with constructs that are familiar in security engineering and privacy engineering and then kind of predict what the outcomes could be like or what the effectors could be. So, all of this gets encapsulated essentially - the concept of the knowledge and the time machine into this temporal knowledge graph that we have built - and we've set up a patent- pending on it and a bunch of other work, intellectual property work that we have done around it. We, of course, want to democratize and take this forward and have the world take advantage of this because we feel that technology is very powerful, the construct is very interesting, and very applicable for the world that we live in.
Shashank Tiwari:Now, last but not the least, there is the third aspect, which is of "causal discovery." So, we already sort of covered the beauty of the language models (the LLMs), talked about the temporal knowledge graph; now, there is one reality, though, which is sort of interesting to think about, especially in the world of, again, breaches, compromises, exploits, and any kind of security or privacy-centric study. You could not run it really like an experiment; and what I mean by that is if you have to introduce a new drug in the market - and I'm just shifting gears a little bit and giving an example because I think sometimes examples make things a little more tangible, if you may - so if you are introducing a new drug, you can conduct very controlled experiments. You can figure out what the drug does. You can have a small sample. You have a control sample. And then, you can study how the drug behaves and, of course, if it meets the success criteria; and if it meets the parameters that one is expecting, you essentially approve it and then it becomes generally available, and then the whole masses, the population, can take advantage of that drug.
Shashank Tiwari:Now, unfortunately, when you look at that sort of a model, which is called a causal inference model - where you're actually deliberately affecting and then studying what is the impact of those effectors - that cannot be applied in a world of cyber and privacy. Not always, sometimes you can in terms of "what if? Analysis, but definitely you cannot apply that post-facto because the activity has already occurred. So it's not like you could recreate the activity or go and conduct the hack yourself. Of course you do, sometimes in controlled environments, with pen testing and red teaming and those kinds of activities, but most often than not you are essentially given a post-facto report and then you have to go and thread the needle there and figure out what may have been the reasons behind it.
Shashank Tiwari:What's the root cause? What caused it? How did it start? So, you know, this is a very classic problem in research about what is called causal discovery, where the idea is that you should be able to see something and then thread back the needle and figure out what's the root cause behind it. So, that problem space is very pertinent in the world that we live in. And, we kind of bring in these three ideas together: 1) large language models; 2) the temporal knowledge graph that I just spoke about; and then, of course 3) the concept of discovering root cause by observing, if you may, the causal discovery piece. We combine these three things and, you know, try and solve the problems in the world of privacy and security. So that's what we do.
Debra J Farber:Wow, that definitely sounds like a lot of new technology put together to solve some age old problems. I'm excited to hear about that. How do you think, more generally, that this explosive growth of AI and LLMs (so, generative AI) has impacted privacy and confidentiality?
Shashank Tiwari:Yeah, that's a very important discussion that I think is happening across companies and across thought leaders today. You make a fantastic point, you know, and if I were to sort of respond to your point, I think there are two parts of it. One is that, yes, there is a massive rise of Gen AI, which means that people will take it places. They will use it in various different ways that we don't know what it would mean in terms of implications and confidentiality and privacy. And then, the second part of the puzzle is there are some pieces of that, or some evidence already out there, where it's becoming very clear that if we don't have a good thoughtful approach to it, we could end up making mistakes that could be very expensive. That could really cost us a lot as a society at large. Now, to give you a little complexity - you know the scope of complexity involved here, Debra - and if you start looking at the world of AI, the world of AI and largely the world of language models, which is where Gen AI is kind of sprouting from, relies a lot on what I would call, broadly, a ton of black box constructs. So, there are a lot of these neural networks. There are a lot of these abstractions. There are a lot of these billions of passes and you may have heard these numbers being thrown around, that you know so-and-so is a 40 billion model or a 60 billion model or a 100 billion model or what have you. But, in other words, what it's really saying is, in very simple words, is that it's to a point of complexity and to a point of iteration where it has become a black box, where walking back from it and trying to understand how and why it's doing what it's doing is non-trivial. It's a very complex problem.
Shashank Tiwari:So, when you look at that complex technology, which also is sort of a black box by design, I think it poses some fundamental challenges from a standpoint of "hat if I made a mistake? Right? So, for example, what if some data was leaked inadvertently out there? Right Like, not intentionally, not from a hack standpoint, but somebody - just some benign fun with the technology - ended up leading to the passing of some confidential data or some important private data onto that large black box. Can you actually extract that out and can you delete it from it? That's the next question because that's what happens in normal life. If, for some reason, you spill out something by mistake, well, you quickly try to find traces of that and you're trying to erase that and delete it; and I'm talking about more of a remediation from an inadvertent case as a starting point.
Shashank Tiwari:Now, even something like that, which is fairly straightforward in a normal case, is actually very complex in the world of AI because you would not be able to find where it has gone, where all it has mixed in, and you know what has happened with that data set that got leaked. In fact, there's an analogy that I like to draw, and I think in my head that's actually a good analogy to think of, and it comes to something like the modern AI technology is like if you took a glass of water and you walk down to the Pacific Coast and any of the lovely beaches that you know we all live close enough to, and then we take that glass of water and pour it into the ocean, and then, a few minutes later we said, "Well, I need to find the water for my glass of water in the Pacific Ocean. Right, and that's an impossible problem. Right, like there is no way for anyone to go and figure that out. It's all mixed in, it's become the Pacific Ocean right, like your glass of water has lost its identity and it's enmeshed in it and it's you know, it's all water right.
Shashank Tiwari:And so, in some sense, the problem space is like that in terms of data leakage onto AI. You know, if your data has leaked onto some sort of a public LLM model, it's very difficult to trace it; it's very difficult to retract; it's very difficult to take that out right. So, that's one humongous problem that you know we are living with now, which means we've got to find ways, we've got to find mechanisms that first we avoid at all costs - which again, by itself, is a humongous challenge because of course you could have controls around deliberate pieces potentially, but inadvertent mistakes do happen. So, even that part of the puzzle is vulnerable to an extent, and so that's one big part of the problem around privacy and confidentiality.
Shashank Tiwari:The second big part of the piece, which I think is being talked about quite a bit borders on what you would call trust. This kind of relates back to hallucination and made- up things as your reality and then made- up things get mixed with each other; and as more and more data is generated using AI, I think it raises the second-level, second-order complexity, of which one is truthful data with a source that I can attribute and go back to and make sure that it's valid and truthful and makes sense and I can rely on versus something that's derived data and perhaps even completely fictitious like the so-called "alternate facts, if you may, that have become part of the facts, right. So that's another part of the problem that becomes very complex from a privacy, identity, confidentiality, fidelity standpoint, because you might start seeing in the future people's identities or information or very critical sort of private facts being essentially mutated and mutilated, right, and then it would be very difficult to then thread it back and even reconcile and say, hey, which one is the made up, nonsensical one and which one is the actual stuff. So I think that's another big challenge as professionals, we sort of live with every day. And then, there is a third part of it, and of course I could keep rattling off many of them because there are quite a few and many more emerging.
Shashank Tiwari:But the third other big part of the puzzle, which is the newer attack vectors that become possible with these sort of GenAI-type technology, and some of it is being talked about. I think there is a lot of conversation around malware that you can quickly create. There's also talked about the old type of attacks in this new world, like the sort of cousins of SQL injection becoming prompt injection in the world of LLMs and things like those that we are certainly talking about. But, I think it also poses a bunch of other important dimensions around access control, around encrypted data sets, around how do you make sure that you keep certain things partitioned - because that's what keeps the confidentiality of that particular data set, and then of course, also evolution of that data set. So, I think there are a ton of challenges. AI is, I would say, the new piece that will keep us very busy, both from a standpoint of policy and also from a standpoint of controls and how we approach it over the next few years.
Debra J Farber:That's really interesting, and you're really highlighting a lot of the complexities of the current outstanding questions of the day, where everyone's trying to implement generative AI into their products across the world right now and there's still regulatory uncertainties, policies that need to be put in place, controls, like you said. So, I'm going to read back to you the highlights - you mentioned three separate areas of complexity that are creating challenges in this space - and maybe you could then tell us a little bit about how you would suggest that developers and architects approach these challenges. What kind of design choices and architectural choices and trade-offs there might be. Does that sound good to you?
Shashank Tiwari:Absolutely. Yeah, let's do it.
Debra J Farber:Okay, great. You talked about mechanisms to avoid at all costs and you gave some examples. What kind of, I would think here, architectural choices are you thinking about when you say you want to avoid certain mechanisms at all costs? I'm assuming here you're talking about protecting privacy and confidentiality of personal data.
Shashank Tiwari:Indeed. Yeah, yeah. Of course, this is an ideal way of looking at it, and one would have to come up with variants of it, which are a little bit more practical, because obviously people will make mistakes and things will happen. Bottom line is that I think, if you have a certain highly confidential data set, the ones that you are very sure as a privacy engineer or as a protector of those very valuable assets that we cannot afford for these to get compromised; for example, in our context, in the world that we all live in, it could be as simple as social security data or health records, or you could look at people's important financial trail data or stuff like that, which is very personal, which is very confidential and which is also very private, and then, of course, multiple other things like that. I think my take there, from an architectural standpoint for enterprises which want to take advantage of the new language models and technologies is that definitely there is one rule they should follow straight out of the gate, which is do not mix any of this extremely confidential data with public LLMs. Just keep it out. Don't even think about having a guardrail and working with it.
Shashank Tiwari:There are a lot of proponents - I'll call them the modern AI firewall for the lack of better word - essentially creating filters, creating certain kinds of brokers, if you may, in the middle. This is the technology which I would say resembles or looks like what DLP technology, and CASB technology, and perhaps even firewalls had, but of course, translated into the world of AI, so they look and sound different. But, conceptually, they're kind of similar, where the idea is that "on't worry, I'll have this super smart filter that is going to somehow figure out that there are certain data sets that are confidentiality, and I'll make sure and I'll protect you from inadvertently or otherwise leaking this over to the public LLM like Open AI, as an example, or by mistake copying that into ChatGPT or one of those types of problems that we've been talking about quite a bit. There are quite a few examples of those also. For example, I believe some developers at Samsung by mistake took their code through it in there and then it became publicly- accessible and they by mistake leaked their IP. So, things like those. I think the idea is, at least in my mind, that until this technology matures and until there is a good understanding of how we can partition it, stay away from putting any confidential data whatsoever on those public LLMs. Just, it's an absolute no- no. The same way that you would have absolute controls over confidential data today not being put in certain places or sitting out there in plain text unencrypted. So ,the same kind of policy, that same kind of rigor needs to be applied, I think, from a standpoint of making these onto the public LLMs. That's definitely one good architecture principle I would say one should start with now.
Shashank Tiwari:The second piece is that if companies and enterprises are struggling with this dichotomy of, "Hey, I want to take advantage of AI because AI can help me further my cause and, you know, help my customers better, but at the same time, I'm worried. Worried about sort of leaking it out into the internet and avoiding things at all costs. I think you could look at alternatives. You could certainly look at alternatives where you host and manage your own private LLM. There are options now. You know, even a few months back that was a difficult one and they want that many choices.
Shashank Tiwari:But today, you've got open source LLMs, you've got self hosted LLMs that, assuming you've got GPUs and other sort of underlying stack available, you could run it internally. The cloud providers are providing private partitions, where Google and AWS and Azure they all have ways where you can run your own LLM boundaries, which is partitioned off from the rest of the world. I would say those are relatively safer. They're not 100% safe. You could still run into some trouble and you know we can
Shashank Tiwari:talk about that a little bit, but you know that's certainly better than public LLM. So, that's another option if you really want to go down the path. Tread it carefully. But, architecturally look at some sort of a self- hosted, private, contained, or even if there's a third party provider, they should be able to partition that off and provide that for you. Just as a case in point, that's what we do. As you know, we work with banking customers, regulated entities, and that's an absolute piece of decision that we made. We don't send out any kind of confidential data or any kind of data that even looks confidential vaguely on to the public LLMs. We just keep out of it. So, I think that's one architecture principle. I would mention the second part. I think there is also to make sure that you can abstract out the data sets as much as possible. You know, this is not new technology; this is the old technology of where you use some sort of obfuscation techniques and privacy, or you know you sort of de- identified or anonymize those kind of data sets and then use them for something - for analytics, for example, which is what a lot of companies do today for data analytics. I would say use the same kind of a puzzle. You of course, might have the challenge of losing some of the meaning in language models because you may have abstracted out to a point where it may not 100% make sense to the language models, but nonetheless, I think that's a safer route. Definitely have that as the first- level filter of anonymizing, taking out the identities, and then obfuscating it as much as possible. Then, maybe run abstract versions of it so that even if some of that gets leaked out by mistake or you end up seeing that out in public, nobody can really relate it back or abuse that potentially private data or the source of private data. So, I think those are some of the pieces architecturally I think where we need to start.
Shashank Tiwari:I would also caution a little bit to the folks: don't over index on using access controllers, some of these modern AI filters or AI firewalls type of technology because they're all very new and it's a funny sort of amplification problem. LLMs and the technology around it, not just the LLMs, but you could apply the same thing to vector databases or other kinds of stack elements within that world are all evolving rapidly.
Shashank Tiwari:So, what we see of them, let's say, last month, and what we see of them today, they're fairly different. Lot more features and therefore with it lot more threat vectors have been added and you know it will continue to happen. So it's a sort of the classic catching up game that I think many of these newer vendors will also play for the foreseeable future. And so, I feel like, if you're serious and you've got confidential data, you should, you know, essentially take the path of caution and tread it a little carefully, and not get over confident that "ey, there's a particular technology that seems to have solved this and over index on it. That would be my take here from an architectural principle standpoint.
Debra J Farber:It's good advice that if something is new technology, especially when it comes to AI, that you really do want it to be tested and battle tested, I guess, before you trust it within your own organization. The next area that you mentioned that has a lot of complexity was trust. What advice would you give, what design choices, what architectural approaches would you recommend so that we can better verify, assure, and test against trustful data? How do we go about making data trustful?
Shashank Tiwari:It's a complex one. It's a difficult problem. If you look at the broader space of trust, I think that's only become an amplified, complex problem across board. Even before AI was becoming so much of an important part of our society. We saw this conversation even sort of pop up in the world of social media, and you know all the content generated today where we don't know what is being said or what is being put out there is even real. So, there's a lot of conversation around it. Of course, AI is generating things, content even in the enterprise space and in the sort of critical application spaces, which is sort of making it worse. I think it all starts with, architecturally at least, two things that I think you could start thinking about. Again, much of this is going to evolve, but these, I think, at least in my opinion, are the right ways to start looking at it. One is the parallel, or the good old way of verifying any content where you go back to the source. Right? You relate it back to the source and say, "well, let me make sure, firstly, that it relates back to something which is, you know, a trustable source. In a way, you are applying the transitive relationship that if the source is trustable, the content they have generated is also trustworthy. Right? So, I think there's a little bit of that where you could go back to the source. Now, going back to the source and the world of AI is non-trivial. You have a content out there on the internet and if you cite the source of where that is from, you could actually very quickly walk back and check that. In the case of AI, because it's generated and meshed with some data that comes from a specific source and some that is completely new, done by the AI engine, by Gen AI, it gets very difficult to actually have that very clean traceability, very clean provenance to the source. So, this problem is being looked at. Again, both industry and research is looking at it and it's the whole space of what is called "Explainable AI.
Shashank Tiwari:This also coming up because there is a ton of debate in society around fairness and bias in the world of AI, and there have been cases where the automated intelligence systems powered by AI have made decisions that have baffled people. For example, I think there was one case, if I have my memory serves me right, where husband and wife applied for the same card to a particular bank. One was approved, one was rejected, and then later it was kind of traced back to say it was probably a bias against women. There was another one where I think Google was doing something in imagine. It started showing certain kinds of images for certain racial profiles, which was just essentially not only inappropriate, but at the same time, the technology became untrustworthy. So, there is a lot of that, I would say, more alarming examples that are coming up. And so, people are talking about how do we go back and measure the fairness? How do we go back and measure the biases, which again relates back to the same problem: the fidelity of the information. What's the source? What caused it to generate what it generated? So, certainly, I think, keep your eyes out for this explainable AI bit. There are a few open source and a few other frameworks that are emerging. I'm pretty confident that we'll see a lot more companies emerge in this space over a period of time, and then, of course, existing AI providers, vendors, platforms, will also want to bake this into their mix. So that's one way sort of looking at trust.
Shashank Tiwari:The second way of looking at trust - which is a tricky one, but still nonetheless an important one and certainly can come handy sometimes - is looking at the ramifications of that information. So, I would give an example which is more close home. Say, if you just went today to the language model and you went and asked a simple question, say, "Hey, which is the most critical vulnerability that can be exploited between so and so date?" the language model, in most cases, most language models will come back and give you an answer very confidently. They'll say, yeah, it is this CVE. Blah, blah, blah, it was exploited here. You know, here's the fix, this is what you should do. And then, when you look at it, you will feel like, yeah, this is very actionable and this is so smart, this is so great.
Shashank Tiwari:The reality, though, is often, when you dig a little deeper and double click into it, you'll realize that, "actually, there's a lot of inaccuracy in that information. For example, I saw once when I was playing with it, that one of these LLMs claimed was the most critical vulnerability was not even a critical severity vulnerability. It was actually a medium severity and there was no fix available for it. So, you will see all these kinds of things kind of emerge that will sound very confident, but you know they aren't. So thereby, my sort of way of approaching that is that I think, as humans, we need to rely on human judgment and reinforcement loops. We need to have a second dataset if you may have some form, which is where, for example, in our world, we try to supplement that with Knowledge Graph and other things.
Shashank Tiwari:But, there are other ways also of supplementing it.
Shashank Tiwari:There's a whole world today that is called Retrieval, Augmented, or Grounded Generation type of technique where the idea is I look at a second source; I look at a third source; and I try to make sure that those two / three sources agree with each other - that there's some sort of a consensus among them and they're saying the same thing.
Shashank Tiwari:So, I think that might be helpful as well, like that's the second approach that I think most architectural decision makers should certainly consider, where they can look at alternative sources and then make sure that all these two / three different ways lead to the same path or at least an adjacent path.
Shashank Tiwari:Having said that, I do want to say again something looking out a little bit, sort of a crystal ball - this might get more difficult over a period of time because as we'll have a larger proportion of generated data as opposed to real organic data; and I believe there are some research and some numbers out there which is saying that it's going to outnumber by a few multiples. So, 5 years, 10 years, fast forward, we might have actually 5 times or 10 times more generated data than real data; and so, in that case, I think reconciling and using some of these techniques is going to become even more challenging because there's just so much generated data. Even the tracing back of the source of a generated data might lead to data, generated data and so I think walking back, that would become an extremely difficult problem, down to the provenance of it.
Debra J Farber:Yes, in fact, last week's episode actually deals a lot with that exact topic and we cover synthetic data and some of the challenges of training LLMs on that synthetic data for the exact challenges that you were highlighting there. So, if people want to learn more, maybe check out that episode. The last area of complexity that you would originally mentioned was that there are Generative AI, a new technology being deployed, that there are now new attack vectors. So, maybe you could talk a little bit more in depthly there about a few of those vectors. You mentioned prompt injections with LLMs and access controls around encrypted data sets and some others. Whatever comes to your mind, what are some of the design decisions and architectural decisions that you would want to highlight in order to prevent compromise of data through these attack vectors?
Shashank Tiwari:Yeah, yeah, absolutely. I think if you look at attack vectors, as I was saying earlier, I think we are still in the early innings of that. So, you're going to see a lot of evolution of that attacks - attack vectors and attack types - similar to how we've seen in almost every other technology stack; and I would draw a parallel, like, if you went back 2009 or 2010 timeframe, there were certain ways of attacking the cloud or certain ways that you would be worried about protecting the cloud. Fast forward, you know, 10, 12 years or so, it's a very wide variety of attacks that are possible on the cloud, and that has, in fact, also led to the emergence of a lot of different categories and ways of defending and ways of attacking and so on and so forth. So, I think AI will go through that evolution as well, and perhaps even more, because it's a whole completely new, unknown space within the attack vectors. Firstly, just from an attack vector standpoint, there are, in my mind, loosely put, two classes of attack vectors. One is what I would like to call the garden variety type, the attack vectors that we are familiar with. Yes, they're dangerous, but they're preventable. We understand them. Among those, I would put things like all these newer malware that are being generated. GenAI is kind of doing the work that otherwise hackers would be doing. Sure enough, it's making the hackers more productive so they can generate more malware at a faster rate. But, at the end of the day, it's malware. And, at the end of the day, you would do the same things you would do to prevent malware? So, there is that, the way to approach it.
Shashank Tiwari:Similarly, if you start looking at prompt injection, prompt injection is a serious problem because we're still trying to unearth the power of the prompt, if you may, which is nothing but basically inputs of the number of text or tokens that we send into a particular language model. And then, of course, the basis that the language model comes back and, you know, does wonderful things. And so, if you send in data that can make it behave in funny ways or do things that you don't want it to do, well, you can cut up the system and you can take it to your advantage, similar to how you could do things like protocol stuffing or SQL injection, where the intent wasn't that of the maker, but you know a hacker could abuse that and take advantage of it. So, we have the same kind of a challenge there - prompt injection as well - and there are a few more like that around access control. Then, of course, we already discussed the data leak kind of problems that can happen.
Shashank Tiwari:So, for these - I'm not trying to trivialize anything, but I would still like to call them - "the garden variety problems in security and privacy, because these are known, we have understood it as professionals, there are tools and checks and balances, at least at a broader level, that are available, and we'll adapt them and create new ones for AI. So, there is that segment. As with every other privacy and security concern, I think it starts with basic hygiene. So, make sure that you have the right controls; make sure you validate things; make sure you have the right protections, as you would do with a normal case with malware or any injection technique. And, for the most part, I think you would start heading in the right direction; but nonetheless, I think you have to very deliberately think about those because you'll have to firstly know about them right before you can start protecting. Now out of these attack vectors, there's also something to be kept in mind that you will see a lot more of these emerge.
Shashank Tiwari:Like I was saying earlier, what we are seeing is probably a small subset of what we will see, even six months out or one year out or two years out. So, keep an eye out for it. The most important part here would be to stay educated, to understand where this technology is headed. What kind of attacks are we seeing, learning from it and, kind of adding controls to it. Now, in this, some of these attack vectors, there is also an element of maturity of the technology or the controls that we have. So, for example, one classic problem is a very simple, good, old problem, but access control: fine-grained access control and vector databases. That is still evolving. It doesn't exist the way that it's there in mature databases. So, yes, it could also be abused and you could get hold of certain parts of these vector sets if you may. That, in turn, could then be abused to do various kinds of attacks, just as an example; and so, in that case you will also see those vendors step up and start adding those so-called enterprise-grade features, which are more baked, hardened things that match the expectations of the fine-grained access control and confidentiality, and protections that we sort of grown up with now with more standard technology. So that's one type of attacks.
Shashank Tiwari:And then the second type of attacks, which is more dangerous, and I think that these could lead to some serious problems, especially because, if you start looking out - although we have been talking about this in science fiction, I don't think we have really internalized it as society - is that the future is going to have a lot of intelligent systems around us and we will start relying heavily on it. From the car we drive, to home appliances, to machinery, to even heavy industrial machinery, to the flights we take, to systems that reconcile our finances, to things that protect us, to things that purify our water for the city, to our lighting system, all of these are going to have an element of AI in them. They'll all be run by, or decided by, systems that are essentially some versions, some evolved version of AI that we are already down the path on. So, in that world where you are seeing this AI becoming sort of part of society. It becomes sort of an adjacent robot next to us, and many of them will be in the sense that
Shashank Tiwari:You know the Terminator style, if you may, but the reality is there'll be fewer of those, but there'll be the invisible type of robots - smart robots that will affect our lives every day.
Shashank Tiwari:For example, the switch that would be figuring out how much a filtration should happen with the water that comes in our taps would probably be governed by AI, and we already headed in that direction. So, it's not the same as a humanoid robot, but it's doing some very critical pieces, because poisoning of water in a city could be a war-skill problem, and so things like those, I think, are going to be a big problem.
Shashank Tiwari:Things like those is where the attack vectors get even more interesting; and I think the attack vectors around that could be in the form of model poisoning or adversaries really bringing in data sets that very slowly seeps into and kind of pollutes the whole decision-making matrix. So now, you're looking at these drip feed model attacks, which are difficult to detect, difficult to trace, and then difficult to unwind once they're baked in into it. Some of this has been discussed in this whole adversarial machine learning concept a few years back. But, I feel the real attacks are going to be way more sophisticated, way more complex than even some of the initial theoretical modeling that many of us have indulged in. So I think, like that attack vector is serious, it's very big and you know, unfortunately it's coming and I think we've got to start getting ready for it.
Debra J Farber:Oh, wow, okay.
Debra J Farber:Well, that's definitely a nice, scary exploration of all the security problems with untested AI systems that are going online. So, that actually brings up my next question, which is where do you see all this going? I know you're looking at it from more of a potential threat perspective, but if we're looking at enterprises specifically, how are they currently approaching at least generative AI? Obviously, AI has been around for a long time, so I just want to focus on generative AI for right now. It looks like everybody and their mother is coming out with the ability to integrate generative AI into a lot of services to help with creating chatbots that make it easier for humans to have more of a conversation level kind of interface with their computers.
Debra J Farber:A lot of those companies seem to be startups and have the ability to be nimble and make some bold choices, and if they fail, they fail; if they win, they could really win. But, what about enterprises that are a lot more conservative about bringing in innovative new technology without having rigorous testing and validation and assurances? Where do you see enterprises right now and where do you see them going in like two years? Five years?
Shashank Tiwari:Well, firstly, the excitement around Gen AI is very high, and I think that is a cross board. You see large enterprises, small startups, and everybody else in the middle is talking about it, wanting to explore it, wanting to get their hands on it. I think, like most technology over the last couple of decades, it's probably being driven from people's experiences in the consumer world. So, just like we saw with mobile phones as an example, first people started using that for their personal and fun things and then eventually creeped it into the enterprise; and then, in many cases, it became the primary way to access certain kinds of things. So, similarly, I think a lot of folks working across larger traditional companies, mid-sized companies, traditional sectors, technology-centric sectors, government, and then the ones in the startup world, are getting their hands on it, or getting exposed to it, through the sort of consumer-centric conversations, starting, of course - I have to give credit, especially from a marketing and messaging standpoint to ChatGPT because it really became very commonly known. So, I think that, with that little taste in their mouth, I feel like the world is hungry for it, and I see this in conversations as well. We as a company engage with a lot of large banks, global 1000 companies, traditional sector, even government sector, and what we are seeing is that there's definitely a very active desire and an active intent to go figure out where they could use Gen AI. To an extent, I think many leaders are also feeling like the time is now and if they were too slow and if somehow they did not have this is a part of their deliberate strategy over the next few years, they would probably be left behind.
Shashank Tiwari:So there's definitely a lot of energy around Gen AI now. Whether that'll translate into enterprise applications wholesale migrating over to Gen AI - just like you could argue early 2000s where all traditional client server and desktop apps eventually made their way into web apps and people stopped building desktop apps - in that big transition will be seen over the next two years, five years, seven years, more and more newer applications, or rather every newer application in enterprise and every existing application becoming essentially an AI- powered application. My hunch is likely "yes. Now, what shape and form it would come in, how fast it would come. I think we will see that. You know that time will tell us, but certainly the wind is blowing in that direction, and everybody wants to sail in it, right. So I think, like that's definitely very clear, very apparent.
Shashank Tiwari:Now, what I would mention, though, is that I think, when it comes to Gen AI, the world is not evenly distributed, like there are certain sectors that are both super excited and also very scared about it. For example, those who are writers and creatives, they love the energy that Gen AI is bringing into the mix and the companies around it, but they're also feeling the competitive pressure from that technology. So, there is a little bit of conflict going on in that space, even enterprises in that space. We might end up seeing things like editorial businesses, or publications, or media production, filmmaking - those guys will really feel the heat one way or the other, or they would just take that technology and unleash some newer kinds of creativity. So, both could happen. They're certainly watching it closely.
Shashank Tiwari:In the traditional sector, I think there are some areas which is seeing a lot more traction. For example, "hey, can I get a smart assistant to? You know, give me good financial advice? Or, you know, can I, for example, in the world of cyber and privacy and security, where one suffers from a lot of skill gap. Well, "can I bridge that skill gap by getting these smarts on my side? Or those companies that are looking at very large amounts of data sets, "an I use AI to kind of munch through it and make sense of it at a very fast pace? So on so forth. There's definitely a little bit more, I would say, domain specific use cases that has got people more hooked onto it as opposed to every sector, right so there's certainly a little bit of a you know, hotspotting, if you may, or you know areas where AI or Gen AI is certainly seeing more love than broadly, but I think we'll see it. We'll see it permeate and go across sectors over the next few years.
Shashank Tiwari:Now there's one part, though, that I wanna mention that there's also a ton of hype around Gen AI, right?
Shashank Tiwari:So, as it happens with most technology, I do believe that it is very promising, but at this point in time in history, it's also a little oversold, and that sometimes creates its own problems where a lot of the believers have basically come out with very strong projections, very strong sort of their own crystal ball gazing and foresight and said, "eah, this is gonna transform over the next 24 months or, you know, next five years, and this is what the world is gonna look like, and, you know, throw numbers around it and gotten the whole world excited, which is cool, which is great;
Shashank Tiwari:but I do think that they would go through its own trough of discontentment and illusion, if you may, where companies might not see the value realization as fast as it's promised, and then you might see some pullback. So, I think that will be part of this adoption curve, where you will see a lot of people step up very quickly, build a lot of Gen AI powered applications, and then a few of them get suspicious of the value it actually brings to the table and then kind of roll back a little bit. Then, there's gonna be a second version or second avatar of this Gen AI movement, which will eventually then get everyone towards that sort of overall success. So, that's at least the way I kind of perceive this.
Debra J Farber:So, that's a good jumping off point, I guess, before we close, to ask you about your feelings on regulation then, because obviously there's gonna be rounds of attempts at regulating AI. I also wanna throw out there that there are plenty of laws on the books that apply it AI today. It's just not specifically about AI, but it's not like there's no laws. There are plenty of laws, but around maybe getting more transparency, and fairness and getting rid of bias, and you know we're starting to see this out of Europe, and rumblings of legislation being proposed in the U. S. But, do you think that's going to be pushing the second hype wave or the second wave of generative AI companies as we move on - that that'll be influenced by regulatory requirements?
Shashank Tiwari:100%.
Shashank Tiwari:I think regulation is gonna play a much bigger part than we think in the world of AI, and I think you said it very well, Debra. Regulation is already here. It's just that it's not specifically worded for Gen AI, but many existing regulations apply to it as much as it applies to a ton of other technologies out there. So, I think there is going to be some effort across board, across enterprises, across the legal sector, across the regulators themselves, of trying to firstly translate that to the world of AI. Because some things have been drafted in the world of traditional, more legacy systems and you know just like they were reworded and redrafted in the world of the modern internet and the mobile phone era.
Shashank Tiwari:They will need to be going through a second iteration, if you may, in the world of AI? So, certainly there is that - existing regulations sort of modifying for Gen AI. Then, of course, on top of it, and especially as it relates to privacy and confidentiality (which has been discussed quite a bit), fairness, you know things of that nature, we are going to see a mass of newer regulations as well. Some addendums to what we have and some completely new, fresh ones. EU, of course, is two steps ahead in regulation, and in fact there's a little joke in the technology community that I think they have learned to innovate in regulation more than anything else, in EU.
Shashank Tiwari:But, there is a little bit of that. They put out the EU AI law, or something, I believe, which is already out there I was, you know, sifting through it; and they've already started putting some guidelines, which I think is great because the discussion started. It's also a bit dangerous because it's early. And then, of course, we have our own Congress discussing some of this very actively right, where they're having hearings; they're getting imports; they're trying to understand; they're setting up committees to see where AI is headed, what are the risks, what kind of regulations are required. I think there's going to be a culmination of that. It's all going to come together. My bet is 2024 or 25, you'll see some big AI regulations come. The reason I also say 2024- 25 is because, as it happens, we'll go through our election cycle and then, as the new Congress gets formed (either the same one or some new version of it) they will then want to go and put certain newer laws together. I have a feeling that AI regulation is also going to be part of that puzzle. But, we will end up seeing 2025 for sure - some big AI regulations coming in some form because it's so much in the conversation. I don't think they can just slip it under the rug or ignore it. So, I think that's definitely coming. Now, to your second part of that
Shashank Tiwari:I think it'll have ramifications on a couple of different fronts. One is that you know it will lead to maturity of the technology and maturity of tools and products in this space, because they will be pressured to, they'll be forced to rise up to the occasion. And then also, I think it will drive awareness because once people start spending time, energy, money on anything, they become more aware. Right? I think that regulation will be the forcing function where you will start having conversations in boardrooms and at the operational level, and people start figuring out how it translates Similar to, in a very orthogonal way,
Shashank Tiwari:I'll take an example. The SEC came out with a very simple ruling of "hey, if you have a breach, you've got a report within four days or so, right? Like if it's a material incident, as they have put it out there. And now suddenly everyone's talking about incidents, you know, and reporting, and MTTRs (i. e., mean time to repair) and what have you. It's a conversation that wasn't having as much in the boardrooms and with executives, but these days I speak to every CISO, and it's on their mind. Like everyone is talking about it. So, same way, I think once the AI regulation comes in, you will see everyone discussing and figuring out the ramifications of it. For sure.
Debra J Farber:Yeah, that makes a lot of sense. I know we're already over time, but I feel like there's a tangential question there about standards. Legislation is gonna be carving out some of the things you can't do or some of the things you must do, but it doesn't give you the pathway of getting there. And it seems to me it makes good sense, if you know regulations coming, that there should be work done on different standards for generative AI tech stacks, I guess. Are you aware of anything that's being worked on right now? Is that something that experts in this space should have already started on and they're a little behind? Or, are we right on time and there's work being done that you could report on for us? What are your thoughts on standards?
Shashank Tiwari:Yeah, absolutely. I think there is already a bunch of that work underway and, as I see it, the standards are coming from two different directions. One, the open source community and the sort of community- led standards, which are being put out there. Y ou're seeing that come through from various directions. One is, of course, from a security threat standpoint, or just identifying good best practices. More recently, for example, OWASP has put out their own list for LLMs. Some of these bodies which think on the same lines, and NIST and others also kind of formulating some. So, you start seeing some of those standards come more organically. Definitely, I think that's happening and some of them will get formalized and be fairly widely adopted over the next few months, I would think. So I think that's definitely happening.
Shashank Tiwari:Our second piece that I also see coming - I haven't seen any tangible discussion on that yet, but I'm 100% sure that it's there somewhere behind the scenes - some work is going on where you'll start seeing some of this creeping into the bigger standards. So, like the NIST standards itself. The NIST series of standards that are very important for enterprises, or certain reporting standards kind of pushed by the SEC, or you know, you'll start seeing, even in terms of compliance frameworks or regulatory frameworks or best practices. Again, if a large number of systems are GenAI- powered systems, well they couldn't be just flying blind in that space. They would have to have say something about it. So, I think you would start seeing those creeping in, perhaps as addendums, perhaps in the next revisions you would start seeing some of that flow in for sure.
Debra J Farber:Well, thank you so much. Any closing remarks before we close up for today?
Shashank Tiwari:Well, thank you for having me. I really enjoyed the conversation. Hope you did and hope the audience likes it, too. It's a new space, so I think the only part I would say before we close is that with any new thing, there is obviously risk and danger, but there's also fun and opportunity. Being an entrepreneur, I'm always biased on fun and opportunity. So, I would say, go for it. You know, go learn more. Enjoy the GenAI as much as you stay cautious and stay aware.
Debra J Farber:I think that's really good advice to give a nice balance between skepticism, between being paranoid and being innovative. I think that's a great way to close. So, Shashank, thank you so much for joining us today on The Shifting Privacy Left Podcast. Until next Tuesday, everyone, when we'll be back with engaging content and another great guest. Thanks for joining us this week on Shifting Privacy Left. Make sure to visit our website, shiftingprivacyleft. com, where you can subscribe to updates so you'll never miss a show. While you're at it, if you found this episode valuable, go ahead and share it with a friend. And, if you're an engineer who cares passionately about privacy, check out Privado: the developer-friendly privacy platform and sponsor of this show. To learn more, go to privado. ai. Be sure to tune in next Tuesday for a new episode. Bye for now.