How do you mitigate data security and privacy issues in this new AI world?
That was the topic of a discussion hosted by Fasoo at the 2023 Gartner Security and Risk Management Summit.
Fasoo CTO and COO Ron Arden spoke with Tad Mielnicki, Co-Founder and COO of Overwatch Data, and Jamie Holcombe, CIO of the US Patent and Trademark Office, about the challenges of using generative AI responsibly.
In Part 1 of this discussion, A Conversation on Risks of Using Generative AI and How to Mitigate Them, we focused on the potential risks of using generative AI and some measures to mitigate them.
In Part 2 of this conversation, the group focused on advice for organizations as they try to navigate and manage the risks of using AI while enjoying the benefits.
Ron Arden: We’ve been hinting at some mitigating strategies. I was thinking there are certain companies, and even countries, that started talking about blocking access to these tools. And to me, it’s too late, since the genie’s out of the bottle. So, what do you think about these policies? And how would we go about really mitigating or alleviating some of the risks that people are trying to guard against?
Jamie Holcombe: Well, as I’ve said, I’ve put a prohibition on its use for our examiners, but at the same time, I want to encourage people to use it for public information. If you’re trying to find public information, you’re trying to search a public database for something that might be prior art, I think that’s a great idea so that you don’t have to spend a lot of money in creating your application, making sure you know if something’s there or not.
The other thing is just to get an idea. I don’t know if a lot of people have tried it, but you should. I put a little thing in there about my workout routine and what diet I needed, and then what ingredients I needed to buy at the grocery store. It’s really good for stuff like that.
But when you’re talking about more serious things, where you have to worry about the privacy of individuals, that’s where the governance needs to be. What you said before about the legislative side, they always seem to be behind in intellectual property law, but things are set by precedent and by case law, as you were saying. So, one of the things about these generative models is they’re 17 miles wide and two inches deep.
It’s just Wikipedia on steroids, right? I mean, you’ve got to check your sources, and if your sources don’t hold up, i.e. in the FBI case and so forth, it’s really bad.
Tad Mielnicki: Well, the LLM is only as good as the data that it’s trained on.
Google recently released Bard and the difference there is it’s connected to the Internet. So, you’re getting not only training data, but you’re also accessing the open Web, which is super interesting. ChatGPT, I think, has an integration on that as well.
You can deny access in closed environments like your office, but you can’t deny access when people leave the office. Work isn’t confined to your physical location anymore. Work travels with you. I think that’s going to be a hard policy to enforce because people will go home. People will take work home. People will do certain things off their company and government networks.
And so, controlling this or trying to keep it in a box is impossible. So, what do you do?
You teach people what these actually are, what the good uses are, versus what the bad uses are.
But inside a company environment, I think it comes down to data governance. If you are properly segmenting your data and governing the data access that’s what counts. What are PII, PFI, and PHI? What is critical? What is proprietary? If you’re segmenting your data environments properly, then you can launch these LLMs in safer parts of your network to give your employees the opportunity to start utilizing them and start seeing where those benefits are and doing some discovery on their own without compromising your company.
It’s probably harder for the government to do that, but it’s probably not far off to think that the Patent Office will have an LLM to allow examiners to go through every patent in the history of creation to do some of their research.
So, it’s coming, but we can’t put it back in the bottle. We have to approach it with a great deal of respect, and that may mean slowing down a little bit, which is hard for me to say considering we’re utilizing this in my company every day. I’m a hypocrite.
Ron Arden: I was going to ask Jamie; do you see this as becoming more like what you’re doing in the Patent & Trademark Office? Do you see that becoming more common in the government writ large? That people are just going to block things until they figure out the right guardrails?
Jamie Holcombe: I think it’s both. You need to block and you need to encourage. We already have a three-year head start. All this stuff just happened, since March, but we’ve been doing it now for two and a half years in earnest. So that thing you just said, we do have the search algorithm for the examiners, but it’s contained within, and we’re learning about generative AI versus other AI.
What does generative mean? It generates. Well, so does everything else. Why is that word being used? In my definition, I believe that they mean generative in the fact that it’s unsupervised. And when you have an unsupervised algorithm that runs forever and then they spit out something, I don’t think that it’s a lot of quality.
So, what we’ve done is we’ve actually put supervision in the examiner search such that they’re in the loop, on the loop, and over the loop so that they pass the red face test. And that’s saying, is this really the result? Is it really up or down? What’s the relative ranking in this? And without that type of supervision, you just get weird results.
So, I think generative AI, if it’s unsupervised, is really bad quality.
Tad Mielnicki: I like the word generative because it confuses people, and then they don’t have to think about all of this technology as Skynet from the Terminator coming, because we’re not close to that.
Ron Arden: Yeah, I wanted to piggyback on something you had mentioned Tad. We were chatting earlier about this governance idea. We said if you have documents, a lot of unstructured data inside your organization, which we all have, it’s everywhere. And it’s got PHI and PII and even proprietary trade secrets. If you put a governance layer on it, you encrypt it, you control it, you control access, and you let your private LLM go at it. Well, if it can’t read that data, then by default, you’ve now protected your environment from not ingesting sensitive data. So, it sounds like we could do the same thing with a public environment. If I prevented a user from copying and pasting something that’s sensitive out of a document and I can’t put it into ChatGPT, then I’ve put the governance right there.
Tad Mielnicki: Before I founded Overwatch, I helped build a company, or the security arm of a company called Egnyte, which was a data security and governance platform. And for years, people in the security and governance world have said, hey, watch what you’re putting out into the open world. Watch what you’re putting out about yourself into the open web.
Well, now we’ve given everybody this amazing opportunity to query with some degree of context all of the data that these LLMs can touch. So, one of my hopes is that as people start to utilize these more, they maybe put less information about themselves so freely out there, right? Because now it’s a lot easier to pull that information down and to draw some level of connection because that data science is being augmented to a certain degree. But I do think it all goes down to data governance.
If you want to secure your data, you govern it properly because data is a living and breathing thing. It moves, and it travels.
If you’re not encrypting it and governing it at the data layer, then you’re in trouble already.
And now, with these language models, you’ll probably be a little bit deeper in the muck.
Jamie Holcombe: How about this challenge? I challenge the industry because just like we had a revolution when we went to TCP/IP, we broadcast our packets, right? And, oh, my God, that’s unheard of. You have to have circuits, and yet, okay, the world is based on the Internet route, right? Well, we’ve also moved into relational databases and data governance and so forth.
If we could have a time to kill for packets on a network, why can’t we have a time to kill for elements in a data store? And I don’t mean the archives. Of course, you need to keep data, but in your transactional logs, transactional databases, things that are not meant to be kept, why don’t we have a time to kill in those data field elements? And if we could create standards around that, a lot of data governance would be taken care of.
So, my challenge to the industry is actually to do that. It’s not an original thought. There’s a lot of R&D going on about this but I’d like to get the message out that I think it’s a great idea.
Ron Arden: It’s interesting because I was thinking one of the things we were talking about is part of the governance. If you’re looking at a document, and we can do this with our technologies, I can essentially put a Time to Live on it. I could put a validity period on it. So, adding that into a data store would obviously be the next logical step. Yeah, that’s cool.
Do the scenarios mentioned in this conversation sound familiar? Most organizations are struggling with the same questions about generative AI.
Join us for part 3 of this conversation soon.
The transcript of this conversation has been shortened and edited for clarity and the blog format.