Founders in Arms Podcast
Founders in Arms
Distinguishing AI Deepfakes From Reality with Ben Colman, Co-Founder and CEO of Reality Defender
0:00
Current time: 0:00 / Total time: -49:27
-49:27

Distinguishing AI Deepfakes From Reality with Ben Colman, Co-Founder and CEO of Reality Defender

Immad and Raj sit down with Ben Colman to discuss the threat posed by deepfakes and how we can distinguish between AI and reality.

Discover innovative insights and highlights from the Curiosity Podcast by following us on TikTok.

You can also listen and subscribe to Curiosity on YouTube, Spotify, and Apple Podcasts.

Transcription of our conversation with Ben Colman, Co-Founder and CEO of Reality Defender

Immad Akhund: Welcome to the Curiosity Podcast, where we go deep on a wide variety of technical topics with the smartest leaders in the world. I'm Immad Akhund, CEO and co-founder of Mercury.

Rajat Suri: And I'm Raj Suri, co-founder of Lima, Presto, and Lyft. And today we are talking to Ben Colman, who is the co-founder and CEO of Reality Defender. This is a company that detects AI photos, videos, text, and other types of deepfakes, and tells a variety of different users what the probability is of that type of media being AI generated. Very interesting, very relevant topic to today's AI revolution. Immad, what are you curious to talk to Ben about?

Immad Akhund: You know, these examples of deepfakes are getting so much easier to produce. Gabor, who's a friend of ours, recently made a deepfake version of himself using just off-the-shelf internet software and recorded something with it. I actually was just thinking how me and you produce all of this podcast content. It's probably going to be very easy to fake us, right? It's an obvious worry and I just don't know, what is the world gonna be like in five years time where all content is trivial to fake, you know? Ben really shows us one potential answer to how we can live in that world.

Rajat Suri: Yeah, I think there's a lot of questions about that. You just opened up a huge topic, but what's the future of podcasts in a world where AI-generated content becomes really easy, just all types of content. But I think with Ben, we're focusing really on cybersecurity-type topics and things like voice identification for banks and other types of security and verification tools. Are they going to be as relevant? So they're at the forefront of this wave of AI-generated content. So yeah, excited to talk to him and see what Reality Defender has to offer in this space and learn more about the technology underlying it too.

Immad Akhund: Welcome, Ben. Thanks for coming. Thank you guys for having me. Yeah, let's start off with Reality Defender. What inspired you to work on this idea?

Ben Colman: You know, I've been obsessed with the intersection of cybersecurity and data science my whole life, so it's really been a natural evolution of my education and my passions in this space. I've worked on startups in the space, I've worked for big companies. I interned at Google during grad school. I worked at Goldman Sachs in cybersecurity and cybersecurity strategy. It's really all I know and I love it.

Immad Akhund: Do you want to give the audience a little bit of an idea of what Reality Defender does and where it's going?

Ben Colman: Reality Defender is a deepfake and generative AI detection platform focused on detecting AI-generated or manipulated audio, video, images, and text. At a high level, we focused on inference, which is probabilistic, looking for different types of anomalies and different types of features we can extract from, whether it's the pixel layer of an image or the spectrogram, which is effectively the audio waveform, and do it without any kind of ground truth. So, you know, kind of compared to other solutions that focus on watermarking and provenance, we assume we will never have the ground truth. We don't need or use any PII, and we find a probabilistic answer so that folks, whether they're a mid-level risk analyst or a senior AI practitioner, can make immediate decisions on whether a voice or a face is indeed real or AI-generated.

Immad Akhund: You know, I've seen some crazy stuff recently. There's obviously like some of the social posts where, I think there was one recently where they made it sound like it was Zuckerberg and he was announcing something and it turned out to be fake. But it looked very real. I mean, Zuckerberg sounds a bit robotic anyway, so maybe that made it easier to fake. But I guess when you think about categories of things that are faked and what are the incentives for people faking it, like, obviously, this was more of a meme thing, but there's a lot more kind of dangerous usage of this, right? What are like the main categories that you've seen and you think about?

Ben Colman: Yeah, let me first say that at Reality Defender and also personally, we believe that the vast majority of AI use cases are going to solve really hard challenges for humanity. And also within generative AI, it's going to solve a lot of issues and really support every level of job and also the entire human race. But with regard to AI used for different types of fraud or misinformation, it's really only limited by your own imagination because the tools are so available. There's thousands of tools that are just a Google search away and no real checks and balances on who or what they're being used for. It means that you could create something really entertaining or really embarrassing on people you work with. Women are hugely affected by this, as well as people in office. Elections can be changed. We saw a number of elections in Europe where literally at the last minute people's voices were deepfaked and it skewed the election just because of that.

Immad Akhund: There was an election that was influenced already in Europe? I actually didn't hear about this at all. Which election and what was the significant change? I mean, I guess it's hard to prove.

Ben Colman: In the Slovakian election, there was a deepfake audio clip which appeared to make one of the candidates say that they're going to increase taxes on beer.

Immad Akhund: Oh no, don't touch the beer.

Ben Colman: Yeah, yeah, it went viral and it legitimately affected the results of the election. And, you know, obviously we saw the deep pick of President Biden just a few months ago, and what was interesting on that was it wasn't even that great of a deep pick. He used, you know, off-the-shelf tools. I don't remember if it was PlayHT or ElevenLabs, but it was a static pre-recorded deepfake. And what we're expecting, unfortunately… 

Immad Akhund: What did the Biden deepfake say or do? 

Ben Colman: There were a number of them. I believe the one that went viral was a deepfake audio clip announcing that voting hours were going to be extended. So trying to get people to potentially not vote and then miss out on voting. But what's interesting there is, again, it was very low tech. And where we're seeing things going, unfortunately, a much higher tech where it's not going to be president Biden. It's going to be, you know, our husbands, our wives, our parents, our bosses. And it's going to be in real time. And it's going to say, “Hey, Hey Immad, um, I need you in the office at 7. I know it's voting day, but you know, it's an emergency,” or it's going to be your wife, your husband saying, “Hey, I need you home immediately.” Also, the corollary to that is something even scarier than throwing an election, which is deepfake ransom phone calls.

Immad Akhund: I've heard that, and that's already happening, right? I've heard of cases of this.

Ben Colman: It's already happening. It used to be, which is still scary, it used to be, “Hello, I have your daughter,” and now it's, “Hello, I am your daughter.” And again, with current tools available, you know, I'll try and avoid naming which ones, but they're just a Google search away. Um, you need, you need just a, you know, a few seconds. Of somebody's voice. Now the voice won't know any information about you beyond what you type. So I won't get the right slang you say, or, you know, share information that is not public, but at least sounding like you, um, if you asked me, how do you, what are the clues to how you tell the difference? I'd say it's, it's already too late. And the only opportunity is to use AI to detect AI, which is what we do at Reality Defender.

Immad Akhund: So one question I had is, by the time something goes viral, it's kind of too late. By the time I get a hostage call, it's kind of too late, because how do I plug you in in the middle of a call, right? So how do you get there before the thing is too late?

Ben Colman: Yes, it's a good point there. I mean, if I hear, an audio clip of what sounds like my seven-year-old, and if they're asking for a few thousand dollars, am I going to take the chance or not? Even as a practitioner in the space, I'm not going to take a chance. I'll probably pay it. But what can people do and how do you get ahead of the problem? There's not a lot individual consumers can do, which is why at a country level, we're focused on encouraging different regulatory proposals and frameworks that will require the platforms themselves to scan things in real time. So you'll get notification the same way you get notification on your phone that perhaps you're receiving spam from a telemarketer. It's the same thing. 

Looking at social media and streaming platforms, you already get an alert saying, this video contains adult images, please log in and confirm you're over 18, or this contains violence, you know, even the last Super Bowl, there are a number of Super Bowl plays where, you know, somebody might have gotten hurt, or, you know, it was quite aggressive or violent, and it's grayed out, or it's pixelated, and it says, Immad, please click to confirm you want to see this, or, you know, also things like underage imagery, which are automatically blocked, or things less dangerous. But again, there's a regulation on it, which is if you're uploading the latest Drake song, you know, YouTube will actually check for it, and will block or flag your account for doing that. And so what we're asking, and what we expect to happen is not to say journey AI, or deep fakes are bad, Not to say they should be blocked, but just to say that there's some limited amount of information that should be shared with the consumer to make an informed decision. Instead of relying on platforms that best case have content moderators, but the moderators don't do anything unless you and I flag it. It's like saying, hey Immad, please check every email for ransomware. Please read the code. You're not going to do that.

Immad Akhund: So basically rely on the platform to inform the user. I mean, some cases, like if I receive a deepfake AI call, I'd be fine with that just being blocked before I even have to speak to it. That's up to you.

Ben Colman: I block all telemarketing calls. I get a call on my Verizon phone and I select it. It says, do you want to receive them or not? And I said, block absolutely everything. But right now, consumers don't have a choice because they just don't know.

Rajat Suri: Ben, it's a fascinating company. What's the status of it right now? Is it being used? Who's using it? How many people are using it? Is it consumers? Is it businesses? Who's your primary customer? Be interesting to know where you are with the technology and usage of it.

Ben Colman: Yeah, our platform is focused on supporting large enterprises, large companies. and government users. We currently don't offer a solution for consumers. We believe, ethically and morally, that platforms should do the work and scan these for consumers. And we look to regulators to require that. And until then, unfortunately, no platform is really doing a lot. But the users that are actively engaging… Sorry, can you repeat that?

Rajat Suri: What were you saying? You said no platform is helping consumers?

Ben Colman: I was saying much like other types of content moderation use cases, we really rely on our elected officials and regulators to set some minimum standards there. And until then, unfortunately, most platforms are just not going to automatically do anything. They'll ask you and I to either flag it or they'll have community notes or or they just won't do anything at all. But as far as our clients and our lab clients, they're kind of in a few main buckets. One of them is banks, financial services institutions who are scanning for fake voices and fake faces. Another one is large media organizations scanning content before they put it on TV. And the third vertical is kind of this grouping of government defense use cases where they're, again, trying to establish whether a piece of media is indicatively real or not real to make decisions that affect lives.

Rajat Suri: OK, interesting. So you don't think there's a use case for someone like Immad to install Reality Defender on his laptop, his phone system. Reality Defender is checking all the incoming calls and inbound email even he's getting.

Ben Colman: Yeah, it absolutely is a use case. It's absolutely a use case. It's a very important use case. The challenge there is that the computation requirements are still requiring Cloud Compute. So, you know, very similar to the early days of antivirus software, you know, anyone who's over 40 years old, like me, will remember a time when you pick which files you want to scan, scan this one, scan this one. And then they'll remember probably in the early 2000s, you got an email from your company or your school saying, hey, Raj, please log out of your computer at six o'clock. So we're going to update the system and check for, you know, for computer viruses. And now we take it for granted because it's happening locally on our phones, on our computers in real time and we only realize it when we get a, you know, my mother will send me a presentation of cat videos, true story, and it'll have a trojan horse or ransomware or some kind of APT in it. And I'll get an email from Outlook or notification from Gmail saying, we blocked this thing because it had something bad in it. And I go, wow, that's great. It's happening. So for us, very much in the kind of 80s, early 90s. It's still computationally expensive. You still need cloud compute. But just this last Monday, we spoke at NVIDIA's GTC conference, and one of the things we announced in our kind of amazing collaboration with NVIDIA is that we see within the next few years, us running our software on the edge and on devices themselves. So your use case is very close to our heart and we hope to move forward to support it.

Rajat Suri: So for these enterprises and government bodies that are using your platform, they're actually like selecting which media they want to run through it. It's like basically they have a process where they say, well, here's all the stuff that is coming in that we want to double check so we'll run it through your software.

Ben Colman: Yeah, we were an API first platform so they could. stream as many assets across audio, video, images, and text, but also real-time communications, real-time phone calls. And so, you know, we'll be careful on avoiding sharing any names, but it's highly likely that you and your listeners are being protected when you call your bank or someone calls your bank trying to impersonate you to try and access your account or execute a wire transfer given the prevalence of tools that offer real-time fake voice generation.

Rajat Suri: And how good is the technology to actually identify deepfakes and other sort of spoof videos and voice recordings? I'm guessing some categories are really easy to identify and some categories aren't. You mentioned you don't use watermarking or something like that. So are you looking for certain patterns in terms of how they're produced? Maybe you can give us some insight into the technology.

Ben Colman: Yeah, we're looking for different types of features that are indicative of either known generative platforms, but also unknown platforms that we may forecast the technology going. So we're both trying to identify and predict whether things are real, but also where things are fake. So we're taking what's called an ensemble approach, building independent, interconnected models that all look for different things and provide both individual scores, but also kind of a more collective score. But it's all based on probabilities and confidence intervals.

Rajat Suri: Yeah. So I guess what kind of stuff do your models look for when you're analyzing a piece of video or audio?

Ben Colman: For video, so for vision, for images and video, looking for indications of different types of generative techniques, whether they're GAN or diffusion or different types of noise within the pixel layer. We're doing frequency domain analysis looking for different indicators of kind of off frequencies that could indicate diffusion, for example. We're also looking at, you know, thousands of other kind of small things that individually might not mean a lot, but together are representative of different types of manipulation or generation. And that's just in vision. We do similar stuff. So I think we mentioned, you know, how we think about looking at images of video and audio. And the last item is text, which is a newer modality for us, focusing on different types of predictability and entropy measurements that might indicate different types of large language models being used. And, you know, across all of these, we are state of the art. 

You asked us about accuracy, Raj. It'd be easy for us or any researcher in the space to claim they're 99.9% accurate. The challenge that took us a few years to realize is that everyone who does that and says that is focusing on kind of the same datasets to train on and also the same datasets to test on. And, you know, it's almost like saying, after you study the exam, then you take the exam, you kind of know the answers beforehand. And so what we developed is both individual modality teams building the models, but equally important, teams specifically focus on building the data sets themselves. So we build our own data sets. 

We also build a number of tools we could commercialize, but right now they're used internally to be able to measure the potential skew or bias of models and also data sets themselves. We kind of noticed a few things. If you go on Wikipedia and you ask how many races there are, it'll probably say there's 64 or 128 when there's effectively thousands. So we, in certain arenas, try to expand that out. So instead of looking at race, we look at what's called the monk skin tone schedule created by Stanford, which is a kind of a constant gradient. And so we try to treat all these things as kind of infinite dimensionality across age, gender, skin color, dialect, language, and then kind of the factorial expansion if you combine any variation of all those. So if you thought about, we look at almost like a 2D or even 3D visualization and then increase that to 4000D, that would get close to how our software works in looking at any piece of media or streaming content.

Immad Akhund: What is the actual false positive and false negative rate for the different types of media? Is one easier to catch than others? It seems like video would be easier to catch than text would be. I don't know if you publish these things, but what are they roughly?

Ben Colman: It's a bit of a trick question because it kind of depends on what we agree on is the benchmarking data set. So we're quite accurate, but we also are constantly updating our benchmarking datasets to try and break our models. Some organizations push out updates once a year, once a quarter. We're pushing updates every few weeks, and we actually think we could do it even faster than that. So we're trying to always be on the kind of bleeding edge of not only what we are seeing, which is reacting really quickly, but also we haven't seen where we might forecast the research in the space is going. We can also fine tune it. Certain use cases might be more concerned with false positives versus false negatives. 

We're quite robust. Our clients are tier one banks and media and governments, who test everything. And you know, we're, we're not really competing with any startups, we're competing against very large companies, you know, companies like Amazon, companies like Microsoft, and we're blowing them out of the water, mostly because this is the only thing we're doing, we're only focusing on, you know, inference detection, while they're focusing on different kinds of weather generation, or voice authentication, where they're retaining Immad, your voiceprint, Raj, your faceprint. We don't need or use those, which also means that our clients are much more comfortable that we don't retain any data we can lose. If you're a phish recognition company and you're telling a bank, trust us with all of your employee data and all your customers faces, if you get hacked or lose that, know, your face print, you can't reset that. There's no, you know, there's no press to reset. And the analogy we'll use is, you know, 23andMe got, you know, admitted they were hacked a few months ago, even though it happened a year ago. Your DNA is kind of the ultimate private key, not resettable. And that's going to affect people for generations. So we as a policy and also from our foundational research, we have not needed that. And that's helped us move a lot quicker in many use cases where the data we're scanning is very delicate and critical.

Immad Akhund: Got it. So would you say for at least for videos and photos, you're near 100% detection on like SSAI and audio?

Ben Colman: It really depends on the use cases, but we're very accurate. I think we have to be careful on quoting any accuracies publicly, but I will say we're focusing a lot on human media. So, you know, human faces, human voices, and then expanding from there to other types of use cases that are not specifically voices and faces.

Immad Akhund: Every time there's a new model, like some new open source thing or like GPT-5 comes out and there's a DALLE-3. Does it suddenly kind of break all your models where like, oh, now like this is more advanced and like the person seems more real and now you have to go retrain on all that?

Ben Colman: I mean, the fakes are getting better at faking, which is a bit of a cat and mouse game. It's kind of whack-a-mole, but I'd argue that's more a feature than a bug. It's also true of any kind of cybersecurity. In our case, the real is not changing, so we're kind of focusing on both detecting the fakeness and the real. And part of our data team's efforts are developing what we believe is the largest corpus of both real and generated media for our training. You know, we don't sell it, we don't license it, it's only for us internally. Any public data set is obviously available to bad actors who can use it to either get around any models, but it also is For the good guys, it's easy to think you're 100% because you, again, have seen the exam answers before the exam. But you asked about how these different modalities differ. You mentioned that you thought videos and vision was easier. There's a lot more dimensionality in vision versus audio. There's a lot more data. We have individual teams that are state-of-the-art and bleeding-edge on all those. So each one has its own kind of advantages.

Immad Akhund: But you're saying the more dimensionality makes it easier to detect a fake, right?

Ben Colman: The more dimensionality makes it harder to train a model to detect.

Immad Akhund: But it's also got more information to detect something.

Ben Colman: It does. It does. And we kind of are kind of multimodal. So we're, let's say we have a video of you, Immad, we're scanning both the video and the audio and the correlation, we know kind of like sound versus mouth movement. So we're doing modality separately and also combining the insights together to provide a kind of clearer picture. And then we're doing things like what's called XAI, explainable AI, we're putting a visualization to our clients to say not just here's our probability encompassing the rules of why something is real or fake, but also on this image. We don't know what it is, but let's say it's Iman's face. If I have a bit of a heat map demonstrating where our model is finding most relevancy at the pixel layer, kind of like a heat map.

Rajat Suri: There are so many questions. It's really fascinating. You're right at the forefront of a really important sea change in our lives and in overall technology, right? So there's a lot of things that I'm sure that you're contemplating, which are very relevant to the whole AI revolution. So I guess one of the questions I had, is there a future where it's basically impossible to detect a deepfake or some kind of AI-generated video or audio because they're generated so well? Even for text right now, I can see it's probably impossible for large sets of text to detect whether it's AI or human.

Ben Colman: Yeah, I think that this isn't a new kind of evolution of technology, and we kind of look at the endpoint antivirus space that we think we're kind of mirroring. And so if you think about it, we're in the very early chapters focusing just more specifically on realness or fakeness. But as we expand, we'll start pulling other types of data, whether we're looking at content and intent and context and truthfulness and different types of correlations between what Raj, you, or you, Immad, said or did, whether the video matches the text next to it. So I would argue that as computational power gets more powerful, we'll start being able to look at a lot more things that'll tell us a lot more of the picture that'll let us continue to stay on the forefront ahead of these threats.

Rajat Suri: I mean, even for now, right now, the Gmail responses, the auto responses that they have, I mean, they're basically human-like, right? So you would never know if someone just pressed a Gmail button or they actually typed out a response. It would be impossible to actually detect that. I don't know if the comparison to viruses stands. I mean, in terms of cybersecurity evolution, I get that comparison. But the actual technology is quite different.

Ben Colman: Yeah, I think I think for text, you might be, you might be honest, I think, because there's a lot. There's a lot less data to measure, especially in shorter, shorter text sentences and paragraphs. So on our side, particular clients will look at very large things. And while they can use our solution to scan, for example, a tweet, I wouldn't think it'd be tremendously insightful versus looking at a, you know, a dissertation, but text is a very, very small part of our platform. The majority of users are looking at image video and audio.

Rajat Suri: For the image thing, there was a famous incident recently with a royal member in the UK, the princess digitally altered a picture and people originally accepted it and then they found out it was altered. I actually looked at the picture. I couldn't figure out how they found out it was altered. Do you know how that happened? How did they find out it was altered and what kind of process did the media go through to even figure that out? And would you have caught that? I'm kind of curious about that one.

Ben Colman: Yeah, on our side, we wouldn't term this a deepfake. Deepfake is a spectrum of deepfakes to cheapfakes to post-processing. Whether this is Photoshop Magic Eraser, which a lot of folks use, or just the tools available on your iPhone. I have small kids, and sometimes they're not very photogenic. And you can combine a few pictures and kind of get the best of all of them. From my understanding, it was kind of a combination of both of those. And our solution has a model that looks for kind of visual noise. So we're able to identify different anomalies on the pixel layer. I think most of it, most of it people looked at was more kind of common sense reasoning of looking at things that just didn't look right. I don't think there was an area where there were six fingers, but there's certain areas where, you know, things in front and behind were not consistent as well. So, you know, yes, we detected an anomaly, but no, I wouldn't deem this as a deep fake. And for the majority of our clients, they're looking at, you know, actual fraud use cases and kind of less, you know, less consumer use cases.

Immad Akhund: When you think about this election coming up, which is pretty soon, you know, the internet's probably going to go completely crazy and there will probably be tons of deepfakes. And at least so far, like, these platforms won't have it. So, you know, potentially 2024 election is going to be like one of the most deepfaked and crazy elections.

Ben Colman: Not just the US, Immad, you know, 65% of the world is having elections in 2024.

Immad Akhund: I feel like the U.S., whatever happens, the U.S. always gets it worse. So I feel like it's going to be kind of crazy.

Ben Colman: For the last 10 days, I've been on a bit of a round the world trip meeting with policymakers and practitioners and security officials. I was at The Hague speaking with global CISOs, heads of cybersecurity, at Tier 1 banks. I was in London speaking to government defense officials at Cephas to the West. I spoke on a panel with media executives. And then this last Monday, I was at NVIDIA GTC speaking with practitioners in the AI space and on a panel with NVIDIA itself. And the commonality here is that the US is far behind really every other country. We look at countries like Singapore and Taiwan and Japan, European Union, UK, and they all have simple minimum regulations in the space and they might not block or flag things. But at least providing some kind of indication or they're setting guidelines at the timetable for industry to actually create some protections themselves. So we do have these. Elections are a bit of a forcing function and we're part of a number of election and kind of geopolitical crash teams that are trying to support different protections here. But it'll certainly be a chaotic year until we have some clear regulations.

Immad Akhund: What do you think's the, I guess, queasiest thing that might happen?

Ben Colman: We kind of play this game on a lot of conversations. What we've discussed and heard is not so much a deepfake about a presidential candidate, But deepfakes to very junior election officials trying to get them to do things that would affect their ability to actually complete the election. So imagine a world where junior election volunteers are told not to show up in certain arenas, or certain roads to those voting booths are somehow blocked off due to some fake instructions. So those are the kinds of things that I think a lot about, these kind of hyper-local, hyper-personalized depicts affecting folks that wouldn't traditionally be thought of as targets. They're not the head of cybersecurity. They're just regular people trying to go about their regular lives.

Immad Akhund: I was thinking it would be either Biden or Trump saying something, but maybe that's like the too obvious one.

Ben Colman: It's already happening. I mean, one of the people you mentioned are making actual fakes of the other one right now. But yeah, I think it's going to be much more personalized and local.

Rajat Suri: I think that this Biden and Trump fake thing is not likely to happen because there's so much media scrutiny over those videos. Like, you know, the moment a deepfake comes out, it's like in headlines, you know, and then whoever made that fake is kind of traced down.

Immad Akhund: And like, there's a lot of… I mean, these things go viral so quickly. Like, it's like within one hour, maybe a million people see it or two million people see it. And yeah, you know, the correction to the news never gets as spread as the news. I think the correction to the deepfake won't get spread as much as the deepfake.

Rajat Suri: But people will ask, is this real? I think people nowadays, I don't know, my gut reaction to any video is like, I'll just wait for it to be verified.

Immad Akhund: Yeah, but you're a sophisticated consumer. I don't think the average American, like you're a technologist. I don't think people know that the amount of deep fakeness possible right now, I don't think most people know.

Ben Colman: I mean, my parents are highly educated and they're sending me things. Now, they're questioning is it real or not, but they're still sharing it. and propagating it.

Immad Akhund: Yeah, think about all the WhatsApp groups. It's not even just social media, right?

Ben Colman: Yeah, and you think about things without naming any platforms, whether it's community notes or these kind of false protections that are really setting false expectations that things will be protected when they only actually come to effect when things have already gone viral. And it's, you know, the toothpaste is already out of the tube. You can't put it back in.

Rajat Suri: But don't you think that that loop of verification is also getting faster? Like where people are, you know, the community notes, for example, is kind of an antibody to the lie that spreads. Don't you think that's really faster?

Ben Colman: I mean, no. No, I mean, if you look at the most recent example, which is I mean, the deepfake Taylor Swift pornography had already been shared and re-shared, you know, millions if not tens of millions of times, and it went off platform. Suddenly it was on WhatsApp and Telegram, where kind of these private communications that aren't even being scanned. So, you know, given that they're already checking for, you know, viruses and, you know, music uploads in real time on upload to see if you're, you know, uploading the latest Drake song, they could do exactly this before any one person has seen it, let alone millions. So I think that community notes and even social media uh, you know, you write, you right click and send a content moderator. We can give you examples. We have a whole deepfake life online. You know, our VP of Human Engagement is a deepfake started kind of as a test of platforms, but he's, he's got a full life online. He's on dating platforms. He was applying for jobs. And even when we flag him that we created him, it's a 10 step process. Nothing happens. Six months later, he's still there.

Rajat Suri: You created a fake online person.

Ben Colman: We typically do it, you know, kind of as we're demonstrating our solution, we're showing our clients how easy it is for fraudsters to do. And part of the challenge was also to demonstrate that platforms are not removing them either. So again, this is scary that anyone can do this with no technical ability. You know, five years ago, you had to be an expert to do it. And still to make a computer virus, you got to be an expert. But in this space, you just got to have a phone and a you know, access to the internet, and you could one click do all of it.

Immad Akhund: Why do you think the platforms won't watermark these things in the future? Like, it feels like almost all of the kind of, you know, creators of these models are incentivized to, like, reduce deepfakes or make them findable, at least. But you don't think they'll watermark these things?

Ben Colman: There's a number of organizations that are focusing on provenance and watermarking, whether it's the big tech companies or these industry associations like C2PA. But that presupposes that not only the watermarks are secure, and there's a recent Wired article showing that they are hackable, but it also presupposes that all the bad actors are going to follow the rules. If you put a sign up that's saying now it's a federal crime to rob banks, is the guy that's already robbing banks, that it's already illegal to do that, is he or she going to not do the crime now because there's a more prevalent and obvious law?

Immad Akhund: Well, I would assume that the bad guys are still using the same models that have the watermark, but you're saying that there'll always be models out there that don't have the watermark.

Ben Colman: I'm saying they'll either hack an account, they'll hack your My account to do it, or they'll just spin up their own instance and run, choose your generative platform. They'll run it locally. We interviewed a guy who, in the middle of the interview, he changed race, he changed accent, he changed hair. Raj, he had glasses and the glasses were gone. And, you know, he made it to the next round. Did you hire him? He made it to the next round. And, you know, we're like, wow, that's impressive. You know, how'd you do that? You know, kind of walk us through. And, you know, we were shocked. Maybe he did it locally, but no. he, he, like most startups, uh, applied for a hundred thousand dollars of free Amazon AWS credit. And he burned through, you know, $10,000 of credit for, you know, a 15 minute phone call or a 15 minute zoom. So just to show that it's absolutely possible to, uh, to do this. And, um, he did not use any off the shelf models. He used open source and he spun it up and hosted it himself. and that would completely get around this kind of world of, oh wow, provenance and watermarking and other types of content authentication will solve anything.

Rajat Suri: Yeah, I have one final question, Ben. I'm really curious to hear how you think the space evolves over the next decade or two. Do you think every person, every company will have a version of Reality Defender embedded in all of its inbound, outbound digital channels? I've heard this theory, I've talked to Paul Buchheit about this, and he's mentioned it publicly; everyone has their own AI assistant and then the AI assistants can detect other people's AI assistant and see how real they are. And yeah, we'd love to hear your thoughts on what the future looks like.

Ben Colman: Yeah, there's a two-part question here. It's like, do I think that this will be ubiquitous? I think the answer is yes. I think we'll all have AI assistants, and I think there'll always be use cases where a bank will say, hey, Raj, please turn off your AI assistant. We want the real you to actually authenticate this wire transfer or changing the address on your bank account. But I also think it's a perimeter approach, and I think that if you look at any other kind of classic cybersecurity implementation within a large bank or actually just a consumer operating system, it's a combination of multiple solutions all working together. And so right now, while our clients are using it directly, where we think we're going to move to is kind of providing our insights into kind of these larger engines and kind of these model of models. 

You know, we will be a signal that will be used. Maybe we're the strongest signal, or maybe there's other things that are potentially more important in determining the answer. And we're starting to do that today. You know, we are not going to directly support you know, here's one example, you know, insider threat is Raj, uh, uh, breaking protocol by using chat to PT and therefore leaking company private data. Um, we're not going to serve our tool directly to you. We're going to go through other solutions that, you know, a company's already using, um, that are already monitoring your, um, email servers. So, you know, it'll be more of kind of a, integration approach, and it'll be kind of an amalgamation of dozens of different platforms providing different types of insights to get to that level of protecting a user or an action or an upload.

Rajat Suri: Thank you, Ben. It was great chatting with you today and we really learned a lot. All the best on your company and we look forward to talking again soon and hearing how things evolve.

Ben Colman: Thanks, guys. Thank you, Raj, thank you Immad.

Immad Akhund: Alright, so we just wrapped up with Ben. It was another interesting episode. What was the most surprising thing to you, Raj?

Rajat Suri: Yeah, it was a really interesting conversation. I was quite impressed by how deep he thought about all the different types of AI deepfake detection. And some of the technology underlying this I thought was quite interesting in terms of finding areas where there might be more noise or diffusion in these models. I'm eager to learn more, actually, about how the technology detects an AI-generated video and picture. I'm a little bit concerned, though, that if the technology is actually good enough, to be honest, because it seems like AI is moving at such a fast pace. Could companies like Reality Defender actually stay up to speed with how good the generative models are?

Immad Akhund: 100% right now, I can see it working very well, right? Like, I mean, I look at AI stuff and I'm like, okay, there's something a bit off around most AI generated video and audio. And it is just a huge question of like, where does it go in the future? One thing that I thought was kind of funny, someone was telling me about how, with a large chunk of text, one of the best ways of telling it's AI is because it's too good. Yeah, like humans just make way more errors and like have weird idiosyncrasies which like the AI kind of removes and I can imagine like maybe there's some similar elements of that where you can, yeah, AI will always tend to do something one way. But yeah, there'll be adversarial AIs as well, which are deliberately trying to defeat people like Reality Defender as well, which will be hard.

Rajat Suri: Yeah. I think this is the beginning of the war between these white hat AIs and the black hat AIs, right?

Immad Akhund: You know, the other thing that could happen is maybe people just don't believe anything anymore. Like you don't believe anything you see on the internet. You don't believe a phone call that you receive, like, unless you see it in person, maybe like everything is just like, has a massive disclaimer in your head about it.

Rajat Suri: Personally, for me, that's where I've already landed. I don't take any phone calls. No one's going to be able to threaten me with a ransom because I don't take any phone calls except from a contact. So there's so much spam. And also with text messages, there's so much spam now. I just hit stop to all text messages if it's not from a contact. Same with videos I see on the internet. I don't even bother seeing it if it's not from a trusted source.

Immad Akhund: I mean, every now and then, like even like the Zuckerberg video I talked about when it came up, like, you know, it sounded just like him. And it all sounded like very reasonable. So I was like, it took me about halfway in before I was like, wait a second, like, this is absurd. I think there will be situations where, you know, I always say like, hey, I'll never get hit by a phishing website. But every now and then they construct the thing well enough and I'm a very sophisticated anti-phishing user, but I've definitely fallen for it at least once or twice. It doesn't matter how disbelieving you are and how you're trusted sources. There's going to be something you see where you're just like, oh, it sits in your biases and confirms them. You think it's real?

Rajat Suri: Yeah, I can definitely see that. I think the most sophisticated phishing I saw was at one of my companies where they texted my employees, as from me, sent some text messages back and forth, and actually had some information about me, and then asked them to buy some gift cards or do something. It actually convinced a lot of my employees. I'm sure you get hit by attacks like that all the time.

Immad Akhund: I mean, all day long. Did that actually convince your employees? I have a channel in our Slack that's called phishing and we just like, every, every single time someone gets hit with something, we do it because like with phishing, you just need to always remember that like, you cannot continuously be socially engineered. Uh, and yeah, maybe it's the same thing here. Like you just need to continuously think like, is this fake? Is this fake? Kind of, maybe it kind of spoils the experience on the internet. Maybe everything can be fake, but maybe improves it. You know, like now you have like tons of interesting content that anyone can create as well. So there's. There's definitely like an upside. Yeah, yeah, yeah.

Rajat Suri: So it'll be interesting to see where it goes. I think we're describing two different futures, right? One is like, this gets so good that people get gypped or people just don't believe in anything anymore. Because maybe once they've been gypped once or twice, they're like, well, I'm not going to trust this medium anymore. I have this theory where I think brands are going to be more important than ever because of these deepfakes and stuff. This actually helps big companies and long-term brands.

Immad Akhund: It helps actually journalism, which has been in decline, but maybe media organizations are one of the few.

Rajat Suri: No, I mean that is a very contrarian point, because you hear all these people criticize mainstream media, but actually I think mainstream media is more important than ever, because they become the trusted gatekeepers, the way I know some content is real. If I've been to the Wall Street Journal website and seen the video there, I'm a lot more likely to believe that than some random ex-account. But everyone criticizes mainstream media, yet it's actually, I think, it's becoming more important than ever. I think, actually, the analog might be 1800s newspapers, early 1800s. Everyone is distributing newsletters and stuff like that. But there could be a lot of fake news in newsletters. Anyone could have a printing press and make any news up. But it was only with the custom New York Times letterhead that people would actually trust because maybe that's an expensive thing to bake, right? It's an expensive letterhead, right? Or a mask head, I think it's called, right? So I think fake news and fake stuff has been around for a long time. So this is not a new problem.

Immad Akhund: Ben's bet is that they will always be able to detect it. And every website slash platform you go to is going to say like, you know, detected fake, detected real, things like that. And that would be the other answer.

Rajat Suri: Yeah, I think there's real cycles of credibility that people go through. So if we put out a deepfake podcast or someone put out a deepfake podcast, that person would lose trust, which has real reaction and negative connotation. So they wouldn't be trusted again. So I think this concept of trust and who the counterparty is. So anyways, fascinating topic, fascinating discussion, and I'm excited to see where the future holds for Ben and all the other companies in this space.

Immad Akhund: Yeah, thanks for joining. Make sure you subscribe, like, leave us reviews, and see you next time.

Discussion about this episode