“I only believe in what I see and hear with my own eyes and ears”, says Swiss-American psychiatrist, Elisabeth Kubler-Ross. However, this cannot be applied to what we see nowadays on our screens.
Day after day, we are stormed by images and videos of all kinds. All this content shapes our beliefs, our convictions and the way we think. So there is a supreme need to be sure that these images are real. However, with Photoshop and video trickery software, we can no longer simply trust that any image we see is actually the original.
In fact, recent advances in the field of artificial intelligence further blur the line between what is real and what is faked and allow anyone with a computer or smartphone to manipulate content and visuals. So let's get started with a short overview of the various advances that allow artificial systems to recognize and generate visual and audio content.
With machine learning, major breakthroughs have improved the visual recognition of artificial intelligence. Machine learning algorithms use computer methods to “learn” directly from data by pressing a pre-determined code during programming. These algorithms improve their performance adaptively as the number of samples available for learning increases. The human voice, with all its subtlety and nuance, proves to be an extremely difficult thing to imitate for an artificial intelligence.
When Siri, Alexa or our GPS communicate with us, it's pretty obvious that it's a machine talking, because virtually, all text-to-speech systems on the market rely on a pre-recorded set of words, phrases and utterances (recorded from voice actors), which are then linked together to produce complete words and phrases.
In an effort to bring a little more life to the automated voice, Lyrebird, a startup in artificial intelligence and the Chinese technology giant Baidu with Deep Voice, have developed algorithms that can mimic the voice of any person, and read a text with a predefined emotion or intonation. A year ago, the technology required about 30 minutes of audio to create an artificial clip. Now, it can create better results in just 3 seconds. This means that if you take a sound clip of any person speaking, you only need 3 seconds for the software to synthesize it. Then you write any text and the synthesized voice will read it. It's like magic!
Another possibility from all this research is the ability to create facial animations from sound speech in real time. Not only does the algorithm take the sound file of a voice as its source, but it is also possible to give it instructions on the emotional state one wants for facial expression. And it is a technology that will eventually be used for dubbing. This brings us to this new technology from the University of Washington where researchers have created a new tool that takes audio files, converts them into realistic mouth movements, and then grafts them onto an existing video. The end result is a video of someone saying something they didn't say. There are also techniques to replicate body movements: dancing for example.
Now that we've covered the latest advances, it's time to look at the Deepfakes phenomenon. It is a technique of image and audio synthesis generated by artificial intelligence, which allows to overlay video parts: for example the face of an actor on the body of another, or to make a person say any text.
The phenomenon began in the fall of 2017 when an anonymous user on “Reddit” registered under the pseudonym “Deepfakes” posted several videos showing famous actresses such as Daisy Ridley, Gal Gadot, Emma Watson and Scarlett Johansson engaging in pornographic practices. The problem is that these videos were fake. These actresses never participated in such clips. Since then, a lot of deepfake videos have appeared on the internet like actors appearing in films in which they never played.
In the end, the possibilities are enormous. Simple software and applications are enough to achieve impressive results and it's very difficult to detect the real videos from the fake.
The dangers of these technologies are obvious. It doesn't take much to recognize that getting someone to say what you want can have very serious consequences. The potential risks are extremely vast and difficult to counteract, and it's happened before. In May 2018 a deepfake which included Donald Trump asking Belgian citizens to abandon the Paris climate agreement, was broadcast on social networks. The video provoked a controversial reaction, until the Belgian political party, the SPA, acknowledged that they were behind it.
Deepfakes are like a gift for political parties to manipulate and influence elections on one side or the other. In the most serious cases, a deepfake video could result in a conflict or war if a political leader is threatened in another country, etc. There are tensions between nations that should not be exacerbated, but some people might deliberately seek to provoke conflicts for personal interests.
Or imagine the following: a video of Donald Trump is published on YouTube. He announces that he has just launched nuclear missiles towards Russia. In less than a minute, the video goes viral on all social networks. One minute later, the Russians launch a nuclear attack against the United States. 30 seconds later, the video is revealed to be fake. Problems appear, the 10 largest cities in the United States are in ruins and World War III supposedly begins.
A politician in the middle of a scandal could claim that it was a deepfake that caused it and that he had nothing to do with it, when he actually had something to do with it. However, if a common person was accused of murder and there's a video of him committing the crime, what are the chances of proving his innocence? Judicial institutions will have to adopt reforms to cope with these technological developments.
So, what are the solutions? Are we doomed to no longer be able to trust anything, to become hyper-cynical about the world and to be accused of anything that can be proven? No, because also, when one technology is dangerous, another one comes along to counterbalance the threat. If there are tools to create counterfeits, there are also tools to detect them.
Thus, artificial intelligence is trained to recognize deepfake videos. For example, there is an algorithm capable of analyzing variations in blood flow on a face. With each heartbeat, our face gets a little redder. It's undetectable by the naked eye, but not to an AI. So if a face has been overlapped on another video to create a deepfake, the fake face will have no heartbeat and the AI will signal the video as fake. However, it's certain that deepfake creators will find solutions to bypass these algorithms, but new detection methods will be implemented. It will be a race similar to the one between hackers and antivirus software. Soon, we could have a plug-in on our browser that will be responsible for detecting deepfake on Facebook, YouTube and other platforms. As soon as we watch a video, a small green light could tell us that the video is 90% true, and a red one that it is 90% false.
Briefly, keep in mind that just because something looks realistic, doesn't mean it's true. We should be very aware in judging the content we see online. If we might stumble upon a fast-moving video or a politician confessing to something extremely inappropriate or controversial, we must doubt its veracity at the moment. If we're not careful, we risk being manipulated to a point history has never seen before.