Krishna Bharat, Computer scientist. Ex-Googler, founder of Google News. Investor and advisor to technology startups. Addicted to the news, part of the board of Columbia & Stanford Journalism (JSK) shares his opinion on the detection of Fake News in Real Time.
Last November, a friend told me about his extended family of Filipino Americans in the Fresno area. In a matter of days, they went from feel conflicted about Trump's candidacy to vote for him. They are Catholics, and once they heard that the Pope had endorsed Trump Their minds were set. Of course, this papal support did not really happen. This is an example of a false wave of news that went viral and deceived millions.
Here is the same story in a Facebook post, shared by the group North Carolina For Donald Trump. They have 65,000 followers, and you can see how the actions of dozens of influential groups could spread this to millions.
On the same topic, a site called winningdemocrats.com posted a joke that Ireland is officially accepting “refugees,” which also had a lot of controversy. This is a bipartisan issue. Journalism is hard work. Fake news for influence and profit is too easy.
This made me wonder what Facebook and other platforms could have done to detect these waves of misinformation in real time. Could they have taken countermeasures? If caught early, could they have slowed the spread or flagged it as unreliable news?
Platforms must act
As many have pointed out, distributing fake news is best done on the main platforms – Facebook, Twitter, Google, Microsoft, Yahoo and Apple. They control the arteries through which most of the world's information and influence flows.. They are better positioned to misinform. Your engineering teams have techniques to detect it and the tools necessary to respond.
Both social media and search engines have engineering “levers” (think: ranking flexibility) and product options to reduce exposure, flag news as fake, or stop waves of misinformation altogether. They will make these decisions individually based on the severity of the issue and how their organization balances information accuracy and author freedom.
Google Search focuses on access to information. Facebook sees itself as a facilitator of free expression. Both of you can solve things differently.
“Our approach will focus less on banning misinformation, and more on additional perspectives and information, including having inspectors dispute the accuracy of an article.” Mark Zuckerberg
In this article I prefer not to get into politics, and I would like to focus on detection rather than advocating for a specific response. No matter what your answer is, if you can spot fake news in real time you can do something about it.
La Real-time detection, in this context, does not mean seconds. It may be unnecessary to take any action if the news does not spread. In practice, rapid response could mean minutes or hours, long enough for an algorithm to detect a wave of suspicious-looking news that is gaining momentum, potentially from multiple sources.
Also, enough time to collect evidence and to be considered by humans who may choose to stop the wave before it becomes a tsunami.
I know a thing or two about news processing algorithms. It is my belief that detection is manageable.
I also know that it's probably not a good idea to run against short-term measures based solely on what the algorithm says. It's important to put humans in the loop, both for corporate accountability and to serve as a sanity check.
In particular, a human referee would be able to do proactive fact checks. In the example above, the Facebook or Twitter representative could have called the Headquarters press office and established that the story is false. If there is no obvious person to call, you could consult major news sources and fact-checking sites to get a read on the situation.
There will be ambiguous cases and situations where verification is elusive. Human referees may decide to wait and monitor the wave for a while before they intervene. Over time, a machine learning system could learn from the results, start using more tests, and train itself to become smarter.
What is a wave? A wave in my language is a set of articles that make the same (possibly erroneous) statement, in addition to messages on associated social networks. A wave is significant if it is growing in commitment. Since the cost of human intervention is high, it only makes sense to point out significant waves that have features that suggest misinformation.
The goal of the detection algorithm is to flag suspicious waves before they cross an exposure threshold, so that humans can do something about them.
To do this specifically: Let's say a social media platform has decided it wants to completely address fake news the moment it gets 10,000 shares. To achieve this, they may want the wave to be flagged at 1,000 stocks, so that human raters have time to study and respond to it. For search, you could count queries and clicks instead of actions and the thresholds could be higher, but the general logic is the same.
Algorithmic Detection
To detect anomalous behavior we have to look below the surface and see what is not happening.
What makes fake news detection manageable is that platforms can look at articles and posts, not just in isolation, but in the context of everything that is being said about that topic in real time. This expanded and timely context makes all the difference.
Take the “Pope supports Trump” story.
If you are an average Facebook user and the article was shared to you through a friend, you have no reason not to believe it. We have a truth bias that makes us want to believe that things are written in newspaper format, especially if it is supported by someone you know.
Therefore, the newly minted fake news sites are trying to appear legitimate. Some by “Macedonian Teenagers”, for profit, or by political professionals or foreign actors seeking to influence elections. As new sites are tagged and blacklisted, they are created out of necessity.
A skeptic would ask: How likely is it that endingthefed.com, a relatively obscure source, is one of the first to report a story about the Pope endorsing Trump, while established sources like the New York Times, Washington Post, BBC, Fox News, CNN, etc.. and even the Vatican News Service, have nothing to say about it? That would seem unnatural. It would be even more suspicious if this set of news sites were talking about all sites that are newly registered or have a history of fake news. This is the logic we are going to use, but with some automation.
To do this at scale, an algorithm would look at all the recent articles (from well-known and obscure sources) that have been getting some play in the last 6 to 12 hours on a particular social network or search engine. To limit the scope, we might require a match to some trigger terms (e.g. names of politicians, controversial topics) or news categories (e.g. politics, crime, immigration). This would reduce the set to around 10,000 items. These articles can be analyzed and grouped into story buckets, based on common traits – significant keywords, dates, quotes, phrases, etc. None of this is technically complex. Computer scientists have been doing this for decades and call it “document clustering.”
This technique has been used successfully on Google News and Bing News, to group articles by story and compare editorial activity between stories. If two different sources mention “pope” and “Trump” and some variant of the term “support” within a short time window then their articles will end up in the same fake news bucket. This essentially helps us capture complete coverage of a story, across multiple news sources. Add the social context, that is, the posts that refer to these articles, and you will have the complete wave. More importantly, this allows us to comprehensively find out which sources and authors are propagating this news and which are not.
To assess whether the wave needs to be flagged as suspicious, the algorithm will need to look at the features of both the stories cluster and the social media cloud surrounding it.
Specifically:
- Is the wave on a politically charged topic? Do you agree with a set of words that seem to appeal to partisan dialogue?
- Is engagement growing quickly? How many views or shares per hour?
- Does it contain recent or new sources? Sources with domains that have been transferred?
- Are there sources with a track record of credible journalism?
- Are there questionable sources in the wave?
(A) Sources flagged as fake news by fact-checking sites (e.g. Snopes, Politifact)
(B) Frequently co-cited sources on social sources with known fake news sources.
(C) Sources that bear a resemblance to known fake news providers in their affiliation, website structure, DNS record, etc.
- Is it being shared by users or does it appear on forums that have historically posted fake news? Are there known trolls or conspiracy theorists spreading it?
- Are there credible sources on the news? As time passes this becomes a powerful signal. A growing story that is not captured by credible sources is suspect.
- Have any of the articles been flagged as fake by (credible) users?
Each of the above points can be evaluated by computers. Not perfectly good, but good enough to serve as a signal. Carefully constructed logic will combine these signals to produce a final score to rate the suspicion of the wave.
When a wave has the characteristics of fake news the algorithm can flag it to give it human attention, and potentially put the temporary brakes on it. This will save time and ensure that you don't cross a high mark (10,000 interactions or views) while the evaluation is in progress.
With each wave that is evaluated by the human judges – and there can be several dozen a day – the system will receive feedback. This in turn allows algorithmic/neural network parameters to be adjusted and helps extend history for sources, authors and forums. Even waves that could not be stopped in time, but eventually misinformed, could contribute to improving the model. Over time this should make detection more accurate, reducing the incidence of false alarms in the signaling step.
Free expression and abuse
Trading free speech is a slippery slope and inevitably a bad idea.
It is important that platforms' policing of fake news happens in a way that is both defensible and transparent. Defensible, in the sense that they explain their policies and how they execute and operate them in a way that the public feels comfortable. I would expect them to target fake news narrowly to cover only factual claims that are demonstrably wrong. They should avoid monitoring opinions or complaints that cannot be controlled. Platforms like to avoid controversy and a narrow, sharp definition will keep them out of the game.
In terms of transparency, I would expect that all news that has been identified as false and slowed or blocked from being publicly revealed. They can choose to delay this, but must disclose within a reasonable time (say, 15 days) all news that was impacted. This, above all, will prevent abuse by the platform. Google, Facebook and others have transparency reports that reveal requests for censorship and surveillance by governments and laws. It is appropriate that they also be transparent about the actions they limit.
Having been on the other side of this issue, I can think of reasons why the details of the detection algorithm may need to be kept secret. A platform, in an arms race with fake news producers, may find its strategy stops working if it goes public too much.
One compromise would be to document the details of the implementation and make it available for internal scrutiny by (a panel of) employees. Also, for an audit authorized external lawyers. When it comes to encouraging good business conduct, employees are the first line of defense. They are technically capable and come from across the political spectrum. They can confirm that there is no political bias in the implementation.
The biggest challenge to stopping fake news is not technical. It is operational will.
The scale and success of our major platforms made this full-scale assault on truth possible in the first place. They are also better positioned to fix it. They can set sensors, pull levers, and crush fake news by denying them traffic and revenue.
My concern is whether the leadership at these companies recognizes the moral imperative and has the will to take this at scale, invest the engineering that is needed, and act with the seriousness it deserves. Not because they are being disingenuous and it benefits their business, I genuinely believe that is not a factor, but because they may think it is too difficult and do not want to be held responsible for mistakes. There is no commercial imperative to do this and there may be accusations of bias or censorship, so why bother?
If they are willing to go beyond that and take on the problem – and recent data suggests they do (e.g. Facebook paying fact checkers, ranking changes on Google) – I think their users and the press will appreciate and support them. . With transparency and the right response they can do immense good for society and ensure that democracies function properly. The alternative is terrifying.