Fake News Datasets

CREDBANK

A large-scale social media corpus with associated credibility annotations

The CREDBANK corpus was collected between mid October 2014 and end of February 2015. It is a collection of streaming tweets tracked over this period, topics in this tweet stream, topics classified as events or non events, and events annotated with credibility ratings.

Detailed information about this dataset can be found in the publication Mitra, T. & Gilbert, E. (2015). Credbank: A large-scale social media corpus with associated credibility annotations. Proceedings of the International AAAI Conference on Web and Social Media, 9(1), 258–267.

Emergent

A dataset for stance classification

The Emergent dataset is derived from a digital journalism project for rumour debunking. The dataset contains 300 rumoured claims and 2,595 associated news articles, collected and labelled by journalists with an estimation of their veracity (true, false or unverified). Each associated article is summarised into a headline and labelled to indicate whether its stance is for, against, or observing the claim, where observing indicates that the article merely reports the claim.

Detailed information about this dataset can be found in the publication Ferreira, W. & Vlachos, A. (2016). Emergent: a novel data-set for stance classification. In Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: Human language technologies. ACL.

FakeNewsNet

A dataset for fake news detection research

A minimalistic version of the latest dataset provided by FakeNewsNet includes samples related to fake and real news collected from PolitiFact and samples related to fake and real news collected from GossipCop. A complete dataset cannot be distributed because of Twitter privacy policies and news publisher copyrights. However, the code from the repository can be used to download news articles from published websites and relevant social media data from Twitter.

Detailed information about this dataset can be found in the publication Shu, K., et al. (2018). Fakenewsnet: A data repository with news content, social context, and spatiotemporal information for studying fake news on social media. arXiv:1809.01286.

FakevsSatire

A dataset of fake news and satire stories

Dataset of fake news and satire stories hand-coded, verified, and, in the case of fake news, including rebutting stories. The dataset also includes a thematic content analysis of the articles, identifying major themes that include hyperbolic support or condemnation of a figure, conspiracy theories, racist themes, and discrediting of reliable sources. The publicly available dataset includes full text of articles, links to the original stories, rebutting articles for fake news, and thematic codes.

Detailed information about this dataset can be found in the publication Golbeck, J., et al. (2018). Fake news vs satire: A dataset and analysis. Proceedings of the 10th ACM Conference on Web Science.

FEVER

Annotation platform and baselines for fact extraction and verification

The FEVER consists of 185,441 claims generated by modifying sentences retrieved from Wikipedia and subsequently verified without knowledge of the sentence they were derived from. The claims are classified as Supported, Refuted or NotEnoughInfo by annotators achieving 0.6841 in Fleiss κ (substantial agreement ). The authors develop a pipeline approach using both baseline and state-of-the-art components and compare it to suitably designed oracles.

Detailed information about this dataset can be found in the publication Thorne, J., et al. (2018). FEVER: a Large-scale Dataset for Fact Extraction and VERification. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1.

LIAR

A benchmark dataset for fake news detection

LIAR is a publicly available dataset for fake news detection. The authors collected a decade-long, 12.8K manually labelled short statements in various contexts from Politifact, which provides detailed analysis reports and links to source documents for each case.

Detailed information about this dataset can be found in the publication Wang, W.Y. (2017). "Liar, liar pants on fire": A new benchmark dataset for fake news detection. arXiv:1705.00648.

MuMiN

A large-scale multilingual multimodal fact-checked misinformation social network dataset

The MuMiN dataset is a misinformation benchmark for automatic misinformation detection models. The dataset is structured as a heterogeneous graph and features 21,565,018 tweets and 1,986,354 users, belonging to 26,048 Twitter threads, discussing 12,914 fact-checked claims from 115 fact-checking organisations in 41 different languages, spanning a decade.

Detailed information about this dataset can be found in the publication Nielsen, D.S. & McConville, R. (2022). MuMiN: A Large-Scale Multilingual Multimodal Fact-Checked Misinformation Social Network Dataset. Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval.

NELA-GT-2018

Large multi-labelled news dataset for the study of misinformation in news articles

Dataset of 713k articles collected between February 2018 to November 2018 from 194 news and media outlets including what the authors consider mainstream, hyper-partisan, and conspiracy sources. They incorporate ground truth ratings of the sources from eight different "assessment sites" (NewsGuard, Pew Research Center, Wikipedia, OpenSources, Media Bias/Fact Check (MBFC), AllSides, BuzzFeed News and Politifact) covering multiple dimensions of veracity, including reliability, bias, transparency, adherence to journalistic standards, and consumer trust.

Detailed information about this dataset can be found in the publication Nørregaard, J., et al. (2019). NELA-GT-2018: A large multi-labelled news dataset for the study of misinformation in news articles. Proceedings of the international AAAI conference on web and social media. Vol. 13.

PHEME

A dataset of rumours and non-rumours

The PHEME dataset contains a collection of Twitter rumours and non-rumours posted during breaking news. The five breaking news topics provided with the dataset are as follows: Charlie Hebdo, Ferguson, Germanwings Crash, Ottawa Shooting and Sydney Siege.

For more details check the publication Zubiaga, A., et al. (2017). Exploiting Context for Rumour Detection in Social Media. Lecture Notes in Computer Science.

Fake news datasets, papers, and eventually codes, widely used in research