Open Datasets by Third Parties about Coronavirus and Misinformation
ⓘ This is a selection of open datasets suggested by Mediafutures mentors for the 1st Open Call. Participants are free to use these or other datasets.
COVID-19 Fact-checkers Dataset
Social Media Lab - Ryerson University
The COVID-19 Fact Checkers Dataset is a comprehensive list of over 200 active fact-checking organizations and groups that verify COVID-19 misinformation. The dataset is maintained by the Ryerson University’s Social Media Lab as part of an international initiative to study the proliferation of COVID-19 misinformation and to map fact-checking activities around the world in partnership with the World Health Organization (WHO). It was created to provide the public with a better understanding of the COVID-19 fact-checking ecosystem and is intended for use by policy makers and others to make data-informed decisions in the fight against COVID-19 misinformation.
The CoronaVirusFacts/DatosCoronaVirus Alliance Database
Database that gathers all of the falsehoods that have been detected by the CoronaVirusFacts/DatosCoronaVirus alliance. This database unites fact-checkers in more than 70 countries and includes articles published in at least 40 languages
The Pennsylvania State University
Diverse COVID-19 healthcare misinformation dataset, including fake news on websites and social platforms, along with users social engagement about such news. It includes 4,251 news, 296,000 related user engagements, 926 social platform posts about COVID-19, and ground truth labels.
Sardar Vallabhbhai National Institute of Technology
Dataset that comprises of labelled news articles of misinformation spread during the Infodemic. It contains approximately 2800 news articles, real and fake, scraped from Poynter and other fact-checking sites. It also contains information such as source URL, publish date and origin country of the news article. Potential applications of this dataset would be to explore various research areas such as classification of intention, study of spatial and temporal features, and linguistic indications that can provide further insight and help mitigate its effect as much as possible.
The GDELT Project monitors the world broadcast, print, and web news from nearly every corner of every country in over 100 languages and identifies the people, locations, organizations, themes, sources, emotions, counts, quotes, images and events driving our global society every second of every day, creating a free open platform for computing on the entire world.
Webhddose’s free datasets
News articles, blog posts and online discussions that mention “CoronaVirus”
COVID19 Infodemics Observatory
CoMuNe Lab - Fondazione Bruno Kessler
Results from the analysis of infodemics due to unreliable content in online social media. Specifically, public posts on Twitter, analyzed with state-of-the-art machine learning techniques for: (1) population emotional state; (2) bot/human classification; (3) news reliability.
Carnegie Mellon University
Twitter misinformation dataset called "CMU-MisCov19" with 4573 annotated tweets over 17 themes around the COVID-19 discourse. It also includes an annotation codebook for the different COVID-19 themes on Twitter, along with their descriptions and examples, for the community to use for collecting further annotations. Further details related to the dataset, and our analysis based on this dataset can be found at https://arxiv.org/abs/2008.00791. In adherence to the Twitter’s terms and conditions, full tweet JSONs are not provided but a ".csv" file with the tweet IDs so that the tweets can be rehydrated. The dataset also provides the annotations, and the date of creation for each tweet for the reproduction of the results of our analyses.
University of Southern California
Ongoing collection of tweets IDs associated with the novel coronavirus COVID-19 (SARS-CoV-2), which commenced on January 28, 2020. The Twitter’s search API was used to gather historical Tweets from the preceding 7 days, leading to the first Tweets in our dataset dating back to January 21, 2020. Twitter’s streaming API was leveraged to follow specified accounts and also collect in real-time tweets that mention specific keywords. To comply with Twitter’s Terms of Service, only the Tweet IDs of the collected Tweets are publicly released. The data is released for non-commercial research use.
Coronavirus (COVID-19) Tweets Dataset
Jawaharlal Nehru University
This dataset includes CSV files that contain IDs and sentiment scores of the tweets related to the COVID-19 pandemic. The tweets have been collected by an on-going project deployed at https://live.rlamsal.com.np. The model monitors the real-time Twitter feed for coronavirus-related tweets using 90+ different keywords and hashtags that are commonly used while referencing the pandemic. This dataset has been wholly re-designed on March 20, 2020, to comply with the content redistribution policy set by Twitter.
Institutional and news media tweet dataset for COVID-19 social science research
Universitat Autònoma de Barcelona
Open access data repository for institutional/news media tweet dataset in the time of COVID-19 pandemic
COVID-19 Reddit Algo-Tracker
COVID-19 content being promoted by reddit algorithms
COVID-19 related content across Wikipedia projects
Coronavirus en YouTube
Universitat Politècnica de València
[ES] Este dataset contiene por una parte la muestra inicial de 73.268 vídeos recuperados en YouTube ante consultas específicas relacionadas con covid-19 y España y, por otra parte, la muestra final de 39-702 vídeos en los que aparece explícitamente el término coronavirus, covid-19 o SARS-CoV-2 en el título o descripción de los vídeos.
CORD-19: The Covid-19 Open Research Dataset
Allen Institute for AI
CORD-19 is a resource of over 200,000 scholarly articles, including over 100,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses. This freely available dataset is provided to the global research community to apply recent advances in natural language processing and other AI techniques to generate new insights in support of the ongoing fight against this infectious disease. There is a growing urgency for these approaches because of the rapid acceleration in new coronavirus literature, making it difficult for the medical research community to keep up.
COVID-19 Data Repository
Johns Hopkins University
Data repository for the 2019 Novel Coronavirus Visual Dashboard operated by the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE). Also, Supported by ESRI Living Atlas Team and the Johns Hopkins University Applied Physics Lab (JHU APL).
Data on COVID-19
Our World in Data
Complete COVID-19 dataset is a collection of the COVID-19 data maintained by Our World in Data. It is updated daily and includes data on confirmed cases, deaths, hospitalizations, and testing, as well as other variables of potential interest.
COVID-19 World Survey Data API
University of Maryland
API for accessing the daily global Facebook symptoms survey data
Data for the Open COVID-19 Data Working Group
University of Washington
Location for summaries and analysis of data related to n-CoV 2019, first reported in Wuhan, China
CCCSL: CSH Covid-19 Control Strategies List
Complexity Science Hub
A structured open dataset of government interventions in response to COVID-19
Health Intevention Tracking for COVID-19 (HIT-COVID) Data
Boston University and Johns Hopkins University
The Health Intervention Tracking for COVID-19 (HIT-COVID) project tracks the implementation and relaxation of public health and social measures (PHSMs) taken by governments to slow transmission of SARS-COV-2 globally. Hundreds of volunteer data contributors were trained, provided with standardized field definitions and access to an online forum for asking questions and sharing ideas.
Mozilla COVID dataset
Data about user browsing in Mozilla Firefox to understand social distancing
The CoVidAffect dataset
Data from CoVidAfect, a nationwide citizen science project aimed to provide longitudinal data of mood changes following the COVID-19 outbreak in the spanish territory
COVID-19 Mobility Monitoring project
ISI Foundation and Cuebiq
Data from COVID-19 Mobility Monitoring project that analyses anonymized location data to understand the effect of mobility restrictions and behavioral changes on the current international COVID-19 outbreak.
A series of projects to identify, collect, and analyze the value data can provide to the ongoing COVID-19 pandemic
MediaFutures is funded by the European Union's Horizon 2020 Programme, under grant agreement number 951962. MediaFutures is a Europe-wide consortium. This website is managed on behalf of the consortium by Eurecat, whose main address is Carrer de Bilbao, 72, 08013 Barcelona (Spain).