Datasets

Open Datasets by Third Parties about Coronavirus and Misinformation

ⓘ  This is a selection of open datasets suggested by Mediafutures mentors for the 1st Open Call. Participants are free to use these or other datasets.

COVID-19 Fact-checkers Dataset

Social Media Lab - Ryerson University


The COVID-19 Fact Checkers Dataset is a comprehensive list of over 200 active fact-checking organizations and groups that verify COVID-19 misinformation. The dataset is maintained by the Ryerson University’s Social Media Lab as part of an international initiative to study the proliferation of COVID-19 misinformation and to map fact-checking activities around the world in partnership with the World Health Organization (WHO). It was created to provide the public with a better understanding of the COVID-19 fact-checking ecosystem and is intended for use by policy makers and others to make data-informed decisions in the fight against COVID-19 misinformation.

The CoronaVirusFacts/DatosCoronaVirus Alliance Database

Poynter Institute


Database that gathers all of the falsehoods that have been detected by the CoronaVirusFacts/DatosCoronaVirus alliance. This database unites fact-checkers in more than 70 countries and includes articles published in at least 40 languages

CoAID

The Pennsylvania State University


Diverse COVID-19 healthcare misinformation dataset, including fake news on websites and social platforms, along with users social engagement about such news. It includes 4,251 news, 296,000 related user engagements, 926 social platform posts about COVID-19, and ground truth labels.

COVID19FN

Sardar Vallabhbhai National Institute of Technology


Dataset that comprises of labelled news articles of misinformation spread during the Infodemic. It contains approximately 2800 news articles, real and fake, scraped from Poynter and other fact-checking sites. It also contains information such as source URL, publish date and origin country of the news article. Potential applications of this dataset would be to explore various research areas such as classification of intention, study of spatial and temporal features, and linguistic indications that can provide further insight and help mitigate its effect as much as possible.

GDELT

Google Jigsaw


The GDELT Project monitors the world broadcast, print, and web news from nearly every corner of every country in over 100 languages and identifies the people, locations, organizations, themes, sources, emotions, counts, quotes, images and events driving our global society every second of every day, creating a free open platform for computing on the entire world.

Webhddose’s free datasets

Webhose


News articles, blog posts and online discussions that mention “CoronaVirus”

COVID19 Infodemics Observatory

CoMuNe Lab - Fondazione Bruno Kessler


Results from the analysis of infodemics due to unreliable content in online social media. Specifically, public posts on Twitter, analyzed with state-of-the-art machine learning techniques for: (1) population emotional state; (2) bot/human classification; (3) news reliability.

CMU-MisCov19

Carnegie Mellon University


Twitter misinformation dataset called "CMU-MisCov19" with 4573 annotated tweets over 17 themes around the COVID-19 discourse. It also includes an annotation codebook for the different COVID-19 themes on Twitter, along with their descriptions and examples, for the community to use for collecting further annotations. Further details related to the dataset, and our analysis based on this dataset can be found at https://arxiv.org/abs/2008.00791. In adherence to the Twitter’s terms and conditions, full tweet JSONs are not provided but a ".csv" file with the tweet IDs so that the tweets can be rehydrated. The dataset also provides the annotations, and the date of creation for each tweet for the reproduction of the results of our analyses.

COVID-19-TweetIDs

University of Southern California


Ongoing collection of tweets IDs associated with the novel coronavirus COVID-19 (SARS-CoV-2), which commenced on January 28, 2020. The Twitter’s search API was used to gather historical Tweets from the preceding 7 days, leading to the first Tweets in our dataset dating back to January 21, 2020. Twitter’s streaming API was leveraged to follow specified accounts and also collect in real-time tweets that mention specific keywords. To comply with Twitter’s Terms of Service, only the Tweet IDs of the collected Tweets are publicly released. The data is released for non-commercial research use.

Coronavirus (COVID-19) Tweets Dataset

Jawaharlal Nehru University


This dataset includes CSV files that contain IDs and sentiment scores of the tweets related to the COVID-19 pandemic. The tweets have been collected by an on-going project deployed at https://live.rlamsal.com.np. The model monitors the real-time Twitter feed for coronavirus-related tweets using 90+ different keywords and hashtags that are commonly used while referencing the pandemic. This dataset has been wholly re-designed on March 20, 2020, to comply with the content redistribution policy set by Twitter.

Institutional and news media tweet dataset for COVID-19 social science research

Universitat Autònoma de Barcelona


Open access data repository for institutional/news media tweet dataset in the time of COVID-19 pandemic

COVID-19 Reddit Algo-Tracker

Cornell University


COVID-19 content being promoted by reddit algorithms

WMF COVID-19

Wikimedia Foundation


COVID-19 related content across Wikipedia projects

Coronavirus en YouTube

Universitat Politècnica de València


[ES] Este dataset contiene por una parte la muestra inicial de 73.268 vídeos recuperados en YouTube ante consultas específicas relacionadas con covid-19 y España y, por otra parte, la muestra final de 39-702 vídeos en los que aparece explícitamente el término coronavirus, covid-19 o SARS-CoV-2 en el título o descripción de los vídeos.

CORD-19: The Covid-19 Open Research Dataset

Allen Institute for AI


CORD-19 is a resource of over 200,000 scholarly articles, including over 100,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses. This freely available dataset is provided to the global research community to apply recent advances in natural language processing and other AI techniques to generate new insights in support of the ongoing fight against this infectious disease. There is a growing urgency for these approaches because of the rapid acceleration in new coronavirus literature, making it difficult for the medical research community to keep up.

COVID-19 Data Repository

Johns Hopkins University


Data repository for the 2019 Novel Coronavirus Visual Dashboard operated by the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE). Also, Supported by ESRI Living Atlas Team and the Johns Hopkins University Applied Physics Lab (JHU APL).

Data on COVID-19

Our World in Data


Complete COVID-19 dataset is a collection of the COVID-19 data maintained by Our World in Data. It is updated daily and includes data on confirmed cases, deaths, hospitalizations, and testing, as well as other variables of potential interest.

COVID-19 World Survey Data API

University of Maryland


API for accessing the daily global Facebook symptoms survey data

Data for the Open COVID-19 Data Working Group

University of Washington


Location for summaries and analysis of data related to n-CoV 2019, first reported in Wuhan, China

CCCSL: CSH Covid-19 Control Strategies List

Complexity Science Hub


A structured open dataset of government interventions in response to COVID-19

Health Intevention Tracking for COVID-19 (HIT-COVID) Data

Boston University and Johns Hopkins University


The Health Intervention Tracking for COVID-19 (HIT-COVID) project tracks the implementation and relaxation of public health and social measures (PHSMs) taken by governments to slow transmission of SARS-COV-2 globally. Hundreds of volunteer data contributors were trained, provided with standardized field definitions and access to an online forum for asking questions and sharing ideas.

Mozilla COVID dataset

Mozilla Foundation


Data about user browsing in Mozilla Firefox to understand social distancing

The CoVidAffect dataset

CoVidAffect project


Data from CoVidAfect, a nationwide citizen science project aimed to provide longitudinal data of mood changes following the COVID-19 outbreak in the spanish territory

COVID-19 Mobility Monitoring project

ISI Foundation and Cuebiq


Data from COVID-19 Mobility Monitoring project that analyses anonymized location data to understand the effect of mobility restrictions and behavioral changes on the current international COVID-19 outbreak.

#Data4COVID19

The GovLab


A series of projects to identify, collect, and analyze the value data can provide to the ongoing COVID-19 pandemic