Open datasets and dataset collections by third parties
Components publication and research group
A collection of large datasets related to media, including a dataset of 2.7 million news articles and essays from the last 7 years, and 10 thousand articles from the front page of the Times.
Data.europa.eu, the official portal for European data, is the access point for open data from Europe and its Member States. It includes already over 1 million datasets from 36 states, labelled and categorized according to different criteria. A good starting point to look for datasets about a variety of topics.
Kaggle dataset collection
Kaggle (Google LLC)
Collaborative collection of datasets for data science and machine learning. It includes datasets shared by the community on a very wide range of topics, and with different licences and formats.
Planet.osm - Open Street Maps data dumps
OpenStreetMap is a collaborative project to create a free editable geographic database of the world. Planet.osm consists of a weekly dump of the whole OpenStreetMap geographic data covering the whole planet. There are also files called Extracts which contain OpenStreetMap Data for individual continents, countries, and metropolitan areas.
Webhddose’s free datasets
A collection of free datasets include data from a range of different sources, languages and categories. Data sources include news articles, blog posts and online discussions.
Wikipedia data dumps
The Wikimedia downloads provide data dumps for all the Wikimedia projects, including Wikipedia in each language edition. The dumps are updated monthly, and include the latest version of the wiki, as well as the complete log of all the previous revisions of each page in XML format.
World news media - GDELT
The GDELT Project monitors the world broadcast, print, and web news from nearly every corner of every country in over 100 languages and identifies the people, locations, organizations, themes, sources, emotions, counts, quotes, images and events driving our global society every second of every day, creating a free open platform for computing on the entire world.
MediaFutures is funded by the European Union's Horizon 2020 Programme, under grant agreement number 951962. MediaFutures is a Europe-wide consortium. This website is managed on behalf of the consortium by Eurecat, whose main address is Carrer de Bilbao, 72, 08013 Barcelona (Spain).