Data Scraping

Open tools for retrieving data from online platforms

ⓘ  This is a selection of open source tools suggested by Mediafutures mentors. Participants are free to use these or other tools.

Twitter API

Twitter API official documentation


Through the Twitter API it is possible to retrieve public tweets about specific topics or query terms, and to monitor the debate in real time. The documentation includes tools and libraries for dealing with the API in different programming languages, and step by step tutorials.

Reddit API

PRAW Python library


Python Reddit API Wrapper (PRAW) is a Python library for accessing Reddit data through the Reddit API, in compliance with the platform’s policies and the API’s rules.

Wikipedia API

Wikipedia-API Python library


Python library to easily query the WIkipedia API. Through the API it is possible to retrieve in real time data from Wikipedia, including text and metadata for any Wikipedia page, as well as images, links between articles, or the revision history.

Facebook Marketing API

pySocialWatcher Python library


A Social Data Collector from Facebook Marketing API. This tool can be used to obtain from the Facebook Marketing API the number of Facebook users sharing a specific interest based on different demographic filtering criteria such as country, age range, gender, scholarity, language and citizenship of users.

Data scraping in Python

Scrapy


Fast high-level web crawling & scraping framework for Python. Scrapy can be useful to build a website crawler and collect data from different websites using python programming language.

Web corpus curation

Hyphe


Websites crawler with built-in exploration and control web interface. It allows for creating a web corpus as a set of web pages and links between them, curated and organized by the user. The exploration starts from an initial seed of one or more web sites defined by the user, that is expanded in one or more iterations.

YouTube data extraction

YouTube Data Tools


Collection of simple tools for extracting data from the YouTube platform via the API v3. The scripts that query the Youtube API can be run, or launched directly online through a web interface.

Tumblr data extraction

Tumblr Tool


Script that extracts data from the microblogging and social networking website Tumblr, and allows for creating co-tag networks and tabular post stats. Data can be queried through a web interface.

Twitter Capture and Analysis Toolset

DMI TCAT


Toolset to retrieve tweets from the Twitter API, and to refine and analyze them in various ways. DMI-TCAT provides robust and reproducible data capture and analysis, and interlinks with existing analytical software.

Reconstructing Twitter ID datasets

Hydrator


Electron based desktop application for hydrating Twitter ID datasets. Sharing full Twitter datasets is not allowed by the Twitter Terms of Service, however tweet IDs can be shared. Hydrator allows one to reconstruct a whole Twitter dataset, in JSON or CSV format, from the tweets IDs, by querying the Twitter API to retrieve the data corresponding to each tweet ID.