Data Collection

Open tools for retrieving data from online platforms

DMI TCAT

Twitter Capture and Analysis Toolset


Toolset to retrieve tweets from the Twitter API, and to refine and analyze them in various ways. DMI-TCAT provides robust and reproducible data capture and analysis, and interlinks with existing analytical software.

Facebook Marketing API

pySocialWatcher Python library


A Social Data Collector from Facebook Marketing API. This tool can be used to obtain from the Facebook Marketing API the number of Facebook users sharing a specific interest based on different demographic filtering criteria such as country, age range, gender, scholarity, language and citizenship of users.

Hydrator

Reconstructing Twitter ID datasets


Electron based desktop application for hydrating Twitter ID datasets. Sharing full Twitter datasets is not allowed by the Twitter Terms of Service, however tweet IDs can be shared. Hydrator allows one to reconstruct a whole Twitter dataset, in JSON or CSV format, from the tweets IDs, by querying the Twitter API to retrieve the data corresponding to each tweet ID.

Hyphe

Web corpus curation


Websites crawler with built-in exploration and control web interface. It allows for creating a web corpus as a set of web pages and links between them, curated and organized by the user. The exploration starts from an initial seed of one or more web sites defined by the user, that is expanded in one or more iterations.

Reddit API

PRAW Python library


Python Reddit API Wrapper (PRAW) is a Python library for accessing Reddit data through the Reddit API, in compliance with the platform’s policies and the API’s rules.

Scrapy

Data scraping in Python


Fast high-level web crawling & scraping framework for Python. Scrapy can be useful to build a website crawler and collect data from different websites using python programming language.

Tumblr Tool

Tumblr data extraction


Script that extracts data from the microblogging and social networking website Tumblr, and allows for creating co-tag networks and tabular post stats. Data can be queried through a web interface.

Twarc

Command line tool and Python library for collecting and archiving Twitter JSON data via the Twitter API


twarc is a command line tool and Python library for collecting and archiving Twitter JSON data via the Twitter API. It has separate commands (twarc and twarc2) for working with the older v1.1 API and the newer v2 API and Academic Access (respectively). It also has an ecosystem of plugins for doing things with the collected data. While the primary use is academic, it works just as well with "Standard" v2 API and "Premium" v1.1 APIs.

Twitter API

Twitter API official documentation


Through the Twitter API it is possible to retrieve public tweets about specific topics or query terms, and to monitor the debate in real time. The documentation includes tools and libraries for dealing with the API in different programming languages, and step by step tutorials.

Wikipedia API

Wikipedia-API Python library


Python library to easily query the WIkipedia API. Through the API it is possible to retrieve in real time data from Wikipedia, including text and metadata for any Wikipedia page, as well as images, links between articles, or the revision history.

YouTube Data Tools

YouTube data extraction


Collection of simple tools for extracting data from the YouTube platform via the API v3. The scripts that query the Youtube API can be run, or launched directly online through a web interface.