Digital Behavioral Data: Datasets

Social scientists are increasingly drawing on web data to analyze social behavior, opinion formation, cultural preferences, or political polarization. Collecting social media data and other digital behavioral data (DBD) up to the standards of social science research is a non-trivial task and often a challenge to individual researchers. GESIS develops innovative methods for the collection of digital behavioral data in the social sciences. In accordance with the proprietary and privacy restrictions that apply, we provide the resulting data for scientific re-use. GESIS offers a range of collected, curated, and augmented datasets; these data are transparent, ready-to-use and often accompanied by additional materials or tools. We concentrate on topical data relevant for the social sciences, training data – e.g., for attribute or opinion detection – or large datasets that can be further mined for individual research purposes.

German Federal Elections

Topical Collection
Source: Twitter, Facebook

These datasets represent results from the social media monitoring of Facebook and Twitter for the German federal election campaigns 2013 and 2017. The project collected the tweets and Facebook posts of political candidates and organizations and the engagement of users with these contents; we continue to cover the 2021 election.

2013 Data | 2017 Data | 2013 Report | 2017 Report | Tool | Paper

TweetsCOV19

Longitudinal Crawl
Source: Twitter

Semantically annotated corpus of tweets related to the COVID-19 pandemic capturing online discourse about various aspects of the pandemic and its societal impact from October 2019 onwards. The dataset contains precomputed entity and sentiment annotations and extracted tweet metadata. The data are publicly available.

Description | Report | Data

'Call me sexist but' (CMSB)

Topical Collection, Training Data
Source: Twitter, Crowdsourced

The 'Call me sexist but' dataset (CMSB) is part of our work to analyze different dimensions of sexism in social media, including overt hostile sexism, 'benevolent' sexism, or more subtle forms that pose a particular challenge for automatic detection techniques. With this research we aim at improving methods for, e.g., addressing sexism on online platforms.

Data | Paper | Github | Github

Politicians in Wikipedia

Topical Collection
Source: Wikipedia, DBPedia

The dataset contains information about international politicians from DBpedia, including name, gender, nationa­lity, and for many also their politi­cal party affiliation. The dataset is based on the English DBpedia dump from October 2015.
The data was used to create an interactive visualization of politicians' networks.

Data | Visualization

Historical Narratives

Topical Collection
Source: Wikipedia

These data allow mining timelines and detecting temporal focal points of written history across languages on Wikipedia. Articles related to the history of all UN member states were extracted and compared in 30 language editions. Our computational approach allows to identify historical focal points quantitatively.

Data | Paper | Paper

TweetsKB

Longitudinal Crawl
Source: Twitter

TweetsKB is a public corpus of semantically annotated tweets based on a permanent Twitter crawl. The dataset currently contains data for more than 2.0 billion tweets, spanning from February 2013 to now. Metadata about the tweets, extracted entities, sentiments, hashtags, and user mentions are shared as a public knowledge base.

Description | Data | Paper

TokTrack Wikipedia

Platform Data
Source: Wikipedia

This dataset contains every instance of all tokens  (≈ words) ever written in undeleted, non-redirect English Wikipedia articles until October 2016, in total 13,545,349,787 instances. We also offer "WikiWho" – a service tool for tracking collaborative knowledge production on Wikipedia.

Data | WikiWhoTool |
WikiWhoTutorial | Report