GESIS Guides to Digital Behavioral Data
This guide offers an introduction to digital behavioral data (DBD), a rich new source for social science research providing insights into human and algorithmic behavior in the digital age. DBD encompass digital observations of interactions, recorded by online platforms or sensors like mobile phones, cameras, or wearables.
These data offer new opportunities to study societal developments and individual behavior in a granular, unobtrusive, and multi-dimensional way. By leveraging DBD, researchers can gain a deeper understanding of the dynamics of digital societies and investigate a wide range of social and socio-technical phenomena, from political polarization to online health behaviors or algorithmic inequalities.
We outline a definition, features, and types of DBD. We differentiate between organic (often also called found) and designed DBD, highlighting the varying levels of control over the data generation process that researchers may possess. We further distinguish the collection of DBD via platform-centered and user-centered research designs. Finally, we discuss the advantages and disadvantages of different types of DBD and provide concrete research examples, showing that DBD presents significant opportunities for social scientists to gain valuable insights into the rapidly evolving digital media landscape and its impact on society.
1 Definitions and scope
Digitalization has changed almost all areas of life, may it be communication, work, learning and teaching, shopping, traveling, finding a job, a partner, or any piece of information. We all interact in and with socio-technical systems, in which algorithms have a decisive role. Thus, digital behavioral data has become a primary source for monitoring and analyzing the structural development of digitalized societies and the behavior of humans therein.
With digital behavioral data, we can tackle well-known social science research questions with new forms of data, get hold of data in spaces that are left blank by traditional quantitative methods, and, most notably, study in vivo the major changes in private and public spheres that digitalization is bringing about, such as the effects of social media and Artificial Intelligence (AI) on democracies, social cohesion, or individual well-being.
In particular since the emergence of large online platforms (Poell et al., 2019), social life “in the network” (Lazer et al., 2009) leaves vast amounts of digital traces that might be noisy and unstructured but can be turned into research data for leveraging new insights into long-standing social science research questions and novel phenomena that emerge in socio-technical systems.
We call these research data “digital behavioral data” (DBD). DBD encompass digital observations of human or algorithmic behavior. DBD are generated (1) through interactions and content production online (e.g., on platforms such as Google, Facebook or websites on the World Wide Web) or (2) by software or sensors for recording specific processes (e.g., smartphones, RFID sensors, satellites, street view cameras, or web tracking).
What we regard as DBD might change and needs to be expanded in the future. In particular, the “behavioral” dimension of the data might undergo critical revisions. While patterns of behavior have traditionally been studied for human agents, be it individual or collective agents, we now also have to consider “algorithmic behavior” or “machine behavior” (Rahwan et al., 2019). This is not an attempt to “humanize” technical systems; however, we conceptually need to consider that those systems are not fixed structures but are constantly changing, adapting, and interacting with (and, increasingly, in place of) humans. In that respect, we do not merely observe human agents behaving within a technical system, but human agents, technical structures, and algorithms co-producing socio-technical environments that structure social interactions.
There are overlaps with the concept of “big data”, as digital behavioral data often, but not always, share the “four V” characteristics (volume, variety, velocity, and veracity) (Fröhling et al., 2023; Kohne et al., 2021). There are also overlaps of DBD with the concept of “digital trace data”– with digital traces being the basis for measuring and explaining digital behavior. In our understanding of digital behavioral data, we focus on the societally relevant aspects of human and algorithmic behavior and the research perspectives derived from these. It is this social science perspective asking for “theoretically interesting contructs” and “societal relevance” that turns data from mere traces to meaningful information that can be used to ask and answer social science research questions (Howison et al., 2011).
The use of DBD is closely connected to the research field of computational social science that has been formed over the last one and a half decades. In their seminal paper of 2009, Lazer and colleagues outlined how the – then new – field is based on large-scale data, how it is data-driven and can reveal “patterns of individual and group behavior, with the potential to transform our understanding of our lives, organizations, and societies” (Lazer et al., 2009). In 2020, a sequel of the paper re-examined the now prospering field, and there the term “behavioral data” became even more central and is explicitly highlighted as an attribute of the data in focus (Lazer et al., 2020).
2 Features of digital behavioral data
A lot has been written about the promises and challenges that digital behavioral data hold for social science research. Advocates as well as critics of DBD usually start with an account of features and characteristics that often – sometimes implicitly – compare these new types of data to more traditional data like surveys and their features or the standards applied to them. Since the field is still rather new and evolving, there is no such thing as one canonical understanding of DBD and their core features. In this guide, we will highlight a few working definitions and outline properties we consider helpful. These might overlap, emphasize different aspects but hopefully add up to a more detailed picture that comes close to a comprehensive account of the data at stake.
The four (or more) “Vs” of big data are the best-known shortcut for big data characteristics: Volume (large-scale data), variety (structured or not, dynamic, etc.), velocity (real-time or close to), and veracity (authenticity, trustworthiness); sometimes value (benefits that can be derived from the data) is added which makes it five. Moreover, there are numerous attempts to define and expand the Vs beyond that.
Conceptual overviews and introductions to the field have emphasized the innovative characteristics of DBD (Golder & Macy, 2014; Lazer et al., 2009, 2020; Salganik, 2019; Strohmaier & Wagner, 2014; Wagner et al., 2021). If we compare these approaches, we find similar characteristics under various labels that may be synthesized as follows:
Volume, voluminous, big – refer to scale,
veracity, unobtrusive, reliable, nonreactive – refer to the presumed “authentic” nature of data that is not confounded by an observational situation or experimental setup,
velocity, always-on, real-time – refer to the continuous but contingent stream of data,
variety, unstructured, incomplete, drifting, dirty – refer to the difficulties to collect data up to the standards of social science research,
nonrepresentative, unstructured, multimodal – refer to the data not having been created for research purposes but being “found”,
algorithmically confounded – refers to the challenge that it is often difficult to tell what aspects of the data are the result of algorithmic processes, and to what extent the data mirrors system affordances or reflects human behavior.
These features of DBD especially apply to the big “found” data from online platforms that long have dominated research – both igniting the enthusiasm about these new data sources and the critique of them not meeting the quality standards of social science research.
3 Types of digital behavioral data and their collection modes
Various typologies exist to categorize different research designs and modes of collecting DBD. In this chapter, we will outline the differences between found vs. designed DBD as well as platform-centered vs. user-centered data collection approaches. (Lazer et al., 2020) aims to synthesize the two typologies and outline their cross-relations with existing modes of DBD data collection.
3.1 Types of digital behavioral data
Found data vs. designed data. Since the 1960s, researchers have distinguished between unobtrusive and obtrusive research methods (Webb et al., 1966). Unobtrusive (or non-reactive) methods allow social scientists to study human behavior without any interventions or direct interaction with study subjects. Because researchers do not influence the data generation process, these methods offer higher ecological validity but they provide researchers with only limited degrees of control over their research design. In the case of DBD, researchers often differentiate between data that is collected with unobtrusive methods (so-called “found data” or “organic data”) and more obtrusive data collections through technical interventions in or the creation of the environment in which the data is generated (so-called “designed data”) (Strohmaier & Wagner, 2014).
DBD qualifies as found data when collected from technical systems (e.g., social media or web platforms) without controlling the data generation process. Typical examples are the traces that humans leave on Facebook, WhatsApp, or YouTube as a byproduct of their interactions with online platforms. Big platform data is most of the time, but not always, collected via Application Programming Interfaces (APIs) or web scraping without the users of the platform contributing to the data collection process (see section 3.2 for a more detailed discussion of different data collection modes).
DBD can be considered as designed data when the research design controls the data generation process. Typically, researchers use specific research infrastructures such as online panels (e.g., the GESIS Panel.dbd) or artificial online environments (e.g., to mimic social media platforms) for collecting designed digital data. The data generation process is controlled by the researcher, e.g., via strategic recruitment of participants. A more specific variant of designed digital data collections are online experiments, where researchers manipulate a targeted aspect of a technical environment or the user experience. Also here, the researcher controls the data generation process via the research design.
In sum, both types of DBD – found and designed – come with different affordances for research practice and they build on different modes of data collection. Both types of DBD also require different procedures with regard to privacy, data protection, data quality assurance and research ethics (Breuer et al., 2025).
Platform-centered vs. user-centered data collections. Another taxonomy is the distinction between platform-centered vs. user-centered data collection approaches. Here, the exclusive criterion for differentiating the two paradigms is the data collection approach, not the research interest or other criteria. The predominant approach in CSS have been platform-centered data collections where researchers create a sample of platform data based on theoretically relevant entities like users, topics, hashtags, search queries, or time. Web scraping and APIs are the predominant methods for collecting platform-centered data (Table 1, cell 1). Platform cooperations (e.g., Social Science One) where researchers get access to some parts of the platform data are another approach that – depending on the implementation of the Digital Services Act of the European Union – might become more prominent in the future. For designed platform-centered data collections, selected groups of researchers have also cooperated with platforms to conduct experiments on the platform (Table 1, cell 2). Such cooperations typically require researchers to closely collaborate with a company and its research units.
Platforms’ APIs and regulations (such as terms of services) limit the types of DBD that one can collect from platforms. Platform-centered data collections typically do not allow for drawing inferences about broader target populations, but enable platform-specific, platform-comparative, and longitudinal studies of socio-technical phenomena such as the spread of misinformation or online campaigning by politicians. Against the backdrop of severe restrictions to data access imposed by platforms, researchers have started to devise their own solutions for collecting relevant DBD in the “post-API age” (Freelon, 2018). User-centered data collections typically require the data owner to actively contribute to the data collection – e.g. by giving informed consent, installing research software, and replying to survey questions. User-centered data collections link DBD with individual-level information on demographics and a potentially wide range of survey instruments, e.g., on demographics, attitudes, sociopsychological traits, trust or evaluations of other societal groups (Stier et al., 2020). The unit of analysis most of the time are individual persons. If the representativity of the sample can be ensured, they also allow for making inferences about specific target populations of interest (Table 1, cell 4).
A typical workflow starts with a systematic sampling and recruitment of participants, either from online access panels, crowd platforms, social media ads, or offline settings. Participants are then surveyed about their demographics and key variables of interest related to the project’s research interest, before participants grant their informed consent and install research software, follow a data donation workflow, or participate in an online experiment. In most user-centered data collections, researchers therefore have full control over the data generation process. However, variants of found user-centered data collections do exist (Table 1, cell 3), for instance, “citizen science” projects where interested citizens can self-select into donating data, without a systematic sampling or any survey component.
Table 1 synthesizes the most important collection modes for DBD and how they map with the taxonomies discussed above.
| Found data non-reactive, no control or interruption of data-generation process | Designed data researchers control and influence the data-generation process | |
|---|---|---|
| Platform-centered data collection without user collaboration | ( 1)
| ( 2)
|
| User-centered data collection with user collaboration | ( 3)
| ( 4)
|
Table 1. Types of digital behavioral data and their collection modes

To illustrate, consider the example of the burgeoning field of research on the prevalence of misinformation on online platforms. A researcher could either collect platform data identifying all accounts that posted certain pieces of misinformation and their social network connections from a platform like X/ Twitter, allowing them to create a representative picture of the spread of misinformation narratives and how much engagement these generate on the platform. Alternatively, researchers could recruit a sample of X/ Twitter users from a survey and ask them to donate information like the accounts the participants were following or the tweets they are seeing. Linked with the survey responses, such a dataset would allow for estimating which variables like socio-demographics or party identification predict exposure to misinformation on X/ Twitter. This user-centered approach would allow for explaining individual engagement with misinformation but not enable a holistic understanding of the phenomenon of interest in its entirety that a platform-centered approach can provide. This example demonstrates that both approaches can be applied in the same research area but are suitable for different types of research questions.
3.2 Modes of collecting digital behavioral data
In the past, there have been three main pathways to obtain DBD for research: (1) self-administered data collections, (2) collaborating with companies, (3) buying data. The latter two models reflect institutional inequalities in the field, as few researchers were in the privileged position to collaborate with companies, most recently and most notably the collaboration of independent researchers with Meta to study the use and effects of Facebook and Instagram during the 2020 US election (Guess et al., 2023). Some researchers are also subscribing to paid data provision models like the one recently introduced by the platform X. By far the most prevalent approach are self-administered collections of online data, which we will cover in more detail in this section.
APIs. With web APIs, platform data can be accessed in an automated and structured way. Most APIs are provided by the platforms themselves, and access is often tiered, with free and paid options for which users often must go through a vetting process. How an API functions can be opaque, e.g., how the data from the API is sampled or moderated (Pfeffer et al., 2018). The provenance of the data might be unknown, and APIs can change over time or get shut down, posing a challenge to researchers who need comparable and longitudinal data. While APIs often enable researchers to easily access public data from social media platforms, they rarely provide access to all their data in unrestricted quantities – Wikipedia being a notable exception in that regard. Accordingly, some information may be visible to researchers in a browser but not available through the corresponding API. Web APIs are usually accessed via wrappers that interact with the API through more common programming languages, such as Python or R. Essentially, API wrappers are built on top of the APIs (Soldner et al., 2024).
Web scraping. Web scraping is conducted via tools or scripts that capture the content of websites (“scrapers”), which can be complex due to the heterogeneous and continuously changing structures of websites. Thus, scrapers often require more time to be implemented and maintained but leave researchers with large degrees of freedom with regard to the web data they aim to collect. With a custom scraper, the data collection procedure is also very transparent (Soldner et al., 2024).
Smartphone-based data collection. Since people use their smartphones throughout the day, DBD can be gathered in situ and in real time via dedicated research apps. Measurements such as app usage, location patterns, or interaction histories are readily available in the predominant operating systems iOS and Android and capture valuable ecologically valid behavioral traces for research (Lux & Wieland, 2025). DBD collected via smartphones are especially powerful in combination with a mobile experience sampling component, allowing researchers to survey participants multiple times over the course of a day (Otto et al., 2022).
Data donation. Researchers have recently started to use data donations to collect DBD. Data donations are made possible by Art. 15 of the General Data Protection Regulation (GDPR), which grants natural persons the right to obtain a copy of their personal data from a controller that is processing their personal data (Boeschoten et al., 2022). Study participants can request an electronic copy of their data, receive it by email or as a download and can transfer their data packages to researchers. Data donation has been described as “partnering with users to collect big data” (Halavais, 2019) and is particularly helpful if access to the data is otherwise not possible or difficult, as is the case with platform APIs.
Web tracking. Web tracking refers to the process of collecting detailed behavioral data through research tools, typically browser extensions that participants install on their devices (Mangold & Stier, 2025). These tools capture users' web activity in real time, tracking the websites they visit, the content they interact with, and their browsing habits over time. As a technique for studying digital behavior, web tracking offers immediate, granular insights into online interactions, providing a level of detail that sets it apart from other research methods.
Integrated data collection infrastructures for DBD. In addition to these self-administered data collections conducted by individual research teams, infrastructure institutions such as GESIS are building up panel infrastructures that support researchers to collect DBD and make DBD datasets available for secondary research. The GESIS Panel.dbd Digital Behavioral Data Sample is an example for a data collection infrastructure that combines an online panel and repeated surveys with DBD such as smartphone-based data collections, web tracking, or data donations.
4 Social science research with digital behavioral data
Why should social scientists be interested in using DBD? A compelling argument is the widespread adoption of digital devices and social media and their potential influence on individuals and society. Social media platforms such as Facebook and TikTok collectively have approximately 5.20 billion users worldwide, representing about 64 % of the global population (as of January, 2025) (Larson, 2025). Their widespread reach and influence on what information individuals are exposed to, how they form opinions, and how they interact with others, makes them highly relevant to the social sciences.
Researchers have highlighted that social media has evolved into a critical tool for civic engagement. It serves as a platform for organizing protests, raising awareness about pressing social issues, and mobilizing public support. For example, movements like the Arab Spring, Black Lives Matter, and #MeToo gained international momentum largely through their visibility and collective organization via social media platforms. The decentralized, anonymous, user-driven nature of these tools allows marginalized voices to be amplified and reach global audiences, often bypassing traditional media gatekeepers. However, these same features also make social media a breeding ground for anti-social behavior. This highlights that social media not only fosters positive civic participation but can also enable harmful forms of communication and societal processes, such as discrimination of minorities or radicalization.
Social media platforms have become increasingly important during election campaigns around the world. Political candidates and parties use platforms like Facebook, X/ Twitter, and Instagram to communicate directly with voters, spread campaign messages, and engage in targeted advertising. Data-driven algorithms enable campaigns to micro-target specific demographics, tailoring content to influence voter behavior. Research has shown that social media can both mobilize voters and contribute to political polarization, raising critical questions about its role in democratic processes. Moreover, the use of social media has been linked to mental health and well-being. While these platforms offer opportunities for connection and expression, there is a lively scholarly debate on their association with increased levels of anxiety, depression, and loneliness, especially among adolescents and young adults.
To study collective or platform-specific phenomena such as civic engagement, radicalization or discrimination and dynamic societal processes (e.g., the spread of information over time), researchers typically use found platform-based data collections. The advantages of platform data for this research interest are twofold: First, the relevant group of users that contribute to the phenomenon of interest is typically hard to reach (e.g., activists, politicians, extremists); and second, the study of dynamic societal processes requires snowball sampling (i.e., groups of users that are connected). For example, Table 2 (cell 1) highlights a study that explores how protest information spreads in online networks using found platform-data from Twitter (González-Bailón et al., 2013). The study does not only provide insight into the prevalence of protest information on social media but also into the dynamic social process of how this information spreads. Studies that audit platforms also often use found, platform-centered data. For example, (Hannák et al., 2017) show gender and racial inequalities in online freelancer platforms using found platform-centered data that was obtained via web scraping.
Found user-centered data collections shown in Table 2 (cell 3) can be used to study the phenomena from a consumer rather than a producer perspective (i.e., exploring patterns of exposure to dehumanizing content instead of studying who produces it). Platforms such as Open Humans or NGOs such as Algorithm Watch host open data donation calls to which everyone can contribute. For example, data donations of individuals from Fitbit and Apple HealthKit projects are currently used to support health research on physical inactivity as a risk factor for preventable chronic condition (Greshake Tzovaras et al., 2019). Other examples are TikTok data donation initiatives for the Australian and German elections 2025 promoted by media organizations (Fell & Tan, 2025).
To explore causal effects of social platforms on individuals or groups, designed platform- or user-based data collections are typically used since they allow for controlling the data generation process, e.g., via strategic sampling or experimental treatments of selected participants. Table 2 (cell 2) highlights a multi-wave field experiment that was conducted in cooperation with Meta on Facebook (Nyhan et al., 2023) where the authors modified the data generation process by changing a platform feature. Researchers tested the effect of exposure to content from like-minded sources during the 2020 US presidential election. They found that exposure to content from like-minded sources on social media is very common but that experimentally reducing its prevalence did not affect levels of polarization, beliefs, or attitudes. Since no user collaboration (besides signing informed consent) was necessary for the collection of DBD, we consider these data as designed platform-based data.
Designed user-based data collections require user collaborations and control over the data generation process, e.g., by randomly sampling participants from well-defined populations, measuring and blocking confounders or randomly assigning participants to treatments. Table 2 (cell 4) highlights a study (Eichstaedt et al., 2018) that recruited study participants from patients visiting a large urban academic emergency department and combined Facebook data donations and medical records of patients to explore if language of social media posts contains early signals of depression.
Another example of a designed user-centered data collection is the study by (Salganik et al., 2006) who created an artificial online market in which a convenience sample of 14,341 participants downloaded previously unknown songs either with or without knowledge of previous participants' choices. The results show that increasing the strength of social influence affected both the inequality and unpredictability of success. A study in the context of the COVID-19 pandemic combined online panel surveys with mobile tracking data to measure the usage of Germany’s official contact tracing app and the effects of interventions (Munzert et al., 2021).
Table 2 lists exemplary studies that fill the categories from Table 1.
| Designed data | |
|---|---|---|
| Platform-centered data collection |
| ( 2) Experiment that tests the effect of reducing exposure to content from like-minded sources. Nyhan et al., 2023 |
| User-centered data collection |
| ( 4) Data donations from patients showed that social media data can predict depression. Eichstaedt et al., 2018 Experimental study of inequality and unpredictability in an artificial cultural market. Salganik et al., 2006 Study on tracking and promoting the usage of a COVID-19 contact tracing app. Munzert et al., 2021 |
Table 2. Examples from digital behavioral data research
5 Challenges and outlook
Digitalization has transformed nearly every aspect of our lives, making the demand for diverse and methodologically rigorous research more critical than ever. By leveraging both found and designed, as well as platform- and user-centered data collections, social scientists can more effectively investigate the broad societal impacts of digital platforms. Each approach offers unique advantages and limitations, and thoughtful methodological choices are essential for producing valid, ethical, and impactful insights. While DBD offers many opportunities to social scientists, some challenges persist and require special attention.
Data quality. A challenge for social scientists who are used to working with well-documented structured data that have been collected using established data collection procedures lies in the nature of DBD – their volume, velocity, variety, and veracity. DBD are often noisy and unstructured, collected rather ad-hoc using black box instruments and may suffer from rapid depreciation as platforms and their data sharing strategies may change. Consequently, collecting DBD in a systematic and transparent way, harmonizing and integrating DBD from different sources, assessing the quality of DBD and potential biases in the data that may lead to validity concerns is a persistent challenge for computational social scientists. In short, DBD require special procedures for ensuring data quality (for an overview of data quality frameworks for DBD, see (Fröhling et al., 2025).
Methods quality. Measurements that are based on DBD often rely on machine learning (ML) and other statistical models which learn relationships between observable data and constructs or phenomena of interest. These models are very powerful – they can detect patterns and relations without having to predefine parametric models and can react to concept drifts – but they typically are complex black boxes. Ensuring the validity and reliability of those models and methods, as well as their reusability and interpretability is an important challenge for the field (for an introduction to the reproducibility of computational analyses see (Bleier, 2025).
Data linking. Despite their many advantages, DBD can also be limited for research purposes when used in isolation. Similarly, there are many limitations associated with conventional data sources for social scientists which led to an increased interest in DBD in the first place. As outlined by (Salganik, 2019), it is especially the combination of “custom-made” data, such as from surveys, and “ready-made” data such as DBD, that unlocks the biggest potential for innovative and societally relevant research (for linking DBD with survey data see (Stier et al., 2020).
Data access. Academia faces a lack of reliable access to DBD. For a long time, researchers have primarily relied on data access options provided by platforms. Over the last years, restrictions or even the discontinuation of APIs have been disruptive for academic research. Although regulation, notably the EU’s Digital Services Act, now requires platforms to provide more transparency by making data accessible to researchers, there are several barriers to that – ranging from missing infrastructure for preserving and accessing such data to standards for data quality and reproducibility that are yet to be developed.
Outlook. The fact that these challenges to successfully working with digital behavioral data have been increasingly identified, discussed, and addressed in the research community over the last decade may be seen as an indication of the field’s increasing maturity. The use of DBD has progressed from an early El Dorado-like fascination with new and often data-driven opportunities to a widespread integration of DBD in diverse social science research portfolios across topics and approaches. Collectively addressing and finding solutions for the problems and barriers mentioned will further propel the adoption of these data, move them closer to becoming part of “normal (social) science”, and leverage their potential to give meaningful answers in the context of social science research questions.
6 Notes
All links in the text and the reference list were retrieved on June 4, 2025.