Text and Data Mining

Text and data mining comprises the development and application of methods which are designed to extract knowledge that is relevant to the social sciences from unstructured texts or data streams.

Our research on Text and Data Mining

  • Detection of statistical regularities in data and text and alignment of these regularities with variables of interest such as political leaning or gender
  • Combine digital behavioral data and survey data to create new types of user models
  • Semantic enrichment and analysis of collaboratively generated documents (e.g. wikipedia articles or scientific publications) and the social dynamics of the creation process (e.g. conflicts, productivity)
  • Statistical modelling of sequential human behavior (e.g., the decisions made when navigating on the web or individual movement in urban surroundings)
  • Detection, disambiguation and linking of entities which are of interest for the social sciences in academic publications (especially references to research data)
  • Extraction of key information from texts and (semi-)automatic indexing
  • Daikeler, Jessica, Leon Fröhling, Indira Sen, Lukas Birkenmaier, Tobias Gummer, Jan Schwalbach, Henning Silber, Bernd Weiß, Katrin Weller, and Clemens Lechner. 2024. "Assessing Data Quality in the Age of Digital Social Research: A Systematic Review." Social Science Computer Review. doi: https://doi.org/10.1177/08944393241245395.
  • Sen, Indira, Dennis Assenmacher, Mattia Samory, Isabelle Augenstein, Wil van der Aalst, and Claudia Wagner. 2023. "People Make Better Edits: Measuring the Efficacy of LLM-Generated Counterfactually Augmented Data for Harmful Language Detection." 2023. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP, edited by Houda Bouamor, Juan Pino, and Kalika Bali, 10480-10504. Singapore: Association for Computational Linguistics.
  • Dahou, Abdelhalim Hafedh, Mohamed Amine Cheragui, and Ahmed Abdelali. 2023. "Performance Analysis of Arabic Pre-Trained Models on Named Entity Recognition Task." In Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing, edited by Ruslan Mitkov, and Galia Angelova, 458–467. Shoumen: INCOMA Ltd.. https://aclanthology.org/2023.ranlp-1.51.pdf.
  • Diera, Andor, Abdelhalim Hafedh Dahou, Lukas Galke, Fabian Karl, Florian Sihler, and Ansgar Scherp. 2023. GenCodeSearchNet: A Benchmark Test Suite for Evaluating Generalization in Programming Language Understanding. Proceedings of the 1st GenBench Workshop on (Benchmarking) Generalisation in NLP. Association for Computational Linguistics (ACL). doi: https://doi.org/10.18653/v1/2023.genbench-1.2.
  • Dahou, Abdelhalim Hafedh, and Brigitte Mathiak. 2023. "Subject Classification of Software Repository." In Proceedings of the 15th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - KDIR, 1, 30-38. SciTePress. doi: https://doi.org/10.5220/0012159600003598.
Title Start End Funder
Kompetenzzentrum Datenqualität in den Sozialwissenschaften (KODAQS)
2023-11-15 2026-11-14 Bund
NFDI for Data Science and Artificial Intelligence (NFDI4DS)
2021-10-01 2026-09-30 DFG
NFDI for Business, Economic and Related Data (BERD@NFDI)
2021-10-01 2026-09-30 DFG
Dehumanization Online: Measurement and Consequences (Professorinnenprogramm) (DeHum)
2021-01-01 2026-09-30 SAW (Leibniz)
TraininG towards a society of data-saVvy inforMation prOfessionals to enable open leadership INnovatio (MOVING)
2016-04-01 2019-03-31 Sonstige Drittmittel

Find out more about our consulting and services: