Knowledge Graph Infrastructure

The aim of the Knowledge Graph (KG) infrastructure is to build an infrastructure for GESIS-wide linking of social science research data and resources and their interoperability and findability on the Web. This is based on the development of a social science knowledge graph which links the GESIS data collections among themselves and these with established vocabularies, social science data sources and established knowledge bases on the web such as web sites like Wikidata.

The KG will also be enriched by extracted entities such as variables and links, for example, between publications and research data. Survey data as well as digital behavioral data are taken into account. The rich information in the Social Science Knowledge Graph will be integrated into GESIS services such as the GESIS-wide search to support users, e.g. during their search for research data.

Based on this, additional knowledge graphs will be provided and linked in the infrastructure that hold data, entities and their relationships relevant to social science research topics, such as ClaimsKG, a graph of annotated claims extracted from fact checking websites.

For the development of the knowledge graph infrastructure in general and the social science knowledge graph in particular, methods of information extraction, entity interlinking, coreference resolution and data fusion are being investigated and applied.

  • ClaimsKG: ClaimsKG is a knowledge graph that contains claims and their evaluation from fact checking websites and links relevant entities with concepts of DBpedia. The KG currently holds 28,383 claims from 6 English-language websites.
  • EXCITE: In the EXCITE - Extraction of Citations from PDF Documents project, procedures were developed and developed to extract and structure literature citations from scientific publications. The extracted references (over 1 million) were delivered to the Open Citations Corpus (OCC). Of these, over 300,000 links to publications in GESIS data collections were identified, which will be integrated into the Social Science Knowledge Graph.
  • GESIS Research Graph: In the GESIS Research Graph project, a graph has been developed prototypically that links publications, research data, projects and people. The GESIS Research Graph is based on the Knowledge Graph infrastructure and contains over 110,000 publications, over 6,200 research records, and over 53,000 research projects.
  • GESIS-wide search: The Knowledge Graph infrastructure is integrated into the backend of the GESIS-wide search and thus provides users with structured information on linked research data, publications, etc.
  • InFoLiS: In the project InFoLiS - Integration of Research Data and Literature a method has been investigated and developed which allows for detecting citations of research datasets in scientific publications. The resulting links between publications and research data are integrated into the Social Science Knowledge Graph.
  • MOVING: In the project MOVING, methods were investigated and developed to disambiguate authors. The methods are used to disambiguate person names from various data sources in the Knowledge Graph infrastructure, as well as to identify and resolve duplicates in the records.
  • OpenMinTeD: In the OpenMinTeD project, methods have been investigated and developed to identify the mentions of variables in scientific publications. The generated 415 links between publications and variables will be integrated into the Social Science Knowledge Graph.
  • Question Feature Sample: A sample knowledge graph of GESIS survey questions annotated with question features, concretely the information type.
  • SoMeSci: SoMeSci is the most comprehensive gold standard corpus, exposed as open knowledge graph, about software mentions in scientific articles, providing training samples for Named Entity Recognition, Relation Extraction, Entity Disambiguation, and Entity Linking. The data consists of 4397422 triples, describing metadata and context of 3756 mentions in 1367 articles.
  • SoftwareKG: SoftwareKG is a knowledge graph that contains information about software mention statements from more than 51,000 scientific articles from the social sciences. It enables analysis on the provenance of the research results, the attribution of the developers, and software citation analysis in general. Additionally, providing information about whether and how the software and the source code are available allows an assessment about the state and the role of open source software in science at a general base.
  • SoRa: In the project SoRa - Social Spatial Research Data Infrastructure a knowledge graph is under development that describes social science survey data at study, variable and question level. So far, the graph represents two complementary datasets of different institutes and will be extended by links to spatial data.
  • TheSoz: The Thesaurus for the Social Sciences (TheSoz) is a controlled vocabulary which contains about 8,000 concepts (recommended keywords) from the Social Sciences. Topics from all social science disciplines are included.
  • TweetsCOV19: TweetsCOV19 is a semantically annotated corpus of Tweets about the COVID-19 pandemic. It is a subset of TweetsKB and aims at capturing online discourse about various aspects of the pandemic and its societal impact. This dataset consists of 20,112,480 tweets in total, posted by 7,384,417 users and reflects the societal discourse about COVID-19 on Twitter in the period of October 2019 until December 2020.
  • TweetsKB: TweetsKB is a knowledge graph hosted at GESIS that includes metadata about 1.5 billion tweets (Feb. 2013 - Mar. 2018) and serves as a resource for social science research. Using information extraction methods, sentiments, entities, hashtags, and user mentions were extracted and published as linked data through a structured RDF schema.
  • Bensmann, Felix, and Benjamin Zapilko. 2023. ScienceLinker - Python Package. https://pypi.org/project/sciencelinker/.
  • Sack, Harald, Torsten Schrade, Oleksandra Bruns, Etienne Posthumus, Tabea Tietz, Ebrahim Norouzi, Jörg Waitelonis, Heike Fliegl, Linnaea Söhn, Julia Tolksdorf, Jonatan Jalle Steller, Abril Az´ocar Guzm´an, Said Fathalla, Ahmad Zainul Ihsan, Volker Hofmann, Stefan Sandfeld, Felix Fritzen, Amir Laadhar, Sonja Schimmler, and Peter Mutschke. 2023. "Knowledge Graph Based RDM Solutions : NFDI4Culture - NFDI-MatWerk - NFDI4DataScience ." In 1st Conference on Research Data Infrastructure (CoRDI) - Connecting Communities , edited by York Sure-Vetter, and Carole Globe, doi: https://doi.org/10.52825/CoRDI.v1i.371.
  • Kartal, Yavuz Selim, Sotaro Takeshita, Tornike Tsereteli, Kai Eckert, Henning Kroll, Philipp Mayr-Schlegel, Simone Paolo Ponzetto, Benjamin Zapilko, and Andrea Zielinski. 2022. Towards Automated Survey Variable Search and Summarization in Social Science Publications. Arxiv. doi: https://doi.org/10.48550/arXiv.2209.06804. http://arxiv.org/abs/2209.06804.
  • Zloch, Matthäus, Maribel Acosta, Daniel Hienert, Stefan Conrad, and Stefan Dietze. 2021. "Characterizing RDF graphs through graph-based measures – framework and assessment." Semantic Web 12 (5): 789-812. doi: https://doi.org/10.3233/SW-200409.
  • Bensmann, Felix, Andrea Papenmeier, Dagmar Kern, Benjamin Zapilko, and Stefan Dietze. 2020. "Semantic annotation, representation and linking of survey data." In Semantic systems. In the era of knowledge graphs. SEMANTICS 2020, edited by Eva Blomquist, Paul Groth, Victor de Boer, Tassilo Pellegrini, Mehwish Alam, Tobias Käfer, Peter Kieseberg, Sabrina Kirrane, Albert Meroño-Peñuela, and Harshvardhan Pandit, Lecture Notes in Computer Science 12378, 53-69. Cham: Springer. doi: https://doi.org/10.1007/978-3-030-59833-4_4. https://link.springer.com/chapter/10.1007/978-3-030-59833-4_4.

Find out more about our consulting and services: