AI-assisted Linking

Knowledge Graphs (KGs) connect data and information, making them easier to find, understand, and use related facts quickly. They enhance data integration, search, and decision-making by structuring information into meaningful relationships, enabling AI-driven insights and context-aware applications. The AI-assisted linking activities in KTS comprises methods, infrastructures, and tools to generate Knowledge Graphs. 

Methods and infrastructures from various third-party funded projects at GESIS contribute to the AI-assisted linking. The generated KGs are closely integrated into existing GESIS offerings, such as the integration of the GESIS KG into the GESIS Search, or into offerings of other communities such as the NFDI

The KGs are geared towards interoperability and use established W3C standards and vocabularies such as schema.org, DDI, the NFDIcore ontology and others to improve the interoperability and reusability of data on the web for people and machines, e.g. through APIs. Findability and Interoperability is improved through the reuse of Persistent Identifiers (PIDs) from common PID systems. 

  • GSAP The goal of the GESIS Scholarly Annotation Project (GSAP) is to create a corpus and develop automatic tools for information extraction and linking of machine learning models and related entities such as methods, datasets and tasks from scientific publications. Such tools will help to unlock new sources of knowledge. The project is a joint approach of the DFG funded projects BERD@NFDI, Unknown Data and NFDI4DataScience
  • InFoLiS In the project InFoLiS - Integration of Research Data and Literature a method has been investigated and developed which allows for detecting citations of research datasets in scientific publications. The resulting 99227 links between publications and research datasets are integrated into the GESIS KG and are also available in GESIS Search. 
  • ClaimsKG ClaimsKG is a knowledge graph that contains claims and their evaluation from fact checking websites and links relevant entities with concepts of DBpedia. The latest release of ClaimsKG covers 74066 claims and 72128 claim reviews. The data was scraped in January of 2023 containing claims published between the years 1996-2023 (Jan 31) from 13 fact-checking websites. The claim-review (fact checking) period for claims ranges between the year 1996 to 2023. 
  • gesisDataSearch KG This Knowledge Graph contains metadata on harvested social science datasets. It is the KG representation of the content served by the GESIS Data Search Portal. The KG currently contains metadata of 11965 datasets. 
  • GESIS Knowledge Graph The GESIS Knowledge Graph (GESIS KG) represents metadata of scientific resources available in the GESIS Search and its semantic relationships in an integrated and consistent form and makes them accessible for reuse. The current version of the GESIS KG contains metadata from 474201 scientific resources and 168362 links between them of which 99227 links have been generated automatically. 
  • GESIS Research Graph  The GESIS Research Graph was a case study in collaboration with the Research Graph Foundation in which a graph has been prototypically developed that links publications, research data, projects and people. The GESIS Research Graph was based on the GESIS KG and contained over 110,000 publications, over 6,200 research records, and over 53,000 research projects. 
  • Question Feature Sample A sample knowledge graph of GESIS survey questions annotated with question features, concretely the information type. The KG represents 4024 unique questions with over 12000 annotated information types (26 distinct information types including subtypes) like willingness, acceptance, demography, perception, assessment, and others. 
  • SoMeSci SoMeSci is the most comprehensive gold standard corpus, exposed as open knowledge graph, about software mentions in scientific articles, providing training samples for Named Entity Recognition, Relation Extraction, Entity Disambiguation, and Entity Linking. The data consists of 4397422 triples, describing metadata and context of 3756 mentions in 1367 articles. 
  • SoftwareKG SoftwareKG is a knowledge graph that contains information about software mention statements from more than 51,000 scientific articles from the social sciences. It enables analysis on the provenance of the research results, the attribution of the developers, and software citation analysis in general. Additionally, providing information about whether and how the software and the source code are available allows an assessment about the state and the role of open source software in science at a general base. 
  • The Thesaurus for the Social Sciences (TheSoz) is a controlled vocabulary which contains about 8000 concepts (recommended keywords) from the Social Sciences. Topics from all social science disciplines are included. 
  • TweetsCOV19 TweetsCOV19 is a semantically annotated corpus of Tweets about the COVID-19 pandemic. It is a subset of TweetsKB and aims at capturing online discourse about various aspects of the pandemic and its societal impact. This dataset consists of 41,307,082 tweets in total, posted by 12,825,911 users and reflects the societal discourse about COVID-19 on Twitter in the period of October 2019 until August 2022. 
  • TweetsKB TweetsKB is a knowledge graph hosted at GESIS that includes metadata about 3.1 billion tweets (Feb. 2013 - June 2023) and serves as a resource for social science research. Using information extraction methods, sentiments, entities, hashtags, and user mentions were extracted and published as linked data through a structured RDF schema. 

  • GRAPHIA (Knowledge Graphs, AI Services and Next Generation Instrumentation for R&D in Social Sciences and Humanities) aims to create the first comprehensive Social Science and Humanities (SSH) Knowledge Graph designed to integrate fragmented data into a unified entry point. 
  • InFoLiS In the project InFoLiS - Integration of Research Data and Literature a method has been investigated and developed which allows for detecting citations of research datasets in scientific publications. The resulting 99227 links between publications and research datasets are integrated into the GESIS KG and are also available in GESIS Search. 
  • MOVING In the project MOVING, methods were investigated and developed to disambiguate authors. The methods have been further developed and will be used to disambiguate person names from various data sources as well as to identify and resolve duplicates in the records. 
  • OpenMinTeD In the OpenMinTeD project, methods have been investigated and developed to identify the mentions of variables in scientific publications. The generated 415 links between publications and variables are integrated into the GESIS Knowledge Graph. 
  • OUTCITE  In the OUTCITE - “Reference Understanding in the Social Sciences” project, procedures were developed and developed to extract and structure literature citations from scientific publications. The extracted references (over 1 million) were delivered to the Open Citations Corpus (OCC). Of these, over 300,000 links to publications in GESIS data collections were identified, which will be integrated into the GESIS Knowledge Graph. 
  • VADIS In the VADIS (VAriable Detection, Interlinking and Summarization) project, references to survey variables within scholarly articles have been identified. Semantic links based on these references have been created and made available as a Knowledge Graph. In total, around 1300 sentences with variable references have been identified from SSOAR publications. Additionally, TL;DR summaries have been generated for publications in English language. 

  • BERD@NFDI GESIS' main focus in BERD@NFDI is the development of innovative methods for the extraction of metadata and relevant entities from unstructured sources, the development of a harvesting infrastructure, and the improvement of the findability of data and resources on the Web. 
  • GESIS Search With GESIS search you find information about social science research data, variables, publications on research data and open access publications. Links between contents are based on the GESIS KG and displayed directly in the result list. 
  • KGI4NFDI KGI4NFDI advocates for a central and reusable Knowledge Graph Infrastructure (KGI) to enhance interoperability within the research domain and support the NFDI's objectives. It aims to provide essential components including a Knowledge Graph (KG) registry and a service for accessing KGs across NFDI projects. Additionally, the service intends to empower research communities to create decentralised KG instances using standardised approaches, technologies, and expertise. 
  • NFDI4DataScience The overarching objective of NFDI4DS is the development, establishment, and sustainment of a national research data infrastructure (NFDI) for the Data Science and Artificial Intelligence community in Germany. This will also deliver benefits for a wider community requiring data analytics solutions, within the NFDI and beyond. KTS’ focus is on extraction relevant entities (such as research datasets, benchmarks, machine learning models and research software) from scholarly documents and on representing and linking such artifacts in Research Knowledge Graphs. 
  • SoRa The infrastructure built in the project SoRa - Social Spatial Research Data Infrastructure enables to link social science and spatial science research data in a data protection-compliant manner, enabling the analysis of interdisciplinary research questions at the intersection of these domains. 

  • Otto, Wolfgang, Sharmila Upadhyaya, and Stefan Dietze. 2024. "Enhancing Software-Related Information Extraction via Single-Choice Question Answering with Large Language Models." In Natural Scientific Language Processing and Research Knowledge Graphs. NSLP 2024, edited by Georg Rehm, Stefan Dietze, Sonja Schimmler, and Frank Krüger, Lecture Notes in Computer Science 14770, 289-306. Cham: Springer Nature. doi: https://doi.org/10.1007/978-3-031-65794-8_21. https://link.springer.com/content/pdf/10.1007/978-3-031-65794-8.pdf.
  • Backes, Tobias, Anastasiia Iurshina, Muhammad Ahsan Shahid, and Philipp Mayr. 2024. "Comparing free reference extraction pipelines." International Journal on Digital Libraries 25 (4): 841–853. doi: https://doi.org/10.1007/s00799-024-00404-6. https://zenodo.org/records/11072332.
  • Kartal, Yavuz Selim, Muhammad Ahsan Shahid, Sotaro Takeshita, Tornike Tsereteli, Andrea Zielinski, Benjamin Zapilko, and Philipp Mayr-Schlegel. 2024. "VADIS -- a VAriable Detection, Interlinking and Summarization system." In Advances in Information Retrieval: 46th European Conference on Information Retrieval, ECIR 2024, Glasgow, UK, March 24–28, 2024, Proceedings, Part V, edited by Nazli Goharian, Nicola Tonellotto, Yulan He, Aldo Lipani, Graham McDonald, Craig Macdonald, and Iadh Ounis, Lecture Notes in Computer Science 14612, 223-228. Springer, Cham. doi: https://doi.org/10.1007/978-3-031-56069-9_22.
  • Sack, Harald, Torsten Schrade, Oleksandra Bruns, Etienne Posthumus, Tabea Tietz, Ebrahim Norouzi, Jörg Waitelonis, Heike Fliegl, Linnaea Söhn, Julia Tolksdorf, Jonatan Jalle Steller, Abril Az´ocar Guzm´an, Said Fathalla, Ahmad Zainul Ihsan, Volker Hofmann, Stefan Sandfeld, Felix Fritzen, Amir Laadhar, Sonja Schimmler, and Peter Mutschke. 2023. "Knowledge Graph Based RDM Solutions : NFDI4Culture - NFDI-MatWerk - NFDI4DataScience ." In 1st Conference on Research Data Infrastructure (CoRDI) - Connecting Communities , edited by York Sure-Vetter, and Carole Globe, doi: https://doi.org/10.52825/CoRDI.v1i.371.
  • Otto, Wolfgang, Matthäus Zloch, Lu Gan, Saurav Karmakar, and Stefan Dietze. 2023. "GSAP-NER: A Novel Task, Corpus, and Baseline for Scholarly Entity Extraction Focused on Machine Learning Models and Datasets." In Findings of the Association for Computational Linguistics: EMNLP 2023, edited by Houda Bouamor, Juan Pino, and Kalika Bali, 8166-8176. Singapore: Association for Computational Linguistics. https://aclanthology.org/2023.findings-emnlp.548.