KTS Research Labs

Welcome to the labs page of the department Knowledge Technologies for the Social Sciences (KTS) of GESIS. KTS conducts research in applied computer science, in particular in areas such as information retrieval, information extraction & NLP, semantic technologies and human computer interaction, in order to innovate digital services and research data infrastructures for the social sciences. Here you will find reusable outcomes of our recent research and development projects, such as:

Research datasets
Applications & demos
Tools & pipelines

Feel free to explore our technologies and get in touch with us.

Research datasets

SoMeSci - A 5 Star Open Data Gold Standard Knowledge Graph of Software Mentions in Scientific Articles

Description:

Knowledge about software used in scientific investigations is important for several reasons, for instance, to enable an understanding of provenance and methods involved in data handling. However, software is usually not formally cited, but rather mentioned informally within the scholarly description of the investigation, raising the need for automatic information extraction and disambiguation. Given the lack of reliable ground truth data, we present SoMeSci - Software Mentions in Science - a gold standard knowledge graph of software mentions in scientific articles. It contains high quality annotations (IRR: κ = .82) of 3756 software mentions in 1367 PubMed Central articles. Besides the plain mention of the software, we also provide relation labels for additional information, such as the version, the developer, a URL or citations. Moreover, we distinguish between different types, such as application, plugin or programming environment, as well as different types of mentions, such as usage or creation. To the best of our knowledge, SoMeSci is the most comprehensive corpus about software mentions in scientific articles, providing training samples for Named Entity Recognition, Relation Extraction, Entity Disambiguation, and Entity Linking.

Download link / Web Access

Website: https://data.gesis.org/somesci/
SPARQL endpoint: https://data.gesis.org/somesci/sparql
Dataset: https://zenodo.org/record/4701763

Link to Source Code

Github: https://github.com/dave-s477/SoMeSci_Code

Puclications

David Schindler, Felix Bensmann, Stefan Dietze and Frank Krüger. 2021. SoMeSci- A 5 Star Open Data Gold Standard Knowledge Graph of Software Mentions in Scientific Articles. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management (CIKM ‘21). Association for Computing Machinery, New York, NY, USA, 4574–4583. DOI: https://doi.org/10.1145/3459637.3482017

Team

	Name	Department	Team	Email	Telephone
	Bensmann, Felix (Kontaktperson)	Knowledge Technologies for the Social Sciences	Information Extraction and Linking		+49 (0221) 47694-524

SoftwareKG_Social and SoftwareKG_Pubmed

SoftwareKG_Social and SoftwareKG_PubMed are two projects by the University of Rostock and GESIS - Leibniz Institute for the Social Sciences that document software mentions in scientific papers.

These two knowledge graphs (KG) enable users to inspect and query and consequently understand the role of software in science. SoftwareKG_Social is our primer from 2019 on a dataset of 51,000 articles from the social sciences, while SoftwareKG_PubMed, from 2021, comprises a much more sophisticated analysis on a larger dataset of more than 3M PubMed Central articles. Both KGs use a very similar data model and originated from extraction methods that were iteratively improved.

Download link / Web Access

Website SoftwareKG

Website SoftwareKG Pubmed

Website SoftwareKG Social

Publications:

Schindler D, Bensmann F, Dietze S, Krüger F. (2022). The role of software in science: a knowledge graph-based analysis of software mentions in PubMed Central. PeerJ Computer Science 8:e835 https://doi.org/10.7717/peerj-cs.835
Schindler D., Zapilko B., Krüger F. (2020) Investigating Software Usage in the Social Sciences: A Knowledge Graph Approach. In: Harth A. et al. (eds) The Semantic Web. ESWC 2020. Lecture Notes in Computer Science, vol 12123. Springer, Cham. https://doi.org/10.1007/978-3-030-49461-2_16

Team

	Name	Department	Team	Email	Telephone
	Bensmann, Felix (Contact)	Knowledge Technologies for the Social Sciences	Information Extraction and Linking		+49 (0221) 47694-524

Dataset of Natural Language Queries for E-Commerce (VACOS-NLQ)

Description

The natural language query dataset (VACOS-NLQ) is a collection of 3540 written queries for product search. With the increasing importance of voice search (e.g., voice assistants such as Siri or Alexa, or mobile search), understanding natural language queries becomes an essential challenge for product search engines. We collected written queries in natural language from English native speakers for two products: laptops and jackets. The queries are enriched with basic information about the participant (age, gender, domain knowledge). The laptop queries are further annotated with vague words and key facts to allow research on product search and natural language processing. The dataset is available under a CC BY-NC-SA 3.0 licence.

Download link / Web Access

https://git.gesis.org/papenmaa/chiir21_naturallanguagequeries/blob/master/VACOS_NLQ_data.jsonl

Link to source code (GitLab)

https://git.gesis.org/papenmaa/chiir21_naturallanguagequeries

Publications

Papenmeier, Andrea, Alfred Sliwa, Dagmar Kern, Daniel Hienert, Ahmet Aker, and Norbert Fuhr. 2021. "Dataset of Natural Language Queries for E-Commerce." In Proceedings of the 2021 ACM SIGIR Conference on Human Information Interaction and Retrieval (CHIIR '21), March 14--19, 2021, Canberra, ACT, Australia, doi: http://dx.doi.org/10.1145/3406522.3446043

Team

	Name	Department	Team	Email	Telephone
	Kern, Dr. Dagmar	Knowledge Technologies for the Social Sciences	Human Information Interaction		+49 (0221) 47694-536
	Hienert, Dr. Daniel	Knowledge Technologies for the Social Sciences	Information and Data Retrieval		+49 (0221) 47694-525

TweetsCOV19

Description

TweetsCOV19 is a semantically annotated corpus of tweets about the COVID-19 pandemic. It is a subset of TweetsKB and aims at capturing online discourse about various aspects of the pandemic and its societal impact. Metadata information about the tweets as well as extracted entities, sentiments, hashtags and user mentions are exposed in RDF using established RDF/S vocabularies. This dataset consists of 8,151,524 tweets in total, posted by 3,664,518 users and reflects the societal discourse about COVID-19 on Twitter in the period of October 2019 until April 2020.

Download link / Web Access

Website: https://data.gesis.org/tweetscov19/
SPARQL endpoint: https://data.gesis.org/tweetscov19/sparql (Graph IRI: http://data.gesis.org/tweetscov19)

Publications

Dimitrov, D., Baran, E., Fafalios, F., Yu, R., Zhu, X., Zloch, M., and Dietze, D.,TweetsCOV19 -- A Knowledge Base of Semantically Annotated Tweets about the COVID-19 Pandemic, 29th ACM International Conference on Information & Knowledge Management (CIKM2020), Resource Track, ACM 2020.

Team

Name	Department	Team	Telephone
Dietze, Prof. Dr. Stefan (contact)	Knowledge Technologies for the Social Sciences		+49 (0221) 47694-421
Dimitrov, Dr. Dimitar	Knowledge Technologies for the Social Sciences	Information Extraction and Linking	+49 (0221) 47694-512
Baran, Erdal	Service Strategy & Engineering	Software Engineering	+49 (0221) 47694-511
Schellhammer, Sebastian	Knowledge Technologies for the Social Sciences	Information Extraction and Linking	+49 (0221) 47694-477

TweetsKB

Description

TweetsKB is a public RDF corpus of anonymized data for a large collection of annotated tweets. The dataset currently contains data for more than 1.5 billion tweets, spanning more than 5 years (February 2013 - March 2018). Metadata information about the tweets as well as extracted entities, sentiments, hashtags and user mentions are exposed in RDF using established RDF/S vocabularies. For the sake of privacy, we encrypt the tweet IDs and usernames, and we do not provide the text of the tweets.

Download link / Web Access

Website: https://data.gesis.org/tweetskb/
SPARQL endpoint: https://data.gesis.org/tweetskb/sparql (Graph URI: https://data.gesis.org/tweetskb/)

Publications

P. Fafalios, V. Iosifidis, E. Ntoutsi, and S. Dietze, TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets, 15th Extended Semantic Web Conference (ESWC'18), Heraklion, Crete, Greece, June 3-7, 2018.

Team

Name	Department	Team	Telephone
Dietze, Prof. Dr. Stefan (contact)	Knowledge Technologies for the Social Sciences		+49 (0221) 47694-421
Dimitrov, Dr. Dimitar	Knowledge Technologies for the Social Sciences	Information Extraction and Linking	+49 (0221) 47694-512
Schellhammer, Sebastian	Knowledge Technologies for the Social Sciences	Information Extraction and Linking	+49 (0221) 47694-477

Sowiport User Search Sessions Data Set (SUSS)

Description

This data set contains individual search sessions from the transaction log of the academic search engine sowiport (www.sowiport.de). The data was collected over a period of one year (between 2nd April 2014 and 2nd April 2015). The web server log files and specific javascript-based logging techniques were used to capture the usage behaviour within the system. All activities are mapped to a list of 58 actions. This list covers all types of activities and pages that can be carried out/visited within the system (e.g. typing a query, visiting a document, selecting a facet, etc.). For each action, a session id, the date stamp and additional information (e.g. queries, document ids, and result lists) are stored. The session id is assigned via browser cookie and allows tracking user behaviour over multiple searches. Based on the session id and date stamp, the step in which an action is conducted and the length of the action is included in the data set as well. The data set contains 558,008 individual search sessions and a total of 7,982,427 logs entries. The average number of actions per search session is 7.
The dataset 'SUSS-16-17' is a follow-up of the Sowiport User Search Sessions Data Set (SUSS) dataset.

Link to source code (GitLab)

https://git.gesis.org/amur/SUSS-16-17

Download link / Web Access

http://dx.doi.org/10.7802/1380

Publications

Kacem, A., & Mayr, P. (2018). Analysis of Search Stratagem Utilisation. Scientometrics, 116(2), 1383–1400. doi.org/10.1007/s11192-018-2821-8
Hienert, D., Sawitzki, F., & Mayr, P. (2015). Digital Library Research in Action – Supporting Information Retrieval in Sowiport. D-Lib Magazine, 21(3/4). doi.org/10.1045/march2015-hienert

Team

	Name	Department	Team	Email	Telephone
	Mayr, Dr. Philipp	Knowledge Technologies for the Social Sciences	Information and Data Retrieval		+49 (0221) 47694-533

German Bundestag Elections 2013: Twitter usage by electoral candidates

Description

The data is a result of a research project at GESIS which aimed to explore social media communication related to the election of the German parliament on September 22nd, 2013. The data includes tweets by candidates and a file describing the key attributes of the candidates and lists their Twitter and Facebook accounts. Tweets were collected for candidates of all covered parties except the AfD. All data was publicly available at the time of data collection. Cases in which a Twitter or Facebook account was not used as part of the role as a candidate (i.e., private accounts and accounts merely used for private postings) were not included. Due to legal reasons only the following data can be shared: (1) A list of all candidates that were considered in the project, their key attributes and if available the identification of their Twitter and Facebook accounts. (2) A list of Tweet-IDs which can be used to retrieve the original tweets of the candidates which they posted between June and December 2013. It includes the Tweet-ID and an ID identifying the candidate. The data describing the candidates include variables with the following content: a subsequent number, name of candidate, first name, member of which party ("AfD", "CDU", "CSU", "Die LINKE", "FDP", "GRUENE", "PIRATEN", "SPD"), state (e.g. "Bayern"), is listed (yes, no), is directe candidate (yes, no), constituency (e.g., "Aachen I"), has facebook account (yes, no), facebook_link, has twitter account (yes, no), twitter_screenname and variables on the frequency of twitter use.

Download link / Web Access

http://dx.doi.org/10.4232/1.12319

Publications

Kaczmirek, L., Mayr, P., Vatrapu, R., Bleier, A., Blumenberg, M., Gummer, T., … Wolf, C. (2014). Social Media Monitoring of the Campaigns for the 2013 German Bundestag Elections on Facebook and Twitter. Retrieved from www.gesis.org/fileadmin/upload/forschung/publikationen/gesis_reihen/gesis_arbeitsberichte/WorkingPapers_2014-31.pdf
Mayr, P., & Weller, K. (2017). Think Before You collect: Setting Up a Data Collection Approach for Social Media Studies. In L. Sloan & A. Quan-Haase (Eds.), The SAGE Handbook of Social Media Research Methods (pp. 107–124). London: SAGE Publications Ltd.

Team

	Name	Department	Team	Email	Telephone
	Mayr, Dr. Philipp	Knowledge Technologies for the Social Sciences	Information and Data Retrieval		+49 (0221) 47694-533

ClaimsKG - A knowledge graph of annotated claims

Description

ClaimsKG is a knowledge graph of linked annotated claims harvested from fact-checking websites. The KG facilitates structured queries about claims, their truth values or other kinds of metadata. ClaimsKG is generated through a (semi-)automated pipeline which harvests claims and respective metadata from popular fact-checking sites on a regular basis, lifts data into an RDF/S model, which exploits established schema such as schema.org and NIF, and annotates claims with related entities from DBpedia.

In summary, we provide (1) a data model for representing claims, (2) a pipeline for crawling and extracting claims from fact-checking websites, (3) a set of open-source tools for data extraction and lifting following the introduced model, which all are applied to provide (4) an openly available dynamic large-scale knowledge base of claims and associated metadata.

Link to source code (GitLab)

https://github.com/claimskg

Download link / Web Access

Website: https://data.gesis.org/claimskg/site
SPARQL endpoint: https://data.gesis.org/claimskg/sparql

Team

Name	Department	Team	Telephone
Dietze, Prof. Dr. Stefan (contact)	Knowledge Technologies for the Social Sciences		+49 (0221) 47694-421
Gangopadhyay, Susmita	Knowledge Technologies for the Social Sciences	Big Data Analytics	+49 (0221) 47694-714
Jabeen, Hajira	Knowledge Technologies for the Social Sciences	Big Data Analytics	+49 (0221) 47694-517

Iodcc

Description

As the availability and the inter-connectivity of RDF datasets grow, so does the necessity to understand the structures of the data. Understanding the topology of RDF graphs can guide and inform the development of e.g. synthetic dataset generators, sampling methods, index structures, or query optimizers. This work proposes two resources: (i) a software framework able to acquire, prepare, and perform a graph-based analysis on the topology of large RDF graphs, and (ii) results on a graph-based analysis of 280 datasets from the LOD Cloud with values for 28 graph measures computed with the framework.

Publications

Zloch, M. & Acosta, M. & Hienert, D. & Dietze, S. & Conrad, S. (2019). (to be published) A Software Framework and Datasets for the Analysis of Graph Measures on RDF Graphs. In ESWC 2019, Portoroz, Slovenia, 2-4 June, 2019.

Team

	Name	Department	Team	Email	Telephone
	Zloch, Dr. (rer. nat.) Matthäus (contact)				+49 (0221) 47694-534
	Hienert, Dr. Daniel	Knowledge Technologies for the Social Sciences	Information and Data Retrieval		+49 (0221) 47694-525

An Open Testbed for Author Name Disambiguation Evaluation

Description

We identified 5,408 authors in DBLP who have a unique identification number. These 5,408 authors and their publications form the gold standard. We got these numbers from DBLP downloaded May/01 2015.

Download link / Web Access

http://dx.doi.org/10.7802/1234

Publications

Momeni, F., & Mayr, P. (2016). Evaluating Co-authorship Networks in Author Name Disambiguation for Common Names. In 20th International Conference on Theory and Practice of Digital Libraries (TPDL 2016) (pp. 386–391). doi.org/10.1007/978-3-319-43997-6_31

Team

	Name	Department	Team	Email	Telephone
	Mayr, Dr. Philipp (contact)	Knowledge Technologies for the Social Sciences	Information and Data Retrieval		+49 (0221) 47694-533
	Momeni, Dr. Fakhri	Knowledge Technologies for the Social Sciences	Human Information Interaction		+49 (0221) 47694-544

Sowiport user queries sample (SQS)

Description

This data set contains a random sample of 1,800 user queries taken from the transaction log of the academic search engine sowiport (www.sowiport.de). The queries (mainly German query terms) were extracted from a larger set of randomly chosen user sessions which have been recorded in sowiport between September 1st 2014 and March 1st 2015. To reduce noise in the data set, we selected sessions in which at least two different searches were conducted and at least one document has been clicked on. In addition, we excluded all searches for numbers (mostly ISSN numbers). The selected queries were sorted randomly and manually assessed by a domain expert. The randomness was introduced to reduce potential biases in the assessment. When assessing multiple queries from one session, the previous queries might influence the decision for following queries as they are then evaluated within a context. The 1,800 user queries were categorized into the 29 facets of the subject categories used in the Thesaurus Social Sciences (TheSoz, see lod.gesis.org/thesoz/en.html and sowiport.gesis.org/thesaurus. It was allowed to apply multiple facets, as some queries cover multiple topics.

Download link / Web Access

http://dx.doi.org/10.7802/1372

Publications

Hienert, D., Sawitzki, F., & Mayr, P. (2015). Digital Library Research in Action – Supporting Information Retrieval in Sowiport. D-Lib Magazine, 21(3/4). doi.org/10.1045/march2015-hienert

Main Responsible / Contact Person

	Name	Department	Team	Email	Telephone
	Mayr, Dr. Philipp	Knowledge Technologies for the Social Sciences	Information and Data Retrieval		+49 (0221) 47694-533

LRMI Datasets

Description

The LRMI dataset provides a research corpus to investigate the spread, adoption and context of learning resources and related metadata on the Web. It contains (schema.org) Web markup of learning resources extracted from the Web Data Commons respectively the Common Crawl of the years 2013-2015. For this, all markup was extracted which contains or co-occurs with any LRMI vocabulary terms, which provides the schema.org vocabulary for markup of educational resources. For the year 2015, this resulted in more than 44 million markup statements extracted from 1.82 billion web pages. In order to improve data quality, heuristics for data cleansing were applied.

Download link / Web Access

Download: http://lrmi.itd.cnr.it/lrmiwp/

Publications

Taibi, D., Dietze, S., Towards embedded markup of learning resources on the Web: a quantitative Analysis of LRMI Terms Usage, in Companion Publication of the IW3C2 WWW 2016 Conference, IW3C2 2016, Montreal, Canada, April 11, 2016
Dietze, S., Taibi, D., Yu, R., Barker, P., d’Aquin, M., Analysing and Improving embedded Markup of Learning Resources on the Web, 26th International World Wide Web Conference (WWW2017), full research paper at Digital Learning track, Perth, Australia, April 2017.

Team

	Name	Department	Team	Email	Telephone
	Dietze, Prof. Dr. Stefan	Knowledge Technologies for the Social Sciences			+49 (0221) 47694-421

SAL - Search log with user knowledge assessment data

Description

This dataset includes 1100 search sessions conducted by crowd workers spanning across 11 information needs for different topics randomly selected from the TREC 2014 Web Track 2 dataset. This includes knowledge assessment data before and after each of the 100 search sessions per information need.

Features

Search session log, user knowledge test for a specific topic before and after a search session

Download link / Web Access

https://sites.google.com/view/predicting-user-knowledge

Publications

Yu, R. , Gadiraju, U. , Holtz, P. , Rokicki, M. , Kemkes, P. and Dietze, S. Predicting User Knowledge Gain in Informational Search Sessions. 41st International ACM SIGIR Conference on Research and Development in Information Retrieval, 2018.
Gadiraju, U. , Yu, R. , Dietze, S. and Holtz, P. Analyzing Knowledge Gain of Users in Informational Search Sessions on the Web. 2018 ACM on Conference on Human Information Interaction and Retrieval (CHIIR), 2018

Team

	Name	Department	Team	Email	Telephone
	Dietze, Prof. Dr. Stefan	Knowledge Technologies for the Social Sciences			+49 (0221) 47694-421

Applications & demos

To top

GESIS Research Graph

Description

The GESIS Research Graph builds a trusted research graph that makes the connection between high-value collections (research datasets) and other scholarly works such as publications and grants discoverable. For building the graph, the Switchboard software - developed by the Data Description Registry Interoperability (DDRI) WG of the Research Data Alliance (RDA) – has been implemented to aggregate, connect and publish the research information from GESIS data collections.

Link to source code (GitLab)

https://github.com/researchgraph

Link to demo / prototype

http://researchgraph.org/gesis/

Team

	Name	Department	Team	Email	Telephone
	Zapilko, Dr. Benjamin	Knowledge Technologies for the Social Sciences	Information Extraction and Linking		+49 (0221) 47694-515

Opening Scholarly Communication in the Social Sciences (OSCOSS)

Description

Scholarly communication in the social sciences is centered around publications, in which data also play a key role. The increasingly collaborative scientific process, from a project plan, to collecting data, to interpreting them in a paper and submitting it for peer review, to publishing an article, to, finally, its consumption by readers, is insufficiently supported by contemporary information systems. They support every individual step, but media discontinuities between steps cause inefficiency and loss of information: word processors lack direct access to data; reviewers cannot provide feedback inside the environment in which authors revising their papers; open access web publishing is constrained to document formats designed for paper printing but neglecting the Web's accessibility and interactivity potential; finally, readers, seeing a single frozen view of the underlying data in a paper, are unable to access the full extent of the data and to make observations beyond the restricted scope chosen by the author.
With the collaborative document editor Fidus Writer and the Open Journal Systems we choose a stable technical foundation. We secure user acceptance by respecting the characteristics of the traditional processes social scientists are used to: web publications must have the same high-quality layout as print publications, and information must remain citable by stable page numbers. To ensure we meet these requirements, we will work closely with the publishers of methods, data, analyses (mda) and Historical Social Research (HSR), two international peer reviewed open accessible journals published by GESIS, and build early demonstrators for usability evaluation.
OSCOSS is funded by the DFG in the Open Access Transformation programme.

Link to source code (GitLab)

https://github.com/OSCOSS

https://github.com/fiduswriter

Link to demo / prototype

https://fiduswriter.gesis.org/

Publications

Sadeghi, A., Capadisli, S., Wilm, J., Lange, C., & Mayr, P. (2019). Opening and Reusing Transparent Peer Reviews with Automatic Article Annotation. Publications, 7(1). doi.org/10.3390/publications7010013
Mayr, P., & Lange, C. (2017). The Opening Scholarly Communication in Social Sciences project OSCOSS. In P. Hauke, A. Kaufmann, & V. Petras (Eds.), Bibliothek – Forschung für die Praxis. Festschrift für Konrad Umlauf zum 65. Geburtstag (pp. 433–444). De Gruyter. Retrieved from arxiv.org/abs/1611.04760.
Sadeghi, A., Wilm, J., Mayr, P., & Lange, C. (2017). Opening Scholarly Communication in Social Sciences by Connecting Collaborative Authoring to Peer Review. Information - Wissenschaft & Praxis

Team

	Name	Department	Team	Email	Telephone
	Mayr, Dr. Philipp (contact)	Knowledge Technologies for the Social Sciences	Information and Data Retrieval		+49 (0221) 47694-533
	Momeni, Dr. Fakhri	Knowledge Technologies for the Social Sciences	Human Information Interaction		+49 (0221) 47694-544

EXCITE – Extraction of Citations from PDF Documents

Description

The EXCITE project, jointly run by WeST (Institut of Web Science and Technologies, University of Koblenz-Landau) in Koblenz and GESIS (Leibniz Institute for Social Sciences) in Cologne, is funded by the Deutsche Forschungsgemeinschaft (DFG) with the aim of extracting citations from social science publications and making more citation data available to researchers. With respect to this objective, a set of algorithms for information extraction and matching has been developed focusing on social science publications in the German language. EXCITE provides different online services to extract and segment citations. Moreover, other online tools are available to create more gold standard data.

The demo is a toolchain of citation extraction software and particular focus on the German-language social sciences and this is a public service for the project. In the background of this page we are using CERMINE for extracting content from PDF files and Exparser for reference string extraction and segmentation.

Download link / Web Access

http://excite.west.uni-koblenz.de/website/

Link to demo / prototype

http://excite.west.uni-koblenz.de/excite

Link to source code (GitLab)

https://github.com/exciteproject

Publications

Körner, M., Ghavimi, B., Mayr, P., Hartmann, H., & Staab, S. (2017). Evaluating Reference String Extraction Using Line-Based Conditional Random Fields: A Case Study with German Language Publications. In M. Kirikova, K. Nørvåg, G. A. Papadopoulos, J. Gamper, R. Wrembel, J. Darmont, & S. Rizzi (Eds.), New Trends in Databases and Information Systems (Vol. 767, pp. 137–145). Cham: Springer International Publishing. doi.org/10.1007/978-3-319-67162-8_15

Team

	Name	Department	Team	Email	Telephone
	Mayr, Dr. Philipp (contact)	Knowledge Technologies for the Social Sciences	Information and Data Retrieval		+49 (0221) 47694-533

GWSBeta

Description

In the Social Sciences, researchers search for information on the Web, but this is most often distributed on different websites, search portals, digital libraries, data archives, and databases. GESIS search is an integrated search system for social science information that allows finding information around research data in one digital library. Users can search for research data sets, publications, survey variables, questions from questionnaires, survey instruments and tools. Information items are linked to each other so that users can see, for example, which publications contain data citations to research data. The integration and linking of different kinds of information increase their visibility so that it is easier for researchers to find information for re-use.

Features

Integrated search over research data sets, publications, survey variables, questions from questionnaires, survey instruments and tools
Links between information items

Download link / Web Access

https://searchtest.gesis.org

Publications

Daniel Hienert, Dagmar Kern, Katarina Boland, Benjamin Zapilko, Peter Mutschke. (to appear). "A Digital Library for Research Data and Related Information in the Social Sciences." In Proceedings of JCDL 2019.

Team

Name	Department	Team	Telephone
Hienert, Dr. Daniel (contact)	Knowledge Technologies for the Social Sciences	Information and Data Retrieval	+49 (0221) 47694-525
Kern, Dr. Dagmar	Knowledge Technologies for the Social Sciences	Human Information Interaction	+49 (0221) 47694-536
Zapilko, Dr. Benjamin	Knowledge Technologies for the Social Sciences	Information Extraction and Linking	+49 (0221) 47694-515
Mutschke, Peter (M.A.)	Knowledge Technologies for the Social Sciences	FAIR Data	+49 (0221) 47694-500

Mining Acknowledgement Texts in Web of Science (MinAck)

Description

The focus of the MinAck project is the detection and quantitative analysis of acknowledged entities using the FLAIR NLP-framework. We trained and implemented a named entity recognition (NER) task in a larger corpus of Web of Science (WoS) articles, which include acknowledgements.

The NER model was trained with the dataset containing over 600 annotated sentences from acknowledgement texts, written in scientific articles stored in WoS. The training was performed using the NER Model with Flair Embeddings. The Flair Embeddings model uses stacked embeddings, i.g. combination of contextual string embeddings with GloVe.

Our NER model (see datasets below) is able to recognize 6 entity types: funding agencies (FUND), corporations (COR), universities (UNI), individuals (IND), grant numbers (GRNB) and miscellaneous (MISC).

Download link/ Web Access

Website: https://kalawinka.github.io/minack/

Datasets: https://doi.org/10.5281/zenodo.5776202

Online demo

You can try our NER tagger demo by following this link: https://mybinder.org/v2/gh/kalawinka/minack/main?labpath=example_model.ipynb

This demo is an interactive notebook built with the Jupyter Notebook and Binder.Two options are available, you can try the model with our example of acknowledgement or you can type in your own acknowledgement text. To use the demo, launch one cell after another and follow the instructions, written in the notebook.

	Name	Department	Team	Email	Telephone
	Mayr, Dr. Philipp	Knowledge Technologies for the Social Sciences	Information and Data Retrieval		+49 (0221) 47694-533
	Smirnova, Nina	Knowledge Technologies for the Social Sciences	Information and Data Retrieval		+49 (0221) 47694-718

Tools & pipelines

To top

InFoLiS - Integration of research data and literature in the social sciences

Description

The goal of the InFoLiS project was to connect research data and publications. In this context, a tool has been created which identifies and extracts citations of research data in scientific publications. These citations are used for generating links between these datasets and publications. The generated links can be made available for a seamless integration into different retrieval systems. All services for link creation are publicly usable as web services.

Download link / Web Access

http://infolis.github.io/

Link to source code (GitLab)

https://github.com/infolis

Publications

Boland, K. & Mathiak, B. (2013). Connecting Literature and Research Data. In IASSIST 2013 - Data Innovation: Increasing Accessibility, Visibility, and Sustainability, Cologne, Germany, May 29-31, 2013.
Boland, K.; Ritze, D.; Eckert, K.; Mathiak, B. (2012): Identifying references to datasets in publications. In: Zaphiris, P.; Buchanan, G.; Rasmussen, E.; Loizides, F. (Hrsg.): Proceedings of the Second International Conference on Theory and Practice of Digital Libraries (TPDL 2012), S.150-161, 2012.
Mathiak, B.; Boland K. (2015): Challenges in Matching Dataset Citation Strings to Datasets in Social Science. D-Lib Magazine 21 (1/2). doi.org/10.1045/january2015-mathiak
Ritze, D.; Boland, K. (2013): Integration of Research Data and Research Data Links into Library Catalogues. Proceedings of the International Conference on Dublin Core and Metadata Applications (DC 2013), 2013.

Team

	Name	Department	Team	Email	Telephone
	Zapilko, Dr. Benjamin	Knowledge Technologies for the Social Sciences	Information Extraction and Linking		+49 (0221) 47694-515

WHOSE

Description

WHOSE is a framework for the analysis of search behavior of real users in different environments and different domains based on log data. The logging component can easily be integrated into real-world IR systems for generating and analyzing new log data. Furthermore, due to a supplementary mapping it is also possible to analyze existing log data. For every IR system different actions and filters can be defined. This allows system operators and researchers to use the framework for the analysis of user search behavior in their IR systems and to compare it with others. Using a graphical user interface they have the possibility to interactively explore the data set from a broad overview down to individual sessions.

Features

Logging interaction data
Mapping of existing log data to user actions
Visualizing user actions
Interactively explore behavioral data within the GUI

Link to source code (GitLab)

https://git.gesis.org/iir/whole-session-evaluation-framework

Publications

Hienert, Daniel, Wilko van Hoek, Alina Weber, and Dagmar Kern. 2015. "WHOSE – A Tool for Whole-Session Analysis in IIR." In Advances in Information Retrieval: 37th European Conference on IR Research, ECIR 2015, Vienna, Austria, March 29 - April 2, 2015. Proceedings, Lecture Notes in Computer Science 9022, 172-183. Springer. arxiv.org/abs/1504.06961.

Team

	Name	Department	Team	Email	Telephone
	Hienert, Dr. Daniel	Knowledge Technologies for the Social Sciences	Information and Data Retrieval		+49 (0221) 47694-525

Reading Protocol

Description

In Interactive Information Retrieval (IIR) experiments the user’s gaze motion on web pages is often recorded with eye tracking. The data is used to analyze gaze behavior or to identify Areas of Interest (AOI) the user has looked at. The reading protocol software breaks eye tracking data down to the textual level by considering the HTML structure of the web pages. This has a lot of advantages for the analyst. First and foremost, it can easily be identified on a large scale what has actually been viewed and read on the stimuli pages by the subjects. Second, the web page structure can be used to filter to AOIs. Third, gaze data of multiple users can be presented on the same page, and fourth, fixation times on text can be exported and further processed in other tools.

Features

Breaking down eye-tracking data to the word level
Interactively explore word-eye-fixations in the GUI

Link to demo / prototype:

http://vizgr.org/reading_protocol/

Link to source code (GitLab)

https://git.gesis.org/iir/reading-protocol

Publications

Hienert, Daniel, Dagmar Kern, Matthew Mitsui, Chirag Shah, and Nicholas J. Belkin. 2019. "Reading Protocol: Understanding what has been read in Interactive Information Retrieval Tasks." In CHIIR '19 Proceedings of the 2019 Conference on Human Information Interaction and Retrieval, 73-81. New York: ACM. doi: http://dx.doi.org/10.1145/3295750.3298921.

Team

	Name	Department	Team	Email	Telephone
	Hienert, Dr. Daniel (contact)	Knowledge Technologies for the Social Sciences	Information and Data Retrieval		+49 (0221) 47694-525
	Kern, Dr. Dagmar	Knowledge Technologies for the Social Sciences	Human Information Interaction		+49 (0221) 47694-536

Preference-based Search

Description

Finding a product online can be a challenging task for users. Faceted search interfaces, often in combination with recommenders, can support users in finding a product that fits their preferences. However, those preferences are not always equally weighted: some might be more important to a user than others (e.g. red is the favorite color, but blue is also fine) and sometimes preferences are even contradictory (e.g. the lowest price vs. the highest performance). Often, there is even no product that meets all preferences. In those cases, faceted search interfaces reach their limits. In this project, we investigate the potential of a search interface, which allows a preference-based ranking based on weighted search and facet terms.

Features

A search system with user frontend which allows adjusting the preferences for different facets.

Link to source code (GitLab)

https://git.gesis.org/iir/preferenced-based-search

Publications

Kern, Dagmar, Wilko van Hoek, and Daniel Hienert. 2018. "Evaluation of a Search Interface for Preference-Based Ranking - Measuring User Satisfaction and System Performance." In NordiCHI '18 Proceedings of the 10th Nordic Conference on Human-Computer Interaction, 184-194. New York: ACM. doi: http://dx.doi.org/10.1145/3240167.3240170.

Team

	Name	Department	Team	Email	Telephone
	Hienert, Dr. Daniel	Knowledge Technologies for the Social Sciences	Information and Data Retrieval		+49 (0221) 47694-525

Variable Detection and Linking

Description

In the OpenMinTeD project (http://openminted.eu/) methods have been investigated and developed to identify the mentions of variables in scientific publications. The variable detection and disambiguation method developed has been tested on a subset of variables.

Download link / Web Access

https://services.openminted.eu/landingPage/application/51d1f81b-aa0f-4675-bb87-8c720779e949

Link to source code (GitLab)

https://github.com/openminted/uc-tdm-socialsciences

Publications

Zielinski, Andrea, and Peter Mutschke. 2018. "Towards a Gold Standard Corpus for Variable Detection and Linking in Social Science Publications." In Proceedings of LREC 2018
Zielinski, Andrea, and Peter Mutschke. 2017. "Mining Social Science Publications for Survey Variables." In Proceedings of the Second Workshop on Natural Language Processing and Computational Social Science, Vancouver, Canada, August 3, 2017, edited by Dirk Hovy, Svitlana Volkova, and David Bamman, 47–52. Association for Computational Linguistics. aclweb.org/anthology/W17-29. aclweb.org/anthology/W17-29.

Team

	Name	Department	Team	Email	Telephone
	Mutschke, Peter (M.A.) (contact)	Knowledge Technologies for the Social Sciences	FAIR Data		+49 (0221) 47694-500

ReshapeRDF

Description

ReshapeRDF is a CLI tool that provides versatile functionality to inspect and reshape large RDF dumps. It is designed closely interact with the Unix CLI tool set.

Processing RDF mass data can be a prone job. Common triple stores offer certain functionality for querying and manipulating RDF data but only few can efficiently handle mass data (let's say more than 200 Mio. statements) at the same time. Typical operations like data import and SPARQL (update)queries tend to be time consuming and inconvenient to be used in comprehensive reshaping operations.

Thus, when working with moderate structured graph data, a solution can be to refrain from using a triple store and to work with dump files instead. Recurring reshaping tasks are extracting entities of a certain class from a large dataset, or subdivide a dataset into blocks according to a certain property (Blocking), filtering the data, extracting and removing resources and statements, renaming properties and similar operations.

The tool at hand allows for these operations. ReshapeRDF’s working principle is to process RDF data as stream of N-Triples (lines) and consequently allows it to be used in Unix tool set environments and for scripting.

An early version of this tool was used in linked.swissbib.ch project.

Features

Convert between various RDF formats
Remove duplicates
Rename properties
Split datasets
Merge dataset
Sort datasets
Extract resources, statements, subjects, predicates, objects by patterns
Extract resources by list
Filter resources by patterns
Filter resources according to a list
Close interaction with Unix tool set
... Refer to our Guide for more information

Link to source code (GitLab)

https://git.gesis.org/bensmafx/reshapeRDF

Publications

Bensmann, Felix, Benjamin Zapilko, and Philipp Mayr. 2017. "Interlinking Large-scale Library Data with Authority Records." Frontiers in Digital Humanities 4 (5): 1-13. doi: dx.doi.org/10.3389/fdigh.2017.00005.

Team

	Name	Department	Team	Email	Telephone
	Bensmann, Felix	Knowledge Technologies for the Social Sciences	Information Extraction and Linking		+49 (0221) 47694-524