WTS Research Labs

Welcome to the labs page of the department Knowledge Technologies for the Social Sciences (WTS) of GESIS. WTS conducts research in applied computer science, in particular in areas such as information retrieval, information extraction & NLP, semantic technologies and human computer interaction, in order to innovate digital services and research data infrastructures for the social sciences. Here you will find reusable outcomes of our recent research and development projects, such as:

Feel free to explore our technologies and get in touch with us.

Research datasets

Sowiport User Search Sessions Data Set (SUSS)

Description

This data set contains individual search sessions from the transaction log of the academic search engine sowiport (www.sowiport.de). The data was collected over a period of one year (between 2nd April 2014 and 2nd April 2015). The web server log files and specific javascript-based logging techniques were used to capture the usage behaviour within the system. All activities are mapped to a list of 58 actions. This list covers all types of activities and pages that can be carried out/visited within the system (e.g. typing a query, visiting a document, selecting a facet, etc.). For each action, a session id, the date stamp and additional information (e.g. queries, document ids, and result lists) are stored. The session id is assigned via browser cookie and allows tracking user behaviour over multiple searches. Based on the session id and date stamp, the step in which an action is conducted and the length of the action is included in the data set as well. The data set contains 558,008 individual search sessions and a total of 7,982,427 logs entries. The average number of actions per search session is 7.
The dataset 'SUSS-16-17' is a follow-up of the Sowiport User Search Sessions Data Set (SUSS) dataset.

Link to source code (GitLab)

https://git.gesis.org/amur/SUSS-16-17

Download link / Web Access

http://dx.doi.org/10.7802/1380

Publications

  • Kacem, A., & Mayr, P. (2018). Analysis of Search Stratagem Utilisation. Scientometrics, 116(2), 1383–1400. doi.org/10.1007/s11192-018-2821-8
  • Hienert, D., Sawitzki, F., & Mayr, P. (2015). Digital Library Research in Action – Supporting Information Retrieval in Sowiport. D-Lib Magazine, 21(3/4). doi.org/10.1045/march2015-hienert

Team

 NameE-MailTelefon
 Dr. Philipp Mayr E-Mail+49 (221) 47694-533

 

German Bundestag Elections 2013: Twitter usage by electoral candidates

Description

The data is a result of a research project at GESIS which aimed to explore social media communication related to the election of the German parliament on September 22nd, 2013. The data includes tweets by candidates and a file describing the key attributes of the candidates and lists their Twitter and Facebook accounts. Tweets were collected for candidates of all covered parties except the AfD. All data was publicly available at the time of data collection. Cases in which a Twitter or Facebook account was not used as part of the role as a candidate (i.e., private accounts and accounts merely used for private postings) were not included. Due to legal reasons only the following data can be shared: (1) A list of all candidates that were considered in the project, their key attributes and if available the identification of their Twitter and Facebook accounts. (2) A list of Tweet-IDs which can be used to retrieve the original tweets of the candidates which they posted between June and December 2013. It includes the Tweet-ID and an ID identifying the candidate. The data describing the candidates include variables with the following content: a subsequent number, name of candidate, first name, member of which party ("AfD", "CDU", "CSU", "Die LINKE", "FDP", "GRUENE", "PIRATEN", "SPD"), state (e.g. "Bayern"), is listed (yes, no), is directe candidate (yes, no), constituency (e.g., "Aachen I"), has facebook account (yes, no), facebook_link, has twitter account (yes, no), twitter_screenname and variables on the frequency of twitter use.

Download link / Web Access

http://dx.doi.org/10.4232/1.12319

Publications

  • Kaczmirek, L., Mayr, P., Vatrapu, R., Bleier, A., Blumenberg, M., Gummer, T., … Wolf, C. (2014). Social Media Monitoring of the Campaigns for the 2013 German Bundestag Elections on Facebook and Twitter. Retrieved from www.gesis.org/fileadmin/upload/forschung/publikationen/gesis_reihen/gesis_arbeitsberichte/WorkingPapers_2014-31.pdf
  • Mayr, P., & Weller, K. (2017). Think Before You collect: Setting Up a Data Collection Approach for Social Media Studies. In L. Sloan & A. Quan-Haase (Eds.), The SAGE Handbook of Social Media Research Methods (pp. 107–124). London: SAGE Publications Ltd.

Team

 NameE-MailTelefon
 Dr. Philipp Mayr E-Mail+49 (221) 47694-533

ClaimsKG - A knowledge graph of annotated claims

Description

ClaimsKG is a knowledge graph of linked annotated claims harvested from fact-checking websites. The KG facilitates structured queries about claims, their truth values or other kinds of metadata. ClaimsKG is generated through a (semi-)automated pipeline which harvests claims and respective metadata from popular fact-checking sites on a regular basis, lifts data into an RDF/S model, which exploits established schema such as schema.org and NIF, and annotates claims with related entities from DBpedia.

In summary, we provide (1) a data model for representing claims, (2) a pipeline for crawling and extracting claims from fact-checking websites, (3) a set of open-source tools for data extraction and lifting following the introduced model, which all are applied to provide (4) an openly available dynamic large-scale knowledge base of claims and associated metadata.

Link to source code (GitLab)

https://github.com/claimskg

Download link / Web Access

Website: https://data.gesis.org/claimskg/site
SPARQL endpoint: https://data.gesis.org/claimskg/sparql 

Team

 NameE-MailTelefon
 Dr. Benjamin Zapilko
(contact)
E-Mail+49 (221) 47694-515
 Prof. Dr. Stefan Dietze E-Mail+49 (221) 47694-421
 Matthäus Zloch M.Sc. E-Mail+49 (221) 47694-534
 M.A. Katarina Boland E-Mail+49 (221) 47694-513

lodcc

Description

As the availability and the inter-connectivity of RDF datasets grow, so does the necessity to understand the structures of the data. Understanding the topology of RDF graphs can guide and inform the development of e.g. synthetic dataset generators, sampling methods, index structures, or query optimizers. This work proposes two resources: (i) a software framework able to acquire, prepare, and perform a graph-based analysis on the topology of large RDF graphs, and (ii) results on a graph-based analysis of 280 datasets from the LOD Cloud with values for 28 graph measures computed with the framework.

Download link / Web Access

Source Code: https://git.gesis.org/matthaeus/lodcc 

Web page: https://data.gesis.org/lodcc/2017-08/ 

Link to demo / prototype

https://data.gesis.org/lodcc/2017-08/

Link to source code (GitLab)

https://git.gesis.org/matthaeus/lodcc

Publications

  • Zloch, M. & Acosta, M. & Hienert, D. & Dietze, S. & Conrad, S. (2019). (to be published) A Software Framework and Datasets for the Analysis of Graph Measures on RDF Graphs. In ESWC 2019, Portoroz, Slovenia, 2-4 June, 2019.

Team

 NameE-MailTelefon
 Matthäus Zloch M.Sc.
(contact)
E-Mail+49 (221) 47694-534
 Dr. Daniel Hienert E-Mail+49 (221) 47694-525

An Open Testbed for Author Name Disambiguation Evaluation

Description

We identified 5,408 authors in DBLP who have a unique identification number. These 5,408 authors and their publications form the gold standard. We got these numbers from DBLP downloaded May/01 2015.

Download link / Web Access

http://dx.doi.org/10.7802/1234

Publications

  • Momeni, F., & Mayr, P. (2016). Evaluating Co-authorship Networks in Author Name Disambiguation for Common Names. In 20th International Conference on Theory and Practice of Digital Libraries (TPDL 2016) (pp. 386–391). doi.org/10.1007/978-3-319-43997-6_31

Team

 NameE-MailTelefon
 Dr. Philipp Mayr
(contact)
E-Mail+49 (221) 47694-533
 Fakhri Momeni E-Mail+49 (221) 47694-544

Sowiport user queries sample (SQS)

Description

This data set contains a random sample of 1,800 user queries taken from the transaction log of the academic search engine sowiport (www.sowiport.de). The queries (mainly German query terms) were extracted from a larger set of randomly chosen user sessions which have been recorded in sowiport between September 1st 2014 and March 1st 2015. To reduce noise in the data set, we selected sessions in which at least two different searches were conducted and at least one document has been clicked on. In addition, we excluded all searches for numbers (mostly ISSN numbers). The selected queries were sorted randomly and manually assessed by a domain expert. The randomness was introduced to reduce potential biases in the assessment. When assessing multiple queries from one session, the previous queries might influence the decision for following queries as they are then evaluated within a context. The 1,800 user queries were categorized into the 29 facets of the subject categories used in the Thesaurus Social Sciences (TheSoz, see lod.gesis.org/thesoz/en.html and sowiport.gesis.org/thesaurus. It was allowed to apply multiple facets, as some queries cover multiple topics.

Download link / Web Access

http://dx.doi.org/10.7802/1372

Publications

  • Hienert, D., Sawitzki, F., & Mayr, P. (2015). Digital Library Research in Action – Supporting Information Retrieval in Sowiport. D-Lib Magazine, 21(3/4). doi.org/10.1045/march2015-hienert

Main Responsible / Contact Person

 NameE-MailTelefon
 Dr. Philipp Mayr E-Mail+49 (221) 47694-533

LRMI Datasets

Description

The LRMI dataset provides a research corpus to investigate the spread, adoption and context of learning resources and related metadata on the Web. It contains (schema.org) Web markup of learning resources extracted from the Web Data Commons respectively the Common Crawl of the years 2013-2015. For this, all markup was extracted which contains or co-occurs with any LRMI vocabulary terms, which provides the schema.org vocabulary for markup of educational resources. For the year 2015, this resulted in more than 44 million markup statements extracted from 1.82 billion web pages. In order to improve data quality, heuristics for data cleansing were applied.

Download link / Web Access

Publications

  • Taibi, D., Dietze, S., Towards embedded markup of learning resources on the Web: a quantitative Analysis of LRMI Terms Usage, in Companion Publication of the IW3C2 WWW 2016 Conference, IW3C2 2016, Montreal, Canada, April 11, 2016

  • Dietze, S., Taibi, D., Yu, R., Barker, P., d’Aquin, M., Analysing and Improving embedded Markup of Learning Resources on the Web, 26th International World Wide Web Conference (WWW2017), full research paper at Digital Learning track, Perth, Australia, April 2017.

Team

 NameE-MailTelefon
 Ran Yu
(contact)
E-Mail+49 (221) 47694-483
 Prof. Dr. Stefan Dietze E-Mail+49 (221) 47694-421

SAL - Search log with user knowledge assessment data

Description

This dataset includes 1100 search sessions conducted by crowd workers spanning across 11 information needs for different topics randomly selected from the TREC 2014 Web Track 2 dataset. This includes knowledge assessment data before and after each of the 100 search sessions per information need.

Features

Search session log, user knowledge test for a specific topic before and after a search session

Download link / Web Access

https://sites.google.com/view/predicting-user-knowledge 

Publications

  • Yu, R. , Gadiraju, U. , Holtz, P. , Rokicki, M. , Kemkes, P. and Dietze, S. Predicting User Knowledge Gain in Informational Search Sessions. 41st International ACM SIGIR Conference on Research and Development in Information Retrieval, 2018.

  • Gadiraju, U. , Yu, R. , Dietze, S. and Holtz, P.  Analyzing Knowledge Gain of Users in Informational Search Sessions on the Web. 2018 ACM on Conference on Human Information Interaction and Retrieval (CHIIR), 2018

Team

 NameE-MailTelefon
 Ran Yu
(contact)
E-Mail+49 (221) 47694-483
 Prof. Dr. Stefan Dietze E-Mail+49 (221) 47694-421

 

TweetsKB

Description

TweetsKB is a public RDF corpus of anonymized data for a large collection of annotated tweets. The dataset currently contains data for more than 1.5 billion tweets, spanning more than 5 years (February 2013 - March 2018). Metadata information about the tweets as well as extracted entities, sentiments, hashtags and user mentions are exposed in RDF using established RDF/S vocabularies. For the sake of privacy, we encrypt the tweet IDs and usernames, and we do not provide the text of the tweets.

Download link / Web Access

Website: https://data.gesis.org/tweetskb/
SPARQL endpoint: https://data.gesis.org/tweetskb/sparql (Graph URI: hhttps://data.gesis.org/tweetskb/)

Publications

  • P. Fafalios, V. Iosifidis, E. Ntoutsi, and S. Dietze, TweetsKB: A Public and Large-Scale RDF Corpus of Annotated Tweets, 15th Extended Semantic Web Conference (ESWC'18), Heraklion, Crete, Greece, June 3-7, 2018.

Team

 NameE-MailTelefon
 Prof. Dr. Stefan Dietze
(contact)
E-Mail+49 (221) 47694-421
 Matthäus Zloch M.Sc. E-Mail+49 (221) 47694-534
Felix Bensmann E-Mail+49 (221) 47694-524

Applications & demos

Nach oben

GESIS Research Graph

Description

The GESIS Research Graph builds a trusted research graph that makes the connection between high-value collections (research datasets) and other scholarly works such as publications and grants discoverable. For building the graph, the Switchboard software - developed by the Data Description Registry Interoperability (DDRI) WG of the Research Data Alliance (RDA) – has been implemented to aggregate, connect and publish the research information from GESIS data collections.

Link to source code (GitLab)

https://github.com/researchgraph

Link to demo / prototype

http://researchgraph.org/gesis/

Team

 NameE-MailTelefon
 Dr. Benjamin Zapilko E-Mail+49 (221) 47694-515

Opening Scholarly Communication in the Social Sciences (OSCOSS)

Description

Scholarly communication in the social sciences is centered around publications, in which data also play a key role. The increasingly collaborative scientific process, from a project plan, to collecting data, to interpreting them in a paper and submitting it for peer review, to publishing an article, to, finally, its consumption by readers, is insufficiently supported by contemporary information systems. They support every individual step, but media discontinuities between steps cause inefficiency and loss of information: word processors lack direct access to data; reviewers cannot provide feedback inside the environment in which authors revising their papers; open access web publishing is constrained to document formats designed for paper printing but neglecting the Web's accessibility and interactivity potential; finally, readers, seeing a single frozen view of the underlying data in a paper, are unable to access the full extent of the data and to make observations beyond the restricted scope chosen by the author.
With the collaborative document editor Fidus Writer and the Open Journal Systems we choose a stable technical foundation. We secure user acceptance by respecting the characteristics of the traditional processes social scientists are used to: web publications must have the same high-quality layout as print publications, and information must remain citable by stable page numbers. To ensure we meet these requirements, we will work closely with the publishers of methods, data, analyses (mda) and Historical Social Research (HSR), two international peer reviewed open accessible journals published by GESIS, and build early demonstrators for usability evaluation.
OSCOSS is funded by the DFG in the Open Access Transformation programme.

Link to source code (GitLab)

https://github.com/OSCOSS

https://github.com/fiduswriter

Link to demo / prototype

https://fiduswriter.gesis.org/

Publications

  • Sadeghi, A., Capadisli, S., Wilm, J., Lange, C., & Mayr, P. (2019). Opening and Reusing Transparent Peer Reviews with Automatic Article Annotation. Publications, 7(1). doi.org/10.3390/publications7010013
  • Mayr, P., & Lange, C. (2017). The Opening Scholarly Communication in Social Sciences project OSCOSS. In P. Hauke, A. Kaufmann, & V. Petras (Eds.), Bibliothek – Forschung für die Praxis. Festschrift für Konrad Umlauf zum 65. Geburtstag (pp. 433–444). De Gruyter. Retrieved from arxiv.org/abs/1611.04760.
  • Sadeghi, A., Wilm, J., Mayr, P., & Lange, C. (2017). Opening Scholarly Communication in Social Sciences by Connecting Collaborative Authoring to Peer Review. Information - Wissenschaft & Praxis

Team

 NameE-MailTelefon
 Dr. Philipp Mayr
(contact)
E-Mail+49 (221) 47694-533
 Fakhri Momeni E-Mail+49 (221) 47694-544

EXCITE – Extraction of Citations from PDF Documents

Description

The EXCITE project, jointly run by WeST (Institut of Web Science and Technologies, University of Koblenz-Landau) in Koblenz and GESIS (Leibniz Institute for Social Sciences) in Cologne, is funded by the Deutsche Forschungsgemeinschaft (DFG) with the aim of extracting citations from social science publications and making more citation data available to researchers. With respect to this objective, a set of algorithms for information extraction and matching has been developed focusing on social science publications in the German language. EXCITE provides different online services to extract and segment citations. Moreover, other online tools are available to create more gold standard data.

The demo is a toolchain of citation extraction software and particular focus on the German-language social sciences and this is a public service for the project. In the background of this page we are using CERMINE for extracting content from PDF files and Exparser for reference string extraction and segmentation.

Download link / Web Access

http://excite.west.uni-koblenz.de/website/

Link to demo / prototype

http://excite.west.uni-koblenz.de/excite

Link to source code (GitLab)

https://github.com/exciteproject

Publications

  • Körner, M., Ghavimi, B., Mayr, P., Hartmann, H., & Staab, S. (2017). Evaluating Reference String Extraction Using Line-Based Conditional Random Fields: A Case Study with German Language Publications. In M. Kirikova, K. Nørvåg, G. A. Papadopoulos, J. Gamper, R. Wrembel, J. Darmont, & S. Rizzi (Eds.), New Trends in Databases and Information Systems (Vol. 767, pp. 137–145). Cham: Springer International Publishing. doi.org/10.1007/978-3-319-67162-8_15

Team

 NameE-MailTelefon
 Dr. Philipp Mayr
(contact)
E-Mail+49 (221) 47694-533
 Behnam Ghavimi E-Mail+49 (221) 47694-251
 Azam Hosseini E-Mail+49 (221) 47694-573

GWSBeta

Description

In the Social Sciences, researchers search for information on the Web, but this is most often distributed on different websites, search portals, digital libraries, data archives, and databases. GESIS search is an integrated search system for social science information that allows finding information around research data in one digital library. Users can search for research data sets, publications, survey variables, questions from questionnaires, survey instruments and tools. Information items are linked to each other so that users can see, for example, which publications contain data citations to research data. The integration and linking of different kinds of information increase their visibility so that it is easier for researchers to find information for re-use.

Features

  • Integrated search over research data sets, publications, survey variables, questions from questionnaires, survey instruments and tools

  • Links between information items

Download link / Web Access

https://searchtest.gesis.org

Publications

  • Daniel Hienert, Dagmar Kern, Katarina Boland, Benjamin Zapilko, Peter Mutschke. (to appear). "A Digital Library for Research Data and Related Information in the Social Sciences." In Proceedings of JCDL 2019.

Team

 NameE-MailTelefon
 Dr. Daniel Hienert
(contact)
E-Mail+49 (221) 47694-525
 Dr. Dagmar Kern E-Mail+49 (221) 47694-536
 M.A. Katarina Boland E-Mail+49 (221) 47694-513
 Dr. Benjamin Zapilko E-Mail+49 (221) 47694-515
 Peter Mutschke M.A. E-Mail+49 (221) 47694-500

Tools & pipelines

Nach oben

InFoLiS - Integration of research data and literature in the social sciences

Description

The goal of the InFoLiS project was to connect research data and publications. In this context, a tool has been created which identifies and extracts citations of research data in scientific publications. These citations are used for generating links between these datasets and publications. The generated links can be made available for a seamless integration into different retrieval systems. All services for link creation are publicly usable as web services.

Download link / Web Access

http://infolis.github.io/

Link to source code (GitLab)

https://github.com/infolis

Publications

  • Boland, K. & Mathiak, B. (2013). Connecting Literature and Research Data. In IASSIST 2013 - Data Innovation: Increasing Accessibility, Visibility, and Sustainability, Cologne, Germany, May 29-31, 2013.
  • Boland, K.; Ritze, D.; Eckert, K.; Mathiak, B. (2012): Identifying references to datasets in publications. In: Zaphiris, P.; Buchanan, G.; Rasmussen, E.; Loizides, F. (Hrsg.): Proceedings of the Second International Conference on Theory and Practice of Digital Libraries (TPDL 2012), S.150-161, 2012.
  • Mathiak, B.; Boland K. (2015): Challenges in Matching Dataset Citation Strings to Datasets in Social Science. D-Lib Magazine 21 (1/2). doi.org/10.1045/january2015-mathiak
  • Ritze, D.; Boland, K. (2013): Integration of Research Data and Research Data Links into Library Catalogues. Proceedings of the International Conference on Dublin Core and Metadata Applications (DC 2013), 2013.

Team

 NameE-MailTelefon
 M.A. Katarina Boland
(contact)
E-Mail+49 (221) 47694-513
 Dr. Benjamin Zapilko E-Mail+49 (221) 47694-515

WHOSE

Description

WHOSE is a framework for the analysis of search behavior of real users in different environments and different domains based on log data. The logging component can easily be integrated into real-world IR systems for generating and analyzing new log data. Furthermore, due to a supplementary mapping it is also possible to analyze existing log data. For every IR system different actions and filters can be defined. This allows system operators and researchers to use the framework for the analysis of user search behavior in their IR systems and to compare it with others. Using a graphical user interface they have the possibility to interactively explore the data set from a broad overview down to individual sessions.

Features

  • Logging interaction data
  • Mapping of existing log data to user actions
  • Visualizing user actions
  • Interactively explore behavioral data within the GUI

Link to source code (GitLab)

https://git.gesis.org/iir/whole-session-evaluation-framework

Publications

  • Hienert, Daniel, Wilko van Hoek, Alina Weber, and Dagmar Kern. 2015. "WHOSE – A Tool for Whole-Session Analysis in IIR." In Advances in Information Retrieval: 37th European Conference on IR Research, ECIR 2015, Vienna, Austria, March 29 - April 2, 2015. Proceedings, Lecture Notes in Computer Science 9022, 172-183. Springer. arxiv.org/abs/1504.06961.

Team

 NameE-MailTelefon
 Dr. Daniel Hienert E-Mail+49 (221) 47694-525

Reading Protocol

Description

In Interactive Information Retrieval (IIR) experiments the user’s gaze motion on web pages is often recorded with eye tracking. The data is used to analyze gaze behavior or to identify Areas of Interest (AOI) the user has looked at. The reading protocol software breaks eye tracking data down to the textual level by considering the HTML structure of the web pages. This has a lot of advantages for the analyst. First and foremost, it can easily be identified on a large scale what has actually been viewed and read on the stimuli pages by the subjects. Second, the web page structure can be used to filter to AOIs. Third, gaze data of multiple users can be presented on the same page, and fourth, fixation times on text can be exported and further processed in other tools.

Features

  • Breaking down eye-tracking data to the word level
  • Interactively explore word-eye-fixations in the GUI

Link to demo / prototype:

http://vizgr.org/reading_protocol/

Link to source code (GitLab)

https://git.gesis.org/iir/reading-protocol

Publications

  • Hienert, Daniel, Dagmar Kern, Matthew Mitsui, Chirag Shah, and Nicholas J. Belkin. 2019. "Reading Protocol: Understanding what has been read in Interactive Information Retrieval Tasks." In CHIIR '19 Proceedings of the 2019 Conference on Human Information Interaction and Retrieval, 73-81. New York: ACM. doi: http://dx.doi.org/10.1145/3295750.3298921.

Team

 NameE-MailTelefon
 Dr. Daniel Hienert
(contact)
E-Mail+49 (221) 47694-525
 Dr. Dagmar Kern E-Mail+49 (221) 47694-536

Preference-based Search

Description

Finding a product online can be a challenging task for users. Faceted search interfaces, often in combination with recommenders, can support users in finding a product that fits their preferences. However, those preferences are not always equally weighted: some might be more important to a user than others (e.g. red is the favorite color, but blue is also fine) and sometimes preferences are even contradictory (e.g. the lowest price vs. the highest performance). Often, there is even no product that meets all preferences. In those cases, faceted search interfaces reach their limits. In this project, we investigate the potential of a search interface, which allows a preference-based ranking based on weighted search and facet terms. 

Features

  • A search system with user frontend which allows adjusting the preferences for different facets.

Link to source code (GitLab)

https://git.gesis.org/iir/preferenced-based-search

Publications

  • Kern, Dagmar, Wilko van Hoek, and Daniel Hienert. 2018. "Evaluation of a Search Interface for Preference-Based Ranking - Measuring User Satisfaction and System Performance." In NordiCHI '18 Proceedings of the 10th Nordic Conference on Human-Computer Interaction, 184-194. New York: ACM. doi: http://dx.doi.org/10.1145/3240167.3240170.

Team

 NameE-MailTelefon
 Dr. Daniel Hienert E-Mail+49 (221) 47694-525

Variable Detection and Linking

Description

In the OpenMinTeD project (http://openminted.eu/) methods have been investigated and developed to identify the mentions of variables in scientific publications. The variable detection and disambiguation method developed has been tested on a subset of variables.

Download link / Web Access

https://services.openminted.eu/landingPage/application/51d1f81b-aa0f-4675-bb87-8c720779e949 

Link to source code (GitLab)

https://github.com/openminted/uc-tdm-socialsciences 

Publications

  • Zielinski, Andrea, and Peter Mutschke. 2018. "Towards a Gold Standard Corpus for Variable Detection and Linking in Social Science Publications." In Proceedings of LREC 2018
  • Zielinski, Andrea, and Peter Mutschke. 2017. "Mining Social Science Publications for Survey Variables." In Proceedings of the Second Workshop on Natural Language Processing and Computational Social Science, Vancouver, Canada, August 3, 2017, edited by Dirk Hovy, Svitlana Volkova, and David Bamman, 47–52. Association for Computational Linguistics. aclweb.org/anthology/W17-29. aclweb.org/anthology/W17-29.

Team

 NameE-MailTelefon
 Peter Mutschke M.A.
(contact)
E-Mail+49 (221) 47694-500
Andrea Zielinski Dr. E-Mail+49 (221) 47694-212

 

ReshapeRDF

Description

ReshapeRDF is a CLI tool that provides versatile functionality to inspect and reshape large RDF dumps. It is designed closely interact with the Unix CLI tool set.

Processing RDF mass data can be a prone job. Common triple stores offer certain functionality for querying and manipulating RDF data but only few can efficiently handle mass data (let's say more than 200 Mio. statements) at the same time. Typical operations like data import and SPARQL (update)queries tend to be time consuming and inconvenient to be used in comprehensive reshaping operations.

Thus, when working with moderate structured graph data, a solution can be to refrain from using a triple store and to work with dump files instead. Recurring reshaping tasks are extracting entities of a certain class from a large dataset, or subdivide a dataset into blocks according to a certain property (Blocking), filtering the data, extracting and removing resources and statements, renaming properties and similar operations.

The tool at hand allows for these operations. ReshapeRDF’s working principle is to process RDF data as stream of N-Triples (lines) and consequently allows it to be used in Unix tool set environments and for scripting.

An early version of this tool was used in linked.swissbib.ch project.

Features

  • Convert between various RDF formats

  • Remove duplicates

  • Rename properties

  • Split datasets

  • Merge dataset

  • Sort datasets

  • Extract resources, statements, subjects, predicates, objects by patterns

  • Extract resources by list

  • Filter resources by patterns

  • Filter resources according to a list

  • Close interaction with Unix tool set

  • ... Refer to our Guide for more information

Link to source code (GitLab)

https://git.gesis.org/bensmafx/reshapeRDF 

Publications

  • Bensmann, Felix, Benjamin Zapilko, and Philipp Mayr. 2017. "Interlinking Large-scale Library Data with Authority Records." Frontiers in Digital Humanities 4 (5): 1-13. doi: dx.doi.org/10.3389/fdigh.2017.00005.

Team

 NameE-MailTelefon
Felix Bensmann E-Mail+49 (221) 47694-524