A Framework for Finding, Linking, and Enriching Social Science Linked Data (ScienceLinker)
LeaderDr. Benjamin Zapilko
Scientists in applied empirical research are regularly searching for datasets. More precisely they are looking for specific datapoints within the datasets (e.g. variables in the case of the social science research) which allow them to investigate their specific research interest. These datasets are used for multiple purposes like answering a particular research question, replicating a finding based on an alternative dataset or merging two datasets in order to increase options for analysis or to fill in missing values. However, finding suitable data and measures for the support of an individual hypothesis is a challenging task. Above that, there are practically no frameworks that allow for identifying similar datasets based on a given one.
In many cases, a researcher will be able to find the desired data at a research data centre. Anyway, regarding the mass of data available on the web (resulting from the Open Data movement) additional relevant datasets are likely to be available which are provided outside of organized infrastructures like research data centres. Another point to be considered is manual effort that still has to be done to exploit the found datasets for interlinking. This is true for instance in order to enrich own datasets with additional content from the found data and metadata but also for subsequent publication in journals, self-archiving platforms or on the web.
This project, “ScienceLinker”, motivates two approaches for these challenges: (1) development of methods to identify Linked Open Data as compatible-by-content and quality; (2) application of Semantic Web technologies to leverage the data for linking, enrichment and subsequent publishing. The developed techniques will support non-domain users by applying automation whenever possible. Hence, the framework that is developed aims to guide the user through the following five steps: (1) the automatic identification of a set of related datasets published as Linked Open Data; (2) the assessment of a dataset in terms of compatibility and quality; (3) the linking of entities referenced in the dataset to the identified datasets; (4) the enrichment of the dataset by applying a set of entity-type-specific rules to infer additional information about the entities also via non-identity links; (5) and the processing of the enriched dataset for a publication in self-archiving platforms, as Linked Data or via further publication ways.
The investigations and developments in this project will be kept generic wrt. domain in order to maximize the potential for re-use in other applications. For instance, in case of the social sciences, candidate Linked Data sources may neither be scientific nor from the social science domain at all like e.g. DBpedia or Geonames. Likewise, we will integrate the ScienceLinker framework into the established data integration platform Karma, which has been developed at ISI, so that it can also be executed in a neutral environment.