Validating,Matching and Retrieving "Non-Source Items" in the Social Science (OUTCITE)
Abstract
Making bibliographic data
available for researchers, scholars and others is important in all
disciplines to ensure easy and
fast access to the literature and other scientific resources such as research
datasets. To this end, many publishers strive to index their publications in
bibliographic databases enabling the linking of publications in a citation
graph. Still, a significant part of citation data in disciplines such as social
science is not accessible via bibliographic databases. Our previous project
EXCITE has addressed this problem and has successfully narrowed the gap between
the availability of citation data in the social sciences and other disciplines.
EXCITE has researched, developed, and deployed powerful tools (https://www.gesis.org/en/research/external-funding-projects/overview-external-funding-projects/excite)
that localize, extract and segment reference strings in PDF documents and then
match them against bibliographic databases. EXCITE has also integrated the
extracted citation data from social science publications into the Open
Citations Corpus (http://opencitations.net). One of the main conclusions
derived from EXCITE is that the metadata of 60% of the cited papers and other
scientific resources are outside of available bibliographic databases. The
extracted reference strings (items) that could not be matched are called
non-source items. Non-source items include incomplete or erroneous references
as well as references that indeed do not exist in the available bibliographic
databases, especially references to datasets, websites and other material. This
finding of EXCITE coincides with our observations on other major bibliographic
databases, such as Web of Science or Semantic Scholar.
The main goal of OUTCITE is to
research, develop and deploy a toolchain which follows-up on the output produced by the EXCITE pipeline in order to link non-source items to their sources. To this end, we will employ our
gained knowledge and expertise to overcome the various foreseen challenges in OUTCITE.
Specifically, we will develop a set of algorithms dedicated to understanding non-source items
(challenge C1), to overcome the problem of their duplicate occurrences (C2) by gathering
them into clusters. Subsequently, new algorithms and methods will be developed to derive
correct and complete representations from these clusters (C3). These representations
will be located by involving web search engines, such that the existence of the
publication is confirmed and the corresponding source is retrieved (C4). To
ensure a high-quality result at the end of the project, we will use, adapt and
extend the technologies reviewed in the state-of-the-art so far. Machine
learning and data science techniques will be actively used in each phase of the
project to reach a satisfying level of quality and an end-to-end optimization concept
will be adopted to achieve high output quality over all the phases of OUTCITE.
To this end, the phases will not only provide their outcomes but also propagate
their estimation on the output’s quality. This mechanism will allow each phase
of the chain to understand the output of the preceding and thus adapt its
processes to increase the output’s quality. At the end of the project and
similar to what has been accomplished in EXCITE, the developed techniques,
tools and enriched reference index will be made available under open-source
licenses, integrated in the GESIS Search infrastructure and ingested in the
Open Citation Corpus.