GESIS Leibniz Institute for the Social Sciences: Go to homepage

Validating,Matching and Retrieving "Non-Source Items" in the Social Science (OUTCITE)



Abstract

Making bibliographic data

available for researchers, scholars and others is important in all

disciplines to ensure easy and

fast access to the literature and other scientific resources such as research

datasets. To this end, many publishers strive to index their publications in

bibliographic databases enabling the linking of publications in a citation

graph. Still, a significant part of citation data in disciplines such as social

science is not accessible via bibliographic databases. Our previous project

EXCITE has addressed this problem and has successfully narrowed the gap between

the availability of citation data in the social sciences and other disciplines.

EXCITE has researched, developed, and deployed powerful tools (https://www.gesis.org/en/research/external-funding-projects/overview-external-funding-projects/excite)

that localize, extract and segment reference strings in PDF documents and then

match them against bibliographic databases. EXCITE has also integrated the

extracted citation data from social science publications into the Open

Citations Corpus (http://opencitations.net). One of the main conclusions

derived from EXCITE is that the metadata of 60% of the cited papers and other

scientific resources are outside of available bibliographic databases. The

extracted reference strings (items) that could not be matched are called

non-source items. Non-source items include incomplete or erroneous references

as well as references that indeed do not exist in the available bibliographic

databases, especially references to datasets, websites and other material. This

finding of EXCITE coincides with our observations on other major bibliographic

databases, such as Web of Science or Semantic Scholar.

The main goal of OUTCITE is to

research, develop and deploy a toolchain which follows-up on the output produced by the EXCITE pipeline in order to link non-source items to their sources. To this end, we will employ our

gained knowledge and expertise to overcome the various foreseen challenges in OUTCITE.

Specifically, we will develop a set of algorithms dedicated to understanding non-source items

(challenge C1), to overcome the problem of their duplicate occurrences (C2) by gathering

them into clusters. Subsequently, new algorithms and methods will be developed to derive

correct and complete representations from these clusters (C3). These representations

will be located by involving web search engines, such that the existence of the

publication is confirmed and the corresponding source is retrieved (C4). To

ensure a high-quality result at the end of the project, we will use, adapt and

extend the technologies reviewed in the state-of-the-art so far. Machine

learning and data science techniques will be actively used in each phase of the

project to reach a satisfying level of quality and an end-to-end optimization concept

will be adopted to achieve high output quality over all the phases of OUTCITE.

To this end, the phases will not only provide their outcomes but also propagate

their estimation on the output’s quality. This mechanism will allow each phase

of the chain to understand the output of the preceding and thus adapt its

processes to increase the output’s quality. At the end of the project and

similar to what has been accomplished in EXCITE, the developed techniques,

tools and enriched reference index will be made available under open-source

licenses, integrated in the GESIS Search infrastructure and ingested in the

Open Citation Corpus.



Runtime
01.08.2021 – 31.07.2023

Sponsored by

Deutsche Forschungsgemeinschaft