Validating,Matching and Retrieving "Non-Source Items" in the Social Science (OUTCITE)
Abstract
Making bibliographic data available for researchers, scholars and others is important in all
disciplines to ensure easy and fast access to the literature and other scientific resources such as research datasets. To this end, many publishers strive to index their publications in bibliographic databases enabling the linking of publications in a citation graph. Still, a significant part of citation data in disciplines such as social science is not accessible via bibliographic databases. Our previous project EXCITE has addressed this problem and has successfully narrowed the gap between the availability of citation data in the social sciences and other disciplines. EXCITE has researched, developed, and deployed powerful tools (https://www.gesis.org/en/research/external-funding-projects/overview-external-funding-projects/excit...) that localize, extract and segment reference strings in PDF documents and then match them against bibliographic databases. EXCITE has also integrated the extracted citation data from social science publications into the Open Citations Corpus (http://opencitations.net). One of the main conclusions derived from EXCITE is that the metadata of 60% of the cited papers and other scientific resources are outside of available bibliographic databases. The extracted reference strings (items) that could not be matched are called non-source items. Non-source items include incomplete or erroneous references as well as references that indeed do not exist in the available bibliographic databases, especially references to datasets, websites and other material. This finding of EXCITE coincides with our observations on other major bibliographic databases, such as Web of Science or Semantic Scholar.
The main goal of OUTCITE is to research, develop and deploy a toolchain which follows-up on the output produced by the EXCITE pipeline in order to link non-source items to their sources. To this end, we will employ our gained knowledge and expertise to overcome the various foreseen challenges in OUTCITE. Specifically, we will develop a set of algorithms dedicated to understanding non-source items (challenge C1), to overcome the problem of their duplicate occurrences (C2) by gathering them into clusters. Subsequently, new algorithms and methods will be developed to derive correct and complete representations from these clusters (C3). These representations will be located by involving web search engines, such that the existence of the publication is confirmed and the corresponding source is retrieved (C4). To ensure a high-quality result at the end of the project, we will use, adapt and extend the technologies reviewed in the state-of-the-art so far. Machine learning and data science techniques will be actively used in each phase of the project to reach a satisfying level of quality and an end-to-end optimization concept will be adopted to achieve high output quality over all the phases of OUTCITE. To this end, the phases will not only provide their outcomes but also propagate their estimation on the output’s quality. This mechanism will allow each phase of the chain to understand the output of the preceding and thus adapt its processes to increase the output’s quality. At the end of the project and similar to what has been accomplished in EXCITE, the developed techniques, tools and enriched reference index will be made available under open-source licenses, integrated in the GESIS Search infrastructure and ingested in the Open Citation Corpus.