Extraction of Citations from PDF Documents
Leitung: Dr. Philipp Mayr
Wissenschaftlicher Arbeitsbereich: Wissenstechnologien für Sozialwissenschaften (WTS)
The shortage of citation data for the international and German social sciences is well known to researchers in the field and has itself often been subject to academic studies. Citation data is the basis of effective information retrieval, recommendation systems and knowledge discovery processes. Therefore, the accessibility of information in the social sciences currently lags behind other fields (e.g. the sciences) where more citation data is available. The EXCITE project narrows this supply gap by developing a toolchain of citation extraction software that will be applied to existing databases of scientific literature, and made available to researchers during and after the project’s runtime, with a particular focus on the German-language social sciences. In order to reach this goal the project will develop a set of algorithms for the extraction of citation and reference information from PDF documents and the matching of reference strings against bibliographic databases. Extraction of citations will be performed as a five-step process: (1) extraction of text from source documents, (2) identification of reference sections and other forms of embedded reference information within the text, (3) segmentation of individual references into its constituent fields such as author, title, etc., (4) matching of reference strings against databases of bibliographic information, and (5) the export of matched references to reusable formats in order for other researchers to be able to use the results. The developed system will consist of a corresponding five-step framework, and will use state-of-the-art technologies where appropriate, and extend them where necessary. In particular, special attention will be given to the optimization of the citation extraction process as a whole, using machine learning methods to ensure that the extracted data exceeds a satisfying threshold of quality. This will be achieved by annotating intermediary output of extraction process with probability values denoting the certainty of each processing step, allowing subsequent steps to determine the most likely interpretation of the data. The extracted citation data will then be integrated into pre-existing bibliographic services of the proposing institutes (Sowiport and related-work.net), and published as linked open data under permissive licenses to enable reuse. The software will also be made accessible as a web service API, to allow third-parties to extract citation data from arbitrary publications. Additionally, the software developed as part of the project will be published under open-source licences.
Projektlaufzeit1.9.2016 - 31.8.2018
- WeST – Institute for Web Science and Technologies, University of Koblenz-Landau