SmartER Affiliations: Harvesting und Extraktion von Affiliationsdaten zur Erweiterung von offenen Repositorien (SmartER)
Abstract
DBLP provides search functionality over metadata of scientific publications and links to PDF full-texts of scientific publications for the discipline of computer science. To this end, DBLP provides high-quality metadata on author names, titles, and venues, including a unique identification of authors, whenever possible. Currently, DBLP records unique identifications of authors for 5.25 million documents. To take the next step, we plan to extend the DBLP database to include affiliation information, whenever possible. We compiled three use cases through which the users benefit from affiliation data. This includes direct benefits, such as new and useful search functionality, and indirect benefits, such as better author disambiguation as well as a more accurate data basis for scientometric studies and measuring scientific output. The goal of this project is to address all three use cases and elevate affiliations to a first-class citizen in the DBLP data environment.
We have broken down this challenge into four tasks: Get the data, extract the data, integrate it into both the backend and the frontend of DBLP and introduce the data to the community. Specifically, we will build a focused hybrid crawler to automatically discover and harvest metadata from different structured and unstructured web sources. The focused crawler learns how to follow data links that contain bibliographic metadata in different formats such as RDF on the Web of Data, PDF full-texts, Web sites, and custom APIs providing bibliographic metadata. We download the content and extract and cleanse the metadata from the different web sources. For example, we apply entity recognition to extract from a PDF the metadata, in particular the authors’ affiliation and match the extracted metadata to external knowledge bases, e.g., lists of known affiliations. The extracted information is fused into a metadata record based on an extended, provenance-aware metadata-model. We ingest the new metadata records into the DBLP database, where it is manually inspected, edited, and confirmed by analysts using the DBLP editorial manager. Through this iterative, manual inspection, feedback is generated, which is returned to improve the machine learning model for extracting affiliation information, as well as used by the hybrid focused crawler to optimize its crawling strategies.
Through user studies, we tailor the new affiliation interface to meet the users’ needs, but also take care to integrate the new information into the on-going author disambiguation and quality assurance processes of DBLP’s editorial management system. Last, but not least, all gathered information will be made publicly available under FAIR principles as part of the ongoing DBLP effort to support the e-research community with high-quality and trustworthy datasets, which are used worldwide by thousands of researchers and software developers.
Runtime
2023-10-01 – 2026-09-30Partner
- Schloss Dagstuhl – Leibniz Center for Informatics
- Universität Ulm
Funding
Deutsche Forschungsgemeinschaft