Smart Harvesting 2

Team: Nadine Dulisch, TBA
Leader: Dr. Brigitte Mathiak
Scientific unit: Knowledge Technologies for the Social Sciences (WTS)


The automatic extraction and preparation of bibliographic data is one of the major problems in maintaining bibliographic databases. In the follow-up project "Smart Harvesting II", therefore, the productive cooperation between the database operators dblp (computer science) and GESIS (social sciences) should be continued to solve common problems.

In the predecessor project, the focus was on the development of a learning wrapper that uses the current database to automatically generate extraction rules. However, this is not always possible because of the multitude of technologies used on the Web, and especially page content generated dynamically at runtime (e.g. using AJAX calls) still presents a great challenge.

The focus of the current project is therefore the development of a wrapper framework for rule-based data extraction, which can also be used by non-computer scientists using simple extraction rules. Both navigation and extraction should be done by parsing the DOM trees underlying the HTML pages. For this purpose, in cooperation with the University of Oxford, their addressing scheme OXPath (an extension of XPath) will be integrated into the wrappers. Furthermore, monitoring tools are to be created to enable non-programmers (such as librarians) to monitor the entire data extraction process and develop new data sources.

At the same time, the databases are to be cleaned up and processed by author disambiguation, so that a more solid database is guaranteed. The software already implemented in the predecessor project Smart Harvesting for the disambiguation of newly acquired data is to be enriched with a further component that reveals homonyms and synonyms in the existing data. In particular, the differences between different publication cultures (computer science ↔ social sciences) revealed in the previous project will be discussed, since in many cases very different strategies have to be applied.


15.4.2016 - 15.4.2018

  • C. Michels, R. R. Fayzrakhmanov, M. Ley, E. Sallinger and R. Schenkel, "OXPath-Based Data Acquisition for dblp," 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL), Toronto, ON, 2017, pp. 1-2.
  • Web-Scraping for Non-Programmers: Introducing OXPath for Digital Library Metadata Harvesting. In: Code4Lib Journal, 38, 2017. Mandy Neumann, Jan Steinberg and Philipp Schaer.
  • Enriching Existing Test Collections with OXPath. In: G. J. F. Jones, S. Lawless, J. Gonzalo, L. Kelly, L. Goeuriot, T. Mandl, L. Cappellato and F. Nicola, editors, Experimental IR Meets Multilinguality, Multimodality, and Interaction 8th International Conference of the CLEF Association, CLEF 2017, Dublin, Ireland, September 11-14, 2017, Proceedings, volume 10456, series Lecture Notes in Computer Science. 2017. Philipp Schaer and Mandy Neumann.