GESIS Leibniz Institute for the Social Sciences: Go to homepage

Smart Harvesting 2


Leader

Dr. Brigitte Mathiak

Team

Nadine Dulisch

Abstract

The automatic extraction and preparation of

bibliographic data is one of the major problems in maintaining

bibliographic databases. In the follow-up project "Smart Harvesting II",

therefore, the productive cooperation between the database operators

dblp (computer science) and GESIS (social sciences) should be continued

to solve common problems.

In

the predecessor project, the focus was on the development of a learning

wrapper that uses the current database to automatically generate

extraction rules. However, this is not always possible because of the

multitude of technologies used on the Web, and especially page content

generated dynamically at runtime (e.g. using AJAX calls) still presents a

great challenge.

The focus of the current project is therefore

the development of a wrapper framework for rule-based data extraction,

which can also be used by non-computer scientists using simple

extraction rules. Both navigation and extraction should be done by

parsing the DOM trees underlying the HTML pages. For this purpose, in

cooperation with the University of Oxford, their addressing scheme

OXPath (an extension of XPath) will be integrated into the wrappers.

Furthermore, monitoring tools are to be created to enable

non-programmers (such as librarians) to monitor the entire data

extraction process and develop new data sources.

At

the same time, the databases are to be cleaned up and processed by

author disambiguation, so that a more solid database is guaranteed. The

software already implemented in the predecessor project Smart Harvesting

for the disambiguation of newly acquired data is to be enriched with a

further component that reveals homonyms and synonyms in the existing

data. In particular, the differences between different publication

cultures (computer science ↔ social sciences)

revealed in the previous project will be discussed, since in many cases

very different strategies have to be applied.



Runtime

2016-04-15 – 2018-04-15