Smart Harvesting 2
Abstract
The automatic extraction and preparation of
bibliographic data is one of the major problems in maintaining
bibliographic databases. In the follow-up project "Smart Harvesting II",
therefore, the productive cooperation between the database operators
dblp (computer science) and GESIS (social sciences) should be continued
to solve common problems.
In
the predecessor project, the focus was on the development of a learning
wrapper that uses the current database to automatically generate
extraction rules. However, this is not always possible because of the
multitude of technologies used on the Web, and especially page content
generated dynamically at runtime (e.g. using AJAX calls) still presents a
great challenge.
The focus of the current project is therefore
the development of a wrapper framework for rule-based data extraction,
which can also be used by non-computer scientists using simple
extraction rules. Both navigation and extraction should be done by
parsing the DOM trees underlying the HTML pages. For this purpose, in
cooperation with the University of Oxford, their addressing scheme
OXPath (an extension of XPath) will be integrated into the wrappers.
Furthermore, monitoring tools are to be created to enable
non-programmers (such as librarians) to monitor the entire data
extraction process and develop new data sources.
At
the same time, the databases are to be cleaned up and processed by
author disambiguation, so that a more solid database is guaranteed. The
software already implemented in the predecessor project Smart Harvesting
for the disambiguation of newly acquired data is to be enriched with a
further component that reveals homonyms and synonyms in the existing
data. In particular, the differences between different publication
cultures (computer science ↔ social sciences)
revealed in the previous project will be discussed, since in many cases
very different strategies have to be applied.