Open Mining Infrastructure for Text and Data (OpenMinTeD)
Abstract
Recent years have witnessed an unparalleled upsurge in the quantities
of digital data, with their volume doubling every three years. In the
world of science, researchers worldwide generate over 1.5 million
publications on an annual basis. While, undoubtedly, these vast amounts
of new data and information can offer new insights, give rise to new
opportunities for analytics and improved understanding, it is equally
undoubted that reading and analysing them is beyond human capacities.
Text and data mining is emerging as a powerful tool for harnessing
the power of and discovering value in data, by analysing structured and
unstructured datasets and content at multiple levels and in many
different dimensions in order to discover concepts and entities in the
world, patterns they may follow and relations they engage in, and on
this basis annotate, index, classify and visualise such content.
Scientific publications as a whole cover a range of respective
scientific areas, each with its own terminology, conventions and way of
using language to express and communicate knowledge, let alone the fact
that they are rendered in different natural languages and comply with
varying access rights and/or restrictions. In the same vein, text mining
tools and platforms have been built either for mining linguistically
generic text or focusing on different domains and languages, each, more
or less, with its own technical and linguistic specifications. Text
mining tools have in the last decade been integrated in text mining
platforms, thus ensuring a level of interoperability between tools and
components within the same platform, while initiatives from
cross-platform interoperability have been launched in the recent years.
However, both text mining tools and integrated platforms are not easily
discoverable by end users (researchers, curators, librarians, policy
makers, etc), while they are also being documented in various ways
making searching and discovering them a challenging task.
OpenMinTeD aspires to enable the creation of an infrastructure that
fosters and facilitates the use of text and data mining technologies in
the scientific publications world and beyond, by both application domain
users and text-mining experts. It builds upon existing text mining
tools, workflows and platforms and renders them discoverable, through appropriate registries,
and
interoperable, through an existing standards-based, to the extent
possible, interoperability layer. It supports awareness of the benefits
and training of the text mining users and developers alike and
demonstrates the merits of the approach through a number of use cases
identified by scholars and experts from different scientific areas,
ranging from generic scholarly communication to life sciences
(bioinformatics, biochemistry, etc) to food and agriculture and social
sciences and humanities related literature.
GESIS will provide the social sciences use cases requirements,
evaluate the corresponding implementation and participate in the
interoperability framework specifications working groups. Furthermore,
GESIS will be actively involved in community engagement and training
activities.