Open Mining Infrastructure for Text and Data
Leader: Peter Mutschke
Scientific unit: Wissenstechnologien für Sozialwissenschaften (WTS)
Recent years have witnessed an unparalleled upsurge in the quantities of digital data, with their volume doubling every three years. In the world of science, researchers worldwide generate over 1.5 million publications on an annual basis. While, undoubtedly, these vast amounts of new data and information can offer new insights, give rise to new opportunities for analytics and improved understanding, it is equally undoubted that reading and analysing them is beyond human capacities.
Text and data mining is emerging as a powerful tool for harnessing the power of and discovering value in data, by analysing structured and unstructured datasets and content at multiple levels and in many different dimensions in order to discover concepts and entities in the world, patterns they may follow and relations they engage in, and on this basis annotate, index, classify and visualise such content.
Scientific publications as a whole cover a range of respective scientific areas, each with its own terminology, conventions and way of using language to express and communicate knowledge, let alone the fact that they are rendered in different natural languages and comply with varying access rights and/or restrictions. In the same vein, text mining tools and platforms have been built either for mining linguistically generic text or focusing on different domains and languages, each, more or less, with its own technical and linguistic specifications. Text mining tools have in the last decade been integrated in text mining platforms, thus ensuring a level of interoperability between tools and components within the same platform, while initiatives from cross-platform interoperability have been launched in the recent years. However, both text mining tools and integrated platforms are not easily discoverable by end users (researchers, curators, librarians, policy makers, etc), while they are also being documented in various ways making searching and discovering them a challenging task.
OpenMinTeD aspires to enable the creation of an infrastructure that fosters and facilitates the use of text and data mining technologies in the scientific publications world and beyond, by both application domain users and text-mining experts. It builds upon existing text mining
tools, workflows and platforms and renders them discoverable, through appropriate registries,
and interoperable, through an existing standards-based, to the extent possible, interoperability layer. It supports awareness of the benefits and training of the text mining users and developers alike and demonstrates the merits of the approach through a number of use cases identified by scholars and experts from different scientific areas, ranging from generic scholarly communication to life sciences (bioinformatics, biochemistry, etc) to food and agriculture and social sciences and humanities related literature.
GESIS will provide the social sciences use cases requirements, evaluate the corresponding implementation and participate in the interoperability framework specifications working groups. Furthermore, GESIS will be actively involved in community engagement and training activities.
Runtime01.06.2015 – 31.05.2018
- ATHENA Research and Innovation Center in Information, Communication and Knowledge Technologies (Greece) (Lead)
- University of Manchester (UK)
- Technische Universität Darmstadt (Germany)
- Institut National de la Recherche Agronomique (France)
- European Molecular Biology Laboratory (Germany)
- Agro-Know (Greece)
- Stichting LIBER (Netherlands)
- University of Amsterdam (Netherlands)
- Open University (UK)
- École Polytechnique Fédérale De Lausanne (Switzerland)
- Fundación Centro Nacional de Investigaciones Oncologicas Carlos III (Spain)
- The University of Sheffield (UK)
- Greek Research and Technology Network (Greece)
- Frontiers Media SA (Switzerland)
- Zielinski, Andrea, and Peter Mutschke. 2018. "Towards a Gold Standard Corpus for Variable Detection and Linking in Social Science Publications." In Proceedings of LREC 2018
- Zielinski, Andrea, and Peter Mutschke. 2017. "Mining Social Science Publications for Survey Variables." In Proceedings of the Second Workshop on Natural Language Processing and Computational Social Science, Vancouver, Canada, August 3, 2017, edited by Dirk Hovy, Svitlana Volkova, and David Bamman, 47–52. Association for Computational Linguistics. aclweb.org/anthology/W17-29. aclweb.org/anthology/W17-29.