Scalable Disambiguation of Institutions for Web of Science
Disambiguation of Institutions
Leader: Dr. Philipp Mayr
Scientific unit: Knowledge Technologies for the Social Sciences
Ambiguity level of institution names is different in various data aggregators based on metadata provided. The Web of Science (WoS) supports basic structures such as university, street address, city, ZIP code and country name. Nevertheless different styles of filling the fields by authors are not avoidable. Often, it is not obvious whether two different references to institutions are variants of just one name or actually constitute different institutions. Authors enter their affiliation in different ways over various publications or introduce spelling errors. Also, the corresponding author enters the affiliation of a shared publication in their own format. For author name disambiguation the usefulness of the affiliation string is limited by its variation. Moreover, if we observe the institution level and want to study institutions through their authors and publications, it is necessary to categorize institution names in order to have a unique list of institution names across the world with all belonging authors and publications. One problem here is different formats of translating the name of an institution by different authors to English. Beside the multiple variants of entities from authors, there are some other problems such as changing the address, merging or splitting the institutions.
The aim of this project is to introduce a method to disambiguate the institution names in WoS, as a large scale bibliometric database. For this purpose, multiple variants of institutions name are grouped together. To evaluate the method, we will use the institution coding implemented by Bielefeld University.
The planned approach will enable to disambiguate institutions for the entire WoS:
- Entity resolution instead of entity linking to overcome database sparsity based on previous work.
- A sophisticated blocking method based on previous work to overcome the complexity problem.
- Transformation of observed addresses into normalized (comparable) representations.
- Description of hierarchical relationships which are encoded in the blocking graph.
Runtime1.10.2019 – 31.12.2019
- Tobias Backes. Effective unsupervised author disambiguation with relative frequencies. In Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries, pages 203–212. ACM, 2018.
- Tobias Backes. The impact of name-matching and blocking on author disambiguation. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pages 803–812. ACM, 2018.