Establishing Contextual Dataset Retrieval - transferring concepts from document to dataset retrieval (ConDATA)

Team: Zeljko Carevic, Dwaipayan Roy
Leader: Dr. Philipp Mayr
Scientific unit: Dep. Knowledge Technologies for the Social Sciences (WTS)


Information Retrieval Systems (IRS) face new challenges due to the growth and diversity of data and users. In fact, an IRS analyses the query submitted by the user and explores collections of data with unstructured or semi-structured nature (e.g. text, image, video, Web page etc.) in order to deliver items that best match his/her intent and interests. Those challenges are more enhanced when the result of the query is no longer a known type such as Webpage or literature reference but rather a dataset or data collection. Dataset retrieval represents a new and a challenging research area because of the particularity of the output type which is not a text that can be easily indexed and retrieved.

Our contextualisation approach is visualised in a simplified schematic way (see Figure 1). At the start of each session, only little information is available about the user and his information needs. This is characterized by a low session context on the system side and a high uncertainty on the user side. Over time the user interacts with the system by entering query terms, looking at datasets and accepting recommendations from the system. With each interaction the systems knowledge about the user and his information needs is growing and resulting in a more concrete representation of the user via the session context. At the same time the uncertainty is reduced and the search results are tailored according to the user’s search interests.

We aim in this project to include the user context based on issued queries, reformulated queries, seen documents etc. in order to provide the user with a personalised ranking of datasets relevant to his interests (see the green dotted box in Figure 1). Another example for a contextualisation in real-life applications is so-called query expansion where a user’s search query is expanded to related terms which are derived from the user’s context.

The main goal of the ConDATA project is to employ well established contextualisation approaches from the field of document retrieval to the field of dataset retrieval in order to personalise search results. To achieve this, we propose a user-centered retrieval approach in which we analyse the user behaviour of researchers during dataset retrieval tasks. This overall goal is divided into the following sub-goals:

  • Sub-Goal 1: User behaviour in dataset retrieval

The goal is to analyse the search behaviour of users during a dataset retrieval task to determine strategies that are commonly employed.

  • Sub-Goal 2: Data representation

The goal is to create an indexed and retrievable representation which is feasible to describe a dataset.

  • Sub-Goal 3: Develop a user profile modeling approach

By utilising implicit relevance feedback the goal is to develop a rich user profile that enables an abstract representation of the users search interest and search behaviour.

  • Sub-Goal 4: Evaluation in a living lab environment

The objective is to utilize a user-driven implementation process which is evaluated in a living lab environment (GESIS Search). Our goal is to compare different contextualisation approaches to determine which is most suitable for dataset retrieval using a living labs approach in which different methods are evaluated using real life search tasks.


The overall contribution of the project will be a recommendation of successful techniques, approaches and concepts and implementations (in GESIS Search) of contextualized data set retrieval which are suitable to contextualise dataset retrieval. The contextualisation will be developed using state-of-the-art methods which are widely implemented and thus can easily be reproduced by other researchers in the field.


1.1.2019 – 31.07.2021

Sponsored by


  • Carevic, Z., Roy, D., & Mayr, P. (2020). Characteristics of Dataset Retrieval Sessions: Experiences from a Real-Life Digital Library. In M. Hall, T. Merčun, T. Risse, & F. Duchateau (Eds.), Digital Libraries for Open Knowledge (Vol. 12246, pp. 185–193). Springer International Publishing.
  • Carevic, Z., 2020. Contextualised Stratagem Browsing in Digital Libraries.  Ph.D. Thesis
  • Chandrasekaran, M. K., Mayr, P., Yasunaga, M., Freitag, D., Radev, D., & Kan, M.-Y. (2019). Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL 2019). SIGIR’19 Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, 1441–1443.
  • Biswas, C., Ganguly, D., Roy, D., & Bhattacharya, U. (2019). Privacy Preserving Approximate K-means Clustering. Proceedings of the 28th ACM International Conference on Information and Knowledge Management  - CIKM ’19, 1321–1330.
  • Roy, D., Saha, S., Mitra, M., Sen, B., & Ganguly, D. (2019). I-REX: A Lucene Plugin for EXplainable IR. Proceedings of the 28th ACM International Conference on Information and Knowledge Management  - CIKM ’19, 2949–2952.
  • Carevic, Z., Schüller, S., Mayr, P., & Fuhr, N. (2018). Contextualised Browsing in a Digital Library’s Living Lab. In Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries (pp. 89–98). Fort Worth, Texas, USA: ACM New York, NY, USA.
  • Koesten, L., Mayr, P., Groth, P., Simperl, E., & de Rijke, M. (2018). Report on the DATA:SEARCH’18 workshop - Searching Data on the Web. SIGIR Forum, 52(2), 117–124.