GESIS Leibniz Institute for the Social Sciences: Go to homepage
Standards, Guidelines and Quality Assurance for Gender Equality in Academia

Big data und gender sensitivity

Results from an analysis of the German Academic Web (GAW) crawl on gender sensitivity measures at German higher education institutions

On this page we present results from the StaRQ project on big data and gender competence/ gender sensitivity. As part of the BMBF-funded project, an interdisciplinary team with expertise in computational social sciences as well as gender studies and higher education research is working across departments to analyze gender equality measures in academia. The German Academic Web (GAW) crawl serves as a data basis for various questions. The aim of the study presented here is to compare theory and practice regarding the activities of higher education institutions on gender sensitivity. In recent years, the topic has become more prominent in connection with the prerequisites for a sustainable transformation of higher education institutions ("fix the knowledge"). This means that, in addition to equal opportunities officers and teaching staff, other target groups are also being considered, in particular people with management responsibilities and those involved in appointment procedures. On the one hand, the analysis aims to achieve transparency with regard to gender awareness measures. On the other hand, the focus lay on the target groups of these measures.

In the following, we use the term gender sensitivity, as this better reflects the objective of the measures, and the term has become established in the higher education context. Since 2015, the terms gender sensitivity, gender-sensitive and gender sensitization have been used more frequently than gender competence on pages of higher education institutions.

The German Academic Web (GAW) crawl is a semi-annually collected web archive. Since 2014, this has been based on a seed list of URLs from German universities with the right to award doctorates (see list of universities in Germany, Wikipedia). With these pages as a starting point, further URLs are recursively detected and archived. At the end of the process, which takes about two weeks, approximately 100 million pages and associated metadata are stored for each half-year. The metadata consists, among other things, of the time of loading and a unique identifier.

Figure 1: Locations of higher education institutions in Germany from which websites were archived in the GAW crawl. (cf., retrieved 07.09.2022)

This analysis aimed at finding out which specific gender sensitivity events are offered at the higher education institutions and which target groups are addressed by them. In terms of methodology, the StaRQ project team chose a combination of bottom-up and top-down analysis.

GAW crawl: sub-corpora

A top-down analysis was created for various questions in order to gain a thematic insight into the crawled web pages. In this case, top-down means that relevant pages were searched for in the entire corpus for the analysis. These were then examined for thematic correlations to determine which other topics might be of interest in answering questions. This process allows for a compromise between pre-selection and openness to results in finding related thematic connections.

The biggest technical challenge is the immense volume of archived web pages. Therefore, in order to make the GAW crawl manageable for a thematic analysis, a pre-selection was created using word lists (filter terms) adapted to the questions. When creating the sub-corpora, only those web pages were used as a basis in which one of the terms appeared. With the help of regular expressions, the spellings of the filter terms were generalized before the search, thus increasing the number of web pages found.

The following table gives an overview of the sub-corpora of the GAW crawl created in the project, which each represent different questions in the project.

Sub-corpus Time span Scope (#websites) Filter terms
GAW-StaRQ-Recruitment 2019 2,733 Berufungs*, Headhunting*, etc. + Gleichstellung*, etc.
GAW-StaRQ-Mentoring 2015-2021 10,249 Mentor*, Mentee*, etc.
GAW-StaRQ-Gender Sensitivity 2015-2021 1,800 genderkompeten*, gendersensib*, geschlechterkompeten*, geschlechtersensib*
GAW-StaRQ-Non-Binary 2015-2021 690 transgeschlecht*, nicht-binär*, etc.

Table 1: Key figures of the created sub-corpora in the period 2015-2021. The scope varies depending on the terms used for filtering as well as the selection of the time span to be examined. Terms with asterisk (*) were generalized here using regular expressions. The time span defines how many time slices of the GAW crawl are to be examined.

Analyzing the gender sensitivity corpus

Step 1 | Top-down analysis: topic modeling

At the beginning of the analysis, we investigated which thematic contexts could be identified in the created sub-corpus gender sensitivity. For this purpose, topic modeling (cf. archived version of the original article) was conducted. It is able to determine a distribution of topic clusters (topics) for each document based on modeling using Bayesian statistics in an unsupervised procedure. So the percentage of words used on the web page can be assigned to individual topics found in the entire sub-corpus. These topics, in turn, are characterized based on the words assigned in each web page. This characterization is thus a distribution of the particularly frequently occurring terms within a topic. An example of this are the terms shown in Figure 2, which describe the topic “Educational events”. Such summary names were chosen for the topics by members of the StaRQ team with gender expertise.

Eight Topics were identified in the gender sensitivity sub-corpus. The team’s gender experts considered the following topics to be relevant for the comparison of practice: “Language and literature”, “Studies and teaching” and “Educational events”. These three topics were selected for further analysis, with further research on the topic of educational events being described here as an example.

Figure 2: The two-dimensional representation of the eight main topics found with the help of topic modeling in the sub-corpus “gender sensitivity”.

The distance in the arrangement reflects the similarity in content. The word list characterizes the selected topic 3. Here, the magenta-colored bar describes the probability that a word is present in documents assigned to the topic. The blue bar gives an indication of how often a term appears in the entire sub-corpus. The longer the bar, the more general the term.


Step 2 | Thematic bottom-up analysis on educational events

The topic analysis identified educational events as a subject that should be further explored with the help of the gender sensitivity corpus. More specifically, educational events for teaching gender sensitivity were to be identified and, in a bottom-up analysis, the specific mentions were to be collected based on the pages on which they were found. In particular, the question was whether relevant educational events on the topic could be collected using this method and the data from the GAW crawl.

Initially, a list of names was created to identify educational events. In addition to the expertise of the project team, automatically generated suggestions were also used. Initial names of educational events were chosen for this purpose (e.g. lecture, seminar, conference). Then, similar words were automatically suggested by a created word embedding model. In total, this semi-automatic procedure identified 80 relevant terms from four manually created categories.

These four identified categories were used to cluster the 80 terms:

  • teaching courses (lectures, etc.)
  • personnel development and junior staff development (workshops, continuing education, etc.)
  • scientific events (conferences, etc.)
  • certificates

For a detailed analysis, the category “personnel development and junior staff development” with the most prominent sub-categories “workshops”, “continuing education”, “advanced training”, and “training” on gender sensitivity in the corresponding sub-corpus was examined as an example. For this purpose, all sentences found in the gender sensitivity sub-corpus were analyzed. If a sentence contained a word of the sub-categories workshops, continuing, advanced training or training and a mention of gender sensitivity in the same sentence (similar spellings and word stems were included), the sentence was automatically included in the analysis database.

In the following, the 136 sentence findings on the websites were examined in more detail and annotated. Thus, a combined approach of automated search and manual review and coding was used here. A sentence finding was considered relevant if it was about personnel development and junior staff events on gender sensitivity by and/ or for members of the higher education institution. The sub-category workshops had the highest number of hits with 84 sentence findings and showed high relevance: Only one sentence finding was deemed not relevant to the content.

Since it is possible that several found sentences reference the same workshop, the analysis resulted in 74 identified workshop references. In five cases, two different sentences were explanations of the same workshop. In another five cases, they were nearly identical sentences (near duplicates) extracted from the same URLs at different times. Since in these cases the sentences were not exactly the same, they were counted as different sentence findings (see Example 1). However, these findings also referenced the same workshop.

Example 1: Near duplicate:

„Darüber hinaus werden die Mentees ermuntert, innerhalb der Förderphase an Workshops zur Gendersensibilisierung (fakultätsübergreifend organisiert) teilzunehmen.“

„Darüber hinaus werden die Mentees ermuntert, innerhalb der zweijährigen Förderphase an (fakultätsübergreifend organisierten) Workshops zur Gendersensibilisierung teilzunehmen.“

The mentions found in the example are from a website that was crawled at different times. In the meantime, it was edited in the passages highlighted in italics.

Figure 3: Workshop findings on websites of higher education institutions related to gender sensitivity in the gender sensitivity corpus. Workshops may be annotated based on multiple sentence findings. A special case are sentence findings that are included in the corpus at different points in time in slightly edited versions (sentence findings near duplicates: see example 1) (n=74 workshops)

Due to the large number of sentence findings and their high relevance, the sub-category workshops was selected as an example for a more in-depth analysis. For this purpose, the relevant references were annotated with regard to the target groups described in the workshop. The target groups were divided into six groups for analysis. The project team was guided by the workshop descriptions on the websites. Target groups outside of higher education institutions, such as students or teachers at schools, were not included in the analysis. This resulted in a new, reduced sample of relevant workshops for target groups within higher education institutions (n=60). The following chart (Figure 4) shows the distribution of the identified target groups among the workshops (multiple answers possible).

Figure 4: Target group distribution within the higher education institutions for workshops.

In summary, the process can be described as follows:

Figure 5: Process illustration from big data to useful insights


The methods outlined here illustrate the interaction of automated and manual procedures. However, they also reflect the result of a learning process in which, on the one hand, the content-focused work benefited successively more from automated procedures. And on the other hand, the selection of automatic procedures was increasingly adapted to the needs of the content-related work. This interweaving led to precise, usable results. Thus, we saw that measures for gender sensitivity are still primarily developed and offered in the context of studies and teaching. In connection with appointment committees, gender equality officers are addressed more frequently than other members of the committee. People with management responsibilities, on the other hand, are generally rarely explicitly mentioned as a target group of gender sensibility measures. However, it must be taken into account here that the web crawl can only examine publicly available data. In order to make good practice transparent and to promote mutual learning, it would be desirable to publish relevant measures.

Funding number: 01FP1901