Data Ingest

Data ingest plays an important role in long term preservation. The more precisely the submission of data is planned and carried out, the simpler future measures for preservation can be planned and realized. For this reason, as much information on the study as possible is collected and to produce detailed documented. After the study has been submitted its content is verified and enhanced with further information.

Validation

After data are submitted to the archive avalidation is carried out to check the following aspects:

  • Documentation check: Which survey instruments have been used, formats of documentation, is the documentation complete?
  • Data check:
    • Data formats, are they complete?
    • Does the data and project match? (Does the dataset correspond to the questionnaire?)
    • Check the weighting, for wild codes, and coding or duplication errors, etc.
    • Check of completeness and comprehensibility of labels
    • Plausibility check
    • Ensuring data protection issues are addressed (data granularity, e.g. with respect to geographic coverage or occupational classification?)

Each study receives a study number and is recorded in the GESIS data catalogue (DBK).

The data are stored in the archive system with further information (archive agreement, correspondence between archive and depositor, etc.).

Two additional standard variables are added (study number, version number and date of version). The following steps depend on the original material, so processing can comprise of, for example:

  • Completion of missing or unclear variable or value label
  • Changes in variable order (adjustment to questionnaire)
  • Removal or aggregation of variables due to data protection issues
  • Harmonization
  • Cumulating
  • etc.

Any alterations made to the data are documented and saved together with the data set.

Some large–scale social surveys (ALLBUS, Eurobarometer, EVS, ISSP, Politbarometer) are supervised by special teams.

Versioning

Data stored in the archive undergo revision and changes even after their publication. For example, subsequently discovered errors are corrected, or the data is augmented by additional variables or interviews. Assigning version numbers guarantees datasets used for publications are identifiable together alongside their study number, allowing for unique referencing and citation.

A persistent identifier (DOI name) assigned to each version also makes the data easier to locate. DOI names link the user directly to the study description in the DBK.

Changes are documented on three levels: Major.Minor.Revision (e.g. 2.1.0):

 1. Position – Major:

  • Addition of one or more new samples (usually countries) in an integrated or cumulative data set
  • Addition of one or more new waves in a cumulative data set

  • Addition/deletion of one or more variables in a data set

  • Addition/deletion of one or more cases in a data set
  • Enhanced processing for a higher class (usually class 1)

 2. Position – Minor

  • Changes relevant to the meaning of a variable, or completing in the data set (label, recoding, data formats…)

3. Position – Revision

  • Changes that do not affect the meaning of a variable (e.g. correction of spelling mistakes)

  • Simple revision of labels without change in meanin

Example:

A spelling mistake in a data set with version 1.2.3 is corrected (→1.2.4), a variable is recoded (→1.3.0), and a variable is added (→ 2.0.0). If all the changes are made at once, version number 2.0.0 is assigned. If only the first mentioned two changes have been carried out, version number 1.3.0 is assigned.