Data Preparation

Translating and cleaning the Vannmiljø dataset

This page outlines the data preparation steps required to standardize the Vannmiljø dataset. The original dataset contains Norwegian terminology and various missing (NA) values. Our goal is to translate the names into English, enforce consistent snake_case naming conventions, and handle missing or inconsistent values appropriately.

1. Column Name Translation

The first step in the data preparation process is renaming the columns. The original Norwegian column names (often CamelCase or mixed) are converted to standardized, lowercase English snake_case names. This makes the dataset much easier to work with in R and SQL.

The table below shows the exact mapping from the original column names to the new column names:

Original Norwegian Name (old) New English Name (new)
Vannlokalitet_kode site_code
Vannlokalitetsnavn site_name
Betegnelse label
Type site_type
Aktivitet_id activity_id
Aktivitet_navn activity_name
Oppdragsgiver client
Oppdragstaker contractor
Parameter_id param_id
Parameter_navn param_name
Parameter_casnr cas_no
Medium_id medium_id
Medium_navn medium_name
LatinskNavn_id taxon_id
VitenskapligNavn scientific_name
Provetakmetode_id sample_method
Analysemetode_id analysis_method
Tid_provetak sample_time
Ovre_dyp upper_depth
Nedre_dyp lower_depth
DybdeEnhet depth_unit
Filtrert_Prove is_filtered
UnntasKlassifisering exclude_class
Operator operator
Verdi value
Listenavn list_name
Enhet unit
Provenr sample_no
Deteksjonsgrense lod
Kvantifiseringsgrense loq
Opprinnelse origin
Ant_verdier n_values
Kommentar comment
Arkiv archive
Produktbeskrivelse product_desc
UTM33 Ost (X) utm33_x
UTM33 Nord (Y) utm33_y

2. Data Cleaning and Value Conversion

After renaming the columns, the internal row values must be cleaned and translated. The data cleaning process addresses several inconsistencies in the raw data:

  • Handling Missing Entities: Missing (NA) values in contractor are replaced with "Unknown". For client, missing values or placeholders like "0" are also replaced with "Unknown".
  • Standardizing Methods: Missing values or the Norwegian word "UKJENT" (Unknown) in sample_method and analysis_method are standardized to "Unknown".
  • Fixing Depth Variables: Missing values (NA) in upper_depth and lower_depth are assumed to be surface level and set to 0.0. The column types are safely coerced to numeric.
  • Boolean Conversions:
    • The is_filtered column is evaluated: if the value is "Filtrert" (Filtered), it is converted to TRUE; otherwise, FALSE. The result is saved to a new filtered column.
    • The archive column uses "j" (for ja, meaning yes). This is converted to a standard TRUE/FALSE boolean.
  • Translating Parameters: The Norwegian chemical parameter names in param_name are mapped to their English equivalents (e.g., Kobber to Copper, Sink to Zinc).