Data Preparation
Translating and cleaning the Vannmiljø dataset
This page outlines the data preparation steps required to standardize the Vannmiljø dataset. The original dataset contains Norwegian terminology and various missing (NA) values. Our goal is to translate the names into English, enforce consistent snake_case naming conventions, and handle missing or inconsistent values appropriately.
1. Column Name Translation
The first step in the data preparation process is renaming the columns. The original Norwegian column names (often CamelCase or mixed) are converted to standardized, lowercase English snake_case names. This makes the dataset much easier to work with in R and SQL.
The table below shows the exact mapping from the original column names to the new column names:
Original Norwegian Name (old) |
New English Name (new) |
|---|---|
| Vannlokalitet_kode | site_code |
| Vannlokalitetsnavn | site_name |
| Betegnelse | label |
| Type | site_type |
| Aktivitet_id | activity_id |
| Aktivitet_navn | activity_name |
| Oppdragsgiver | client |
| Oppdragstaker | contractor |
| Parameter_id | param_id |
| Parameter_navn | param_name |
| Parameter_casnr | cas_no |
| Medium_id | medium_id |
| Medium_navn | medium_name |
| LatinskNavn_id | taxon_id |
| VitenskapligNavn | scientific_name |
| Provetakmetode_id | sample_method |
| Analysemetode_id | analysis_method |
| Tid_provetak | sample_time |
| Ovre_dyp | upper_depth |
| Nedre_dyp | lower_depth |
| DybdeEnhet | depth_unit |
| Filtrert_Prove | is_filtered |
| UnntasKlassifisering | exclude_class |
| Operator | operator |
| Verdi | value |
| Listenavn | list_name |
| Enhet | unit |
| Provenr | sample_no |
| Deteksjonsgrense | lod |
| Kvantifiseringsgrense | loq |
| Opprinnelse | origin |
| Ant_verdier | n_values |
| Kommentar | comment |
| Arkiv | archive |
| Produktbeskrivelse | product_desc |
| UTM33 Ost (X) | utm33_x |
| UTM33 Nord (Y) | utm33_y |
2. Data Cleaning and Value Conversion
After renaming the columns, the internal row values must be cleaned and translated. The data cleaning process addresses several inconsistencies in the raw data:
- Handling Missing Entities: Missing (
NA) values incontractorare replaced with"Unknown". Forclient, missing values or placeholders like"0"are also replaced with"Unknown". - Standardizing Methods: Missing values or the Norwegian word
"UKJENT"(Unknown) insample_methodandanalysis_methodare standardized to"Unknown". - Fixing Depth Variables: Missing values (
NA) inupper_depthandlower_depthare assumed to be surface level and set to0.0. The column types are safely coerced to numeric. - Boolean Conversions:
- The
is_filteredcolumn is evaluated: if the value is"Filtrert"(Filtered), it is converted toTRUE; otherwise,FALSE. The result is saved to a newfilteredcolumn. - The
archivecolumn uses"j"(for ja, meaning yes). This is converted to a standardTRUE/FALSEboolean.
- The
- Translating Parameters: The Norwegian chemical parameter names in
param_nameare mapped to their English equivalents (e.g., Kobber to Copper, Sink to Zinc).