| Fraction | Size range | Description |
|---|---|---|
| Clay | < 2 µm | Very fine particles, high surface area |
| Silt | 2 – 63 µm | Fine particles, settles slowly |
| Sand | 63 – 2000 µm | Medium particles comprising five sub-fractions (63–125, 125–250, 250–500, 500–1000, 1000–2000 µm) |
| Gravel | > 2000 µm | Coarse particles > 2 mm; typically minor in open marine settings |
Sediment Particle Size Fractions
From raw grain-size measurements to standardised fractions
Overview
Marine sediment cores collected as part of this project were analysed for grain-size distribution. The raw data report particle abundance as percentages across a variety of size classes and aggregate measures.
Data Quality Issues
1. Non-standard measurement operators
Not all measurements are simple point estimates. Each value in the source data is accompanied by an operator describing its relationship to the true value:
| Operator | Meaning | Treatment |
|---|---|---|
| = | Exact measurement | Used as-is |
| < | Below detection or quantification limit | Replaced with half the reported value (mid-point convention) |
| > | Exceeds the upper measurement range (lower bound only) | Used as-is (treated as a measured lower bound) |
2. Decimal-point errors
Values well above 100 % were present in the source data, most likely caused by decimal points being omitted or misplaced during data entry in spreadsheet software. For example, a true value of 45.9 % was recorded as 45903.
3. Overlapping and redundant parameters
The source data contain both aggregate parameters and their constituent sub-fractions for the same sample:
- Fines < 63 µm (
FINS) is the sum of Clay + Silt, but some samples report FINS alongside separate Clay and Silt measurements, double-counting that portion. - Coarse > 63 µm (
GSMF_63) is similarly redundant when individual sand sub-fractions or a gravel measurement are also present.
Including both aggregate and component values inflates the apparent total well beyond 100 % and prevents straightforward fraction conversion.
Processing Pipeline
The data are processed in six sequential steps. Each step feeds directly into the next.
Step 0: Operator adjustment
Measurement operators are resolved into usable numeric values using the conventions shown in the table above. All subsequent steps operate on these adjusted values.
Step 0b: Decimal-point correction
For each parameter, a background median is computed from all valid values (i.e. those already in the range 0–100 %). Values exceeding 100 % are then corrected by dividing by successive powers of ten (10, 100, 1000, …) and selecting the candidate that is both ≤ 100 % and closest to that parameter’s background median. Values that cannot be recovered this way are set to missing and flagged.
Step 0c: Overlap removal and rescaling
Redundant aggregate parameters are removed from any sample where their components are also present. Specifically:
FINSis dropped if eitherGSMF2(Clay) orGSMF2_63(Silt) is present.GSMF_63is dropped if any sand sub-fraction orGSMF_2000(Gravel) is present.
After deduplication, the remaining values for each sample are summed. If the total exceeds 100 %, all values are rescaled proportionally so that the sample sums to exactly 100 %.
Step 1: Wide-format restructuring
The cleaned long-format data are pivoted to wide format, giving one row per sample × sediment layer with columns for each grain-size parameter and its operator.
Step 2: Arithmetic derivation (four passes)
The four target fractions are derived algebraically from whichever measurements are available, without using any background statistics. The derivation exploits the following constraints.
- Clay + Silt = Fines
- Sand + Gravel = Coarse
- Fines + Coarse = 100%
Four passes are made in order:
| Pass | Action |
|---|---|
| 1 | Assign directly measured values |
| 2a | Compute group totals (Fines or Coarse) from the sum of their components |
| 2b | Derive a missing component by subtracting the known one from the group total |
| 3 | Apply the 100 % constraint to derive the missing group total |
| 4 | Repeat Pass 2b using group totals that became available in Pass 3 |
This resolves the vast majority of combinations present in the dataset without any assumptions about the sediment composition.
Step 3: Background ratio estimation
For samples that arithmetic derivation could not fully resolve, background ratios are estimated from all samples where both components of a group are arithmetically known:
\[r_\text{clay/fines} = \text{median}\!\left(\frac{\text{Clay}}{\text{Clay} + \text{Silt}}\right)\] \[\qquad r_\text{gravel/coarse} = \text{median}\!\left(\frac{\text{Gravel}}{\text{Sand} + \text{Gravel}}\right)\]
A set of overall background proportions (Clay : Silt : Sand : Gravel) is also derived from samples where all four fractions are known, for use as a final fallback.
Step 4: Within-group background splits
If a group total (Fines or Coarse) is known but neither of its components could be derived arithmetically, the group total is split using the background ratios from Step 3.
Step 5: General background imputation
Any fraction still unknown after Steps 2–4 is assigned a share of the remaining percentage budget (100 % minus the sum of known fractions). The share is proportional to that fraction’s background weight. Fractions already resolved are never modified at this stage.
Quality Confidence Flag
Every sample receives a single qc_confidence rating based on accumulated penalty points from the three pre-processing steps.
| Source | Penalty points |
|---|---|
| Sum of fractions 100–101 % (minor rounding) | +1 |
| Sum of fractions 101–110 % (moderate conflict) | +2 |
| Sum of fractions 110–200 % (serious conflict) | +3 |
| Sum of fractions > 200 % (likely multiple errors) | +4 |
| Decimal-point correction applied (Step 0b) | +1 |
| Redundant aggregate removed (Step 0c) | +1 |
| Total penalty | qc_confidence | Interpretation |
|---|---|---|
| 0 | high | No data quality issues detected |
| 1 | medium | One minor issue (e.g. rounding-only excess or one decimal correction) |
| 2 | low | Two minor or one moderate issue |
| 3 | very_low | Substantial conflict in the original data; values corrected by rescaling |
| 4 or more | unreliable | Severe conflict; results may not be reliable |
Researchers are encouraged to filter on qc_confidence according to the requirements of their analysis. A common starting point is to retain "high" and "medium" samples for quantitative work, and to inspect "low" samples on a case-by-case basis.
Derivation Method Labels
In addition to the confidence flag, each of the four output fractions carries a label describing how it was derived:
| Method | Meaning |
|---|---|
| direct | The fraction was measured directly in the original data |
| arithmetic | The fraction was calculated from other measurements using mass-balance arithmetic (no assumptions about composition) |
| background | At least one background ratio or overall proportion was required to estimate this fraction |
Data Availability
A tab-delimited files containing particle fractions is available for download at the following location.
Column Definition
The final exported dataset contains one row per sample × sediment layer. The columns are described below.
| Column | Type | Description |
|---|---|---|
| sample_id | character | Unique sample identifier, linking to the broader dataset |
| sediment_no | integer | Sediment layer number within the core (1 = topmost measured layer) |
| clay_pct | numeric | Clay fraction (< 2 µm), as a percentage of the total sediment |
| silt_pct | numeric | Silt fraction (2–63 µm), as a percentage |
| sand_pct | numeric | Sand fraction (63–2000 µm), as a percentage |
| gravel_pct | numeric | Gravel fraction (> 2000 µm), as a percentage |
| total_pct | numeric | Sum of the four fractions; equals 100 % for all fully processed samples |
| clay_method | character | Derivation method for clay: 'direct', 'arithmetic', or 'background' |
| silt_method | character | Derivation method for silt |
| sand_method | character | Derivation method for sand |
| gravel_method | character | Derivation method for gravel |
| any_op_adjusted | logical | TRUE if any input measurement had a non-exact operator (i.e. '<', '>', or 'ND') |
| qc_confidence | factor | Data quality confidence level: 'high', 'medium', 'low', 'very_low', or 'unreliable' |
Fractions estimated using background ratios (method = "background") are model-derived and carry more uncertainty than directly measured or arithmetically derived values. The qc_confidence column summarises the overall reliability of each row and should be the primary filter applied before analysis.