Sediment Particle Size Fractions

From raw grain-size measurements to standardised fractions

Overview

Marine sediment cores collected as part of this project were analysed for grain-size distribution. The raw data report particle abundance as percentages across a variety of size classes and aggregate measures.

Fraction	Size range	Description
Clay	< 2 µm	Very fine particles, high surface area
Silt	2 – 63 µm	Fine particles, settles slowly
Sand	63 – 2000 µm	Medium particles comprising five sub-fractions (63–125, 125–250, 250–500, 500–1000, 1000–2000 µm)
Gravel	> 2000 µm	Coarse particles > 2 mm; typically minor in open marine settings

Data Quality Issues

1. Non-standard measurement operators

Not all measurements are simple point estimates. Each value in the source data is accompanied by an operator describing its relationship to the true value:

Operator	Meaning	Treatment
=	Exact measurement	Used as-is
<	Below detection or quantification limit	Replaced with half the reported value (mid-point convention)
>	Exceeds the upper measurement range (lower bound only)	Used as-is (treated as a measured lower bound)

2. Decimal-point errors

Values well above 100 % were present in the source data, most likely caused by decimal points being omitted or misplaced during data entry in spreadsheet software. For example, a true value of 45.9 % was recorded as 45903.

3. Overlapping and redundant parameters

The source data contain both aggregate parameters and their constituent sub-fractions for the same sample:

Fines < 63 µm (FINS) is the sum of Clay + Silt, but some samples report FINS alongside separate Clay and Silt measurements, double-counting that portion.
Coarse > 63 µm (GSMF_63) is similarly redundant when individual sand sub-fractions or a gravel measurement are also present.

Including both aggregate and component values inflates the apparent total well beyond 100 % and prevents straightforward fraction conversion.

Processing Pipeline

The data are processed in six sequential steps. Each step feeds directly into the next.

Step 0: Operator adjustment

Measurement operators are resolved into usable numeric values using the conventions shown in the table above. All subsequent steps operate on these adjusted values.

Step 0b: Decimal-point correction

For each parameter, a background median is computed from all valid values (i.e. those already in the range 0–100 %). Values exceeding 100 % are then corrected by dividing by successive powers of ten (10, 100, 1000, …) and selecting the candidate that is both ≤ 100 % and closest to that parameter’s background median. Values that cannot be recovered this way are set to missing and flagged.

Step 0c: Overlap removal and rescaling

Redundant aggregate parameters are removed from any sample where their components are also present. Specifically:

FINS is dropped if either GSMF2 (Clay) or GSMF2_63 (Silt) is present.
GSMF_63 is dropped if any sand sub-fraction or GSMF_2000 (Gravel) is present.

After deduplication, the remaining values for each sample are summed. If the total exceeds 100 %, all values are rescaled proportionally so that the sample sums to exactly 100 %.

Step 1: Wide-format restructuring

The cleaned long-format data are pivoted to wide format, giving one row per sample × sediment layer with columns for each grain-size parameter and its operator.

Step 2: Arithmetic derivation (four passes)

The four target fractions are derived algebraically from whichever measurements are available, without using any background statistics. The derivation exploits the following constraints.

Clay + Silt = Fines
Sand + Gravel = Coarse
Fines + Coarse = 100%

Four passes are made in order:

Pass	Action
1	Assign directly measured values
2a	Compute group totals (Fines or Coarse) from the sum of their components
2b	Derive a missing component by subtracting the known one from the group total
3	Apply the 100 % constraint to derive the missing group total
4	Repeat Pass 2b using group totals that became available in Pass 3

This resolves the vast majority of combinations present in the dataset without any assumptions about the sediment composition.

Step 3: Background ratio estimation

For samples that arithmetic derivation could not fully resolve, background ratios are estimated from all samples where both components of a group are arithmetically known:

\[r_\text{clay/fines} = \text{median}\!\left(\frac{\text{Clay}}{\text{Clay} + \text{Silt}}\right)\] \[\qquad r_\text{gravel/coarse} = \text{median}\!\left(\frac{\text{Gravel}}{\text{Sand} + \text{Gravel}}\right)\]

A set of overall background proportions (Clay : Silt : Sand : Gravel) is also derived from samples where all four fractions are known, for use as a final fallback.

Step 4: Within-group background splits

If a group total (Fines or Coarse) is known but neither of its components could be derived arithmetically, the group total is split using the background ratios from Step 3.

Step 5: General background imputation

Any fraction still unknown after Steps 2–4 is assigned a share of the remaining percentage budget (100 % minus the sum of known fractions). The share is proportional to that fraction’s background weight. Fractions already resolved are never modified at this stage.

Quality Confidence Flag

Every sample receives a single qc_confidence rating based on accumulated penalty points from the three pre-processing steps.

Source	Penalty points
Sum of fractions 100–101 % (minor rounding)	+1
Sum of fractions 101–110 % (moderate conflict)	+2
Sum of fractions 110–200 % (serious conflict)	+3
Sum of fractions > 200 % (likely multiple errors)	+4
Decimal-point correction applied (Step 0b)	+1
Redundant aggregate removed (Step 0c)	+1

Total penalty	qc_confidence	Interpretation
0	high	No data quality issues detected
1	medium	One minor issue (e.g. rounding-only excess or one decimal correction)
2	low	Two minor or one moderate issue
3	very_low	Substantial conflict in the original data; values corrected by rescaling
4 or more	unreliable	Severe conflict; results may not be reliable

Researchers are encouraged to filter on qc_confidence according to the requirements of their analysis. A common starting point is to retain "high" and "medium" samples for quantitative work, and to inspect "low" samples on a case-by-case basis.

Derivation Method Labels

In addition to the confidence flag, each of the four output fractions carries a label describing how it was derived:

Method	Meaning
direct	The fraction was measured directly in the original data
arithmetic	The fraction was calculated from other measurements using mass-balance arithmetic (no assumptions about composition)
background	At least one background ratio or overall proportion was required to estimate this fraction

Data Availability

A tab-delimited files containing particle fractions is available for download at the following location.

pilot_vannmiljo_particles.tsv.gz

Column Definition

The final exported dataset contains one row per sample × sediment layer. The columns are described below.

Column	Type	Description
sample_id	character	Unique sample identifier, linking to the broader dataset
sediment_no	integer	Sediment layer number within the core (1 = topmost measured layer)
clay_pct	numeric	Clay fraction (< 2 µm), as a percentage of the total sediment
silt_pct	numeric	Silt fraction (2–63 µm), as a percentage
sand_pct	numeric	Sand fraction (63–2000 µm), as a percentage
gravel_pct	numeric	Gravel fraction (> 2000 µm), as a percentage
total_pct	numeric	Sum of the four fractions; equals 100 % for all fully processed samples
clay_method	character	Derivation method for clay: 'direct', 'arithmetic', or 'background'
silt_method	character	Derivation method for silt
sand_method	character	Derivation method for sand
gravel_method	character	Derivation method for gravel
any_op_adjusted	logical	TRUE if any input measurement had a non-exact operator (i.e. '<', '>', or 'ND')
qc_confidence	factor	Data quality confidence level: 'high', 'medium', 'low', 'very_low', or 'unreliable'

Note

Fractions estimated using background ratios (method = "background") are model-derived and carry more uncertainty than directly measured or arithmetically derived values. The qc_confidence column summarises the overall reliability of each row and should be the primary filter applied before analysis.