7 Biomass Mapping

In order to quantify the biomass growth rates and the total emissions lost due to degradation, we have developed a multi-temporal modelling approach to estimate the spatial variability and distribution of aboveground biomass.

To achieve this we have adopted a machine learning workflow that uses the following data sources: aerial LiDAR scan (ALS) data from 2016 describing canopy height, USGS Landsat 8 Level 2, Collection 2, Tier 1, Sentinel-1 SAR GRD: C-band Synthetic Aperture Radar Ground Range Detected, log scaling, and forest plot data from the Kuamut Project area.

These data are used as inputs to a Machine Learning Pipeline, the structure of which is depicted in Figure 7.1 below:

Figure 7.1: Machine Learning Pipeline Flow Chart

7.1 Input data Aquisition and Processing.

7.1.1 Landsat 8

Annual composites of Landsat 8 (L8) imagery were created for the Kuamut project area using Google earth engine. All images, in a given year, with a cloud cover faction < 50% were retained, cloud masking of images was carried out using the rgeeExtra package (Aybar et al. 2020). L8 values were then rescaled to provide the correct floating point value for each band. The following spectral indices (presented in Table 7.1) were then calculated using {rgeeExtra} package (Aybar et al. 2020)- see Awesome Spectral Indices (David Montero 2022) for further information:

Table 7.1: Table of Spectral Indices

Median composite L8 images were then generated for years 2015:2022. The following bands were removed from the resulting rasters as they were not considered to be informative for the purposes of biomass estimation and/or contained significant data gaps: Aerosol Quality Assessment, Emissivity standard deviation, Pixel Quality, Radiometric saturation, Distance to cloud, Surface temperature uncertainty.

7.1.2 Sentinel 1 (S1)

All S1 images, for a given year, that intersected the area of interest were collected. Each image was preprocessed using the following steps according to (Mullissa et al. 2021):

Erroneous values can occur on the edges of S1 imagery, these effects were masked for every image used in each annual collection.
A monotemporal Gamma despeckle filter, with a kernel size of 9 was used to reduce speckle noise in each of the images.
slope correction was then applied using a volumetric method with slope derived from SRTM 30 m data.
The Ratio between VV and VH polorizaion bands was calculated by dividing the VV band by the VH band and was subsequently converted to dB.

Each image therefore included 3 bands: VV, VH and VV:VH ratio. The median composite of all images from each year was then derived. All S1 preprocessing was carried out using Google Earth Engine and the {rgee} package (Aybar et al. 2020).

7.1.3 Terrain Data

The Copernics GLO 30 m global digital evelation model (DEM) was used to derive terrain data for the area of interest.

Further terrain metrics were derived from this DEM which included: slope, aspect, minimal curvature, mean curvature. Slope and aspect were calculated using the {terra} R package (Hijmans 2023). minimal curvature and mean curvature were calculated using the {Whitebox} R package. curvature metrics were calculated on a resampled DEM with a resoution of 150 m.

7.1.4 Temporal smoothing

Temporal smoothing/harmonisation was essential in order to enable the application of the same model across different time steps, ensuring consistent prediction and error estimates through time.

7.1.4.1 Monitoring Report 1 (2016:2021):

All the bands from the rasters discussed above were combined into a data cube, comprising a total of 41 bands. In order to harmonize the data between annual composites, it was necessary to apply temporal smoothing across each band of the data cube for all years (2015:2022). Although this Monitoring report focuses on the years 2016:2021, including years 2015 and 2022 improves the temporal smoothing estimation for those non-endpoint years and enables growth rate estimates for 2021.

This was achieved by fitting a loess regression for every cell of the data cube across time. This was applied to every band in the data cube using a span value of 2.0.

7.1.4.2 Monitoring Report 2 (2022:2023)

For this monitoring period, a full year of imagery was not yet available (2024). Therefore in order to mitigate the risk of over-fitting during the temporal smoothing process, we opted to use a linear interpolation method to smooth the data cube. The linear fit was applied for all years from 2015-2023 but only the year 2023 was carried forward for further analysis. For the MR1 period we had limited concern of over-fitting because forest plot data were collected in 2022, ensuring the constraint of predicted values for 2022. Adopting linear smoothing for 2023 conservatively allows for extrapolation in the absence of additional field data.

7.2 Canopy Height Model prediction.

A tree Canopy Height Model (CHM) was generated from LiDAR data, collected in 2016 as part of Asner et al. (2018). Using the gdalwarp utility (via the {ezwarp} package (Graham 2024)), we resampled the 2 m canopy height model to align with our data cube at a resolution of 30 m, using bilinear interpolation.

The coverage of the CHM is not complete, approximating 59260/84800 ha (~70%) of the Kuamut project area. All regions that intersected the CHM were used to create a data frame comprising the measured canopy height values from the CHM and all of the environmental/ earth observation data contained within the data cube.

7.3 Canopy Height Model Machine learning.

The {mlr3} (Lang et al. 2019) package and its ecosystem was used to structure and execute all parts of the machine learning pipeline. Spatial resampling was carried out using the {mlr3spatiotempcv} package (Schratz and Becker 2022).

Environmental variables were filtered based on the importance value, using an untrained LightGBM (LGBM) model. Then, sequential forward feature selection was used to determine the combination of environmental metrics that could produce a model with the lowest RMSE value. This was done using the default LGBM. The following features were therefore retained and used to tune and train the canopy height model: aspect, l8_CVI, l8_EVI, l8_EVI2, l8_MNLI, l8_NDVI, l8_RDVI, l8_SR_B3, l8_SR_B5, l8_SR_B6, l8_SR_B7, l8_ST_DRAD, l8_ST_URAD, l8_kVARI, s1_Ratio_RTC, s1_VH_RTC, slope.
A LGBM model was tuned, using the retained features. Spatial coordinate-based k-means clustering was used to generate resampling sets, reducing the effect of spatial autocorrelation in the selection of hypertuned parameters; 6 folds were used in each resampling iteration. Hyperband tuning was used to derive the optimal tuning parameters for the model. The following parameters were explored in the hyperparamter search space: learning_rate, max_bin, num_iterations, num_leaves. tuning parameter selection was based on a multi metric criterion, considering, the iteration RMSE and R^2.
Spatial resampling was undertaken to evaluate the performance of the ensemble model. Again, a 6-fold k-means coordinate clustering approach was used to split training and test sets. The results of this cross-validation are presented in Figure 7.2.

Figure 7.2: Predicted vs observed canopy height model data - trained using LiDAR data from 2016. These are the results of a spatially-clustered cross validation with six folds.

The model was then used to predict canopy height for years 2015:2022. The resulting predictions stored as spatial raster data.

7.4 Above Ground Biomass Machine Learning model.

Canopy height model predictions for all years were appended to their respective year’s data cube as an additional band to form the input data cube for biomass model development.

Forest plots were undertaken in 2016 and 2020 in the Kuamut project area . Data cubes for the corresponding years were used to extract the environmental features that interacted each plot area for each year it was surveyed using the {exactextractr} R package, which uses a weighted calculation for partially intersected cells, providing better accuracy than an all-touches or all-contained method. The total Above Ground Carbon (AGC) was converted to Aboveground Carbon density (ACD) by dividing the AGC by the planimetric area of the plot. The derived dataframe was used to train a model to predict ACD as follows:

Manual inspection of data correlation and model performance was carried out, highlighting the following variables as being of particular importance for the prediction of ACD: Modelled canopy Height (CHM), landsat bands B1 and B7, the following spectral vegetation indices: AVI, CVI, EVIGEMI, MNLI, sentinel1 VV and VH bands, and topographic mean curvature.
The best performing model was selected using the {MuMin} R package. Here, a master model is provided containing the maximal model with all components. This included all control variables as additive effects and also the interaction between the CHM and all landsat 8 derived bands and spectral indices. All variable combinations were tested that included between 6-10 control features, two of which were required to be modelled canopy height and mean curvature (because of their importance). mean curvature identifies upland areas and ridgelines, which in the Kuamut project area are typically less effected by historical logging activity, due to their inaccessibility.
An ensemble model was then constructed from: a XGboost model from the {XGboost} R package, a support vector machine SVM, and a General Linear Model (GLM) with a gaussian family and identity link function. The predictions of these models were then combined using a ‘super learner’, in this case a simple linear regression, in line with the methods described by Polley and Van der Laan (2010). The ensemble model was constructed using the {mlr3pipelines} package Binder et al. (2021) to create a single learner which enables simultaneous tuning of all individual models.
spatial resampling was once again carried out to evaluate model performance.

Figure 7.3: Predicted vs observed Above Ground Carbon Density (ACD) model results; comparing modelled ACD with plot-measured ACD. left hand panel shows predictions on full model. Right hand panel shows the results from spatially-clustered cross-validation.

Finally, the model was trained on the full forest plot inventory and predicted across the entire Kuamut area. Model predictions were cleaned by setting all predictions less than zero to zero and all predictions greater than 300 to NA.

Figure 7.4: Animated map depciting the change in Above Ground Carbon Density between 2016 and 2021.

The previous steps including feature selection, ensemble model creation, resampling, model training and prediction were repeated for Merchantable Volume with cutting limits of 10 cm, 31 cm, and 40 cm.

Figure 7.5: Merchantable Volume maps for 2016 including the 31 cm and 41 cm cutting limit models.

Figure 7.6: Predicted vs observed Merchantable Volume (MV) model results; comparing modelled MV with plot-measured ACD. This example shows the MV results with a 31cm cutting limit. Similar models were generated for 10 and 40cm limits. left hand panel shows predictions on full model. Right hand panel shows the results from spatially-clustered cross-validation.

7.5 Model Uncertainty

Prediction interval bands were generated for all volume and biomass raster maps. The first band in all maps is the prediction derived from the model, The second is the lower 2.5% Prediction Interval and the third is the 97.5% prediction interval.

Prediction intervals were generated as follows:

Bootstrap resampling of the tuned model was carried out with a total of 1000 iterations.
The bootstrapped predictions of the xgboost, svm models and the ensemble model, for all iterations,were aggregated. A quantile regression was fitted using xgboost and svm resamples as predictors and the ensemble model as response variable at both the 2.5 and 97.5 percentiles. The observed vs predicted plot for the quantile regression is presented in Figure 7.7.

Figure 7.7: Merchantable volume prediction intervals and bootstrapped resamples; visualised as predicted vs. observed Timber Volume.

The Quantile regression models for the upper and lower prediction intervals were then used to predict the uncertainty spatially across the raster map. This required the generation of map bands for glm, xgboost and svm models before applying the two quantile regression models. An example of this spatial prediction for the three bands is shown below in Figure 7.8

Figure 7.8: Above Ground Carbon Density model for 2023, showing the upper and lower prediction limits.

These uncertainty bands are subsequently used in Chapter 18 to estimate the uncertainty of total timber volume estimates.

Asner, Gregory P, Philip G Brodrick, Christopher Philipson, Nicolas R Vaughn, Roberta E Martin, David E Knapp, Joseph Heckler, et al. 2018. “Mapped Aboveground Carbon Stocks to Advance Forest Conservation and Recovery in Malaysian Borneo.” Biological Conservation 217: 289–310.

Aybar, Cesar, Quisheng Wu, Lesly Bautista, Roy Yali, and Antony Barja. 2020. “Rgee: An r Package for Interacting with Google Earth Engine.” Journal of Open Source Software.

Binder, Martin, Florian Pfisterer, Michel Lang, Lennart Schneider, Lars Kotthoff, and Bernd Bischl. 2021. “mlr3pipelines - Flexible Machine Learning Pipelines in r.” Journal of Machine Learning Research 22 (184): 1–7. https://jmlr.org/papers/v22/21-0281.html.

David Montero, Miguel D. Mahecha, Cesar Aybar. 2022. “Spectral: Awesome Spectral Indices Deployed via the Google Earth Engine JavaScript API.” The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences XLVIII-4/W1-2022: 301–6. https://doi.org/10.5194/isprs-archives-XLVIII-4-W1-2022-301-2022.

Graham, Hugh. 2024. Ezwarp: A Utility Tool for Working with Gdal Warp in r. https://github.com/Permian-Global-Research/ezwarp.

Hijmans, Robert J. 2023. Terra: Spatial Data Analysis. https://rspatial.org/.

Lang, Michel, Martin Binder, Jakob Richter, Patrick Schratz, Florian Pfisterer, Stefan Coors, Quay Au, Giuseppe Casalicchio, Lars Kotthoff, and Bernd Bischl. 2019. “mlr3: A Modern Object-Oriented Machine Learning Framework in R.” Journal of Open Source Software, December. https://doi.org/10.21105/joss.01903.

Mullissa, Adugna, Andreas Vollrath, Christelle Odongo-Braun, Bart Slagter, Johannes Balling, Yaqing Gou, Noel Gorelick, and Johannes Reiche. 2021. “Sentinel-1 SAR Backscatter Analysis Ready Data Preparation in Google Earth Engine.” Remote Sensing 13 (10). https://doi.org/10.3390/rs13101954.

Polley, Eric C, and Mark J Van der Laan. 2010. “Super Learner in Prediction.”

Schratz, Patrick, and Marc Becker. 2022. Mlr3spatiotempcv: Spatiotemporal Resampling Methods for ’Mlr3’. https://CRAN.R-project.org/package=mlr3spatiotempcv.