Working with larger-than-memory FLUXNET data
Source:vignettes/fluxnet-duckdb.Rmd
fluxnet-duckdb.RmdWhen working with a large number of site, especially with daily or
hourly data, you may encounter out-of-memory errors when trying to read
in data with flux_read(). As an alternative,
fluxnet has a set of experimental functions that ingest CSV
files into a DuckDB database that can
be queried using dplyr functions without pulling data into
memory.
Building the database
Assuming you’ve already downloaded files with
flux_download() and extracted the CSVs with
flux_extract() (see
vignette("fluxnet", "fluxnet")), you can build the database
using flux_db_build().
flux_download(site_ids = c("AR-TF2", "NO-Hur", "AU-Emr"))
flux_extract()
manifest <- flux_discover_files()
flux_db_build(manifest)#> Downloading data from specified sites.
#> DD / ERA5 → 3 sites, 134 site-years across 3 files
#> DD / FLUXMET → 3 sites, 8 site-years across 3 files
#> HH / ERA5 → 3 sites, 134 site-years across 3 files
#> HH / FLUXMET → 3 sites, 8 site-years across 3 files
#> MM / ERA5 → 3 sites, 134 site-years across 3 files
#> MM / FLUXMET → 3 sites, 8 site-years across 3 files
#> WW / ERA5 → 3 sites, 134 site-years across 3 files
#> WW / FLUXMET → 3 sites, 8 site-years across 3 files
#> YY / ERA5 → 3 sites, 134 site-years across 3 files
#> YY / FLUXMET → 3 sites, 8 site-years across 3 files
#> ℹ Opening DuckDB connection to '
]8;;file:///var/folders/9x/8tfdxtbx3_173cs_8jqsyjph0000gn/T/Rtmp8bYb6v/fluxnet_vignette109c5441a5f55/fluxnet.duckdb/var/folders/9x/8tfdxtbx3_173cs_8jqsyjph0000gn/T/Rtmp8bYb6v/fluxnet_vignette109c5441a5f55/fluxnet.duckdb
]8;;'
#>
#>
ℹ Recording CSV manifest in database
#>
✔ Recording CSV manifest in database [18ms]
#>
#>
#>
ℹ Reading in annual data CSVs
#>
✔ Reading in annual data CSVs [44ms]
#>
#>
#>
ℹ Reading in monthly data CSVs
#>
✔ Reading in monthly data CSVs [67ms]
#>
#>
#>
ℹ Reading in weekly data CSVs
#>
✔ Reading in weekly data CSVs [285ms]
#>
#>
#>
ℹ Reading in daily data CSVs
#>
✔ Reading in daily data CSVs [945ms]
#>
#>
#>
ℹ Reading in hourly/half-hourly data CSVs
#>
✔ Reading in hourly/half-hourly data CSVs [8s]
#>
#>
#>
ℹ Closing connection.
#>
✔ Closing connection. [1.6s]
The DuckDB database will only contain data, not metadata from BIF or
BIFVARINFO CSV files. See ?flux_badm() and
?flux_varinfo() for working with those files.
Using the database
If you’re new to working with database connections in R, it may be
helpful to start with the introduction vignette for dbplyr
(vignette("dbplyr", "dbplyr")).
flux_db_connect() is a helpful wrapper around
DBI::dbConnect() that ensures the correct
duckdb() driver is used.
con <- flux_db_connect()#> ℹ Opening DuckDB connection to '
You can list the tables in the database with
DBI::dbListTables() and “open” a specific table with
tbl()
DBI::dbListTables(con)
#> [1] "annual" "daily" "hourly" "manifest" "monthly" "weekly"
annual <- tbl(con, "annual")
annual
#> # Source: table<annual> [?? x 335]
#> # Database: DuckDB 1.5.2 [root@Darwin 25.5.0:R 4.6.0//private/var/folders/9x/8tfdxtbx3_173cs_8jqsyjph0000gn/T/Rtmp8bYb6v/fluxnet_vignette109c5441a5f55/fluxnet.duckdb]
#> TIMESTAMP TA_ERA TA_ERA_NIGHT TA_ERA_NIGHT_SD TA_ERA_DAY TA_ERA_DAY_SD SW_IN_ERA LW_IN_ERA VPD_ERA PA_ERA P_ERA WS_ERA TA_F_MDS
#> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1981 4.90 4.20 1.01 5.43 1.24 104. 278. 2.33 98.7 4035. 2.89 NA
#> 2 1982 4.71 3.99 1.12 5.25 1.47 103. 277. 2.45 98.7 3769. 2.85 NA
#> 3 1983 5.07 4.28 1.06 5.68 1.24 99.9 282. 2.38 98.7 3778. 2.82 NA
#> 4 1984 4.61 3.86 1.06 5.17 1.31 101. 277. 2.40 98.8 3755. 2.73 NA
#> 5 1985 4.33 3.50 1.04 4.97 1.32 100. 279. 2.26 98.7 3925. 2.79 NA
#> 6 1986 3.52 2.62 0.998 4.20 1.31 106. 273. 2.16 98.5 3897. 3.02 NA
#> 7 1987 4.20 3.34 1.05 4.84 1.38 104. 278. 2.32 98.7 3785. 2.86 NA
#> 8 1988 4.24 3.42 1.03 4.84 1.34 104. 279. 2.28 98.8 3761. 2.90 NA
#> 9 1989 4.44 3.59 1.07 5.11 1.38 102. 278. 2.39 98.7 3741. 2.94 NA
#> 10 1990 4.26 3.46 1.02 4.87 1.28 105. 277. 2.34 98.6 3990. 2.93 NA
#> # ℹ more rows
#> # ℹ 322 more variables: TA_F_MDS_QC <dbl>, TA_F_MDS_NIGHT <dbl>, TA_F_MDS_NIGHT_SD <dbl>, TA_F_MDS_NIGHT_QC <dbl>, TA_F_MDS_DAY <dbl>,
#> # TA_F_MDS_DAY_SD <dbl>, TA_F_MDS_DAY_QC <dbl>, TA_F <dbl>, TA_F_QC <dbl>, TA_F_NIGHT <dbl>, TA_F_NIGHT_SD <dbl>, TA_F_NIGHT_QC <dbl>,
#> # TA_F_DAY <dbl>, TA_F_DAY_SD <dbl>, TA_F_DAY_QC <dbl>, SW_IN_F_MDS <dbl>, SW_IN_F_MDS_QC <dbl>, SW_IN_F <dbl>, SW_IN_F_QC <dbl>,
#> # LW_IN_F_MDS <dbl>, LW_IN_F_MDS_QC <dbl>, LW_IN_F <dbl>, LW_IN_F_QC <dbl>, LW_IN_JSB <dbl>, LW_IN_JSB_QC <dbl>, LW_IN_JSB_ERA <dbl>,
#> # LW_IN_JSB_F <dbl>, LW_IN_JSB_F_QC <dbl>, VPD_F_MDS <dbl>, VPD_F_MDS_QC <dbl>, VPD_F <dbl>, VPD_F_QC <dbl>, PA_F <dbl>, PA_F_QC <dbl>,
#> # P_F <dbl>, P_F_QC <dbl>, WS_F <dbl>, WS_F_QC <dbl>, USTAR <dbl>, USTAR_QC <dbl>, PPFD_IN <dbl>, PPFD_IN_QC <dbl>, CO2_F_MDS <dbl>, …Unlike the results of flux_read(), this object is a
“lazy” tibble, meaning it will only be pulled into memory when you want
it to be by using collect()
class(annual)
#> [1] "tbl_duckdb_connection" "tbl_dbi" "tbl_sql" "tbl_lazy" "tbl"
class(collect(annual))
#> [1] "tbl_df" "tbl" "data.frame"You can apply many tidyverse functions to a
“lazy” tibble and only upon running collect() will they be
translated into a SQL database query and pull the results into
memory.
annual %>%
mutate(p_gapfilled = 1 - NEE_VUT_REF_QC) %>%
filter(p_gapfilled < 0.5) %>%
select(site_id, TIMESTAMP, starts_with("GPP_"), starts_with("NEE_")) %>%
collect()
#> # A tibble: 4 × 192
#> site_id TIMESTAMP GPP_NT_VUT_REF GPP_NT_VUT_USTAR50 GPP_NT_VUT_MEAN GPP_NT_VUT_SE GPP_NT_VUT_05 GPP_NT_VUT_16 GPP_NT_VUT_25
#> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 AR-TF2 2017 301. 299. 296. 1.11 287. 288. 289.
#> 2 AU-Emr 2012 537. 540. 536. 1.36 519. 525. 528.
#> 3 AU-Emr 2013 399. 396. 396. 0.885 383. 389. 393.
#> 4 NO-Hur 2024 1120. 1086. 1105. 11.2 958. 1021. 1059.
#> # ℹ 183 more variables: GPP_NT_VUT_50 <dbl>, GPP_NT_VUT_75 <dbl>, GPP_NT_VUT_84 <dbl>, GPP_NT_VUT_95 <dbl>, GPP_NT_CUT_REF <dbl>,
#> # GPP_NT_CUT_USTAR50 <dbl>, GPP_NT_CUT_MEAN <dbl>, GPP_NT_CUT_SE <dbl>, GPP_NT_CUT_05 <dbl>, GPP_NT_CUT_16 <dbl>, GPP_NT_CUT_25 <dbl>,
#> # GPP_NT_CUT_50 <dbl>, GPP_NT_CUT_75 <dbl>, GPP_NT_CUT_84 <dbl>, GPP_NT_CUT_95 <dbl>, GPP_DT_VUT_REF <dbl>, GPP_DT_VUT_USTAR50 <dbl>,
#> # GPP_DT_VUT_MEAN <dbl>, GPP_DT_VUT_SE <dbl>, GPP_DT_VUT_05 <dbl>, GPP_DT_VUT_16 <dbl>, GPP_DT_VUT_25 <dbl>, GPP_DT_VUT_50 <dbl>,
#> # GPP_DT_VUT_75 <dbl>, GPP_DT_VUT_84 <dbl>, GPP_DT_VUT_95 <dbl>, GPP_DT_CUT_REF <dbl>, GPP_DT_CUT_USTAR50 <dbl>, GPP_DT_CUT_MEAN <dbl>,
#> # GPP_DT_CUT_SE <dbl>, GPP_DT_CUT_05 <dbl>, GPP_DT_CUT_16 <dbl>, GPP_DT_CUT_25 <dbl>, GPP_DT_CUT_50 <dbl>, GPP_DT_CUT_75 <dbl>,
#> # GPP_DT_CUT_84 <dbl>, GPP_DT_CUT_95 <dbl>, NEE_CUT_REF <dbl>, NEE_VUT_REF <dbl>, NEE_CUT_REF_QC <dbl>, NEE_VUT_REF_QC <dbl>, …The “lazy” tibble annual above does not work
with flux_qc() (yet).
When you are done, you can close the connection.
DBI::dbDisconnect(con)Updating the Database
When you download new sites or updated releases of existing sites,
you can update the database two ways. The first is to simply delete the
fluxnet.duckdb file created by flux_db_build()
and then run flux_db_build() again. However, this may take
some time, especially with a lot of hourly data, and you may have
deleted some of the CSVs already ingested into the database. As an
alternative, you can run flux_db_update().
flux_download(site_ids = "BR-Ji3")
flux_extract()
manifest <- flux_discover_files()
flux_db_update(manifest)#> Downloading data from specified sites.
#> DD / ERA5 → 4 sites, 179 site-years across 4 files
#> DD / FLUXMET → 4 sites, 15 site-years across 4 files
#> HH / ERA5 → 4 sites, 179 site-years across 4 files
#> HH / FLUXMET → 4 sites, 15 site-years across 4 files
#> MM / ERA5 → 4 sites, 179 site-years across 4 files
#> MM / FLUXMET → 4 sites, 15 site-years across 4 files
#> WW / ERA5 → 4 sites, 179 site-years across 4 files
#> WW / FLUXMET → 4 sites, 15 site-years across 4 files
#> YY / ERA5 → 4 sites, 179 site-years across 4 files
#> YY / FLUXMET → 4 sites, 15 site-years across 4 files
#> ℹ Opening DuckDB connection to '
]8;;file:///var/folders/9x/8tfdxtbx3_173cs_8jqsyjph0000gn/T/Rtmp8bYb6v/fluxnet_vignette109c5441a5f55/fluxnet.duckdb/var/folders/9x/8tfdxtbx3_173cs_8jqsyjph0000gn/T/Rtmp8bYb6v/fluxnet_vignette109c5441a5f55/fluxnet.duckdb
]8;;'
#>
#>
ℹ Reading in annual data CSVs
#>
✔ Reading in annual data CSVs [22ms]
#>
#>
#>
ℹ Updating/Inserting annual data into database
#>
✔ Updating/Inserting annual data into database [33ms]
#>
#>
#>
ℹ Reading in monthly data CSVs
#>
✔ Reading in monthly data CSVs [49ms]
#>
#>
#>
ℹ Updating/Inserting monthly data into database
#>
✔ Updating/Inserting monthly data into database [36ms]
#>
#>
#>
ℹ Reading in weekly data CSVs
#>
✔ Reading in weekly data CSVs [126ms]
#>
#>
#>
ℹ Updating/Inserting weekly data into database
#>
✔ Updating/Inserting weekly data into database [57ms]
#>
#>
#>
ℹ Reading in daily data CSVs
#>
✔ Reading in daily data CSVs [420ms]
#>
#>
#>
ℹ Updating/Inserting daily data into database
#>
✔ Updating/Inserting daily data into database [444ms]
#>
#>
#>
ℹ Reading in hourly/half-hourly data CSVs
#>
✔ Reading in hourly/half-hourly data CSVs [1.1s]
#>
#>
#>
ℹ Updating/Inserting hourly data into database
#>
✔ Updating/Inserting hourly data into database [6.4s]
#>
#>
#>
ℹ Updating CSV manifest in database
#>
✔ Updating CSV manifest in database [65ms]
#>
#>
#>
ℹ Closing connection.
#>
✔ Closing connection. [176ms]
Now we can see the new site is in the data
con <- flux_db_connect()
hourly <- tbl(con, "hourly")
hourly %>% pull(site_id) %>% unique()
DBI::dbDisconnect(con)#> ℹ Opening DuckDB connection to '
#> [1] "AR-TF2" "AU-Emr" "NO-Hur" "BR-Ji3"
Currently there is a known issue with flux_db_update()
where it will fail if new CSVs have columns not already in the database
tables. In that case, you’ll have to delete the database and regenerate
it with flux_db_build()