From tidy data to raster

Tidy data

  • Wickham (2014): Tidy tabular data: Each variable is a column, each observation is a row, and each type of observational unit is a table.
  • Below is a dataset from Baltagi and Baltagi (2008)
  • Observation is a state in a perticular year
  • Variable is a measured parameter (see below)
  • pcap: public capital stock
  • hwy: highway and streets
  • water: water and sewer facilities
  • util: other public buildings and structures
  • pc: private capital stock
  • gsp: gross state product
  • emp: labor input measured by the employment in non–agricultural payrolls
  • unemp: state unemployment rate
state year pcap hwy water util pc gsp emp unemp
ALABAMA 1970 15032.67 7325.80 1655.68 6051.20 35793.80 28418 1010.5 4.7
ALABAMA 1971 15501.94 7525.94 1721.02 6254.98 37299.91 29375 1021.9 5.2
ALABAMA 1972 15972.41 7765.42 1764.75 6442.23 38670.30 31303 1072.3 4.7
ALABAMA 1973 16406.26 7907.66 1742.41 6756.19 40084.01 33430 1135.5 3.9
ALABAMA 1974 16762.67 8025.52 1734.85 7002.29 42057.31 33749 1169.8 5.5
ALABAMA 1975 17316.26 8158.23 1752.27 7405.76 43971.71 33604 1155.4 7.7
  • This is considered a long table and is great for modelling and visualization.
  • Its bad for memory (a lot of repetitions)

Long tables have a lot of repetitions:

length(Produc$state)
## [1] 815
n_distinct(Produc$state)
## [1] 48

length(Produc$year)
## [1] 815
n_distinct(Produc$year)
## [1] 17


n_values <- dim(Produc) |> 
  prod()

n_values
## [1] 8150

Wide tables have less repetitions.

To demonstrate we convert a long to wide.

# Pivoting must be done per variable
Produc_wide <- Produc |> 
    select(state, year, unemp) |> 
    pivot_wider(names_from = state, values_from = unemp) |> 
    column_to_rownames("year")

We can either omit the column “year”, (since this is implicit knowledge, \(row_i + 1970\)), or use it as a rowname.

Long vs Wide

Long / tidy:

state year pcap hwy water util pc gsp emp unemp
ALABAMA 1970 15032.67 7325.80 1655.68 6051.20 35793.80 28418 1010.5 4.7
ALABAMA 1971 15501.94 7525.94 1721.02 6254.98 37299.91 29375 1021.9 5.2
ALABAMA 1972 15972.41 7765.42 1764.75 6442.23 38670.30 31303 1072.3 4.7
ALABAMA 1973 16406.26 7907.66 1742.41 6756.19 40084.01 33430 1135.5 3.9
ALABAMA 1974 16762.67 8025.52 1734.85 7002.29 42057.31 33749 1169.8 5.5

Wide / untidy:

ALABAMA ARIZONA ARKANSAS CALIFORNIA COLORADO CONNECTICUT DELAWARE FLORIDA GEORGIA IDAHO ILLINOIS INDIANA IOWA KANSAS KENTUCKY LOUISIANA MAINE MARYLAND MASSACHUSETTS MICHIGAN MINNESOTA MISSISSIPPI MISSOURI MONTANA NEBRASKA NEVADA NEW_HAMPSHIRE NEW_JERSEY NEW_MEXICO NEW_YORK NORTH_CAROLINA NORTH_DAKOTA OHIO OKLAHOMA OREGON PENNSYLVANIA RHODE_ISLAND SOUTH_CAROLINA SOUTH_DAKOTA TENNESSE TEXAS UTAH VERMONT VIRGINIA WASHINGTON WEST_VIRGINIA WISCONSIN WYOMING
1970 4.7 4.4 5.0 7.2 4.4 5.6 4.8 4.4 4.1 5.8 4.1 5.0 3.7 4.8 5.0 6.6 5.7 3.3 4.6 6.7 4.2 4.8 3.3 5.5 3.1 5.9 3.3 4.6 5.9 4.5 4.3 4.6 5.4 4.4 6.2 4.5 5.2 5.0 3.3 4.8 4.4 6.1 4.9 3.4 9.1 6.1 3.9 4.5
1971 5.2 NA 5.4 8.8 4.0 8.9 5.7 4.9 3.9 6.3 5.1 5.7 4.2 5.5 5.5 7.0 7.6 4.2 6.6 7.6 4.4 4.8 4.9 6.3 3.6 7.0 4.7 5.7 6.2 6.6 4.8 5.3 6.5 4.9 6.6 5.4 6.8 5.3 3.7 5.0 4.9 6.4 6.8 3.6 10.0 6.5 4.5 4.5
1972 4.7 4.2 4.6 7.6 3.6 8.2 4.7 4.5 4.1 6.2 5.1 4.5 3.6 4.0 4.8 6.1 7.0 4.7 6.4 7.0 4.3 3.9 4.2 6.2 3.4 7.0 4.5 5.8 5.8 6.7 4.0 4.9 5.5 4.5 5.7 5.4 6.5 4.2 3.7 3.6 4.5 6.1 6.5 3.6 9.5 6.5 4.2 4.0
1973 3.9 4.1 4.1 7.0 3.4 5.7 4.6 4.3 3.9 5.6 4.1 4.2 2.9 3.1 4.4 6.0 5.9 3.5 6.7 5.8 4.4 3.6 3.7 6.3 3.3 6.2 3.9 5.6 5.7 5.4 3.5 5.1 4.3 4.2 5.3 4.8 6.2 3.7 3.3 3.0 3.9 5.7 5.6 3.6 7.7 5.7 4.1 3.5
1974 5.5 5.6 4.8 7.7 3.8 6.2 6.0 6.2 5.0 6.0 4.5 5.9 3.0 3.5 4.5 6.7 6.7 3.7 7.2 8.7 4.3 4.1 4.5 6.7 3.8 7.5 3.6 6.9 6.3 6.3 4.5 3.0 5.0 4.4 7.5 5.1 7.3 4.5 3.5 3.9 4.3 5.9 6.9 4.0 7.2 5.9 4.6 3.6

Long vs. Wide: Number of cells

How many cells / values do we have after this transformation?

n_values_new <- Produc_wide |> 
    dim() |> 
    prod()

# since we have 8 variables, we multiply by 8:
n_values_new <- n_values_new*8 

n_values_new
[1] 6528
# before we had:
n_values
[1] 8150

→ This is a reduction of 20%

Dataframe → Matrix

Less repetitions / smaller memory footprint is only part of the advantage:

  • All columns now have the same datatype (dbl)
    • This means, they can be stored in a matrix / array
    • This gives us a big speed advantage (e.g. calculating the mean over all values is 8x faster)

Dataframe → Matrix

  • Missing values are now explicit
which(is.na(Produc_matrix))
[1] 19
  • Before, missing values were implicit:
# A tibble: 6 × 10
  state    year   pcap   hwy water  util     pc   gsp   emp unemp
  <chr>   <dbl>  <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>
1 ALABAMA  1984 19257. 8656. 2235. 8366. 59447. 45118 1388.  11  
2 ALABAMA  1985 19433. 8726. 2253. 8454. 60688. 46849 1427.   8.9
3 ALABAMA  1986 19723. 8813. 2309. 8601. 61629. 48409 1463.   9.8
4 ARIZONA  1970 10148. 4557. 1628. 3964. 23586. 19288  547.   4.4
5 ARIZONA  1972 10978. 4848. 1615. 4515. 26059. 23289  646.   4.2
6 ARIZONA  1973 11598. 4963. 1648. 4987. 27305. 25244  714.   4.1

Dataframe → Matrix

To detect missing values, cases must be made complete first:

Produc |> 
    complete(state, year) |>                    # ← make cases complete 
    filter(is.na(pcap)) # ← filter by NA
# A tibble: 1 × 10
  state    year  pcap   hwy water  util    pc   gsp   emp unemp
  <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ARIZONA  1971    NA    NA    NA    NA    NA    NA    NA    NA

Limitations

  • Matrices are only advantages if they are densely populated (little NAs)
  • Speed and memory footprint is only relevant if the data is large

Tasks / Exercises

  1. Import the Produc dataset (from the package plm) using the following code:
data(Produc, package = "plm")
  1. Convert the data from long to wide

  2. Convert the data to a matrix

  3. Calculate mean unemployment first using the original dataframe, then the new matrix. Compare the speed difference.