state | year | pcap | hwy | water | util | pc | gsp | emp | unemp |
---|---|---|---|---|---|---|---|---|---|
ALABAMA | 1970 | 15032.67 | 7325.80 | 1655.68 | 6051.20 | 35793.80 | 28418 | 1010.5 | 4.7 |
ALABAMA | 1971 | 15501.94 | 7525.94 | 1721.02 | 6254.98 | 37299.91 | 29375 | 1021.9 | 5.2 |
ALABAMA | 1972 | 15972.41 | 7765.42 | 1764.75 | 6442.23 | 38670.30 | 31303 | 1072.3 | 4.7 |
ALABAMA | 1973 | 16406.26 | 7907.66 | 1742.41 | 6756.19 | 40084.01 | 33430 | 1135.5 | 3.9 |
ALABAMA | 1974 | 16762.67 | 8025.52 | 1734.85 | 7002.29 | 42057.31 | 33749 | 1169.8 | 5.5 |
ALABAMA | 1975 | 17316.26 | 8158.23 | 1752.27 | 7405.76 | 43971.71 | 33604 | 1155.4 | 7.7 |
From tidy data to raster
Tidy data
- Wickham (2014): Tidy tabular data: Each variable is a column, each observation is a row, and each type of observational unit is a table.
- Below is a dataset from Baltagi and Baltagi (2008)
- Observation is a state in a perticular year
- Variable is a measured parameter (see below)
Parameter
- pcap: public capital stock
- hwy: highway and streets
- water: water and sewer facilities
- util: other public buildings and structures
- pc: private capital stock
- gsp: gross state product
- emp: labor input measured by the employment in non–agricultural payrolls
- unemp: state unemployment rate
Note
- This is considered a long table and is great for modelling and visualization.
- Its bad for memory (a lot of repetitions)
Long tables have a lot of repetitions:
length(Produc$state)
## [1] 815
n_distinct(Produc$state)
## [1] 48
length(Produc$year)
## [1] 815
n_distinct(Produc$year)
## [1] 17
<- dim(Produc) |>
n_values prod()
n_values## [1] 8150
Wide tables have less repetitions.
To demonstrate we convert a long to wide.
# Pivoting must be done per variable
<- Produc |>
Produc_wide select(state, year, unemp) |>
pivot_wider(names_from = state, values_from = unemp) |>
column_to_rownames("year")
Note
We can either omit the column “year”, (since this is implicit knowledge, \(row_i + 1970\)), or use it as a rowname
.
Long vs Wide
Long / tidy:
state | year | pcap | hwy | water | util | pc | gsp | emp | unemp |
---|---|---|---|---|---|---|---|---|---|
ALABAMA | 1970 | 15032.67 | 7325.80 | 1655.68 | 6051.20 | 35793.80 | 28418 | 1010.5 | 4.7 |
ALABAMA | 1971 | 15501.94 | 7525.94 | 1721.02 | 6254.98 | 37299.91 | 29375 | 1021.9 | 5.2 |
ALABAMA | 1972 | 15972.41 | 7765.42 | 1764.75 | 6442.23 | 38670.30 | 31303 | 1072.3 | 4.7 |
ALABAMA | 1973 | 16406.26 | 7907.66 | 1742.41 | 6756.19 | 40084.01 | 33430 | 1135.5 | 3.9 |
ALABAMA | 1974 | 16762.67 | 8025.52 | 1734.85 | 7002.29 | 42057.31 | 33749 | 1169.8 | 5.5 |
Wide / untidy:
ALABAMA | ARIZONA | ARKANSAS | CALIFORNIA | COLORADO | CONNECTICUT | DELAWARE | FLORIDA | GEORGIA | IDAHO | ILLINOIS | INDIANA | IOWA | KANSAS | KENTUCKY | LOUISIANA | MAINE | MARYLAND | MASSACHUSETTS | MICHIGAN | MINNESOTA | MISSISSIPPI | MISSOURI | MONTANA | NEBRASKA | NEVADA | NEW_HAMPSHIRE | NEW_JERSEY | NEW_MEXICO | NEW_YORK | NORTH_CAROLINA | NORTH_DAKOTA | OHIO | OKLAHOMA | OREGON | PENNSYLVANIA | RHODE_ISLAND | SOUTH_CAROLINA | SOUTH_DAKOTA | TENNESSE | TEXAS | UTAH | VERMONT | VIRGINIA | WASHINGTON | WEST_VIRGINIA | WISCONSIN | WYOMING | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1970 | 4.7 | 4.4 | 5.0 | 7.2 | 4.4 | 5.6 | 4.8 | 4.4 | 4.1 | 5.8 | 4.1 | 5.0 | 3.7 | 4.8 | 5.0 | 6.6 | 5.7 | 3.3 | 4.6 | 6.7 | 4.2 | 4.8 | 3.3 | 5.5 | 3.1 | 5.9 | 3.3 | 4.6 | 5.9 | 4.5 | 4.3 | 4.6 | 5.4 | 4.4 | 6.2 | 4.5 | 5.2 | 5.0 | 3.3 | 4.8 | 4.4 | 6.1 | 4.9 | 3.4 | 9.1 | 6.1 | 3.9 | 4.5 |
1971 | 5.2 | NA | 5.4 | 8.8 | 4.0 | 8.9 | 5.7 | 4.9 | 3.9 | 6.3 | 5.1 | 5.7 | 4.2 | 5.5 | 5.5 | 7.0 | 7.6 | 4.2 | 6.6 | 7.6 | 4.4 | 4.8 | 4.9 | 6.3 | 3.6 | 7.0 | 4.7 | 5.7 | 6.2 | 6.6 | 4.8 | 5.3 | 6.5 | 4.9 | 6.6 | 5.4 | 6.8 | 5.3 | 3.7 | 5.0 | 4.9 | 6.4 | 6.8 | 3.6 | 10.0 | 6.5 | 4.5 | 4.5 |
1972 | 4.7 | 4.2 | 4.6 | 7.6 | 3.6 | 8.2 | 4.7 | 4.5 | 4.1 | 6.2 | 5.1 | 4.5 | 3.6 | 4.0 | 4.8 | 6.1 | 7.0 | 4.7 | 6.4 | 7.0 | 4.3 | 3.9 | 4.2 | 6.2 | 3.4 | 7.0 | 4.5 | 5.8 | 5.8 | 6.7 | 4.0 | 4.9 | 5.5 | 4.5 | 5.7 | 5.4 | 6.5 | 4.2 | 3.7 | 3.6 | 4.5 | 6.1 | 6.5 | 3.6 | 9.5 | 6.5 | 4.2 | 4.0 |
1973 | 3.9 | 4.1 | 4.1 | 7.0 | 3.4 | 5.7 | 4.6 | 4.3 | 3.9 | 5.6 | 4.1 | 4.2 | 2.9 | 3.1 | 4.4 | 6.0 | 5.9 | 3.5 | 6.7 | 5.8 | 4.4 | 3.6 | 3.7 | 6.3 | 3.3 | 6.2 | 3.9 | 5.6 | 5.7 | 5.4 | 3.5 | 5.1 | 4.3 | 4.2 | 5.3 | 4.8 | 6.2 | 3.7 | 3.3 | 3.0 | 3.9 | 5.7 | 5.6 | 3.6 | 7.7 | 5.7 | 4.1 | 3.5 |
1974 | 5.5 | 5.6 | 4.8 | 7.7 | 3.8 | 6.2 | 6.0 | 6.2 | 5.0 | 6.0 | 4.5 | 5.9 | 3.0 | 3.5 | 4.5 | 6.7 | 6.7 | 3.7 | 7.2 | 8.7 | 4.3 | 4.1 | 4.5 | 6.7 | 3.8 | 7.5 | 3.6 | 6.9 | 6.3 | 6.3 | 4.5 | 3.0 | 5.0 | 4.4 | 7.5 | 5.1 | 7.3 | 4.5 | 3.5 | 3.9 | 4.3 | 5.9 | 6.9 | 4.0 | 7.2 | 5.9 | 4.6 | 3.6 |
Long vs. Wide: Number of cells
How many cells / values do we have after this transformation?
<- Produc_wide |>
n_values_new dim() |>
prod()
# since we have 8 variables, we multiply by 8:
<- n_values_new*8
n_values_new
n_values_new
[1] 6528
# before we had:
n_values
[1] 8150
→ This is a reduction of 20%
Dataframe → Matrix
Less repetitions / smaller memory footprint is only part of the advantage:
- All columns now have the same datatype (
dbl
)- This means, they can be stored in a matrix / array
- This gives us a big speed advantage (e.g. calculating the mean over all values is 8x faster)
Dataframe → Matrix
- Missing values are now explicit
which(is.na(Produc_matrix))
[1] 19
- Before, missing values were implicit:
# A tibble: 6 × 10
state year pcap hwy water util pc gsp emp unemp
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ALABAMA 1984 19257. 8656. 2235. 8366. 59447. 45118 1388. 11
2 ALABAMA 1985 19433. 8726. 2253. 8454. 60688. 46849 1427. 8.9
3 ALABAMA 1986 19723. 8813. 2309. 8601. 61629. 48409 1463. 9.8
4 ARIZONA 1970 10148. 4557. 1628. 3964. 23586. 19288 547. 4.4
5 ARIZONA 1972 10978. 4848. 1615. 4515. 26059. 23289 646. 4.2
6 ARIZONA 1973 11598. 4963. 1648. 4987. 27305. 25244 714. 4.1
Dataframe → Matrix
To detect missing values, cases must be made complete first:
|>
Produc complete(state, year) |> # ← make cases complete
filter(is.na(pcap)) # ← filter by NA
# A tibble: 1 × 10
state year pcap hwy water util pc gsp emp unemp
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ARIZONA 1971 NA NA NA NA NA NA NA NA
Limitations
- Matrices are only advantages if they are densely populated (little
NA
s) - Speed and memory footprint is only relevant if the data is large
Tasks / Exercises
- Import the
Produc
dataset (from the packageplm
) using the following code:
data(Produc, package = "plm")
Convert the data from long to wide
Convert the data to a matrix
Calculate mean unemployment first using the original dataframe, then the new matrix. Compare the speed difference.