Chapter 3 Data transformation

3.1 Covid Data

Our first data source on Covid already comes in a handy csv format, so we can directly read it into R. However, an issue with this data set is that it comes in a long format with each country repeated for hundreds of times among the rows. There are also some redundant information which we won’t be using in our analysis - an example of this is the source column since all 37036 entries in this data set have the exact same value for source as “Epidemic intelligence, national weekly data”, which wouldn’t be helpful in this case.

So for the transformation, we first split the whole data set into two smaller data sets based on the indicator variable (cases/deaths). For each of the subset, we then used a function to extract the cumulative_count and the weekly_count separately, and subsequently used the ·pivot_wider()· function to transform the data from long to wide format. After the transformation, we also ordered each data set by continent and country names. As a result, we obtained 4 final data sets - tidy_case_count, tidy_case_cum, tidy_death_count, tidy_death_cum. As suggested by their names, these are the weekly count and cumulative count for cases and deaths respectively. In these four data sets, each row represents a single country (or a continent) and the columns contain 92 week time points from the \(1^{st}\) week of 2020 to the \(39^{th}\) week of 2021.

Generated Data Description
tidy_case_count COVID-19 weekly case count
tidy_death_count COVID-19 weekly death count
tidy_case_cum COVID-19 cumulative case count
tidy_death_cum COVID-19 cumulative death count

Each of these four data we generate is of dimension 222*95. Each row represents a country or a continent and each column is the statistical count of COVID-19 related data. In our following analysis, we will be using both the geographical and time series data to generate insights.

Here’s an example of what the data looks like (a subset of tidy_case_cum):

country country_code continent 2020-01-02 2020-01-09 2020-04-23 2020-11-19 2021-09-23
Algeria DZA Africa 0 0 3382 74862 230789
Angola AGO Africa 0 0 26 14493 58603
Benin BEN Africa 0 0 64 2916 23890
Botswana BWA Africa 0 0 22 9594 179220
Burkina Faso BFA Africa 0 0 632 2735 14335
Burundi BDI Africa 0 0 15 662 18529

3.2 BOP Data

Our second data source on international trade is a bit more problematic. The original data comes in as excel files and is therefore not organized as data frame format by itself. There are also quite a bit of notes and annotations written outside of the tables in these files, which we would need to take care of.

In order to follow a reproducible workflow, we still tried and succeeded in using R to read in the files and transform it into a more desirable format. One of the major issues with the excel files is that it uses sub headers within the table, which causes inconsistencies in the column/row names and left the tabular form with a lot of blank entries. In order to handle this, we decided to fix these column/row names by pasting the general main headlines and the underlying sub headers together. By doing so, the newly generated column/row names would now contain all the necessary identifying information and we can then organize and clean the enclosing data accordingly. In addition, we removed the annotations and irrelevant entries by specifying the row numbers at which they lie in. We repeated the above steps for each of our 10 selected files with some minor adjustment to fit each case properly.

You can find below some details of what our processed data look like:

1. Exhibit 1 - U.S. International Trade in Goods and Services

  • -> int_trade_1 (37 obs. * 10 vars)

Column Names of int_trade_1:

##  [1] "Period"            "Balance Total"     "Balance Goods (1)"
##  [4] "Balance Services"  "Exports Total"     "Exports Goods (1)"
##  [7] "Exports Services"  "Imports Total"     "Imports Goods (1)"
## [10] "Imports Services"

int_trade_1 has monthly record of Exports, Imports, and Balance for each of Goods, Services, and Total (Goods and Services combined). The time period is from Jan. 2019 to Aug. 2021. We will mainly use it to see the fluctuation in the US imports and exports from 2019 to 2021 and compare the balance for goods and services across time.

2. Exhibit 8 - U.S. Imports by End-Use Category and Commodity

  • -> imp_trade_cat (137 obs. * 9 vars)
  • -> imp_trade_tot (6 obs. * 7 vars)
## [1] "Item (1)"            "August 2021"         "July 2021 (R)"      
## [4] "Monthly Change"      "Year-to-Date 2021"   "Year-to-Date 2020"  
## [7] "Year-to-Date Change" "Category"

imp_trade_cat and imp_trade_tot have the import of goods for “August 2021”, “July 2021 (R)”, “Monthly Change”, “Year-to-Date 2021” “Year-to-Date 2020”, and “Year-to-Date Change”. We will mainly be focusing on the Year-to-Date values in our further analysis. imp_trade_cat has 137 obs. for each of the specific end-use category, wheres imp_trade_tot is a summary that only has 6 obs. for each of the principal categories. By examining these end-use categories, we will be able to have an in depth view of the State’s dependency on each category of goods’ import.

Some example of categories in imp_trade_tot:

##  [1] "Alcoholic beverages, excluding wine" "Wine, beer, and related products"   
##  [3] "Food oils, oilseeds"                 "Other foods"                        
##  [5] "Feedstuff and foodgrains"            "Nuts"                               
##  [7] "Cocoa beans"                         "Bakery products"                    
##  [9] "Dairy products and eggs"             "Fruits, frozen juices"              
## [11] "Cane and beet sugar"                 "Tea, spices, etc."                  
## [13] "Vegetables"                          "Nonagricultural foods, etc."        
## [15] "Fish and shellfish"                  "Green coffee"                       
## [17] "Meat products"                       "Chemicals-organic"                  
## [19] "Copper"                              "Petroleum products, other"

6 Principal Categories in imp_trade_tot:

## [1] "Foods, feeds, and beverages"            
## [2] "Industrial supplies and materials"      
## [3] "Capital goods, except automotive"       
## [4] "Automotive vehicles, parts, and engines"
## [5] "Consumer goods"                         
## [6] "Other goods"

3. Exhibit 7 - U.S. Exports of Goods by End-Use Category and Commodity

  • -> exp_trade_cat (128 obs. * 8 vars)
  • -> exp_trade_tot (6 obs. * 7 vars)

Similar to Exhibit 8, exp_trade_cat and exp_trade_tot have the export of goods for the above variables. exp_trade_cat has 128 obs. for each of the specific end-use category, wheres exp_trade_tot is a summary that only has 6 obs. for each of the principal categories. By examining these end-use categories, we will be able to have an in depth view of what the US exports are mainly used for.

4. Exhibit 12 - U.S. Trade in Goods

  • -> us_trade_good (37 obs. * 9 vars)
## [1] "Period"                                   
## [2] "Balance - Total Balance of Payments Basis"
## [3] "Balance - Total Census Basis"             
## [4] "Exports - Total Balance of Payments Basis"
## [5] "Exports - Net Adjustments"                
## [6] "Exports - Total Census Basis"             
## [7] "Imports - Total Balance of Payments Basis"
## [8] "Imports - Net Adjustments"                
## [9] "Imports - Total Census Basis"

From this dataset, we extracted a us_trade_good that includes the balance, import amount and export amount of goods involved in the international trade from January 2019 to August 2021. This organized variable could serve as complementary dataset of the dataset exh1.xlsx, and further examine the proportion of goods among all imports and exports.

5. Exhibit 13 - U.S. Trade in Goods by Principal End-Use Category

  • -> us_trade_pic_export (23 obs. * 10 vars)
  • -> us_trade_pic_import (23 obs. * 10 vars)

Column Names of us_trade_pic_export:

##  [1] "Period"                          "Total Balance of Payments Basis"
##  [3] "Net  Adjustments"                "Total Census Basis"             
##  [5] "Foods, Feeds, & Beverages"       "Industrial Supplies (2)"        
##  [7] "Capital Goods"                   "Automotive Vehicles, etc."      
##  [9] "Consumer Goods"                  "Other Goods"

us_trade_pic_export and us_trade_pic_import have the monthly Export or Import for each of the 6 principal end-use categories. The time period is from Jan. 2019 to Aug. 2021. In addition to exh7 and exh8, here we get more information about these principal end-use categories over this period of time and can therefore see their fluctuations.

6. Exhibit 14 - U.S. Trade in Goods by Selected Countries and Areas - Current Year

  • -> trade_countries_2021 (49 obs. * 11 vars)

Column Names of trade_countries_2021:

##  [1] "Item (1)"                     "Balance August\r\n2021"      
##  [3] "Balance July\r\n2021"         "Balance Year-to-Date\r\n2021"
##  [5] "Exports August\r\n2021"       "Exports July\r\n2021"        
##  [7] "Exports Year-to-Date\r\n2021" "Imports August\r\n2021"      
##  [9] "Imports July\r\n2021"         "Imports Year-to-Date\r\n2021"
## [11] "Continent"

trade_countries_2021 has the 2021 Trade in Goods BOP data for 49 selected trading partners of US. We will be using the Year-to-Date data on export and import for different countries to visualize a geographical map that displays the trading amount of each country with US.

7. Exhibit 14a - U.S. Trade in Goods by Selected Countries and Areas - Prior Year

  • -> trade_countries_2020 (49 obs. * 11 vars)

Similar to the previous one, trade_countries_2020 has the 2020 Trade in Goods BOP data for 49 selected trading partners of US. And we will be using the Year-to-Date data on export and import for different countries to plot the world map.

8. Exhibit 20 - U.S. Trade in Goods and Services by Selected Countries and Areas - BOP Basis

  • -> countries_quarters_bal (19 obs. * 10 vars)
  • -> countries_quarters_exp (19 obs. * 10 vars)
  • -> countries_quarters_imp (19 obs. * 10 vars)

The original file contains the quarterly BOP of Trade in Goods and Services for 19 selected countries who are active trading partners of the US. We split it into three data sets countries_quarters_bal, countries_quarters_exp, and countries_quarters_imp that summarizes Balance, Export, and Import respectively.

Each of the above data contains quarterly and annually data which allows us to make a more detailed comparison:

## [1] "First Quarter\r\n2020"  "Second Quarter\r\n2020" "Third Quarter\r\n2020" 
## [4] "Fourth Quarter\r\n2020" "First Quarter\r\n2021"  "Second Quarter\r\n2021"
## [1] "Annual\r\n2018" "Annual\r\n2019" "Annual\r\n2020"

19 active trading partners of US included:

##  [1] "Brazil"                "Canada"                "China"                
##  [4] "France"                "Germany"               "Hong Kong"            
##  [7] "India"                 "Italy"                 "Japan"                
## [10] "Korea, South"          "Mexico"                "Saudi Arabia"         
## [13] "Singapore"             "Taiwan"                "United Kingdom"       
## [16] "All other countries"   "CAFTA-DR"              "European Union"       
## [19] "South/Central America"

9. Exhibit 20a - U.S. Trade in Goods by Selected Countries and Areas - BOP Basis

  • -> countries_quarters_goods_bal (19 obs. * 10 vars)
  • -> countries_quarters_goods_exp (19 obs. * 10 vars)
  • -> countries_quarters_goods_imp (19 obs. * 10 vars)

Similar to Exhibit 20, this file contains the quarterly BOP for 19 selected countries who are active trading partners of the US, except that it focuses on Trade in Goods only. We also generated three data sets countries_quarters_goods_bal, countries_quarters_goods_exp, and countries_quarters_goods_imp that summarizes Balance, Export, and Import respectively.

10. Exhibit 20b - U.S. Trade in Services by Selected Countries and Areas

  • -> countries_quarters_services_bal (19 obs. * 10 vars)
  • -> countries_quarters_services_exp (19 obs. * 10 vars)
  • -> countries_quarters_services_imp (19 obs. * 10 vars)

Similar to Exhibit 20, this file contains the quarterly BOP for 19 selected countries who are active trading partners of the US, except that it focuses on Trade in Services only. We also generated three data sets countries_quarters_services_bal, countries_quarters_services_exp, and countries_quarters_services_imp that summarizes Balance, Export, and Import respectively.