The sfo_stats
dataset provides monthly
statistics on San Francisco International Airport’s air traffic landing
between July 2005 and December 2020. The following vignette demonstrate
some approaches for exploring the dataset. As the structure of the
sfo_stats
is similar to the sfo_passengers
dataset, we will repeat the same data prep steps as shown on the
previous vignette. We will use the dplyr and
plotly packages for data manipulation and
visualization, respectively.
For simplicity, let’s use a shorter name , d
, for the
dataset:
library(sfo)
library(dplyr)
library(plotly)
d <- sfo_stats
head(d)
#> activity_period operating_airline operating_airline_iata_code
#> 1 202212 Air Canada AC
#> 2 202212 Delta Air Lines DL
#> 3 202212 Delta Air Lines DL
#> 4 202212 WestJet Airlines WS
#> 5 202212 WestJet Airlines WS
#> 6 202212 WestJet Airlines WS
#> published_airline published_airline_iata_code geo_summary geo_region
#> 1 Air Canada AC International Canada
#> 2 Delta Air Lines DL Domestic US
#> 3 Delta Air Lines DL Domestic US
#> 4 WestJet Airlines WS International Canada
#> 5 WestJet Airlines WS International Canada
#> 6 WestJet Airlines WS International Canada
#> landing_aircraft_type aircraft_body_type aircraft_manufacturer aircraft_model
#> 1 Passenger Wide Body Boeing B773
#> 2 Passenger Narrow Body Airbus A319
#> 3 Passenger Narrow Body Airbus A320
#> 4 Passenger Narrow Body Boeing B38M
#> 5 Passenger Narrow Body Boeing B737
#> 6 Passenger Narrow Body Boeing B738
#> aircraft_version landing_count total_landed_weight
#> 1 - 1 554000
#> 2 - 163 22459444
#> 3 - 43 6114514
#> 4 - 5 764000
#> 5 - 5 646000
#> 6 - 3 438900
Next, let’s reformat the period indicator,
activity_period
to a Date
object, setting the
first day of the month as the default day:
d$date <- as.Date(paste(substr(d$activity_period, 1,4),
substr(d$activity_period, 5,6),
"01", sep ="/"))
We can see, with the str
command, the stucture of the
dataset:
str(d)
#> 'data.frame': 57381 obs. of 15 variables:
#> $ activity_period : int 202212 202212 202212 202212 202212 202212 202212 202212 202212 202212 ...
#> $ operating_airline : chr "Air Canada" "Delta Air Lines" "Delta Air Lines" "WestJet Airlines" ...
#> $ operating_airline_iata_code: chr "AC" "DL" "DL" "WS" ...
#> $ published_airline : chr "Air Canada" "Delta Air Lines" "Delta Air Lines" "WestJet Airlines" ...
#> $ published_airline_iata_code: chr "AC" "DL" "DL" "WS" ...
#> $ geo_summary : chr "International" "Domestic" "Domestic" "International" ...
#> $ geo_region : chr "Canada" "US" "US" "Canada" ...
#> $ landing_aircraft_type : chr "Passenger" "Passenger" "Passenger" "Passenger" ...
#> $ aircraft_body_type : chr "Wide Body" "Narrow Body" "Narrow Body" "Narrow Body" ...
#> $ aircraft_manufacturer : chr "Boeing" "Airbus" "Airbus" "Boeing" ...
#> $ aircraft_model : chr "B773" "A319" "A320" "B38M" ...
#> $ aircraft_version : chr "-" "-" "-" "-" ...
#> $ landing_count : int 1 163 43 5 5 3 22 33 19 269 ...
#> $ total_landed_weight : int 554000 22459444 6114514 764000 646000 438900 8360000 5660127 2779700 42313700 ...
#> $ date : Date, format: "2022-12-01" "2022-12-01" ...
The data set has 11 categorical variables and two numeric variables -
landing_count
and total_landed_weight
.
Let’s start with viewing the total monthly number of landing in SFO:
d %>%
group_by(date) %>%
summarise(landing_count = sum(landing_count)) %>%
plot_ly(x = ~ date, y = ~ landing_count,
type = "scatter", mode = "lines") %>%
layout(title = "Montly Landing in SFO Airport",
yaxis = list(title = "Number of Landing"),
xaxis = list(title = "Source: San Francisco data portal (DataSF)"))
As can seen in the aggregate plot above, the data has:
We can use plotly’s fill plot to review the distribution of landing at SFO by geo region:
d %>%
group_by(date, geo_region) %>%
summarise(landing_count = sum(landing_count)) %>%
as.data.frame() %>%
plot_ly(x = ~ date,
y = ~ landing_count,
type = 'scatter',
mode = 'none',
stackgroup = 'one',
groupnorm = 'percent', fillcolor = ~ geo_region) %>%
layout(title = "Dist. of Landing at SFO by Region",
yaxis = list(title = "Percentage",
ticksuffix = "%"))
As expected, we can notice the change in geo’s landing distribution since March 2020 due to the Covid19 pandemic.
The aircraft_manufacturer
column, as the name implies,
provides the the aircraft manufacture. Let’s summarize the total landing
during 2019, the most recent full calendar year, by the manufacturer
type:
d %>%
filter(activity_period >= 201901 & activity_period < 202001,
aircraft_manufacturer != "") %>%
group_by(aircraft_manufacturer) %>%
summarise(total_landing = sum(landing_count),
`.groups` = "drop") %>%
arrange(-total_landing) %>%
plot_ly(labels = ~ aircraft_manufacturer,
values = ~ total_landing) %>%
add_pie(hole = 0.6) %>%
layout(title = "Landing Distribution by Aircraft Manufacturer During 2019")
Similarly, we can add the aircract_body_type
and get the
distribution of landing airplans during 2019 by manufacturer and body
type (e.g., wide, narrow, etc.):
d %>%
filter(activity_period >= 201901 & activity_period < 202001,
aircraft_manufacturer != "") %>%
group_by(aircraft_manufacturer, aircraft_body_type) %>%
summarise(total_landing = sum(landing_count),
`.groups` = "drop") %>%
arrange(-total_landing)
#> # A tibble: 9 × 3
#> aircraft_manufacturer aircraft_body_type total_landing
#> <chr> <chr> <int>
#> 1 Boeing Narrow Body 156286
#> 2 Airbus Narrow Body 112296
#> 3 Boeing Wide Body 51900
#> 4 Embraer Regional Jet 48648
#> 5 Bombardier Regional Jet 41724
#> 6 Airbus Wide Body 11506
#> 7 Bombardier Narrow Body 2028
#> 8 McDonnell Douglas Narrow Body 6
#> 9 McDonnell Douglas Wide Body 2
A Sankey plot enables us to get a distribution flow of some numeric
value by multiple categorical variables. In the following example, we
will use the sankey_ly
function to plot the distribution of
landing during 2019 by geo, flight type, and aircraft details:
d %>%
filter(activity_period >= 201901 & activity_period < 202001,
aircraft_manufacturer != "") %>%
group_by(geo_region, landing_aircraft_type,
aircraft_manufacturer, aircraft_model,
aircraft_body_type) %>%
summarise(total_landing = sum(landing_count),
groups = "drop") %>%
sankey_ly(cat_cols = c("geo_region",
"landing_aircraft_type",
"aircraft_manufacturer",
"aircraft_model",
"aircraft_body_type"),
num_col = "total_landing",
title = "SFO Landing Summary by Geo Region and Aircraft Type During 2019")