Introduction
Most of us can agree that education can be a step in shaping the lifestyle of an individual. But you can be either one degree away from a high paying job or in great student loan debt with a job that barely helps you meet the needs. I am curious to understand how the major of the degree shapes the opportunities of an individual. In the analysis below I will tackle some general questions to get a general idea on the data and perform inference statistics to understand the returns of college majors and if you should invest or pass. In general, I am going to assume that the student is open to any major.
The research questions!
What major should a student planning to pursue a graduate degree in the United Sates of America choose, in order to make the most out of the investment in earnings.
Does the gender of a student have any influence on the major the student would pursue and the how do their earnings look.
Why do I care?
I remember contacting several seniors who had pursued a graduate degree and asking them if the degree was worth it financially, Lucky I mostly got positive feedback and went ahead. But it might not be the same for all majors. I want to analyse the opportunities across all majors and how they differ from that of mine. This could help someone make an educated call on, if they want to go ahead with idea of pursuing a graduate degree of a particular major.
Why should others care?
Pursuing a degree is a huge commitment and often involves hefty investment of time and money. Which could in some cases involve leaving the current job and taking student loan. Thus it is crucial to analyse what opportunities the degree could offer financially post completion.
Data
Data source:
“College majors” data is from American Community Survey 2010-2012 Public Use Microdata Series. The data contains 5 files segregated based on level of education and age. In this analysis I will be using the two files namely majors-list and recent-grads.
The other data “Science_and_Engineering_through_years” is of estimated total number of students taking Science and Engineering Major over the years 2010 through 2019 from the American Community Survey Public Use Microdata Series as well.
College majors data:
majors-list.csv:
- List of majors with their FOD1P codes and major categories.
recent-grads.csv:
- File contains the basic earnings and labor force information and detailed breakdown, including by sex and by the type of job recent graduates got (ages <28).
Science_and_Engineering_through_years:
science-engineering-2010-2019.csv:
- File contains the year and the estimated total number of students taking Science and Engineering Major.
Links to data:
- <https://data.fivethirtyeight.com/>
- <https://github.com/fivethirtyeight/data/tree/master/college-majors>
- <https://data.census.gov/cedsci/table?q=education&tid=ACSST1Y2019.S1502>
Data collection:
The ACS PUMS files are a set of records from individual people or housing units, with disclosure protection enabled so that individuals or housing units cannot be identified. The data is from an observational study from the year 2010 through 2012 for “college majors” and from the year 2010 through 2019 for “Science_and_Engineering_through_years” data.
Units of observations:
majors-list description
Header | Description |
---|---|
FOD1P | Recorded field of degree - first entry |
Major_code | Major code, FO1DP in ACS PUMS |
Major | Major description |
recent-grads description
Header | Description |
---|---|
Rank | Rank by median earnings |
Major_code | Major code, FO1DP in ACS PUMS |
Major | Major description |
Major_category | Category of major from Carnevale et al |
Total | Total number of people with major |
Sample_size | Sample size (unweighted) of full-time, year-round ONLY (used for earnings) |
Men | Male graduates |
Women | Female graduates |
ShareWomen | Women as share of total |
Employed | Number employed (ESR == 1 or 2) |
Full_time | Employed 35 hours or more |
Part_time | Employed less than 35 hours |
Full_time_year_round | Employed at least 50 weeks (WKW == 1) and at least 35 hours (WKHP >= 35) |
Unemployed | Number unemployed (ESR == 3) |
Unemployment_rate | Unemployed / (Unemployed + Employed) |
Median | Median earnings of full-time, year-round workers |
P25th | 25th percentile of earnings |
P75th | 75th percentile of earnings |
College_jobs | Number with job requiring a college degree |
Non_college_jobs | Number with job not requiring a college degree |
Low_wage_jobs | Number in low-wage service jobs |
science-and-engineering-2010-2019 description
Header | Description |
---|---|
Year | Year of census estimate |
Science_Engineering | Estimated total number of students taking Science and Engineering Major |
Variables:
The primary set of files that will be used are majors-list.csv, recent-grads.csv and science-and-engineering-2010-2019. All the variables in the above files will be used in the study.
Type of study:
The data is from an observational study from year 2010 to 2012 for college majors data and from 2010 to 2019 for Science and Engineering majors data.
Data clean-up and checks:
Load Packages:
library(tidyverse)
library(scales)
library(naniar)
Load data
majors_list <- read.csv("./majors-list.csv")
recent_grads <- read.csv("./recent-grads.csv")
se_data <- read.csv("./se_2010_2019.csv")
majors_list <- as_tibble(majors_list)
recent_grads <- as_tibble(recent_grads)
se_data <- as_tibble(se_data)
se_data <- se_data %>%
rename(
Science_Engineering_Total = Science_Engineering
)
Check for missing values
Lets check for any missing values across data
vis_miss(majors_list)
The NA in majors list data represents education level below bachelors degree
vis_miss(recent_grads)
The only major category with partial missing data is FOOD SCIENCE of Agriculture & Natural Resources. It does not contain data about the total number of men, women and share of women.
vis_miss(se_data)
There is no missing data for both year and Science_Engineering_Total
Exploratory Data Analysis
Lets explore the recent grads data first followed by the Science and Engineering majors data
Recent Graduates:
Summary:
drop_cols <- c('Rank','Major_code')
summary_data <- recent_grads %>% select(-one_of(drop_cols))
summary(summary_data)
## Major Total Men Women
## Length:173 Min. : 124 Min. : 119 Min. : 0
## Class :character 1st Qu.: 4550 1st Qu.: 2178 1st Qu.: 1778
## Mode :character Median : 15104 Median : 5434 Median : 8386
## Mean : 39370 Mean : 16723 Mean : 22647
## 3rd Qu.: 38910 3rd Qu.: 14631 3rd Qu.: 22554
## Max. :393735 Max. :173809 Max. :307087
## NA's :1 NA's :1 NA's :1
## Major_category ShareWomen Sample_size Employed
## Length:173 Min. :0.0000 Min. : 2.0 Min. : 0
## Class :character 1st Qu.:0.3360 1st Qu.: 39.0 1st Qu.: 3608
## Mode :character Median :0.5340 Median : 130.0 Median : 11797
## Mean :0.5222 Mean : 356.1 Mean : 31193
## 3rd Qu.:0.7033 3rd Qu.: 338.0 3rd Qu.: 31433
## Max. :0.9690 Max. :4212.0 Max. :307933
## NA's :1
## Full_time Part_time Full_time_year_round Unemployed
## Min. : 111 Min. : 0 Min. : 111 Min. : 0
## 1st Qu.: 3154 1st Qu.: 1030 1st Qu.: 2453 1st Qu.: 304
## Median : 10048 Median : 3299 Median : 7413 Median : 893
## Mean : 26029 Mean : 8832 Mean : 19694 Mean : 2416
## 3rd Qu.: 25147 3rd Qu.: 9948 3rd Qu.: 16891 3rd Qu.: 2393
## Max. :251540 Max. :115172 Max. :199897 Max. :28169
##
## Unemployment_rate Median P25th P75th
## Min. :0.00000 Min. : 22000 Min. :18500 Min. : 22000
## 1st Qu.:0.05031 1st Qu.: 33000 1st Qu.:24000 1st Qu.: 42000
## Median :0.06796 Median : 36000 Median :27000 Median : 47000
## Mean :0.06819 Mean : 40151 Mean :29501 Mean : 51494
## 3rd Qu.:0.08756 3rd Qu.: 45000 3rd Qu.:33000 3rd Qu.: 60000
## Max. :0.17723 Max. :110000 Max. :95000 Max. :125000
##
## College_jobs Non_college_jobs Low_wage_jobs
## Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 1675 1st Qu.: 1591 1st Qu.: 340
## Median : 4390 Median : 4595 Median : 1231
## Mean : 12323 Mean : 13284 Mean : 3859
## 3rd Qu.: 14444 3rd Qu.: 11783 3rd Qu.: 3466
## Max. :151643 Max. :148395 Max. :48207
##
From the above summary it can be seen that across all majors:
- Total students count ranges from 124 to 393735
- The minimum number of men is 119 and maximum is 173809, while minimum number of women is 0 and maximum is 307087.
- On average the share of women across majors is 0.5222
- On average the Unemployment rate across majors is 0.06796
- The median salary ranges from 22000 to 110000
1. What are the 15 most popular majors?
majors_sorted <- recent_grads %>%
arrange(desc(Median)) %>%
mutate(Major = str_to_title(Major),
Major = fct_reorder(Major, Median))
majors_sorted %>%
mutate(Major = fct_reorder(Major, Total)) %>%
arrange(desc(Total)) %>%
head(15) %>%
ggplot(aes(Major, Total, fill = Major_category)) +
geom_col() +
coord_flip() +
scale_y_continuous() +
labs(x = "Major",
y = "Total number of graduates",
title = "15 most popular majors")+
scale_fill_discrete(name = "Major Category")
Psychology seems to be the most popular Major among recent grads, followed by Business Management and Administration. Although category Psychology tops the list, 1/3 of the majors are from Business category.
2. What are the 15 least popular majors?
majors_sorted %>%
mutate(Major = fct_reorder(Major, Total)) %>%
arrange(desc(Total)) %>%
tail(15) %>%
ggplot(aes(Major, Total, fill = Major_category)) +
geom_col() +
coord_flip() +
scale_y_continuous() +
labs(x = "Major",
y = "Total number of graduates",
title = "15 least popular majors")+
scale_fill_discrete(name = "Major Category")
Military Technologies is the least popular in Majors with less than 500 people taking it. Categories Engineering and Education together dominate the least popular majors.
Which major category is most popular?
majors_sorted %>%
group_by(Major_category) %>%
summarize(Total = mean(Total))%>%
mutate(Major_category = fct_reorder(Major_category, Total)) %>%
ggplot(aes(Major_category, Total)) +
geom_col() +
scale_y_continuous() +
coord_flip()+
labs(x = "Major Category",
y = "Total number of graduates",
title = "Average # of graduate students across Major categories")
Business seems to be the most popular major category while Engineering and Interdisciplinary are least. Although Psychology is the most popular major, Psychology and Social work is only the 4th popular major category.
Which major categories earn the most?
majors_sorted %>%
mutate(Major_category = fct_reorder(Major_category, Median)) %>%
ggplot(aes(Major_category, Median, fill = Major_category)) +
geom_boxplot() +
scale_y_continuous(labels = dollar_format()) +
expand_limits(y = 0) +
coord_flip() +
theme(legend.position = "none")+
labs(x = "Major Category",
y = "Median Salary",
title = "Median Salaries across Major categories")
Its interesting that the least popular majors earn the most while the most popular majors are down the scale. Business and Social Science are the only majors in the popular 5 and has highest median pay.
How is the gender distribution across popular majors?
majors_sorted %>%
arrange(desc(Total)) %>%
head(15) %>%
mutate(Major = fct_reorder(Major, Total)) %>%
gather(Gender, Number, Men, Women) %>%
ggplot(aes(Major, Number, fill = Gender)) +
geom_col() +
coord_flip()+
labs(x = "Major",
y = "Total Number of Graduates",
title = "Gender Distribution across top 15 majors")
Women seem to be high in number for the majors Psychology, Nursing and Elementary Eduaction, while least in Finance
How does the gender distribution correlate with earnings?
by_major_category <- majors_sorted %>%
filter(!is.na(Total)) %>%
group_by(Major_category) %>%
summarize(Men = sum(Men),
Women = sum(Women),
Total = sum(Total),
MedianSalary = median(Median)) %>%
mutate(ShareWomen = Women / Total) %>%
arrange(desc(ShareWomen))
library(ggrepel)
by_major_category %>%
mutate(Major_category = fct_lump(Major_category, 8)) %>%
ggplot(aes(ShareWomen, MedianSalary, color = Major_category)) +
geom_point() +
geom_smooth(aes(group = 1), method = "lm") +
geom_smooth(method = "lm") +
geom_text_repel(aes(label = Major_category)) +
expand_limits(y = 0)+
labs(x = "Share of Women",
y = "Median salary of Major category",
title = "Correlation of gender distribution and earnings")+
scale_fill_discrete(name = "Major Category")
library(ggrepel)
majors_sorted %>%
ggplot(aes(ShareWomen, Median, color = Major_category)) +
geom_point() +
geom_smooth(aes(group = 1), method = "lm") +
labs(x = "Share of Women",
y = "Median salary of Major category",
title = "Correlation of gender distribution and earnings")+
scale_fill_discrete(name = "Major Category")+
expand_limits(y = 0)
Most STEM programs have least share of women and high pay while Health and Arts have most share of women. Business and Social Science seem to be right at the center with almost equal share of women and men.
How is the total estimated Science and Engineering over the years
plot(se_data$Year, se_data$Science_Engineering_Total , main="Science and engineering major total graduates 2010-2019",
xlab="year", ylab="Science and Engineering Major Estimated Total", pch=19)
There seems to be a steady increase in the number of students taking Science and Engineering Majors over the years
Inference
indata <- majors_sorted %>%
filter(!is.na(Total)) %>%
group_by(Major_category) %>%
summarize(Men = sum(Men),
Women = sum(Women),
Total = sum(Total),
Employed = sum(Employed),
Full_time = sum(Full_time),
Part_time = sum(Part_time),
Full_time_year_round = sum(Full_time_year_round),
Unemployed = sum(Unemployed),
Unemployed_rate = sum(Unemployed)/sum(Unemployed) + sum(Employed),
College_jobs = sum(College_jobs),
Non_college_jobs = sum(Non_college_jobs),
Low_wage_jobs = sum(Low_wage_jobs),
MedianSalary = median(Median),
Emp_total = sum(Employed)+ sum(Unemployed)) %>%
mutate(ShareWomen = Women / Total) %>%
arrange(desc(Employed))
indata <- as_tibble(indata)
eng_tech <- filter(indata, Major_category =="Engineering" | Major_category =="Computers & Mathematics")
business <- filter(indata, Major_category == "Business")
arts <- filter (indata, Major_category == "Arts")
Two proportion Z-test:
As we could see from above analysis that although Engineering, Computers & Mathematics and Business have the highest median salaries respectively. Most students opt for a business major over a Engineering or Computers & Mathematics Major. I wanted to understand if that was related to the employment rate of the Major categories.
Why z test? :
The data only contains the total of different categories and median salary from samples. As we do not have information on mean or standard deviation for data, we use a two proportion z-test.
Hypothesis of a two sample Z-test (Upper tailed):
Question: The employment rate of people who took E/CM(Engineering or Computers & Mathematics) is less than or equal to those who took business.
Null Hypothesis \(\implies\) H0 : pe \(\le\) pb
Alternate Hypothesis \(\implies\) HA : pe > pb
and let significance level \(\alpha\) = 0.05
Decision rule: Since \(\alpha\) = 0.05, and we are using z-test, the decision rule is to “Reject H0 if Z \(\ge\) 1.645
Assumptions:
Here, Let ‘e’ represent ‘Engineering, Computer and Mathematics Major’ and ‘b’ represent ‘Business Major’.
The value of z is given by \[z = \frac { p_e - p_b}{\sqrt{ p * (1- p) *(1/n_e + 1/n_b)}}\] The proportion of Engineering, Computer and Mathematics graduates Employed is given by \(p_e\) \[ p_e = \frac{Employed_e}{Employed_e + Unemployed_e} \implies p_e = 0.9317863\] The proportion of Business graduates Employed is given by \(p_b\) \[ p_b = \frac{Employed_b}{Employed_b+ Unemployed_b} \implies p_b = 0.9316484\] The overall proportion of graduates Employed is given by p \[ p = \frac{ Employed_e + Employed_b}{(Employed_e + Unemployed_e) +(Employed_b+ Unemployed_b)} \implies p = 0.9317003 \] The total number of Engineering, Computer and Mathematics graduates is given by n_e and Business graduates is given by n_b
\[n_e = 706456, n_b = 1168619\]
A two proportions z test is valid only when sample size (n) is large enough. i.e, \(n_ep\), \(n_e(1-p)\), \(n_bp\), \(n_b(1-p)\) should be \(\ge\) 5. Here, as \(n_ep= 658205.3\) , \(n_e(1-p) = 48250.71\), \(n_bp = 1088803\) and \(n_b(1-p) = 79816.29\), we can proceed to perform z test on the data.
Lets gather the required data into a tibble for ease of reference.
temp_vs_b <- eng_tech %>%
summarize(Men = sum(Men),
Women = sum(Women),
Major_category = "Engineering, Computers & Mathematics",
Total = sum(Total),
Employed = sum(Employed),
Full_time = sum(Full_time),
Part_time = sum(Part_time),
Full_time_year_round = sum(Full_time_year_round),
Unemployed = sum(Unemployed),
Unemployed_rate = sum(Unemployed)/sum(Unemployed) + sum(Employed),
College_jobs = sum(College_jobs),
Non_college_jobs = sum(Non_college_jobs),
Low_wage_jobs = sum(Low_wage_jobs),
MedianSalary = median(MedianSalary),
Emp_total = sum(Employed)+ sum(Unemployed)) %>%
mutate(ShareWomen = Women / Total) %>%
arrange(desc(Employed))
temp_vs_b <- temp_vs_b %>% add_row(business)
temp_vs_a <- temp_vs_b %>% add_row(arts)
Test for hypothesis, H0 : pe \(\le\) pb
res_temp_vs_b <- prop.test(x=c(temp_vs_b$Employed[1],temp_vs_b$Employed[2]), n=c(temp_vs_b$Emp_total[1],temp_vs_b$Emp_total[2]),alternative = "greater")
res_temp_vs_b
##
## 2-sample test for equality of proportions with continuity correction
##
## data: c(temp_vs_b$Employed[1], temp_vs_b$Employed[2]) out of c(temp_vs_b$Emp_total[1], temp_vs_b$Emp_total[2])
## X-squared = 0.12939, df = 1, p-value = 0.3595
## alternative hypothesis: greater
## 95 percent confidence interval:
## -0.0004884284 1.0000000000
## sample estimates:
## prop 1 prop 2
## 0.9317863 0.9316484
Interpretation of Result:
From the results of z test, we can see that the p-value of the hypothesis is equal to 0.35, thus the result is not significant and we fail to reject the null hypothesis that the employment rate of people who took E/CM(Engineering or Computers & Mathematics) is less than or equal to those who took business.
Regression Analysis:
Since the College majors data consists of only population and median salary over different major categories, I could not think of a suitable regression model that could be applied on the data.
Application of Approach:
Fit an linear regression model over the time series se_data to predict the number of students taking Science and Engineering majors given the year
Formulation
Predict number of students taking Science and Engineering major given the year.
Check for assumptions
plot(se_data$Year, se_data$Science_Engineering_Total , main="Science and Engineering Major Estimated Total over years 2010-2019",
xlab="year", ylab="Science and Engineering Major Estimated Total", pch=19)
par(mfrow=c(1, 2)) # divide graph area in 2 columns
boxplot(se_data$Year, main="year", sub=paste("Outlier rows: ", boxplot.stats(se_data$Year)$out)) # box plot for 'Year'
boxplot(se_data$Science_Engineering_Total, main="Science_Engineering_Total", sub=paste("Outlier rows: ", boxplot.stats(se_data$Science_Engineering_Total)$out)) # box plot for 'Science_Engineering_Total'
cor(se_data$Year, se_data$Science_Engineering_Total)
## [1] 0.995729
As the value is close to 1 there is a strong relationship between the variables
se_lr <- lm( Science_Engineering_Total ~ Year,data = se_data)
summary(se_lr)
##
## Call:
## lm(formula = Science_Engineering_Total ~ Year, data = se_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -230108 -189820 3226 151470 358029
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.460e+09 4.861e+07 -30.03 1.64e-09 ***
## Year 7.360e+05 2.413e+04 30.50 1.45e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 219200 on 8 degrees of freedom
## Multiple R-squared: 0.9915, Adjusted R-squared: 0.9904
## F-statistic: 930.5 on 1 and 8 DF, p-value: 1.448e-09
The residuals look close to symmetrical. The least square estimates of the fitted line has the intercept -1.460e+09 and a slope 7.360e+05 and the p-value of Year is statistically significant as 1.45e-09 is less than 0.05. A significant p-value for the year means that it will give us a reliable guess of the number of people taking Science Engineering Major.
As the R-squared is 0.9915, it means that year can explain 99% of variation in Science_Engineering_total. The Adjusted R-squared is the R-squared scaled by the number of parameters in the model.The Degrees of freedom of the model is 8.
plot(Science_Engineering_Total ~ Year,data=se_data)
abline(se_lr, col="blue")
plot(se_lr)
testdata<- data.frame(Year=2020)
predict(se_lr,testdata)
## 1
## 26923972
AIC(se_lr)
## [1] 278.0983
BIC(se_lr)
## [1] 279.006
The values of AIC and BIC are small. The model is captures the relationship among the variables well.
Conclusion
Summary:
From the analysis is it can be observed that majority of graduate students fall under the Business Major category, which is also the major category with 3rd highest median salary. Despite having the highest median salary, Engineering and Computers & Mathematics major categories fail to attract more graduate students while Engineering being the least of them all. The linear regression analysis shows a steady increase in the number of students opting for Science and Engineering major over time.
Answers for research questions:
- Engineering looks like a good option for major if you are aiming at high income jobs post graduation, followed by Computers & Mathematics and Business.
- Most women opt for low paying majors and the share of women is significantly lower in high paying majors.
Limitations:
The following analysis is performed on the data is from American Community Survey 2010-2012. Geography, university can largely influence the compensation and opportunities, which is not taken into consideration in the data analysis. Apart from that each organization has a different pay scale which can be subject to inflation and vary for the same role and experience level within the organization as well. We are only considering the median salary of a particular major.
Future research:
data relating to geography, job role and experience can be appended to the college majors data to help paint a better picture and close on the current limitations of the data. The regression model can be used to understand if the number of students graduating form science and engineering major meet the job demand of a given year.