College Majors Project Report

Priyanka

12/06/2021

Introduction

Most of us can agree that education can be a step in shaping the lifestyle of an individual. But you can be either one degree away from a high paying job or in great student loan debt with a job that barely helps you meet the needs. I am curious to understand how the major of the degree shapes the opportunities of an individual. In the analysis below I will tackle some general questions to get a general idea on the data and perform inference statistics to understand the returns of college majors and if you should invest or pass. In general, I am going to assume that the student is open to any major.

The research questions!

What major should a student planning to pursue a graduate degree in the United Sates of America choose, in order to make the most out of the investment in earnings.
Does the gender of a student have any influence on the major the student would pursue and the how do their earnings look.

Why do I care?

I remember contacting several seniors who had pursued a graduate degree and asking them if the degree was worth it financially, Lucky I mostly got positive feedback and went ahead. But it might not be the same for all majors. I want to analyse the opportunities across all majors and how they differ from that of mine. This could help someone make an educated call on, if they want to go ahead with idea of pursuing a graduate degree of a particular major.

Why should others care?

Pursuing a degree is a huge commitment and often involves hefty investment of time and money. Which could in some cases involve leaving the current job and taking student loan. Thus it is crucial to analyse what opportunities the degree could offer financially post completion.

Data

Data source:

“College majors” data is from American Community Survey 2010-2012 Public Use Microdata Series. The data contains 5 files segregated based on level of education and age. In this analysis I will be using the two files namely majors-list and recent-grads.

The other data “Science_and_Engineering_through_years” is of estimated total number of students taking Science and Engineering Major over the years 2010 through 2019 from the American Community Survey Public Use Microdata Series as well.

College majors data:

majors-list.csv:
- List of majors with their FOD1P codes and major categories.
recent-grads.csv:
- File contains the basic earnings and labor force information and detailed breakdown, including by sex and by the type of job recent graduates got (ages <28).

Science_and_Engineering_through_years:

science-engineering-2010-2019.csv:
- File contains the year and the estimated total number of students taking Science and Engineering Major.

Links to data:

-   <https://data.fivethirtyeight.com/>
-   <https://github.com/fivethirtyeight/data/tree/master/college-majors>
-   <https://data.census.gov/cedsci/table?q=education&tid=ACSST1Y2019.S1502>

Data collection:

The ACS PUMS files are a set of records from individual people or housing units, with disclosure protection enabled so that individuals or housing units cannot be identified. The data is from an observational study from the year 2010 through 2012 for “college majors” and from the year 2010 through 2019 for “Science_and_Engineering_through_years” data.

Units of observations:

majors-list description

Header	Description
FOD1P	Recorded field of degree - first entry
Major_code	Major code, FO1DP in ACS PUMS
Major	Major description

recent-grads description

Header	Description
Rank	Rank by median earnings
Major_code	Major code, FO1DP in ACS PUMS
Major	Major description
Major_category	Category of major from Carnevale et al
Total	Total number of people with major
Sample_size	Sample size (unweighted) of full-time, year-round ONLY (used for earnings)
Men	Male graduates
Women	Female graduates
ShareWomen	Women as share of total
Employed	Number employed (ESR == 1 or 2)
Full_time	Employed 35 hours or more
Part_time	Employed less than 35 hours
Full_time_year_round	Employed at least 50 weeks (WKW == 1) and at least 35 hours (WKHP >= 35)
Unemployed	Number unemployed (ESR == 3)
Unemployment_rate	Unemployed / (Unemployed + Employed)
Median	Median earnings of full-time, year-round workers
P25th	25th percentile of earnings
P75th	75th percentile of earnings
College_jobs	Number with job requiring a college degree
Non_college_jobs	Number with job not requiring a college degree
Low_wage_jobs	Number in low-wage service jobs

science-and-engineering-2010-2019 description

Header	Description
Year	Year of census estimate
Science_Engineering	Estimated total number of students taking Science and Engineering Major

Variables:

The primary set of files that will be used are majors-list.csv, recent-grads.csv and science-and-engineering-2010-2019. All the variables in the above files will be used in the study.

Type of study:

The data is from an observational study from year 2010 to 2012 for college majors data and from 2010 to 2019 for Science and Engineering majors data.

Data clean-up and checks:

Load Packages:

library(tidyverse)
library(scales)
library(naniar)

Load data

majors_list <- read.csv("./majors-list.csv")
recent_grads <- read.csv("./recent-grads.csv")
se_data <- read.csv("./se_2010_2019.csv")

majors_list <- as_tibble(majors_list)
recent_grads <- as_tibble(recent_grads)
se_data <- as_tibble(se_data)
se_data <- se_data %>% 
  rename(
    Science_Engineering_Total = Science_Engineering
    )

Check for missing values

Lets check for any missing values across data

vis_miss(majors_list)

The NA in majors list data represents education level below bachelors degree

vis_miss(recent_grads)

The only major category with partial missing data is FOOD SCIENCE of Agriculture & Natural Resources. It does not contain data about the total number of men, women and share of women.

vis_miss(se_data)

There is no missing data for both year and Science_Engineering_Total

Exploratory Data Analysis

Lets explore the recent grads data first followed by the Science and Engineering majors data

Recent Graduates:

Summary:

drop_cols <- c('Rank','Major_code')
summary_data <- recent_grads %>% select(-one_of(drop_cols))
summary(summary_data)

##     Major               Total             Men             Women       
##  Length:173         Min.   :   124   Min.   :   119   Min.   :     0  
##  Class :character   1st Qu.:  4550   1st Qu.:  2178   1st Qu.:  1778  
##  Mode  :character   Median : 15104   Median :  5434   Median :  8386  
##                     Mean   : 39370   Mean   : 16723   Mean   : 22647  
##                     3rd Qu.: 38910   3rd Qu.: 14631   3rd Qu.: 22554  
##                     Max.   :393735   Max.   :173809   Max.   :307087  
##                     NA's   :1        NA's   :1        NA's   :1       
##  Major_category       ShareWomen      Sample_size        Employed     
##  Length:173         Min.   :0.0000   Min.   :   2.0   Min.   :     0  
##  Class :character   1st Qu.:0.3360   1st Qu.:  39.0   1st Qu.:  3608  
##  Mode  :character   Median :0.5340   Median : 130.0   Median : 11797  
##                     Mean   :0.5222   Mean   : 356.1   Mean   : 31193  
##                     3rd Qu.:0.7033   3rd Qu.: 338.0   3rd Qu.: 31433  
##                     Max.   :0.9690   Max.   :4212.0   Max.   :307933  
##                     NA's   :1                                         
##    Full_time        Part_time      Full_time_year_round   Unemployed   
##  Min.   :   111   Min.   :     0   Min.   :   111       Min.   :    0  
##  1st Qu.:  3154   1st Qu.:  1030   1st Qu.:  2453       1st Qu.:  304  
##  Median : 10048   Median :  3299   Median :  7413       Median :  893  
##  Mean   : 26029   Mean   :  8832   Mean   : 19694       Mean   : 2416  
##  3rd Qu.: 25147   3rd Qu.:  9948   3rd Qu.: 16891       3rd Qu.: 2393  
##  Max.   :251540   Max.   :115172   Max.   :199897       Max.   :28169  
##                                                                        
##  Unemployment_rate     Median           P25th           P75th       
##  Min.   :0.00000   Min.   : 22000   Min.   :18500   Min.   : 22000  
##  1st Qu.:0.05031   1st Qu.: 33000   1st Qu.:24000   1st Qu.: 42000  
##  Median :0.06796   Median : 36000   Median :27000   Median : 47000  
##  Mean   :0.06819   Mean   : 40151   Mean   :29501   Mean   : 51494  
##  3rd Qu.:0.08756   3rd Qu.: 45000   3rd Qu.:33000   3rd Qu.: 60000  
##  Max.   :0.17723   Max.   :110000   Max.   :95000   Max.   :125000  
##                                                                     
##   College_jobs    Non_college_jobs Low_wage_jobs  
##  Min.   :     0   Min.   :     0   Min.   :    0  
##  1st Qu.:  1675   1st Qu.:  1591   1st Qu.:  340  
##  Median :  4390   Median :  4595   Median : 1231  
##  Mean   : 12323   Mean   : 13284   Mean   : 3859  
##  3rd Qu.: 14444   3rd Qu.: 11783   3rd Qu.: 3466  
##  Max.   :151643   Max.   :148395   Max.   :48207  
##

From the above summary it can be seen that across all majors:

Total students count ranges from 124 to 393735
The minimum number of men is 119 and maximum is 173809, while minimum number of women is 0 and maximum is 307087.
On average the share of women across majors is 0.5222
On average the Unemployment rate across majors is 0.06796
The median salary ranges from 22000 to 110000

1. What are the 15 most popular majors?

majors_sorted <- recent_grads %>%
  arrange(desc(Median)) %>%
  mutate(Major = str_to_title(Major),
         Major = fct_reorder(Major, Median))

majors_sorted %>%
  mutate(Major = fct_reorder(Major, Total)) %>%
  arrange(desc(Total)) %>%
  head(15) %>%
  ggplot(aes(Major, Total, fill = Major_category)) +
  geom_col() +
  coord_flip() +
  scale_y_continuous() +
  labs(x = "Major",
       y = "Total number of graduates",
       title = "15 most popular majors")+
  scale_fill_discrete(name = "Major Category")

Psychology seems to be the most popular Major among recent grads, followed by Business Management and Administration. Although category Psychology tops the list, 1/3 of the majors are from Business category.

2. What are the 15 least popular majors?

majors_sorted %>%
  mutate(Major = fct_reorder(Major, Total)) %>%
  arrange(desc(Total)) %>%
  tail(15) %>%
  ggplot(aes(Major, Total, fill = Major_category)) +
  geom_col() +
  coord_flip() +
  scale_y_continuous() +
  labs(x = "Major",
       y = "Total number of graduates",
       title = "15 least popular majors")+
  scale_fill_discrete(name = "Major Category")

Military Technologies is the least popular in Majors with less than 500 people taking it. Categories Engineering and Education together dominate the least popular majors.

Which major category is most popular?

majors_sorted %>%
  group_by(Major_category) %>%
  summarize(Total = mean(Total))%>%
  mutate(Major_category = fct_reorder(Major_category, Total)) %>%
  ggplot(aes(Major_category, Total)) +
  geom_col() +
  scale_y_continuous() +
  coord_flip()+
  labs(x = "Major Category",
       y = "Total number of graduates",
       title = "Average # of graduate students across Major categories")

Business seems to be the most popular major category while Engineering and Interdisciplinary are least. Although Psychology is the most popular major, Psychology and Social work is only the 4th popular major category.

Which major categories earn the most?

majors_sorted %>%
  mutate(Major_category = fct_reorder(Major_category, Median)) %>%
  ggplot(aes(Major_category, Median, fill = Major_category)) +
  geom_boxplot() +
  scale_y_continuous(labels = dollar_format()) +
  expand_limits(y = 0) +
  coord_flip() +
  theme(legend.position = "none")+
  labs(x = "Major Category",
       y = "Median Salary",
       title = "Median Salaries across Major categories")

Its interesting that the least popular majors earn the most while the most popular majors are down the scale. Business and Social Science are the only majors in the popular 5 and has highest median pay.

How is the gender distribution across popular majors?

majors_sorted %>%
  arrange(desc(Total)) %>%
  head(15) %>%
  mutate(Major = fct_reorder(Major, Total)) %>%
  gather(Gender, Number, Men, Women) %>%
  ggplot(aes(Major, Number, fill = Gender)) +
  geom_col() +
  coord_flip()+
  labs(x = "Major",
       y = "Total Number of Graduates",
       title = "Gender Distribution across top 15 majors")

Women seem to be high in number for the majors Psychology, Nursing and Elementary Eduaction, while least in Finance

How does the gender distribution correlate with earnings?

by_major_category <- majors_sorted %>%
  filter(!is.na(Total)) %>%
  group_by(Major_category) %>%
  summarize(Men = sum(Men),
            Women = sum(Women),
            Total = sum(Total),
            MedianSalary = median(Median)) %>%
  mutate(ShareWomen = Women / Total) %>%
  arrange(desc(ShareWomen))

library(ggrepel)

by_major_category %>%
  mutate(Major_category = fct_lump(Major_category, 8)) %>%
  ggplot(aes(ShareWomen, MedianSalary, color = Major_category)) +
  geom_point() +
   geom_smooth(aes(group = 1), method = "lm") +
  geom_smooth(method = "lm") +
  geom_text_repel(aes(label = Major_category)) +
  expand_limits(y = 0)+
  labs(x = "Share of Women",
       y = "Median salary of Major category",
       title = "Correlation of gender distribution and earnings")+
  scale_fill_discrete(name = "Major Category")

library(ggrepel)

majors_sorted %>%
ggplot(aes(ShareWomen, Median, color = Major_category)) +
geom_point() +
geom_smooth(aes(group = 1), method = "lm") +
  labs(x = "Share of Women",
       y = "Median salary of Major category",
       title = "Correlation of gender distribution and earnings")+
  scale_fill_discrete(name = "Major Category")+
expand_limits(y = 0)

Most STEM programs have least share of women and high pay while Health and Arts have most share of women. Business and Social Science seem to be right at the center with almost equal share of women and men.

How is the total estimated Science and Engineering over the years

plot(se_data$Year, se_data$Science_Engineering_Total , main="Science and engineering major total graduates 2010-2019",
   xlab="year", ylab="Science and Engineering Major Estimated Total", pch=19)

There seems to be a steady increase in the number of students taking Science and Engineering Majors over the years

Inference

indata <- majors_sorted %>%
  filter(!is.na(Total)) %>%
  group_by(Major_category) %>%
  summarize(Men = sum(Men),
            Women = sum(Women),
            Total = sum(Total),
            Employed = sum(Employed),
            Full_time = sum(Full_time),
            Part_time = sum(Part_time),
            Full_time_year_round = sum(Full_time_year_round),
            Unemployed = sum(Unemployed),
            Unemployed_rate =   sum(Unemployed)/sum(Unemployed) + sum(Employed),
            College_jobs = sum(College_jobs),
            Non_college_jobs = sum(Non_college_jobs),
            Low_wage_jobs = sum(Low_wage_jobs),
            MedianSalary = median(Median),
            Emp_total = sum(Employed)+ sum(Unemployed)) %>%
  mutate(ShareWomen = Women / Total) %>%
  arrange(desc(Employed))
  
  
  

indata <- as_tibble(indata)
eng_tech <- filter(indata, Major_category =="Engineering" | Major_category =="Computers & Mathematics")
business <- filter(indata, Major_category == "Business")
arts <- filter (indata, Major_category == "Arts")

Two proportion Z-test:

As we could see from above analysis that although Engineering, Computers & Mathematics and Business have the highest median salaries respectively. Most students opt for a business major over a Engineering or Computers & Mathematics Major. I wanted to understand if that was related to the employment rate of the Major categories.

Why z test? :

The data only contains the total of different categories and median salary from samples. As we do not have information on mean or standard deviation for data, we use a two proportion z-test.

Hypothesis of a two sample Z-test (Upper tailed):

Question: The employment rate of people who took E/CM(Engineering or Computers & Mathematics) is less than or equal to those who took business.

Null Hypothesis \(\implies\) H₀ : p_e \(\le\) p_b

Alternate Hypothesis \(\implies\) H_A : p_e > p_b

and let significance level \(\alpha\) = 0.05

Decision rule: Since \(\alpha\) = 0.05, and we are using z-test, the decision rule is to “Reject H₀ if Z \(\ge\) 1.645

Assumptions:

Here, Let ‘e’ represent ‘Engineering, Computer and Mathematics Major’ and ‘b’ represent ‘Business Major’.

The value of z is given by \[z = \frac { p_e - p_b}{\sqrt{ p * (1- p) *(1/n_e + 1/n_b)}}\] The proportion of Engineering, Computer and Mathematics graduates Employed is given by \(p_e\) \[ p_e = \frac{Employed_e}{Employed_e + Unemployed_e} \implies p_e = 0.9317863\] The proportion of Business graduates Employed is given by \(p_b\) \[ p_b = \frac{Employed_b}{Employed_b+ Unemployed_b} \implies p_b = 0.9316484\] The overall proportion of graduates Employed is given by p \[ p = \frac{ Employed_e + Employed_b}{(Employed_e + Unemployed_e) +(Employed_b+ Unemployed_b)} \implies p = 0.9317003 \] The total number of Engineering, Computer and Mathematics graduates is given by n_e and Business graduates is given by n_b

\[n_e = 706456, n_b = 1168619\]

A two proportions z test is valid only when sample size (n) is large enough. i.e, \(n_ep\), \(n_e(1-p)\), \(n_bp\), \(n_b(1-p)\) should be \(\ge\) 5. Here, as \(n_ep= 658205.3\) , \(n_e(1-p) = 48250.71\), \(n_bp = 1088803\) and \(n_b(1-p) = 79816.29\), we can proceed to perform z test on the data.

Lets gather the required data into a tibble for ease of reference.

temp_vs_b  <- eng_tech %>%
  summarize(Men = sum(Men),
            Women = sum(Women),
            Major_category = "Engineering, Computers & Mathematics",
            Total = sum(Total),
            Employed = sum(Employed),
            Full_time = sum(Full_time),
            Part_time = sum(Part_time),
            Full_time_year_round = sum(Full_time_year_round),
            Unemployed = sum(Unemployed),
            Unemployed_rate =   sum(Unemployed)/sum(Unemployed) + sum(Employed),
            College_jobs = sum(College_jobs),
            Non_college_jobs = sum(Non_college_jobs),
            Low_wage_jobs = sum(Low_wage_jobs),
            MedianSalary = median(MedianSalary),
            Emp_total = sum(Employed)+ sum(Unemployed)) %>%
  mutate(ShareWomen = Women / Total) %>%
  arrange(desc(Employed))

temp_vs_b <- temp_vs_b %>% add_row(business)
temp_vs_a <- temp_vs_b %>% add_row(arts)

Test for hypothesis, H₀ : p_e \(\le\) p_b

res_temp_vs_b <- prop.test(x=c(temp_vs_b$Employed[1],temp_vs_b$Employed[2]), n=c(temp_vs_b$Emp_total[1],temp_vs_b$Emp_total[2]),alternative = "greater") 

res_temp_vs_b

## 
##  2-sample test for equality of proportions with continuity correction
## 
## data:  c(temp_vs_b$Employed[1], temp_vs_b$Employed[2]) out of c(temp_vs_b$Emp_total[1], temp_vs_b$Emp_total[2])
## X-squared = 0.12939, df = 1, p-value = 0.3595
## alternative hypothesis: greater
## 95 percent confidence interval:
##  -0.0004884284  1.0000000000
## sample estimates:
##    prop 1    prop 2 
## 0.9317863 0.9316484

Interpretation of Result:

From the results of z test, we can see that the p-value of the hypothesis is equal to 0.35, thus the result is not significant and we fail to reject the null hypothesis that the employment rate of people who took E/CM(Engineering or Computers & Mathematics) is less than or equal to those who took business.

Regression Analysis:

Since the College majors data consists of only population and median salary over different major categories, I could not think of a suitable regression model that could be applied on the data.

Application of Approach:

Fit an linear regression model over the time series se_data to predict the number of students taking Science and Engineering majors given the year

Formulation

Predict number of students taking Science and Engineering major given the year.

Check for assumptions

plot(se_data$Year, se_data$Science_Engineering_Total , main="Science and Engineering Major Estimated Total over years 2010-2019",
   xlab="year", ylab="Science and Engineering Major Estimated Total", pch=19)

par(mfrow=c(1, 2))  # divide graph area in 2 columns
boxplot(se_data$Year, main="year", sub=paste("Outlier rows: ", boxplot.stats(se_data$Year)$out))  # box plot for 'Year'
boxplot(se_data$Science_Engineering_Total, main="Science_Engineering_Total", sub=paste("Outlier rows: ", boxplot.stats(se_data$Science_Engineering_Total)$out))  # box plot for 'Science_Engineering_Total'

cor(se_data$Year, se_data$Science_Engineering_Total)

## [1] 0.995729

As the value is close to 1 there is a strong relationship between the variables

se_lr <- lm( Science_Engineering_Total ~ Year,data = se_data)
summary(se_lr)

## 
## Call:
## lm(formula = Science_Engineering_Total ~ Year, data = se_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -230108 -189820    3226  151470  358029 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.460e+09  4.861e+07  -30.03 1.64e-09 ***
## Year         7.360e+05  2.413e+04   30.50 1.45e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 219200 on 8 degrees of freedom
## Multiple R-squared:  0.9915, Adjusted R-squared:  0.9904 
## F-statistic: 930.5 on 1 and 8 DF,  p-value: 1.448e-09

The residuals look close to symmetrical. The least square estimates of the fitted line has the intercept -1.460e+09 and a slope 7.360e+05 and the p-value of Year is statistically significant as 1.45e-09 is less than 0.05. A significant p-value for the year means that it will give us a reliable guess of the number of people taking Science Engineering Major.

As the R-squared is 0.9915, it means that year can explain 99% of variation in Science_Engineering_total. The Adjusted R-squared is the R-squared scaled by the number of parameters in the model.The Degrees of freedom of the model is 8.

plot(Science_Engineering_Total ~ Year,data=se_data)
abline(se_lr, col="blue")

plot(se_lr)

testdata<- data.frame(Year=2020)
predict(se_lr,testdata)

##        1 
## 26923972

AIC(se_lr)

## [1] 278.0983

BIC(se_lr)

## [1] 279.006

The values of AIC and BIC are small. The model is captures the relationship among the variables well.

Conclusion

Summary:

From the analysis is it can be observed that majority of graduate students fall under the Business Major category, which is also the major category with 3rd highest median salary. Despite having the highest median salary, Engineering and Computers & Mathematics major categories fail to attract more graduate students while Engineering being the least of them all. The linear regression analysis shows a steady increase in the number of students opting for Science and Engineering major over time.

Answers for research questions:

Engineering looks like a good option for major if you are aiming at high income jobs post graduation, followed by Computers & Mathematics and Business.
Most women opt for low paying majors and the share of women is significantly lower in high paying majors.

Limitations:

The following analysis is performed on the data is from American Community Survey 2010-2012. Geography, university can largely influence the compensation and opportunities, which is not taken into consideration in the data analysis. Apart from that each organization has a different pay scale which can be subject to inflation and vary for the same role and experience level within the organization as well. We are only considering the median salary of a particular major.

Future research:

data relating to geography, job role and experience can be appended to the college majors data to help paint a better picture and close on the current limitations of the data. The regression model can be used to understand if the number of students graduating form science and engineering major meet the job demand of a given year.

College Majors Project Report

Introduction

The research questions!

Why do I care?

Why should others care?

Related work

Data

Data source:

Links to data:

Data collection:

Units of observations:

majors-list description

recent-grads description

science-and-engineering-2010-2019 description

Variables:

Type of study:

Data clean-up and checks:

Load Packages:

Load data

Check for missing values

Exploratory Data Analysis

Lets explore the recent grads data first followed by the Science and Engineering majors data

Recent Graduates:

Summary:

From the above summary it can be seen that across all majors:

1. What are the 15 most popular majors?

2. What are the 15 least popular majors?

Which major category is most popular?

Which major categories earn the most?

How is the gender distribution across popular majors?

How does the gender distribution correlate with earnings?

How is the total estimated Science and Engineering over the years

Inference

Two proportion Z-test:

Why z test? :

Hypothesis of a two sample Z-test (Upper tailed):

Question: The employment rate of people who took E/CM(Engineering or Computers & Mathematics) is less than or equal to those who took business.

Assumptions:

Test for hypothesis, H0 : pe \(\le\) pb

Interpretation of Result:

Regression Analysis:

Application of Approach:

Formulation

Check for assumptions

Conclusion

Summary:

Answers for research questions:

Limitations:

Future research:

Test for hypothesis, H₀ : p_e \(\le\) p_b