8 min readJan 20, 2020

EDA | Analysis of World Happiness Score

Exploratory data analysis (EDA) is the process of exploring data to find the relationships between variables, identify the trends and patterns and test assumptions. The purpose of doing EDA is simple: knowing your dataset better before moving on to more advanced topics. This EDA series consists of three articles: data cleansing, data exploration, and feature engineering using end to end example and plain English. In this article, we will talk about data exploration and hypothesis testing using python. Let’s get started!

Introduction to the analysis

This research is focused on the World Happiness Report, a landmark survey of the state of global happiness that ranks 156 countries by how happy their citizens perceive themselves to be. By the end of the research, I will figure out are we getting happier(test assumption) and what makes countries’ citizens happy(identify relationship).

The data is from Kaggle. Originally there are five separate datasets each of which is different years’ happiness score report. We can use df.head() &df.info() to have a quick look at the dataset. There is no missing value in each dataset but the problem is the columns’ names are not consistent. For example, there is no region column in 2019.csv and 2018.csv(only countries), variable names are not the same even they represent the same thing…It’s necessary to figure out what you want to do first with the data and then prepare the dataset to the way it can be easily processed. In my project, I want to give the answer if there is a happiness score change from 2015 to 2019. I know I have to plot a boxplot with years as x-axis and happiness score as the y-axis, so I need to merge the datasets into a large one with ‘year’ as an additional column(2015.csv, 2017.csv, 2019.csv to integrated one). This is a great article about how to combine the Pandas data frame. Here’s how it looks at the end:

Data Exploration

Data exploration is the process of applying statistics and visualization techniques to uncover the hidden information in data.

Data Type

It’ s important to know the data type before anything because data type can determine how we approach the data science problem. For categorical data, we can solve the classification problem and for continuous data, we can formulate the problem as a regression problem. In this example, the categorical data are year, country and region. The continuous data are score, GDP_per_capita, freedom_to_make_life_changes, generosity and healthy_life_expentency. To know our dataset better, we can just be naturally curious about the data.

region/country: How the happy score spreads on the map? Which counties’ citizens are the happiest?

year: Are we getting happier? Are we getting richer, healthier and more social-active?

continuous factors: what’s the relationship between happiness score and each life factor? What’s the relationship between factors(eg: the freedom to make life choices vs GDP, social support vs health…)

The more we curious about the dataset, the better we can understand it!

Knowing by seeing👀

How happiness score spreads all over the world?

Sadly, I can’t show the interactive map plot here, but we can still clearly see that Western Europe, North America, and Austrlia&New Zealand have a high happiness score, while South Africa and Southern Asia have a low score. Why do we need to know this? Because Simpson's paradox exists! For example, after we analyze the trending of world happiness score changes in general, we get the conclusion: we are getting happier; But when we dig deeper and separately see the tread in each region, it’s possible to see that people in most regions are not getting happier. The reason behind it could be a specific region got developed fast and people there increased their happiness level sharply. The one region increased the overall happiness level, while other areas maintain the same or even got worse.

What we can learn from the boxplot? outlier, median, data range(minimum&maximum)! Why those are valuable to us? Because by understanding how our data is spread out, we can know how well the mean, for example, represents the data. Boxplot is especially helpful when there are lots of variables and we need to compare the distribution of the variables, just like the situation in our case: compare the happiness score layout in 10 regions! Using boxplot, we can clearly see the Middle East and Northern Africa has the largest variability. That means the mean happiness score there can’t represent the region!

Hypothesis Testing

Let’s get back on our track: Are we getting happier?

From the boxplot, it looks there is no significant change over time. We’ll test it by conducting hypothesis testing:

Ho: There is no significant happiness score change from 2015 to 2019.
Ha: There is a significant happiness score change from 2015 to 2019.

Before anything, we have to know if the variable distribution is normal or not. Because if it’s normal distribution, we can use the parametric test — use parameter estimates (mean, standard deviation…)to represent the information of the data. If we can’t tell the distribution of the variable, we can conduct the non-parametric test(distribution-free methods).

Three ways to test normality:

scipy.stats.describe(): return the number, mix&min, mean, variance, skewness, kurtosis
visually see it: distribution plot
scipy.stats.shapiro(): Values close to 1 indicate that the distribution is similar to a normal distribution. The Shapiro-Wilk test also provides a p-value: p<0.05 means non-normality at 95% level

result of normality test for happiness score in 2015, 2017, 2019

A good rule of thumb is that anything with kurtosis from -2.5 to 2.5 and skewness from -1.5 to 1.5 is close enough for a t-test to work well.
Shapiro-Test: Values close to 1 indicate that the distribution is similar to a normal distribution. The Shapiro-Wilk test also provides a p-value: one under <.05 indicates non-normality at the 95% confidence level.

Result: kurtosis and skewness don’t look too bad. But the distribution and Shapiro-Wilk showed the non-normality in variable happiness score in 2015. I will conduct both parametric test and non-parametric test because the sample size is large enough. According to the central limit theorem, the sampling distribution of the mean of any distribution independent, random variable will be normal or nearly normal.

F_onewayResult(statistic=0.17523715047000546, pvalue=0.8393148311404635)
KruskalResult(statistic=0.3799820728679535, pvalue=0.8269665464793793)

The p-value of both One-Way ANOVA and the Kruskal-Wallis H test is larger than 0.05, we failed to reject the null hypothesis: there is a significant change in happiness. We are not getting happier:(

What actually affect people’s happiness level?

Correlation

Two ways to identify correlation between factors:

1. visually see it: seaborn.scatterplot() / seaborn.regplot()/seaborn.heatmap()/seaborn.pairplot()

2. scipy.stats.pearsonr()

Result:

The factor GDP_per_capita VS happiness score Pearsonr Score for year 2019 is (0.7964065217934377, 6.421354314507238e-34)
The factor social_support VS happiness score Pearsonr Score for year 2019 is (0.7725760557206915, 8.463944246134181e-31)
The factor healthy_life_expectancy VS happiness score Pearsonr Score for year 2019 is (0.7795802474390381, 1.1247986484545201e-31)
The factor freedom_to_make_life_choices VS happiness score Pearsonr Score for year 2019 is (0.5581598777207817, 1.4077799289784974e-13)
The factor generosity VS happiness score Pearsonr Score for year 2019 is (0.0829869946144349, 0.3143237050610324)

Not surprisingly, the highly related factors are GDP, social support and life expectancy. These three are also showed a strong correlation among themselves (e.g: GDP with health, GDP with social support…)

Does every region show the same pattern?

The closer the same color points gather the more similarity a region shares. There’s still a clear correlation in GDP per capita even within every region. But when comes to healthy life expectancy, the correlation is weakened. In northern Europe, countries share a very similar level of life expectancy, but the happiness score differs a lot; while a huge gap of healthy life expectancy exists in Sub-Saharan Africa, but countries there maintain a similar happiness level. We should be cautious about getting the conclusion from the micro. When we dig deeper into each group, it is may show a different phenomenon. This is the biggest bonus I get from this EDA.

Conclusion

We first see how the happiness score spreads out over the map, and we found each region shares similarity in happiness level. From the hypothesis test, we failed to reject the null hypothesis, so we can say we are not getting happier. Then we tried to identify the relationship between factor and factor, factor and happiness score to figure out what makes citizens happier. That’s GDP, social support and life expectancy. During the process of exploration, we found we can’t get a conclusion from the global scope. We saw how the lurking variable affects the research conclusion, which gives us a lesson — we should break down our analysis into the micro analysis sometimes. For the next step, I might conduct the hypotheses testing in each region to testify the conclusion we got from the macroscope— WE ARE NOT GETTING HAPPIER.

In the end, I just want to say I really appreciate this supportive tech community. I just started the data science journey one month ago, and during the journey I got so much help from those intelligent, caring, encouraging peers, my amazing mentor Athanasios, and knowledgeable technical coaches at Thinkful. I will keep sharing my thoughts and findings, at one hand sharing is loving, also it’s helpful to keep track of my learning process. JIAYOU! 🥰

Any corrections and suggestions are greatly appreciated! Thanks!