# Exploratory data analysis

## 1.1 – Data Analysis, what is it?

Whenever we take any decision in our day-to-day life is by thinking about what happened last time or what will happen by choosing that particular decision. This is nothing but analyzing our past or future and making decisions based on it. For that, we gather memories of our past or dreams of our future. So that is nothing but data analysis. Now same thing analyst does for business purposes, is called Data Analysis.

## 1.1 – Exploratory data Analysis, what is it?

Exploratory data analysis is a process or a philosophy in which is we approach the data free of any pre-conceived assumptions or hypotheses. We first see the patterns in the data before we impose any views on it and fit models.

Exploratory data analysis can also be used to :

**detect any errors (outliers or anomalies) in the data****check the assumptions made by any models or statistical tests****identify the most important/influential variables****develop parsimonious models – that is models that explain the data with the minimum****number of variables necessary.**

## 1.1 – Data sets and EDA.

For numerical data, this process will include the calculation of summary statistics and the use of data visualisations.

For a single variable, EDA will involve calculating summary statistics (such as mean, median, quartiles, standard deviation, IQR and skewness) and drawing suitable diagrams (such as histograms, boxplots, quantile-quantile (Q-Q) plots and a line chart for time series/ordered data).

For bivariate or multivariate data, EDA will involve calculating the summary statistics for each variable and calculating correlation coefficients between each pair of variables. Data visualisation will typically involve scatterplots between each pair of variables.

For multivariate data sets with large dimensionality various techniques such as cluster analysis and principle components analysis (also called factor analysis) can be used to reduce the complexity of the data set.

## 1.4 – Scatter plot.

Scatterplots are the first step to visualise the data and assess the shape of any correlation

between a pair of variables. The strength of that correlation is measured by the sample

correlation coefficient which takes a value from -1 to +1.

## 1.5 – Linear correlation.

Linear correlation between a pair of variables looks at the strength of the linear relationship between them.

The diagrams below show the various degrees of correlation:

Lets Check what you have learned so far.

Written by Pratyaksh ( Pursuing graduation from St.Xaviers Calcutta and has cleared 2 actuarial papers )