# An introduction to inferential statistics

While descriptive statistics summarize the characteristics of a data set, **inferential statistics** help you come to conclusions and make predictions based on your data.

When you have collected data from a sample, you can use inferential statistics to understand the larger population from which the sample is taken.

Inferential statistics have two main uses:

- making
**estimates**about populations (for example, the mean SAT score of all 11th graders in the US). **testing hypotheses**to draw conclusions about populations (for example, the relationship between SAT scores and family income).

## Descriptive versus inferential statistics

**Descriptive statistics** allow you to *describe* a data set, while **inferential statistics** allow you to make *inferences* based on a data set.

### Descriptive statistics

Using descriptive statistics, you can report characteristics of your data:

- The
**distribution**concerns the frequency of each value - The
**central tendency**concerns the averages of the values - The
**variability**concerns how spread out the values are

In descriptive statistics, there is no uncertainty – the statistics precisely describe the data that you collected. If you collect data from an entire population, you can directly compare these descriptive statistics to those from other populations.

### Inferential statistics

Most of the time, you can only acquire data from samples, because it is too difficult or expensive to collect data from the whole population that you’re interested in.

While descriptive statistics can only summarize a sample’s characteristics, inferential statistics use your sample to make reasonable guesses about the larger population.

With inferential statistics, it’s important to use random and unbiased sampling methods. If your sample isn’t representative of your population, then you can’t make valid statistical inferences.

### Sampling error in inferential statistics

Since the size of a sample is always smaller than the size of the population, some of the population isn’t captured by sample data. This creates **sampling error**, which is the difference between the true population values (called parameters) and the measured sample values (called statistics).

Sampling error arises any time you use a sample, even if your sample is random and unbiased. For this reason, there is always some uncertainty in inferential statistics. However, using probability sampling methods reduces this uncertainty.

## Estimating population parameters from sample statistics

The characteristics of samples and populations are described by numbers called statistics and parameters:

- A
**statistic**is a measure that describes the sample (e.g., sample mean). - A
**parameter**is a measure that describes the whole population (e.g., population mean).

Sampling error is the difference between a parameter and a corresponding statistic. Since in most cases you don’t know the real population parameter, you can use inferential statistics to estimate these parameters in a way that takes sampling error into account.

There are two important types of estimates you can make about the population: point estimates and interval estimates.

- A
**point estimate**is a single value estimate of a parameter. For instance, a sample mean is a point estimate of a population mean. - An
**interval estimate**gives you a range of values where the parameter is expected to lie. A**confidence interval**is the most common type of interval estimate.

Both types of estimates are important for gathering a clear idea of where a parameter is likely to lie.

### Confidence intervals

A **confidence interval** uses the variability around a statistic to come up with an interval estimate for a parameter. Confidence intervals are useful for estimating parameters because they take sampling error into account.

While a point estimate gives you a precise value for the parameter you are interested in, a confidence interval tells you the uncertainty of the point estimate. They are best used in combination with each other.

Each confidence interval is associated with a confidence level. A confidence level tells you the probability (in percentage) of the interval containing the parameter estimate if you repeat the study again.

A 95% confidence interval means that if you repeat your study with a new sample in exactly the same way 100 times, you can expect your estimate to lie within the specified range of values 95 times.

Although you can say that your estimate will lie within the interval a certain percentage of the time, you cannot say for sure that the actual population parameter will. That’s because you can’t know the true value of the population parameter without collecting data from the full population.

However, with random sampling and a suitable sample size, you can reasonably expect your confidence interval to contain the parameter a certain percentage of the time.

## Hypothesis testing

**Hypothesis testing **is a formal process of statistical analysis using inferential statistics. The goal of hypothesis testing is to compare populations or assess relationships between variables using samples.

Hypotheses, or predictions, are tested using statistical tests. Statistical tests also estimate sampling errors so that valid inferences can be made.

Statistical tests can be parametric or non-parametric. Parametric tests are considered more statistically powerful because they are more likely to detect an effect if one exists.

Parametric tests make assumptions that include the following:

- the population that the sample comes from follows a normal distribution of scores
- the sample size is large enough to represent the population
- the variances, a measure of spread, of each group being compared are similar

When your data violates any of these assumptions, non-parametric tests are more suitable. Non-parametric tests are called “distribution-free tests” because they don’t assume anything about the distribution of the population data.

Statistical tests come in three forms: tests of comparison, correlation or regression.

### Comparison tests

**Comparison tests** assess whether there are differences in means, medians or rankings of scores of two or more groups.

To decide which test suits your aim, consider whether your data meets the conditions necessary for parametric tests, the number of samples, and the levels of measurement of your variables.

Means can only be found for interval or ratio data, while medians and rankings are more appropriate measures for ordinal data.

Comparison test | Parametric? | What’s being compared? | Samples |
---|---|---|---|

t-test | Yes | Means | 2 samples |

ANOVA | Yes | Means | 3+ samples |

Mood’s median | No | Medians | 2+ samples |

Wilcoxon signed-rank | No | Distributions | 2 samples |

Wilcoxon rank-sum (Mann-Whitney U) | No | Sums of rankings | 2 samples |

Kruskal-Wallis H | No | Mean rankings | 3+ samples |

### Correlation tests

**Correlation tests **determine the extent to which two variables are associated.

Although Pearson’s *r *is the most statistically powerful test, Spearman’s *r* is appropriate for interval and ratio variables when the data doesn’t follow a normal distribution.

The chi square test of independence is the only test that can be used with nominal variables.

Correlation test | Parametric? | Variables |
---|---|---|

Pearson’s r | Yes | Interval/ratio variables |

Spearman’s r | No | Ordinal/interval/ratio variables |

Chi square test of independence | No | Nominal/ordinal variables |

### Regression tests

**Regression tests **demonstrate whether changes in predictor variables cause changes in an outcome variable. You can decide which regression test to use based on the number and types of variables you have as predictors and outcomes.

Most of the commonly used regression tests are parametric. If your data is not normally distributed, you can perform data transformations.

Data transformations help you make your data normally distributed using mathematical operations, like taking the square root of each value.

Regression test | Predictor | Outcome |
---|---|---|

Simple linear regression | 1 interval/ratio variable | 1 interval/ratio variable |

Multiple linear regression | 1+ interval/ratio variable(s) | 1 interval/ratio variable |

Logistic regression | 1+ any variable(s) | 1 binary variable |

Nominal regression | 1+ any variable(s) | 1 nominal variable |

Ordinal regression | 1+ any variable(s) | 1 ordinal variable |

## Frequently asked questions about inferential statistics

- What’s the difference between descriptive and inferential statistics?
**Descriptive statistics**summarize the characteristics of a data set.**Inferential statistics**allow you to test a hypothesis or assess whether your data is generalizable to the broader population.- What’s the difference between a statistic and a parameter?
A

**statistic**refers to measures about the**sample**, while a**parameter**refers to measures about the**population**.- What is sampling error?
A sampling error is the difference between a population parameter and a sample statistic.

- What is hypothesis testing?
Hypothesis testing is a formal procedure for investigating our ideas about the world using statistics. It is used by scientists to test specific predictions, called hypotheses, by calculating how likely it is that a pattern or relationship between variables could have arisen by chance.