Ch. 1 - Statistical Concepts

Class: STAT-211


Notes:

Outline:

Intro

Research

The first step in conducting a research is to identify topics or
questions that are to be investigated. A clearly laid out research
question is helpful in identifying what subjects or cases should be
studied, and what measurements (variables) are important.

It is also important to consider how data are collected so that they
are reliable and help achieve the research goals.

A good statistical study design should include but not limited to:

Some key terminologies

Data
Data is any collection of numbers, characters, texts, images,
graphs, symbols, or some combinations of them that conveys
factual information used as a basis for scientific analysis, logical
reasoning, interpretation, and making decisions.

Descriptive Statistics
Organizing and summarizing the data or information collected in a
study using graphical representation, and tabulation of data, and
computing various summary measures based on the observations.

Inferential Statistics or Statistical Inference
Formal scientific principles for drawing conclusions about the
unknown population or population parameter from data or
information.

Population and Paremeter

Population

Definition

Parameter

The objective of a study is to investigate certain summary measures, like an average or proportion, that describes the entire population, and are known as parameters.

Definition

Example

Example

Example

Pasted image 20250828143651.png|300

Rationale: Information obtained at a smaller scale (sample) would
be projected to obtain an idea of the bigger level (population).

Sampling

Sampling

Sample

Statistic

Statistical Inference

Usually, a statistic is used to draw meaningful scientific conclusions
or decisions (or, inference) about an unknown parameter. This process of drawing inference about an unknown population or a population parameter is called Statistical Inference.

Pasted image 20250828142244.png|300

Exploratory Analysis to Inference

Sampling Variability

Selecting a Simple Random Sample

Implementation of Simple Random Sampling

Sampling With and Without Replacement

Sampling without replacement (SRSWOR)

Sampling with replacement (SRSWR)

When a population is very large, sampling with and without replacement are practically equivalent.

Non-representative Samples

Biased Sampling: A Classic Example

What went wrong?

Variables

Variables

A variable is any characteristic or measurement that can be
determined for each member of a population or a sample.

Example

Typically, variables are denoted by lower case letters like x, y, z etc.

Observe the difference between a parameter and a variable.

Classification of Variables

It’s worthwhile to consider that variables can be different types.

Pasted image 20250828150415.png|400

Quantitative variables

A variable is quantitative or numerical if observations on it take numerical values that represent different magnitudes of the variable, and the difference (or ratio) between two possible values has a consistent, meaningful interpretation.

Example
Age, annual income, mercury concentration in water, number of car accidents, number of siblings, etc.

Discrete variables

Example
Number of children in a family, number of siblings, number of pets in a household, years of education, number of students, number of pairs of shoes, number of car accidents at a busy intersection etc.

Continuous variables

Example
Height, weight, and age of an individual, monthly household income and expenditure, price, distance, amount of time to complete an assignment, amount of daily precipitation, temperature, daily rainfall, mercury concentration, concrete strength, etc.

Example 1
Example 2

Qualitative variable

Nominal and ordinal vairiables

Qualitative variables can be nominal or ordinal.

Nominal variables

A categorical variable is said to be nominal if it has levels that correspond to names of the categories, with no implied ordering.

Example
Hair color, eye color, gender, ethnicity, race, nationality, marital status or political preference of an individual. There is no natural ordering to these levels.

Ordinal variables

A categorical variable is said to be ordinal if it has some sort of ordered structure to the underlying levels.

Example

Treating ordinal data as if it were quantitative could lead to serious
misinterpretation. The difference between two groups ”Group 1” and ”Group 2” need not mean the same as the difference between groups ”Group 2” and ”Group 3”, etc.

Example

Example
What types of variables are these?

Multivariate Data

When two or more measurements are made on an observational unit, we
have bivariate or, more generally multivariate data. A data matrix is a convenient and common way to organize multivariate data. Each row of
a data matrix corresponds to a unique case (observational unit), and
each column corresponds to a variable.

This is how a typical multivariate data matrix would look like:
Pasted image 20250902142114.png|500

Displaying the data

Displaying Data

Statistics isn’t always inference or prediction.

Sometimes making a good visualization leads to important
insights.

The distribution of a variable describes what values are likely (or
unlikely) to appear across the range of possible values.

Visualizing the distribution of variables can provide key insights.

Frequently Distribution

Displaying Qualitative Data

A survey is conducted on 225 individuals asking how satisfied they are with a particular product:
Pasted image 20250902142823.png|350

What kind of chart or graph can be used to display the data in this
table?

  1. Bar Graph
  2. Pie Chart

Displaying Qualitative Data: Bar graphs

Bar graphs display a vertical bar for each category. The height of
each bar represents either counts (“frequencies”) or percentages
(“relative frequencies”) for that category.
Pasted image 20250902143339.png|500

A disadvantage of bar graphs is that the categories are ordered
alphabetically (by default), which may sometimes obscure patterns
in the display. Sorting the bars from largest to smallest makes the
bar graph easier to read and interpret.
Pasted image 20250902143706.png|500

Displaying Qualitative Data: Pie Charts

Categories are represented by wedges in a circle and are
proportional in size to the percentage relative frequency of each
category, provided the categories do not overlap with each other.

Pasted image 20250902143759.png|500

Still pie charts have some limitations

Pasted image 20250902144205.png|300

Displaying Quantitative Variables

Quantitative variables take more care to visualize as there are often
a large number of possible values.
(Question: What would a bar graph look like for a quantitative variable?)

The usual features we want to learn from a quantitative data are

There are different ways for the graphical display of quantitative data

Histogram

A histogram provides a versatile way to visualize quantitative data

  1. Divide the range of the data into intervals of equal width.

    • Also called bins
  2. Count the number of observations in each interval, creating a frequency table.

    • Also called bin frequencies
  3. On the horizontal axis, label the endpoints of the intervals.

  4. Draw a ‘bar’ over each interval with area equal to the relative frequency of the corresponding interval so that the total area of all the bars is 1.

  5. Label and title appropriately.

    (The height of each bar may also be equal to either the frequency or the relative frequency of the corresponding interval. However, it is usually not recommended due to certain issues.)

Histogram - Example

Histogram based on the heights (measured to the nearest half
inches) data of 100 male semi-professional soccer players:
Pasted image 20250902144649.png|500
Provides an idea about the center, the spread and the shape of the data distribution.

Important Remarks on the Choice of Bins
Bin width=2×IQR×n13

where IQR stands for the Inter Quartile Range, a measure of spread
of a frequency distribution, and n being the total number of
observations.

Density Function

Imagine we have the possibility of infinite amounts of data

Density Function for a Continuous Data
Pasted image 20250902145743.png|500
For a continuous data, assuming we have a sufficiently large number of
observations, a histogram can be well approximated by a smooth, continuous curve (function) known as the (probability) density function.

Note:

  1. Bars will always lie above the horizontal axis
    • It will approximate the histogram
    • Associated density function must always be non-negative
  2. The total area of our histogram is 1
    • Total area under the density curve is 1
    • This is due to the relative frequencies having a total sum of 1

Shape

Mode: Value that corresponds to a prominent peak or mound.

Pasted image 20250902151141.png|500

  1. Mode is approximately 100
  2. Prominent peak on the left is about 0, prominent peak on the right is about 9.
    • The center of this distribution should be the mid-point always, in this example it is 5
    • This data distribution is roughly symmetric actually
Symmetric vs. Skewed distributions

Pasted image 20250902151556.png|500

Example of skewed distributions
Pasted image 20250902151834.png|500

Scatterplot

A scatterplot is a powerful graphical tool to explore the
relationship between two quantitative variables.

We examine a scatterplot to study the association between two
quantitative variables. How do values on one variable change as
values of the other variable change?

You can describe the overall pattern of a scatterplot by the trend,
direction, and strength of the relationship between the two variables.

We will return to this when we talk about ‘simple linear regression’.

Example of scatterplot

Birth Weight (g) vs. Gestational Age (weeks)
Pasted image 20250902152035.png|400

Proportion, Average and Variance

Proportions

When the variable of interest is categorical, such as Success or Failure, Strength of Opinion, Type of Car, and so on, then interest lies in the proportion in each category

pi=NiN p^i=nin

Population Average/Mean

Consider a population consisting of N units, and let X1,X2,...,XN denote the values of some numerical variable in the population.

Then the population average or mean, denoted by µ, is defined as
the arithmetic average of all numerical values in the statistical
population:

µ=1N[X1+X2++XN]=1Ni=1NXi.

(The above definition of the population mean is not complete. Here, we implicitly assumed that the population is finite, and all the variable values in the statistical population have equal weights which may not hold in practice. We need probability theory to introduce a more formal definition of the population mean.)

Sample Average/Mean

If a sample of size n is randomly selected from the population, and
if x1,x2,....,xn denote the variable values corresponding to the
sample units, then the sample average or the sample mean is

Pasted image 20250902152758.png|275

The sample mean x¯ approximates but is, in general, different from
the population mean µ.

The notation
Pasted image 20250902152813.png|275
is used to denote the sum of the x values, and should be read as
‘summation over xi where the index variable i ranges from 1 to n’.

Sample Median

Let x1,...,xn be the values of a numerical variable in a randomly
selected sample of size n. Further assume that these values are
arranged in an ascending order (that is, from smallest to largest).

Then, the sample median, denoted xMe , is defined as that value
which satisfies the following:

Assume the data values x1,...,xn are arranged in an ascending
order:

Example:

Comparing Mean and Median

Comparing the median to the mean:

In a skewed distribution, the mean is farther out in the long tail than is the median.

Pasted image 20250904142053.png|400

If the data distribution is Uni-modal, the following relationship holds true:

  1. Symmetric: Mean = Median = Mode
  2. Right-skewed: Mean > Median > Mode
  3. Left-skewed: Mean < Median < Mode

Measures of Spread

Much information is lost in reducing a list of observations to a single summary measure, such as the mean or, median. A measure of center alone is not enough to adequately describe a quantitative variable.

Consider for example, the following distributions:

Pasted image 20250904142633.png|300

A measure of center or, location tells us nothing about the spread or, variability of the data.

Population Variance and Standard Deviation

Consider a population consisting of N units, and let X1,X2,...,XN
denote the values of some numerical variable in the population.
Then the population variance, denoted by σ^2, is defined as

Pasted image 20250904143205.png|200

(The above definition of the population variance is not complete. Here, we implicitly assumed that the population is finite, and all the variable values in the statistical population have equal weights which may not hold in practice. We need probability theory to introduce a more formal definition of the population variance.)

The positive square root of the population variance is called the
population standard deviation and is denoted by σ. That is,

Pasted image 20250904143217.png|200

Sample Variance and Standard Deviation

If a sample x1,x2,...,xn is randomly selected from the population,
then the sample variance, denoted by S^2, is defined as

S2=1n1i=1n(xix¯)2

(The above definition with divisor (n−1) makes the sample variance S2 an unbiased estimator of the population variance σ2. This means, it neither underestimates nor overestimates σ2, and is exactly right on target on an average.)

The positive square root of the sample variance s2 is called the
sample standard deviation and is denoted by s, that is,

s=1n1i=1n(xix¯)2

Properties of Variance or Standard Deviation

Median, Quartiles, and Boxplots

Let x1,...,xn be the values of a quantitative variable in a randomly selected sample of size n.

We previously defined the sample median xMe , whereby at least 50% of the observations are no larger than xMe , and at least 50% of the observations are no lesser than xMe.

We can extend this concept to other fractions, such as

Question: What would be the second quartile Q2?

Answer: The median xMe

Example:

Formula for finding Quartiles

Suppose we wish to find the k-th quartile, for k = 1,2,3.

  1. Let n be the total number of observations or data points.
  2. Assume the observations as x1,x2,...,xn are arranged in an ascending order.
  3. Compute h=(n1)×(k/4)+1.
  4. Then the k-th quartile is given by the following equation:
xh+(hh)×(xhxh),

where
- ⌊h⌋= largest integer less than or equal to h (floor of h), and
- ⌈h⌉= smallest integer greater than or equal to h (ceiling of h).
- It may have decimal part as well, it does not need to be an integer.

Example

We have a sample of size n = 12:

9.39,7.04,7.17,13.28,10.53,7.46,11.97,8.39,21.06,12.68,13.19,8.50

Find the quartiles and the sample average

Manually

Process:

Using R

Open a new R script and create an R vector

# Create vector of the observations
x = c(9.39,7.04,7.17,13.28,10.53,7.46,11.97,8.39,21.06,12.68,13.19,8.50)
	# Note the observations are not arranged in ascending order here

x = sort(x)   # This will arrange x in ascending order

# Need to specify the order each quantile
Q1 = quantile(x,0.25)
Q2 = quantile(x,0.50)
Q3 = quantile(x,0.75)

# We are ready to print the values of these 3 quantiles
Q1
Q2
Q3

Q1
8.1575
Q2
9.96
Q3
12.8075

Inter-Quartile Range

Pasted image 20250904150936.png|300

We define the inter-quartile range (IQR) to be
IQR=Q3Q1

This gives us another indication of how ‘spread out’ the values are

Question: What is the IQR for the preceding example?

Interquartile range in R

IQR = Q3 - Q1
IQR

Inter-Quartile Range and Outliers

Definition:

Question: Identify the presence of any potential outlier in the preceding example.

Box Plots

Box plots (also called box-and-whisker plots or box-whisker plots) give a good graphical image of the concentration of the data.

Shows how far the extreme values are from most of the data.

A box plot is constructed from five descriptive measures:

These measures together are called the 5 point summary measures of a data set.

Anatomy of a Box Plot

Example of a vertical Box plot:
Pasted image 20250904151835.png|400

Comparing Histogram and Box-plot

Pasted image 20250904152316.png|600

  1. Histogram: Data distribution is very symmetric

    • Median lies at an equal distance from Q3 and Q1
      • Q2-Q1 = Q3-Q2
        Box-plot: Median lies approximately at the same distance from Q3 and Q1
    • Approximately an equal number of observations on the upper and lower whiskers
  2. Histogram: Skewed to the left

    • Distance Between Q2 and Q1 is much larger than between Q2 and Q3
      • Q2-Q1 > Q3-Q2
        Box-plot: It is a little bit more skewed upwards
  3. Histogram: Right skewed distribution
    Box-plot: Median is at a much smaller distance to Q1 than to Q3

    • Strong indicator that the underlying data distribution is a right skewed distribution

Which Measure Should We Use?

Summary - Which Measure To Use

Symmetric Bell-Shaped Skewed or Outliers
Measure of Center Mean Median
Measure of Spread Standard Deviation IQR