Ch. 1 - Statistical Concepts

Notes:

Outline:

Population and Parameter
Sampling
Variable
Displaying the Data
Proportion, Average and Variance

Intro

Research

The first step in conducting a research is to identify topics or
questions that are to be investigated. A clearly laid out research
question is helpful in identifying what subjects or cases should be
studied, and what measurements (variables) are important.

It is also important to consider how data are collected so that they
are reliable and help achieve the research goals.

A good statistical study design should include but not limited to:

identifying the statistical model, the population, and the parameters of interest
rephrasing the objectives of the study in terms of questions to be answered about the parameters
determining the variables of interest, selection of the experimental units, data collection method, number of units to be sampled, etc.

Some key terminologies

Data
Data is any collection of numbers, characters, texts, images,
graphs, symbols, or some combinations of them that conveys
factual information used as a basis for scientific analysis, logical
reasoning, interpretation, and making decisions.

Descriptive Statistics
Organizing and summarizing the data or information collected in a
study using graphical representation, and tabulation of data, and
computing various summary measures based on the observations.

Inferential Statistics or Statistical Inference
Formal scientific principles for drawing conclusions about the
unknown population or population parameter from data or
information.

Population and Paremeter

Population

Definition

A Population is a group or collection of subjects or objects (living, non–living or abstract) whose properties are being investigated to answer the research questions.
- Population members are called population units.
- A population doesn’t always refer to people. It can mean a group containing elements of anything you want to study, such as objects, events, organizations, countries, species, organisms, etc.
- The exact population will depend on the scope of the study.

Parameter

The objective of a study is to investigate certain summary measures, like an average or proportion, that describes the entire population, and are known as parameters.

Definition

A parameter is a numerical summary measure or quantity used to represent a specific characteristic or feature of the population.

Example

The proportion of defective items among all items of a certain manufactured product.
The proportion of voters who are in favor of voting Candidate A in an election.
Average weekly rate of accidents at a busy intersection.
Average mercury concentration in swordfishes in Atlantic Ocean.

Example

We want to study the average mercury concentration in swordfishes in the Atlantic ocean.
- Population: The collection of all swordfishes in Atlantic Ocean.
- Parameter: The average mercury concentration in swordfishes in Atlantic ocean.
  - Quantifies a specific feature of the population as a whole
Question: Is it possible to catch all the swordfishes in the Atlantic ocean, and take measurements on their mercury contents?
- No, note that you do not know the exact population size, this is issue number 1. The second issue is that at any given point in time you do not know the location of each of the swordfishes.
- It is not possible for use the land the exact true value, we do not know our population exactly and it is infiscible to take measurement on each of the subjects.

Example

We might be interested in learning about the average weight of middle-aged Americans.
The population is all middle-aged Americans and the parameter of interest is their average weight
- Is it possible to know the exact value of this parameter?
  - No, there are millions of middle-aged Americans, it is impossible to measure each of them.
Just imagine how much time, cost and manpower it will require to observe the weights of all middle-aged Americans!
Sampling and statistics allow us to ‘estimate’ the parameter instead.
We might use the average weight of a random sample of 100 middle-aged Americans to estimate the parameter.
To exactly recover the parameter with certainty, ideally we need to measure every unit in a population, called a census.
A census is rarely used often due to cost, time, and physical considerations.
Thus, in order to answer questions regarding parameters, we obtain a sample.

Pasted image 20250828143651.png|300

Subset of the original population
Chosen to be a good representative of the original population

Rationale: Information obtained at a smaller scale (sample) would
be projected to obtain an idea of the bigger level (population).

Sampling

Sample

A sample is a subset of the population that has the same characteristics as the population of interest.
Each individual member of a sample is called a sampling unit.
Measurements taken on an individual sampling unit is called a sample observation.
Collection of all the sample observations is referred to as the data.

Statistic

A statistic is a quantitative measure that represents the property of interest in the sample.
- Population mean
- Population proportion
A statistic is often used to guess or to estimate the actual value of the unknown parameter of interest.
- Example: We use statistics sample proportion to draw conclusions about some unknown population proportions (parameter)

Statistical Inference

Usually, a statistic is used to draw meaningful scientific conclusions
or decisions (or, inference) about an unknown parameter. This process of drawing inference about an unknown population or a population parameter is called Statistical Inference.

Pasted image 20250828142244.png|300

Take a sample based on some appropriate scientific principles
Calculate the value of the sample mean and sample proportion
Once this value is calculated we can draw an inference about the unknown population mean/proportion

Exploratory Analysis to Inference

Sampling is natural
Think about sampling something you are cooking - you taste (examine) a small part of what you’re cooking to get an idea about the dish as a whole.
When you taste a spoonful of soup and decide the spoonful you tasted isn’t salty enough, that’s exploratory analysis.
If you generalize and conclude that your entire soup needs salt, that’s an inference.
- From the sample level you try to get an idea at the population level, this is called statistical inference
For your inference to be valid, the spoonful you tasted (the sample) needs to be representative of the entire pot (the population).
- If your spoonful comes only from the surface and the salt is collected at the bottom of the pot, what you tasted is probably not representative of the whole pot.
  - Here your sample is a poor representative of the population
- If you first stir the soup thoroughly before you taste, your spoonful will more likely be representative of the whole pot.
  - Here your sample is probably a good representative of the population

Sampling Variability

When we compute a statistic from a sample, we just observe one of its possible values.
Even if the same study design is used more than once under identical conditions, a different sample is likely to occur.
Thus, different samples of the same size will possibly yield different values of the statistic.
- The average mercury concentration in 25 water samples will differ from one another
- The proportion of voters who favor a candidate in an election will be different in three different samples of voters
This inherent variation in the observed values of the statistic across all possible samples of a given size is referred to as sampling variation.
- If there was no sampling variation, there would be no need of statistics

Selecting a Simple Random Sample

Proper extrapolation of sample information to the population requires that the sample be representative of the population.
A sample of size n is a simple random sample (SRS) if the selection process ensures that every sample of size n has equal chance of being selected.
Fact:
- In simple random sampling, every population unit has the same chance of being included in the sample in a given draw.
Other Random Sampling Schemes:
- Stratified Random Sampling
- Cluster Random Sampling
- Multi-stage Random Sampling
- Systematic Random Sampling

Implementation of Simple Random Sampling

How do we select a SRS of size n from a population of N units? ace.5cm
- STEP 1: Assign to each unit a number from 1 to N.
- STEP 2: Write each number on a slips of paper, place the N slips of paper in an urn, and shuffle them.
- STEP 3: Select n slips of paper at random, one at a time.
Example
- The classic urn example.
  - People tended to use this method a few decades back
- Of course, typically software is used.

Sampling With and Without Replacement

Sampling without replacement (SRSWOR)

Once a unit’s number is drawn, it cannot be redrawn
- A population unit can be selected at most 1 times
- It is either selected or not
A population unit can be included in a sample at most once
A SRS is obtained by sampling without replacement
Observations are dependent because the outcome of one draw affects the probabilities of subsequent draws.
- as the population of available items changes with each selection. Once an item is selected and not replaced, it is no longer available, which alters the sample space for the next pick

Sampling with replacement (SRSWR)

After a unit’s slip of paper is chosen, it is put back in the urn
A population unit could be included in the sample anywhere between 0 and n times
Observations are (dependent/independent)...

When a population is very large, sampling with and without replacement are practically equivalent.

Non-representative Samples

Non-representative samples arise whenever parts of the population of interest are systematically under-represented in the sample.
This is called selection bias.
Two examples of non-representative samples are voluntary and convenience samples.
- A voluntary sample often occurs when the individuals choose to be included in the sample voluntarily.
  - Example: You are asked to provide feedback when you eat at a restaurant
- Convenience sampling is a method of data collection from population members who are conveniently available to participate in study.
  - Example: Suppose that the university would like to know the average of some parameters of 100 students, they asked the first 100 students that were in front of the library.

Biased Sampling: A Classic Example

In 1936, the American Literary Digest magazine collected around 2.4 million postal surveys and predicted that the Republican candidate Alfred Landon would beat the Democrat candidate Franklin D. Roosevelt by an overwhelming margin in the U.S. presidential election.
- It was a very large sample size for that time, so what went wrong?
The poll showed that Landon would likely be the overwhelming winner and FDR would get only 43% of the votes.
Election result: FDR won, with 62% of the votes.
The magazine was completely discredited because of the poll, and was soon discontinued.

What went wrong?

The Literary Digest survey represented a sample collected from readers of the magazine, supplemented by records of registered automobile owners and telephone users.
- Was not a good representation of the whole population
These groups had incomes well above the erstwhile national average (remember, this is Great Depression era) resulting in lists of voters far more likely to support Republicans than a truly typical voter of the time.
While lager samples are preferable, and the Literary Digest election poll was based on a huge sample of size 2.4 million, the sample was biased. Consequently, the sample failed to yield an accurate prediction.
In contrast, a poll of only 50 thousand citizens selected by George Gallup’s organization successfully predicted the result, leading to the popularity of the Gallup poll.
- Predicted the output of an election in advance
- The sample was collected properly, the sample was a true representation of the US true population

Variables

A variable is any characteristic or measurement that can be
determined for each member of a population or a sample.

Example

air pressure, temperature, humidity, whether it rained
number of centimeters of precipitation, mercury concentration
income, expenditure, race, ethnicity, nationality, height, weight, marital status, hair color, eye color of an individual
point or letter grades obtained by a STAT 211 student
gasoline prices, make and model of a car, number of cars sold
volumes of import/export, production, sell of a commercial product

Typically, variables are denoted by lower case letters like x, y, z etc.

Random variables are usually denoted with capital letters, A, B, C etc.

Observe the difference between a parameter and a variable.

Parameter is a numerical summarization that quantifies a specific feature or characteristic of the population as a whole
A variable can be measured on each and every one of the sampling units

Classification of Variables

It’s worthwhile to consider that variables can be different types.

Pasted image 20250828150415.png|400

numerical variables = quantitative
categorical variables = qualitative
There are other types of variables
- Example: Spacial temporal variables
  - Depend both on time and space

Quantitative variables

A variable is quantitative or numerical if observations on it take numerical values that represent different magnitudes of the variable, and the difference (or ratio) between two possible values has a consistent, meaningful interpretation.

Example
Age, annual income, mercury concentration in water, number of car accidents, number of siblings, etc.

Discrete variables

Quantitative variables can either be discrete or continuous.
A quantitative variable is discrete if its possible values form a set of separate or discrete numbers, such as 0, 1, 2, 3,....

Example
Number of children in a family, number of siblings, number of pets in a household, years of education, number of students, number of pairs of shoes, number of car accidents at a busy intersection etc.

A discrete variable may assume finite or countably infinitely many possible values.
A discrete variable may assume decimal values, but they must be discrete or separate

Continuous variables

A quantitative variable is continuous if it takes values in an interval (bounded or unbounded).

Example
Height, weight, and age of an individual, monthly household income and expenditure, price, distance, amount of time to complete an assignment, amount of daily precipitation, temperature, daily rainfall, mercury concentration, concrete strength, etc.

Continuous variables have uncountably many possible values. (Theoretically) A continuous variable can be measured up to any desired degree of accuracy.
- From a practical viewpoint this statement has a limitation
The distinction between discrete and continuous data is not always clear-cut. Sometimes it even depends on the scale or, measurement technique.

Example 1

Variable: Height of an individual (cm^2)
- Continuous variable
Perhaps the height of an individual is expressed as follows:
- 165 cm
- 167.3 cm
- 170 cm
- 172.8 cm
- 175 cm
- ...
- All this heights are expressed as discrete values, they are all separated numbers
But still we cant't call the height of an individual a discrete variable, why?
- Example of an actual height of an individual: 170.15698324751 cm
  - Is it possible to device any such measuring instrument that can capture this height?
    - It is absurd
  - Whatever height we observer that is actually because of the limitation of the measurement system,
    - it does not mean that the measurement of the sample unit is exactly that value.
  - This height will probably be reported as 170.1 cm or 170.2 cm

Example 2

You ask for the age of a sample unit
- They respond: 21.67532657312
This will be a continuous variable

Qualitative variable

A Categorical or Qualitative variable takes only a finite, and usually a small number of values or categories and are not necessarily numerical.
Usually expressed through labels/tags/categories.
- Type of Residence (Apartment, Condo, Townhouse,...)
- Undergraduate major at a university (Physics, Chemistry,
- Mathematics, Statistics, Biology, Computer Science,...)
- Car manufacturer (Honda, Hyundai, Subaru, Toyota,...)
The values of a qualitative variable are called its levels or categories.
Qualitative variables serve to subdivide the data set into categories and are also sometimes referred to as factors.

Nominal and ordinal vairiables

Qualitative variables can be nominal or ordinal.

Nominal variables

A categorical variable is said to be nominal if it has levels that correspond to names of the categories, with no implied ordering.

Example
Hair color, eye color, gender, ethnicity, race, nationality, marital status or political preference of an individual. There is no natural ordering to these levels.

Ordinal variables

A categorical variable is said to be ordinal if it has some sort of ordered structure to the underlying levels.

Implied natural order

Example

Socioeconomic status and age-groups of people
Height measured as ”tall”, ”medium”, and ”short”
Survey responses like ”strongly disagree”, ”disagree”, ”neutral”, ”agree”, and ”strongly agree” (known as Likert scale)
Letter grades of students (”A”, ”B”, ”C”, ”D”, and ”F”).

Treating ordinal data as if it were quantitative could lead to serious
misinterpretation. The difference between two groups ”Group 1” and ”Group 2” need not mean the same as the difference between groups ”Group 2” and ”Group 3”, etc.

Example

Age groupings such as 0−5, 6−10, 11−20, 30−50, etc do not correspond to equal intervals of time.
Likert scales often are coded numerically (as for example 1,2,3,4,5). But not only is the difference between two successive values inconsistent, the very meaning of the scale can vary among the respondents.
- This is an example of a categorical variable
  - Category of 1 is actually the code name of: Extremely Dissatisfied
  - Category of 2 corresponds to: Dissatisfied
  - Category of 3 corresponds to: Average
  - Category of 4 corresponds to: Satisfied
  - Category of 5 corresponds to: Extremely Satisfied
- Differences between level 2 and 1, and differences between 5 and 4, are they the same?
  - No, you cannot say that the difference between extremely dissatisfaction and dissatisfaction is not the same to the difference between satisfaction and extremely satisfaction
- The very interpretation of this scale may vary from person to person

Example
What types of variables are these?

The survival time of a cancer patient after receiving a new treatment for cancer
The number of ticks found on a cow entering an inspection station
- Quantitative discrete
The average rainfall during August in College Station
- Quantitative continuous
The number of touchdowns thrown during an NFL game
- Discrete quantitative
Letter grades of STAT 211 students in Fall 2024 SSN (or, UIN)
- Qualitative and ordinary
ZIP Codes
- Code names of some administrative geographical region to uniquely identify the region
- If you add or multiply or take the difference of two different zip code are you going to get a meaningful interpretation?
- No, ZIP codes are not quantitative variables

Multivariate Data

When two or more measurements are made on an observational unit, we
have bivariate or, more generally multivariate data. A data matrix is a convenient and common way to organize multivariate data. Each row of
a data matrix corresponds to a unique case (observational unit), and
each column corresponds to a variable.

This is how a typical multivariate data matrix would look like:
Pasted image 20250902142114.png|500

The above table displays rows 1, 2, 3, and 50 of a data set for 50 randomly sampled loans offered through Lending Club.
Each row in the table represents a single loan. The columns represent characteristics, called variables, for each of the loans.
The first row represents a loan of $7,500 with an interest rate of 7.34%, where the borrower is based in Maryland (MD) and has an income of $70,000.
Example:
- There are 50 rows, we are taking multiple measurements for each of these 50 experimental units, individuals. More specifically we are taking 7 measurements (variables) for each.

Displaying the data

Displaying Data

Statistics isn’t always inference or prediction.

Sometimes making a good visualization leads to important
insights.

The distribution of a variable describes what values are likely (or
unlikely) to appear across the range of possible values.

Visualizing the distribution of variables can provide key insights.

Frequently Distribution

Frequency - number of times each data value or, category occurs in the data.
Frequency Distribution - a table or function listing all the observed variable values (or, class intervals) (in case of quantitative variables) or, the categories or levels (in case of a qualitative variable) along with their corresponding frequencies.
Relative Frequency - proportion of times a data value or, category occurs in the data.
Percentage Relative Frequency - relative frequency expressed as percentage
- Percentage of times a particular data value occurs in the data
- 0% - 100%
- Sum of all percentage relative frequencies = 100%

Displaying Qualitative Data

A survey is conducted on 225 individuals asking how satisfied they are with a particular product:
Pasted image 20250902142823.png|350

This is an example of an ordinal category
Counts: number of individuals belonging to each category
Note that all relative frequencies here are rounded
- It is very likely that when you take the sum of all the relative frequencies it is not equal to one, it is quite possible to be more or less than one
- To overcome this issue you should round the relative frequencies from all categories except from the last one, which we will calculate the relative frequency from subtracting 1 from the current total of the previous relative frequencies. In this way we will get accurate relative frequencies

What kind of chart or graph can be used to display the data in this
table?

Bar Graph
Pie Chart

Displaying Qualitative Data: Bar graphs

Bar graphs display a vertical bar for each category. The height of
each bar represents either counts (“frequencies”) or percentages
(“relative frequencies”) for that category.
Pasted image 20250902143339.png|500

Over each category we construct a vertical bar, such that its height equals the frequency of the category.
On the top of the bar you can see the percentage relative frequency for the category
Each bar is a category
Is this vertical bar graph appealing?
- No, note that all of these categories are ordered alphabetically
- As a result these bars are somewhat of a zigzagging pattern
- You should order your categories in terms of decreasing or increasing frequencies.

A disadvantage of bar graphs is that the categories are ordered
alphabetically (by default), which may sometimes obscure patterns
in the display. Sorting the bars from largest to smallest makes the
bar graph easier to read and interpret.
Pasted image 20250902143706.png|500

Now this bar chart is much more appealing

Displaying Qualitative Data: Pie Charts

Categories are represented by wedges in a circle and are
proportional in size to the percentage relative frequency of each
category, provided the categories do not overlap with each other.

Pasted image 20250902143759.png|500

You can find the labels of the categories in the Pie on the top right corner
The area of each pie is proportional in size to the percentage relative frequency of each category
Note: Here the groups are not overlapping, which means that an individual cannot belong to more that one of the groups simultaneously.

Still pie charts have some limitations

Cannot be used if some of the groups are mutually overlapping with each other, you can still use a bar chart in such a situation
When categorical variable has many categories, it can be difficult to read and interpret the pie chart.
Even with the numbers on the outside, it is pretty much impossible to tell the percentage of any slice of the pie.

Pasted image 20250902144205.png|300

Displaying Quantitative Variables

Quantitative variables take more care to visualize as there are often
a large number of possible values.
(Question: What would a bar graph look like for a quantitative variable?)

The usual features we want to learn from a quantitative data are

What is the spread of the data?
(Wide versus narrow)
What are the typical values of the data?
(Generally characterized by the center of the data)
What is the shape of the distribution of the data?

There are different ways for the graphical display of quantitative data

Histogram

A histogram provides a versatile way to visualize quantitative data

Divide the range of the data into intervals of equal width.
- Also called bins
Count the number of observations in each interval, creating a frequency table.
- Also called bin frequencies
On the horizontal axis, label the endpoints of the intervals.
Draw a ‘bar’ over each interval with area equal to the relative frequency of the corresponding interval so that the total area of all the bars is 1.
Label and title appropriately.

(The height of each bar may also be equal to either the frequency or the relative frequency of the corresponding interval. However, it is usually not recommended due to certain issues.)

Histogram - Example

Histogram based on the heights (measured to the nearest half
inches) data of 100 male semi-professional soccer players:
Pasted image 20250902144649.png|500
Provides an idea about the center, the spread and the shape of the data distribution.

8 Continues class intervals (bins)
- 60-62
- 62-64
- 64-66
- 66-68
- 68-70
- 70-72
- 72-74
- 74-76
Bim frequency
- At the top of each bar you will find the bin frequency for each class interval
- The sum of all of these bin frequencies equals 100
- Histogram should be constructed so that the height of each bar should equal to the relative frequency of each class interval
Features just by visual inspection:
- Range:
  - minimim >= 60
  - maximum <= 76
  - Range = maximum - minimum
    - <= 76 - 60 = 16
    - 16 gives you an idea of the overall spray of the distribution
- Prominent peaks
  - There are only 1 prominent peaks
  - This data distribution is certainly uni-modal
  - Its point is probably at 67 where it peaks
    - mode ~ 67
- Imagine a smooth continuous curve that will approximate this histogram
  - We will call this curve a density curve later
  - Most observations cluster over the center of the curve
  - The midpoint of this curve is the mid-value at the peak
    - Probably the point 67
    - Treat it as the center of the distribution
  - Draw an hypothetical vertical line passing through the center of the distribution
    - Left side is similar to right side
    - This would be an example of a nearly symmetric distribution
    - We will talk about this soon!
- mode ~ 67
- mean ~ 67
- median ~ 67
Note: The common bin width has been chose to be 2, why not 3, 4, 5?
- Technically speaking you can chose whatever real number
- But you need to be careful that the common bin width is not too large or too small for some intervals
- Imagine considering the interval width to be 10, what would happen to the number of bins?
  - It decreases
  - Structure would be very different
- If the interval is very small, the number of bins will increase
  - This would not be very meaningful for us.

Important Remarks on the Choice of Bins

Choice of the number of bins (or, equivalently common bin width) is somewhat subjective.
If you use too few bins, the histogram doesn’t really portray the data very well, and fail to convey any information about the distribution.
If you have too many bins, you get a broken comb look, which also doesn’t give a sense of the distribution.
Square Root Choice: Number of bins k ≈√n (rounded to the next positive integer)
Freedman-Diaconis Rule 1:

Bin width = 2 \times I Q R \times n^{- \frac{1}{3}}

where IQR stands for the Inter Quartile Range, a measure of spread
of a frequency distribution, and n being the total number of
observations.

Density Function

Imagine we have the possibility of infinite amounts of data

You can have indefinitely large number of sample data

Density Function for a Continuous Data
Pasted image 20250902145743.png|500
For a continuous data, assuming we have a sufficiently large number of
observations, a histogram can be well approximated by a smooth, continuous curve (function) known as the (probability) density function.

Note:

Bars will always lie above the horizontal axis
- It will approximate the histogram
- Associated density function must always be non-negative
The total area of our histogram is 1
- Total area under the density curve is 1
- This is due to the relative frequencies having a total sum of 1

Shape

Mode: Value that corresponds to a prominent peak or mound.

If the histogram has a single prominent peak, the data distribution is called a uni-modal distribution.
There may be more than one mode (prominent peak).
If the histogram has multiple prominent peaks, the data distribution is said to be multi-modal.
- Bi-modal in case of two prominent peaks.
- Tri-modal in case of three prominent peaks, and so on.
If a histogram has no prominent peaks, it is called a uniform distribution.
For a multi-modal distribution, considering mode as a measure of center can be misleading.

Pasted image 20250902151141.png|500

Mode is approximately 100
Prominent peak on the left is about 0, prominent peak on the right is about 9.
- The center of this distribution should be the mid-point always, in this example it is 5
- This data distribution is roughly symmetric actually

This disproofs that Mode is a measure of center, that is not true.

Symmetric vs. Skewed distributions

Pasted image 20250902151556.png|500

Symmetric Distribution: One side of the distribution will be identical to the other side of the distribution
Asymmetric or skewed distribution:
- Skewed to the left: Left tail damps slowly to the left as compared to the rest of the distribution
- Skewed to the right: Right tail is stretched farther to the right as compared to the rest of the distribution

Example of skewed distributions
Pasted image 20250902151834.png|500

Most income levels lie to the left of the distribution

Scatterplot

A scatterplot is a powerful graphical tool to explore the
relationship between two quantitative variables.

We examine a scatterplot to study the association between two
quantitative variables. How do values on one variable change as
values of the other variable change?

You can describe the overall pattern of a scatterplot by the trend,
direction, and strength of the relationship between the two variables.

Trend: linear, curved, clusters, no pattern
Direction: positive, negative, no direction
Strength: how closely the points fit the trend

We will return to this when we talk about ‘simple linear regression’.

Example of scatterplot

Birth Weight (g) vs. Gestational Age (weeks)
Pasted image 20250902152035.png|400

Above average gestational age corresponds to above average birth weight
- There is some kind of relationship between these two variables
Note that the trend is linear
- Explains the data well
- Tend to cluster above this continues linear form
- Try to fit a curved linear functional form
Trend has a positive slope
- It has a positive direction

Proportion, Average and Variance

Proportions

When the variable of interest is categorical, such as Success or Failure, Strength of Opinion, Type of Car, and so on, then interest lies in the proportion in each category

If the population has N units, and Ni units are in category i , then the population proportion of category i is

p_{i} = \frac{N_{i}}{N}

If a sample size n is taken for this population, and ni sample units are in category i, then the sample proportion of category i is

{\hat{p}}_{i} = \frac{n_{i}}{n}

The sample proportion $\hat{p}$ approximates (or estimates) the population proportion $p$ .

Population Average/Mean

Consider a population consisting of N units, and let X1,X2,...,XN denote the values of some numerical variable in the population.

Then the population average or mean, denoted by µ, is defined as
the arithmetic average of all numerical values in the statistical
population:

µ = \frac{1}{N} [X_{1} + X_{2} + \dots + X_{N}] = \frac{1}{N} \sum_{i = 1}^{N} X_{i} .

(The above definition of the population mean is not complete. Here, we implicitly assumed that the population is finite, and all the variable values in the statistical population have equal weights which may not hold in practice. We need probability theory to introduce a more formal definition of the population mean.)

Sample Average/Mean

If a sample of size n is randomly selected from the population, and
if x1,x2,....,xn denote the variable values corresponding to the
sample units, then the sample average or the sample mean is

Pasted image 20250902152758.png|275

The sample mean $\bar{x}$ approximates but is, in general, different from
the population mean $µ$ .

The notation
Pasted image 20250902152813.png|275
is used to denote the sum of the x values, and should be read as
‘summation over xi where the index variable i ranges from 1 to n’.

Sample Median

Let x1,...,xn be the values of a numerical variable in a randomly
selected sample of size n. Further assume that these values are
arranged in an ascending order (that is, from smallest to largest).

Median depends directly on the location of the observations
Mean depends directly on the total amount of observation

Then, the sample median, denoted $x_{M e}$ , is defined as that value
which satisfies the following:

at least 50% of the observations ≤ $x_{M e}$ , and
at least 50% of the observations ≥ $x_{M e}$ .

Assume the data values x1,...,xn are arranged in an ascending
order:

If n is odd, the sample median $x_{M e}$ would be $x_{\frac{n + 1}{2}}$
If n is even, the sample median $x_{M e}$ would be $(x_{\frac{n}{2}} + x_{\frac{n}{2} + 1}) / 2$

Example:

$\bar{x}$ = (1 + 5 + 6 + 4 + 3) / 5
= 19/5 = 3.8
n = 5
$x_{M e}$ = $x_{\frac{n + 1}{2}}$ = 4
% of observations <= 4
- = 4/5 * 100
- = 60%
% of observations >= 4
- = 60%

Comparing Mean and Median

Comparing the median to the mean:

Advantage: resistant to extreme values, easy to describe
Disadvantage: not as mathematically tractable, need to sort the data to calculate

In a skewed distribution, the mean is farther out in the long tail than is the median.

For skewed distributions the median is preferred because it better represents what is typical.
- Example: Income distribution
  - Heavily skewed to the right
  - Consider the median income instead of the average income
- Example: Lifespan distribution
  - Skewed to the left
  - Use median
However despite of this advantage of the sample median the sample mean is often preferred because of mathematical attractibility

Pasted image 20250904142053.png|400

If the data distribution is Uni-modal, the following relationship holds true:

Symmetric: Mean = Median = Mode
Right-skewed: Mean > Median > Mode
Left-skewed: Mean < Median < Mode

Measures of Spread

Much information is lost in reducing a list of observations to a single summary measure, such as the mean or, median. A measure of center alone is not enough to adequately describe a quantitative variable.

Both Mean and Median are considered location information

Consider for example, the following distributions:

Pasted image 20250904142633.png|300

If only report the center of this distribution (100) would that be good enough to describe other characteristics of this distribution?
- No
- Most observation would be centered around the mean (100/small interval)
- Consider the other data distribution in the background, it has a smaller height, and a damped tail on both sides, for this data distribution there will be observations that will be away from the center of distribution = a greater degree of variability
- Takeout: Although both distributions have the same measure of location, they have different degrees of variabilities
There are various measures of sprayed data
- Sample variance
- Sample Standard deviation
- Range

A measure of center or, location tells us nothing about the spread or, variability of the data.

Population Variance and Standard Deviation

Consider a population consisting of N units, and let X1,X2,...,XN
denote the values of some numerical variable in the population.
Then the population variance, denoted by σ^2, is defined as

Pasted image 20250904143205.png|200

(The above definition of the population variance is not complete. Here, we implicitly assumed that the population is finite, and all the variable values in the statistical population have equal weights which may not hold in practice. We need probability theory to introduce a more formal definition of the population variance.)

In reality a population is always infinite

The positive square root of the population variance is called the
population standard deviation and is denoted by σ. That is,

Pasted image 20250904143217.png|200

Sample Variance and Standard Deviation

If a sample x1,x2,...,xn is randomly selected from the population,
then the sample variance, denoted by S^2, is defined as

S^{2} = \frac{1}{n - 1} \sum_{i = 1}^{n} (x_{i} - \bar{x})^{2}

Often used to draw inference about some unknown population variance (sigma^2)
Unbiased estimatior of the population variance (sigma^2)
S^2 neither underestimates or overestimates sigma^2

(The above definition with divisor (n−1) makes the sample variance S2 an unbiased estimator of the population variance σ2. This means, it neither underestimates nor overestimates σ2, and is exactly right on target on an average.)

The positive square root of the sample variance $s^{2}$ is called the
sample standard deviation and is denoted by $s$ , that is,

s = \sqrt{\frac{1}{n - 1} \sum_{i = 1}^{n} (x_{i} - \bar{x})^{2}}

Any measure of variability of a data distribution must always be non-negative
If iS^2 is small, it means is Small ~0, it means a small variability
- S^2 = 0
- S = 0
- $x_{i}$ = $\bar{x}$ for all i
  - Means all observations must be equal
  - In such a case you should expect an appropriate measure of spray
If the sample variance is large, the sample variance deviation is also large, it means a large variability
- There must be some observations which are away from the center of the distribution
- If the data distribution is heavily skewed, that is going to have a great impact on the values of S and S^2
  - They are not suitable for used in the case of outlier observations or skewed distributions (extreme cases)

Properties of Variance or Standard Deviation

A small variance or standard deviation indicates a small amount of variation among the variable values.
If the variance or the standard deviation is zero, all the observations must be equal to the mean, and conversely. In case of no variability, a measure of spread must be zero.
A large variance or standard deviation indicates a large degree of variability among the variable values, and there are some values which are away from the mean.
Both sample variance and sample SD are not resistant to outliers. Strong skewness, or the presence of a few outliers can greatly increase their values.
For any two real numbers α and β,

Var(α+ βx) = β^2Var(x), and SD(α+ βx) = |β|SD(x).

(True for both population and sample.)
- This result says that both variance and standard deviation are invariant on the change of location, but depend on change of skill
  - Example: Adding 5 poiitns to all students, that is not going to make change in variability
  - Example: Multiplying actual points by 2 will actually change the spray of the distributions, variance will increase.

Median, Quartiles, and Boxplots

Let x1,...,xn be the values of a quantitative variable in a randomly selected sample of size n.

We previously defined the sample median xMe , whereby at least 50% of the observations are no larger than xMe , and at least 50% of the observations are no lesser than xMe.

We can extend this concept to other fractions, such as

The first quartile Q1 (0.25 quantile): It’s that value such that
- ▶ at least 25% of observations ≤ Q1, and
- ▶ at least 75% of observations ≥ Q1.
The third quartile Q3 (0.75 quantile): It’s that value such that
- ▶ at least 75% of observations ≤ Q3, and
- ▶ at least 25% of observations ≥ Q3.

Question: What would be the second quartile Q2?

It should be the same as xMe, 50% and 50%.
Also known as the "Sample Median"

Answer: The median xMe

Example:

Given: 0 < p < 1
- A 100 p-th percetile (100, p-th quantile) is that value ${\hat{x}}_{1 - p}$ for which
  1. At least 100p% of obs. <= ${\hat{x}}_{1 - p}$
  2. At least 100(1-p)% of obs. >= ${\hat{x}}_{1 - p}$

Formula for finding Quartiles

Suppose we wish to find the k-th quartile, for k = 1,2,3.

Let n be the total number of observations or data points.
Assume the observations as x1,x2,...,xn are arranged in an ascending order.
Compute $h = (n - 1) \times (k / 4) + 1$ .
Then the k-th quartile is given by the following equation:

x_{⌊ h ⌋} + (h - ⌊ h ⌋) \times (x_{⌈ h ⌉} - x_{⌊ h ⌋}),

where
- ⌊h⌋= largest integer less than or equal to h (floor of h), and
- ⌈h⌉= smallest integer greater than or equal to h (ceiling of h).
- It may have decimal part as well, it does not need to be an integer.

Example

We have a sample of size n = 12:

9.39,7.04,7.17,13.28,10.53,7.46,11.97,8.39,21.06,12.68,13.19,8.50

Find the quartiles and the sample average

Manually

Process:

Order observations in an ascending fashion:
- 7.04, 7.17, 7.46, 8.39, 8.50, 9.39, 10.53, 11.97, 12.68, 13.19, 13.28, 21.06
There are 12 Observations.
Calculate Q1
- Order of quartile: 1
- h = $(n - 1) * 1 / 4 + 1 = 11 / 4 + 1 = 3.75$
  - ⌊h⌋ (floor of h) = 3
  - ⌈h⌉ (ceiling of h) = 4
- Q1 = $x_{⌊ h ⌋} + (h - ⌊ h ⌋) * (x_{⌈ h ⌉} - x_{⌊ h ⌋})$
  - = $x_{3} + (3.75 - 3) (x_{4} - x_{3})$
  - = $7.46 + 0.75 (8.39 - 7.46)$
  - = $8.1575$
Calculate Q2 (xMe) - Median
- Order of quartile: 2
- n = 12 is even
- Q2 = $\frac{1}{2} (x_{6} + x_{7}) = \frac{9.39 + 10.53}{2} = \frac{19.92}{2} = 9.96$
Calculate Q3
- h = 3
- h = $(n - 1) \times (\frac{3}{4}) + 1 = 11 \times \frac{3}{4} + 1 = 9.25$
  - ⌊h⌋ (floor of h) = 9
  - ⌈h⌉ (ceiling of h) = 10
- Q3 = $x_{9} + (9.25 - 9) * (x_{10} - x_{9})$
  - 12.68 + ...
  - ...

Using R

Open a new R script and create an R vector

# Create vector of the observations
x = c(9.39,7.04,7.17,13.28,10.53,7.46,11.97,8.39,21.06,12.68,13.19,8.50)
	# Note the observations are not arranged in ascending order here

x = sort(x)   # This will arrange x in ascending order

# Need to specify the order each quantile
Q1 = quantile(x,0.25)
Q2 = quantile(x,0.50)
Q3 = quantile(x,0.75)

# We are ready to print the values of these 3 quantiles
Q1
Q2
Q3

Q1
8.1575
Q2
9.96
Q3
12.8075

Exactly the same results as when we manually used the equations

Inter-Quartile Range

Pasted image 20250904150936.png|300

50% of the data points will lay in between Q1 and Q3
...

We define the inter-quartile range (IQR) to be
$I Q R = Q 3 - Q 1$

Situations:
- Q3 - Q1 = 0
  - Nearly 50% of the observations will fall inside a very small narrow interval
    - In this situation the data will have a very high concentration of values = a very small degree of variability
- Q3 - Q1 = large (far away)
  - The overall spray of the dataset is very high
  - Variability of the dataset is quite large
  - The difference is regarded as a measure of concentration or spray for a given distribution

This gives us another indication of how ‘spread out’ the values are

Question: What is the IQR for the preceding example?

Interquartile range in R

IQR = Q3 - Q1
IQR

Inter-Quartile Range and Outliers

The IQR is the range of the middle 50% of the data values.
The IQR is resistant to outliers.
The IQR can help to determine potential outliers.

Definition:

An observation x is said to be an outlier if

x < Q1 − 1.5 × IQR or x > Q3 + 1.5 × IQR.
- Lower Fence (LF) = Q1 − 1.5 × IQR
- Upper Fence (UF) = Q3 + 1.5 × IQR
- Outlier: observation does not fall inside the space between the lower and upper fences
An observation x is said to be an extreme outlier if

x <Q1−3 ×IQR or x >Q3 + 3 ×IQR.

Question: Identify the presence of any potential outlier in the preceding example.

Box Plots

Box plots (also called box-and-whisker plots or box-whisker plots) give a good graphical image of the concentration of the data.

Shows how far the extreme values are from most of the data.

A box plot is constructed from five descriptive measures:

minimum
first quartile
median
third quartile
maximum

These measures together are called the 5 point summary measures of a data set.

Anatomy of a Box Plot

Example of a vertical Box plot:
Pasted image 20250904151835.png|400

The box represents the middle 50% of the data.
The upper (right) whisker represents the upper 25% of the data.
- Its limit is the largest observation within the upper fence
The lower (left) whisker represents the lower 25% of the data.
- It is constructed down to the smallest observation within the lower fence
- In this example it also corresponds to the smallest observation, since there are no other observations below it
Any observation above the upper whisker or below the lower whisker would be a potential outlier.

Comparing Histogram and Box-plot

Pasted image 20250904152316.png|600

Histogram: Data distribution is very symmetric
- Median lies at an equal distance from Q3 and Q1
  - Q2-Q1 = Q3-Q2
    Box-plot: Median lies approximately at the same distance from Q3 and Q1
- Approximately an equal number of observations on the upper and lower whiskers
Histogram: Skewed to the left
- Distance Between Q2 and Q1 is much larger than between Q2 and Q3
  - Q2-Q1 > Q3-Q2
    Box-plot: It is a little bit more skewed upwards
Histogram: Right skewed distribution
Box-plot: Median is at a much smaller distance to Q1 than to Q3
- Strong indicator that the underlying data distribution is a right skewed distribution

Which Measure Should We Use?

The mean and the standard deviation are good measures of center, and spread, respectively, for symmetric bell-shaped distributions.
In case the data is heavily skewed or have outliers, the median and the IQR are more appropriate measures of center, and spread, respectively.

Summary - Which Measure To Use

	Symmetric Bell-Shaped	Skewed or Outliers
Measure of Center	Mean	Median
Measure of Spread	Standard Deviation	IQR