4. Random Variables and Categorical Variables

9 minute video showing how to solve the Random Variables HW

Random Variables and Categorical Variables

Note. Some of the mathematics might not display properly on your cell phone. If this is the case, try viewing in landscape mode, or better yet, on a regular computer screen.

Note. An R script to make a bar plot of a distribution is at the bottom of this page.

Easy Definition. A “variable” in statistics gives you information about the members of a population. If the information is a measurement that gives you a number (like height, weight, GPA) the measurement is called a random variable. If the information is descriptive (like eye color, nationality, favorite sport) it is called a categorical variable.

Here is a more abstract, general definition of random and categorical variables.

Let S be the set of all possible outcomes to some process. Suppose that P is a probability on S. Then, any function from S to the real numbers is called a random variable. You can think of a random variable as a measurement, like height, weight, GPA, income, almost anything with a number.

Any function from S to a category is called a categorical variable (or a nominal variable). Categories are things like color, food, country, people’s names, anything descriptive.

Note. Occasionally numbers can be used as categories, for example, if the number is being used to identify something (rather than measure it). Examples of categorical variables that are numeric: zip codes, telephone numbers, social security numbers, student ID numbers.

Example of Random and Categorical Variable when S is a population. Typically, when studying a population we’ll make many different types of measurements (random variables) and we’ll divide the population into many different categories. In the Figure below we have the random variables X = height, Y = weight, and the categorical variables c = favorite color and h = home state (state they live in).

The Figure below shows a table called a “data frame“. It holds the data from a sample of size 3, the sample consisting of {abe, ben, chris}. The rows correspond to the members of the sample. The columns correspond to random or categorical variables.

Example of a Random Variable (toss a coin twice). We toss a fair coin twice. The set of all possible outcomes for this situation is:

S = {HH, HT, TH, TT},

where, for example:

HT = heads on the first toss and tails on the second toss.

Let X be the random variable which counts how many heads. So:

X(HH) = 2
X(HT) = 1
X(TH) = 1
X(TT) = 0

So, we say X takes on the values 0, 1, 2.

If the coin is fair, then getting heads or tails is equally likely. So, all 4 outcomes in S will also be equally likely. So, let P be the equally likely probability on S, so:

P(HH) = P(HT) = P(TH) = P(TT) = 1/4

Recall that X counts heads, so:

P(X = 0) = P(TT) = 1/4
P(X = 1) = P(HT, TH) = P(HT) + P(TH) = 1/4 + 1/4 = 2/4
P(X = 2) = P(HH) = 1/4

The list:
P(X = 0) = 1/4
P(X = 1) = 2/4
P(X=2) = 1/4
is called the distribution of X.

Note that the distribution always sums up to 1.
In this example we have 1/4 + 2/4 + 1/4 = 1.

Distribution of a Random Variable

If the random variable X takes on only N distinct (finitely many) values:

$x_1, x_2, \ldots, x_N$

then the distribution of X is the list

$P(X = x_1), P(X = x_2), \ldots, P(X=x_N)$

Note. The distribution always sums up to 1.

$P(X = x_1) + P(X = x_2) + \cdots + P(X = x_N) = 1$

Example (toss a coin three times). We toss a fair coin three times. The set S of all possible outcomes is:

S = {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT }
|S| = 8

It’s a fair coin, so each of the |S| = 8 outcomes are equally likely. So:

P(HHH) = P(HHT) = P(HTH) = P(HTT) = P(THH) = P(THT) = P(TTH) = P(TTT) = 1/8

Let X be the random variable on S that counts how many H’s are in an outcome. So, X can take on the values 0, 1, 2, 3.

We find the distribution of X:

P(X = 0) = P(TTT) = 1/8
P(X = 1) = P(TTH, THT, HTT) = P(TTH) + P(THT) + P(HTT) = 1/8 + 1/8 + 1/8 = 3/8
P(X = 2) = P(HHT, HTH, THH) = P(HHT) + P(HTH) + P(THH) = 1/8 + 1/8 + 1/8 = 3/8
P(X = 3) = P(HHH) = 1/8

The list:
P(X = 0) = 1/8
P(X = 1) = 3/8
P(X=2) = 3/8
P(X=3) = 1/8
is the distribution of X. Note that the distribution always sums to 1. In this example we have 1/8 + 3/8 + 3/8 + 1/8 = 8/8 = 1.

Working with Random Variables and Distributions

Question 1. Suppose we toss a fair coin three times. Let X be the random variable that counts how many heads we get. Find $P(X \geq 2)$.

Answer to Question 1. Using the distribution of X which we calculated in the previous example we get:

$P(X \geq 2) = P(X = 2) + P(X=3)$
$= \dfrac{3}{8} + \dfrac{1}{8} $
$ = \dfrac{4}{8} = 0.5$

Question 2. Suppose we toss a fair coin four times. Let X be the random variable that counts how many heads we get. Find the distribution of X.

Answer to Question 2. We showed in an earlier example that the set of all possible outcomes for tossing a coin 3 times is

$S_3 = \{HHH, HHT, HTH, HTT, $
$THH, THT, TTH, TTT \}$
$|S_3| = 2^3 = 8$

If we toss a coin 4 times the set of all possible outcomes is

$S_4 = S_3 \times \{H, T\}$
$ = \{HHHH, HHTH, HTHH, HTTH, $
$THHH, THTH, TTHH, TTTH, $
$ \{HHHT, HHTT, HTHT, HTTT, $
$THHT, THTT, TTHT, TTTT \}$
and so, by the above, or just using the multiplication principle, we get
$ |S_4| = |S_3| \times 2 = 2^3 \times 2 = 2^4 = 16$

Since the coin is fair all 16 possible outcomes are equally likely. Hence:

$P(X = 0) = P(TTTT) = 1/16$
$P(X = 1) = P(HTTT, THTT, $
$TTHT, TTTH ) = 4/16$
by symmetry (meaning by interchanging H and T) it follows that:
$P(X = 3) = 4/16$
$P(X = 4) = 1/16$
We can find P(X = 2) using the fact the distribution always sums to 1.

$P(X = 0) + P(X = 1) + P(X = 2) + P(X = 3) + P(X = 4) = 1$
So,
$P(X = 2) = 1 – (\ P(X = 0) + P(X = 1) + P(X = 3) + P(X = 4)\ )$
$P(X = 2) = 1 – \left( \dfrac{1}{16} + \dfrac{4}{16} + \dfrac{4}{16} + \dfrac{1}{16} \right )$
$P(X = 2) = 1 – \left( \dfrac{10}{16} \right )$
$P(X = 2) = \dfrac{6}{16}$

So, we get the distribution of X
$P(X = 0) = 1/16$
$P(X = 1) = 4/16$
$P(X = 2) = 6/16$
$P(X = 3) = 4/16$
$P(X = 4) = 1/16$

Which we can represent as a bar plot. See Figure below.

Here is the R script that was used to create the bar plot shown above.

 #Bar Plot of Distribution of X
b = barplot(c(1/16,4/16,6/16, 4/16, 1/16), 
 names.arg = c(0,1,2,3,4), 
 ylim = c(0, .5), 
 xlab = "X", 
 ylab = "Probability",
 main = "Distribution of X",
 col = "blue")
text(b,c(1/16,4/16,6/16, 4/16, 1/16), 
 labels= c("1/16","4/16","6/16", "4/16", "1/16"), 
  adj=c(0.5, -0.5))

For help with using R see my R webpage:
https://mccarthymat150.commons.gc.cuny.edu/r/

Professor McCarthy Statistics

Mat 150 BMCC

Need help with the Commons?