Exercises - BIOSTATISTICS A ﬁrst course in probability theory and statistics for biologists

1. In a study area three insect species (A, B, C) can be found. We place an insect trap, and after some time we look at the insects that have been caught. Let

E1 := a specimen of species A can be found in the trap, P(E1) := 0.1, E₂ := a specimen of species B can be found in the trap, P(E₂) := 0.2, E₃ := a specimen of species C can be found in the trap, P(E₃) := 0.3.

Assuming that the events E₁, E₂, E₃ are independent, find the probability of the following events:

(a) There is no specimen of species A in the trap.

There is no specimen of speciesB in the trap.

There is no specimen of speciesC in the trap.

(b) There is an insect in the trap.

(d) There are specimens of all the three species in the trap.

(e) There are specimens of at most two species in the trap.

(Cf. Example 1.7.) Solution:

(a)P(E₁) = 1−P(E₁) = 0.9, P(E₂) = 1−P(E₂) = 0.8, P(E₃) = 1−P(E₃) = 0.7.

(b) P(E₁+E₂+E₃) =P(E₁+E₂) +P(E₃)−P((E₁+E₂)E₃)

= P(E₁) +P(E₂)−P(E₁E₂)

+P(E₃)−P(E₁E₃+E₂E₃)

=P(E₁)+P(E₂)−P(E₁)P(E₂)+P(E₃)− P(E₁E₃)+P(E₂E₃)−P(E₁E₂E₃)

=P(E₁) +P(E₂) +P(E₃)−P(E₁)P(E₂)−P(E₁)P(E₃)−P(E₂)P(E₃) +P(E₁)P(E₂)P(E₃) = 0.496.

(d) P(E₁E₂E₃) = P(E₁)P(E₂)P(E₃) = 0.006.

(e) P(E₁E₂E₃) = 1−P(E₁E₂E₃) = 0.994.

2. The following table contains data from a study describing the relationship between lung cancer and smoking as a risk factor. Let us consider the relative frequencies of the events as their probabilities.

(a) Determine the probability that a randomly chosen person has lung cancer provided that he or she smokes / does not smoke.

(b) Calculate the relative risk of smoking, i.e., the ratio P(has lung cancer|smokes) P(has lung cancer|does not smoke).

S S

C 483 76 C 982 1412

S := the person smokes,

C := the person has lung cancer.

Solution:

(a)

P(the chosen person has lung cancer provided that he or she smokes)

=P(C|S) = 483

483 + 982 ≈0.330.

P(the chosen person has lung cancer provided that he or she does not smoke)

=P(C|S) = 76

76 + 1412 ≈0.051.

(b) The relative risk of smoking is P(C|S)

P(C|S) ≈ 0.330

0.051 ≈6.471.

Remark: Let the set of individuals involved in a biological experiment be Ω, on the subset E of Ω a given treatment is applied, and E is the control group. If we observe two kinds of results during the treatment (F and F, e.g., recovery and deterioration or survival and death), then it is customary to arrange the data in the following so-called contingency table:

E E

F |EF| |EF| |F| F EF |E F| |F|

|E| |E| |Ω|

3. In the 1980’s a lung screening detected 90% of TB infections, and in 1% of all cases it indicated infection falsely. At that time the proportion of the infected people was 5·10⁻⁴ in the whole population.

(a) At that time what was the probability that the screening result was positive provided that the examined person was infected? (This is calledsensitivity of the examination.)

(b) At that time what was the probability that the screening result was negative provided that the examined person was healthy? (This is called specificity of the examination.)

(d) How do you explain that in those years among 23 people with a positive screen-ing result there was one infected on an average? How could this proportion have been improved?

Solution: Let E: the person is infected, F: the result of the screening is positive.

(a) The sensitivity is P(F|E) = 0.9.

(b) On the basis of the property of conditional probability given the complement event, for the specificity we obtain P(F|E) = 1−P(F|E) = 1−0.01 = 0.99.

P(F|E)·P(E) +P(F|E)·P(E)

= 0.9·5·10⁻⁴

0.9·5·10⁻⁴ + 0.01·(1−5·10⁻⁴) ≈0.043.

(d) Since 1

0.043 ≈ 23, at that time among 23 people with a positive screening result there was one infected on an average. The proportion of the infected people among those with a positive screening result is independent of the size of the population. For example, the table for 10,000,000 residents looks like this:

infected (E) healthy (E)

positive (F) 4500 99,950 104,450 negative (F)

5000 9,995,000 10,000,000

The proportion of infected among those with a positive result is 4500 : 104,450≈ 0.043. By increasing the sensitivity of the screening to the maximal value of 1, the proportion would only slightly improve (5000 : 104,450 ≈ 0.048). We could more effectively improve this proportion by increasing the specificity.

For instance, if the screening had falsely detected the infection in only 0.1%

of the cases instead of 1%, then by assuming the original sensitivity and the original proportion of infected, the proportion of the infected people among those with a positive result would have been 4500 : 14,495≈ 0.310, which is approximately 7.2 times greater than the original proportion.

4. In a large population, from the allelesAanda belonging to an autosomal locus the relative frequency of the dominant alleleAisp. By assuming total panmixie and the absence of selection, find the relative frequencies of the zygotes’ (and individuals’) genotypes in the next generation.

Solution: During the formation of zygote both parents give a single allele to the offspring. The possible outcomes are Aanda, so the sample space is Ω0 :={A, a}, the event algebra A₀ is the power set of Ω₀, and the values of the probability function P₀: A₀ →Rare

P₀(∅) = 0, P₀({A}) =p, P₀({a}) = 1−p=:q, P₀(Ω) = 1.

Denote the product of the probability space (Ω₀,A₀, P₀) with itself by ( ˆΩ,A,ˆ Pˆ), which, according to the experience, describes the genotype of the zygote under the given conditions. On the sample space

Ω =ˆ {(A♂, A♀),(A♂, a♀),(a♂, A♀),(a♂, a♀)}

by the identification of the pairs of alleles (A♂, a♀) and (a♂, A♀) as

(A♂, A♀)7→AA, (A♂, a♀)7→Aa, (a♂, A♀)7→Aa, (a♂, a♀)7→aa we obtain a finite probability space on the sample space

Ω :={AA, Aa, aa}

consisting of three elements, where the values of the probability function P on the sets containing exactly one elementary event are

P({AA}) =p², P({Aa}) = 2pq, P({aa}) =q²,

independently of the proportions of genotypesAA,Aa,aain the parent generation, and, consequently, the latter proportions do not change in the following generations any more (Hardy–Weinberg’s law).

Chapter 4 Random variables

In the previous chapters random phenomena were investigated in an elementary ap-proach. Now our knowledge will be extended by applying further tools of mathematical analysis. We introduce the notion of random variable, and then we deal with functions and numbers characterizing random variables.

4.1 Random variables

In the applications of probability theory, we often encounter experiments where we can assign expediently a number to every possible outcome, see the example below. In other cases we have to transform random phenomenon to scalars because of the original elements of the probability space are unknown.

IfA is an event algebra given on Ω, then the function X: Ω→R is called arandom variable if for all intervalsI ⊂Rthe preimage

X⁻¹(I) :={ω ∈Ω :X(ω)∈I} is an event in the set A.

In other experiments we can naturally assign n-tuples to the outcomes, more pre-cisely, the vector of these numbers. For these cases the notion of random vector will be introduced (see Chapter 6).

Example 4.1. Consider families with four children. Taking into account the sexes and the order of births of the children, these families can be classified into 2⁴ = 16 types, inasmuch as the first, the second, the third and the fourth child can be either a boy or a girl. On the sample space

Ω :={(b, b, b, b),(b, b, b, g),(b, b, g, b),(b, g, b, b),(g, b, b, b), . . . ,(g, g, g, g)}

let the function X: Ω→ R be the random variable that assigns to each type of families the number of boys in the given family. Then

X :

(b, b, b, b)7→4,

(b, b, b, g),(b, b, g, b),(b, g, b, b),(g, b, b, b)7→3,

(b, b, g, g),(b, g, b, g),(b, g, g, b),(g, b, b, g),(g, b, g, b),(g, g, b, b)7→2, (b, g, g, g),(g, b, g, g),(g, g, b, g),(g, g, g, b)7→1,

(g, g, g, g)7→0.

Other two random variables are

Y := 4−X, which means the number of girls in the family,

Z :=|2X−4|, which gives the absolute difference between the numbers of boys and girls in the family.

It is easy to show that if X is a random variable and a is a real number, then aX is also a random variable and if Y is a random variable and b is a real number, then aX+bY is also a random variable. Moreover, if X, Y are random variables in the same probability space and f is a real valued continuous function of two variables, then f◦(X, Y) is also a random variable in the same probability space.

In a wider sense the value of a random variable as a function can be a rank, that is an element of an ordinal scale (ordinal or rank-valued random variable). In other cases the value of the random variable represents a qualitative category (categorical or qualitative random variable.) Both of these types of random variables will occur in the book.

In document BIOSTATISTICS A ﬁrst course in probability theory and statistics for biologists (Pldal 38-43)