Algorithms and Inference


Statistics is the science of learning from experience, particularly experience that arrives a little bit at a time: the successes and failures of a new experimental drug, the uncertain measurements of an asteroid’s path toward Earth. (Fonte: Efron, B., & Hastie, T. (2016). Computer Age Statistical Inference: Algorithms, Evidence, and Data Science. Cambridge: Cambridge University Press.)

First, however, we want to discuss a less philosophical, more operational division of labor that applies to both theories: between the algorithmic and inferential aspects of statistical analysis. (Fonte: Efron, B., & Hastie, T. (2016). Computer Age Statistical Inference: Algorithms, Evidence, and Data Science. Cambridge: Cambridge University Press.)

"Very broadly speaking, algorithms are what statisticians do while inference says why they do them. A particularly energetic brand of the statistical enterprise has flourished in the new century, data science, emphasizing algorithmic thinking rather than its inferential justification." (Fonte: Efron, B., & Hastie, T. (2016). Computer Age Statistical Inference: Algorithms, Evidence, and Data Science. Cambridge: Cambridge University Press.)

Here averaging is the algorithm, while the standard error provides an inference of the algorithm’s accuracy. It is a surprising, and crucial, aspect of statistical theory that the same data that supplies an estimate can also assess its accuracy (Fonte: Efron, B., & Hastie, T. (2016). Computer Age Statistical Inference: Algorithms, Evidence, and Data Science. Cambridge: Cambridge University Press.)

The point is that the algorithm comes first and the inference follows at a second level of statistical consideration. (Fonte: Efron, B., & Hastie, T. (2016). Computer Age Statistical Inference: Algorithms, Evidence, and Data Science. Cambridge: Cambridge University Press.)


Suponha que você tenha em mãos uma amostra aleatória (random sample) $x = \{ x_1, x_2, \ldots, x_n \}$ de tamanho $n$ que foi coletada de alguma população (population) e que estes valores $x_1, x_2, \ldots, x_n$ se aplicam a algum fenômeno do seu interesse (digamos, a quantidade de alunos matriculados no ensino fundamental em $n$ municípios do estado de Minas Gerais).

Você precisa de uma estimativa para a média ($\mu$) da população (population). Com a amostra aleatória (random sample size) $x = \{ x_1, x_2, \ldots, x_n \}$ em mãos, a melhor estimativa que você pode ter para a média ($\mu$) da população (population) é a média ($\bar{x}$) desta amostra aleatória (random sample size).

Aqui, o nosso algoritmo para calcular a melhor estimativa para o parâmetro de interesse é a média aritmética:

$$ \bar{x} = \sum_{i=1}^n \frac{x_i}{n} $$

Mas qual a incerteza (uncertanty) em relação a esta estimativa? Em outras palavras, o quão preciso é este número $\bar{x}$?

De acordo com o Teorema Central do Limite (Central Limit Theorem - CLT), sob certas condições (mais detalhes aqui), a distribuição das médias das amostras (sample means) $\bar{X}$ é aproximadamente normal (nearly normal), centrada na média da população (population mean) $\mu$, e com um desvio padrão (standard deviation) $SE$ igual ao desvio padrão da população (population standard deviation) $\sigma$ dividido pela raiz quadrada do tamanho das amostras (samples size) $n$.

$$ \bar{X} \sim N\left(mean=\mu, SE=\frac{\sigma}{\sqrt{n}}\right) $$

Então, o desvio padrão (standard deviation) $SE = \frac{\sigma}{\sqrt{n}}$ desta distribuição das médias das amostras (sample means) $\bar{X}$ é, na verdade, uma medida do erro da nossa estimativa $\bar{x}$ e, portanto, $SE$ fornece uma inferência a respeito da precisão do nosso algoritmo (a média aritmética). Por este motivo, o desvio padrão (standard deviation) $SE$ desta distribuição das medias das amostras (sample means) $\bar{X}$ é denominado de erro padrão (standard error).

No entanto, tipicamente, não conhecemos o desvio padrão da população (population standard deviation) $\sigma$, porém, podemos estimá-lo através do desvio padrão da amostra (sample standard deviation) $s$. Logo,

$$ SE \approx \hat{se} = \frac{s}{\sqrt{n}} $$

Então, ao invés de reportar somente a estimativa $\bar{x}$, podemos reportar alguma incerteza sobre esta estimativa através de $\bar{x} \pm \hat{se}$.

In [1]:
from scipy import stats as st
import numpy as np

# collects random samples (for a range of sizes) from population distribution and
# calculates estimations to mean and standard error
def collect_samples_and_calc_mean_SE_estimations(pop_distribution, start_size, end_size, increment_size):
    random_sample_sizes = []
    estimated_means = []
    estimated_standard_errors = []

    for random_sample_size in range(start_size,end_size,increment_size):
        random_sample = pop_distribution.rvs(size=random_sample_size)
        estimated_mean = np.mean(random_sample)
        estimated_standard_error = np.std(random_sample) / np.sqrt(random_sample_size)

        random_sample_sizes.append(random_sample_size)
        estimated_means.append(estimated_mean)
        estimated_standard_errors.append(estimated_standard_error)
        
    return random_sample_sizes, estimated_means, estimated_standard_errors
In [2]:
# definition of the population distribution
pop_par_mu = 0
pop_par_sd = 1
pop_distribution = st.norm(loc=pop_par_mu, scale=pop_par_sd)

random_sample_sizes = []
estimated_means = []
estimated_SEs = []

# collects samples, calculates estimates and updates arrays
# start_size=10, end_size=100, increment_size=20
sizes, means, SEs = collect_samples_and_calc_mean_SE_estimations(pop_distribution, 10, 100, 20)
random_sample_sizes.extend(sizes)
estimated_means.extend(means)
estimated_SEs.extend(SEs)

# collects samples, calculates estimates and updates arrays
# start_size=100, end_size=1.000, increment_size=100
sizes, means, SEs = collect_samples_and_calc_mean_SE_estimations(pop_distribution, 100, 1000, 100)
random_sample_sizes.extend(sizes)
estimated_means.extend(means)
estimated_SEs.extend(SEs)

# collects samples, calculates estimates and updates arrays
# start_size=1.000, end_size=10.000, increment_size=1.000
sizes, means, SEs = collect_samples_and_calc_mean_SE_estimations(pop_distribution, 1000, 10000, 1000)
random_sample_sizes.extend(sizes)
estimated_means.extend(means)
estimated_SEs.extend(SEs)

# collects samples, calculates estimates and updates arrays
# start_size=10.000, end_size=100.000, increment_size=10.000
sizes, means, SEs = collect_samples_and_calc_mean_SE_estimations(pop_distribution, 10000, 100000, 10000)
random_sample_sizes.extend(sizes)
estimated_means.extend(means)
estimated_SEs.extend(SEs)
In [3]:
import pandas as pd
import altair as alt
alt.renderers.enable('notebook') # for the notebook only (not for JupyterLab) run this command once per session

def plot_truemean_estimatedmean_SE(true_mean, estimated_means, estimated_standard_errors, random_sample_sizes):
    data = pd.DataFrame({
        'true_mean': true_mean, 'random_sample_size': random_sample_sizes, 
        'estimated_mean': estimated_means, 'estimated_se': estimated_standard_errors
    })
    
    true_mean_plot = alt.Chart(data).mark_rule(color='red').encode(
        y='true_mean:Q',
        size=alt.value(1)
    )
    
    estimated_mean_plot = alt.Chart(data).mark_line().encode(
        x='random_sample_size:O',
        y='estimated_mean'
    )
    
    standard_error_plot = alt.Chart(data).mark_area(opacity=0.3).encode(
        x='random_sample_size:O',
        y='mu_hat_sup:Q',
        y2='mu_hat_inf:Q'
    ).transform_calculate(
        mu_hat_sup='datum.estimated_mean + datum.estimated_se',
        mu_hat_inf='datum.estimated_mean - datum.estimated_se',
    )
      
    chart = (true_mean_plot + estimated_mean_plot + standard_error_plot).properties(title='(est_mean +- se)  |  (true_mean)')
    return chart
In [4]:
plot_truemean_estimatedmean_SE(pop_par_mu, estimated_means, estimated_SEs, random_sample_sizes)
Out[4]:

A Regression Example

Esta seção se refere ao exemplo abaixo do livro:

Figure 1.1 concerns a study of kidney function. Data points $\left( x_i, y_i \right)$ have been observed for $n=157$ healthy volunteers, with $x_i$ the $i$th volunteer’s age in years, and $y_i$ a composite measure “tot” of overall function. Kidney function generally declines with age, as evident in the downward scatter of the points. The rate of decline is an important question in kidney transplantation: in the past, potential donors past age $60$ were prohibited, though, given a shortage of donors, this is no longer enforced. (Fonte: Efron, B., & Hastie, T. (2016). Computer Age Statistical Inference: Algorithms, Evidence, and Data Science. Cambridge: Cambridge University Press.)

Kidney fitness data of Figure 1.1

Os dados utilizados aqui são os disponibilizados no próprio site do livro através da URL https://web.stanford.edu/~hastie/CASI/data.html.

Dados:

In [5]:
kidney_df = pd.read_csv("data/kidney.txt", delimiter=' ')
kidney_df.head()
Out[5]:
age tot
0 18 2.44
1 19 3.86
2 19 -1.22
3 20 2.30
4 21 0.98
In [6]:
# scatter plot of dataset
kidney_df_plot = alt.Chart(kidney_df).mark_point().encode(
    alt.X('age:Q', scale=alt.Scale(zero=False)),
    y='tot:Q',
    size=alt.value(10)
).properties(title="kidney dataset")

kidney_df_plot
Out[6]:

Linear regression model

Algorithmic aspect of linear regression model

In [7]:
import statsmodels.formula.api as smf

# fit linear model
kidney_ols_model = smf.ols(formula='tot ~ age', data=kidney_df).fit()
# predict dataset
kidney_ols_model_tot_pred = kidney_ols_model.predict(kidney_df['age'])

# line plot that represents the linear model
kidney_ols_model_data = pd.DataFrame({'age': kidney_df['age'], 'tot_pred': kidney_ols_model_tot_pred})
kidney_ols_model_plot = alt.Chart(kidney_ols_model_data).mark_line(color="green").encode(x='age', y='tot_pred')

kidney_ols_model_plot_title = "OLS model (tot = {0:.2f} + {1:.3f}age)".format(
    kidney_ols_model.params[0], kidney_ols_model.params[1]
)

# dataset + model plots
plot = (kidney_df_plot + kidney_ols_model_plot).properties(
    title=kidney_ols_model_plot_title
)
plot
Out[7]:

Inferential aspect of linear regression model

In [8]:
# calculates an inference (standard errors) for the linear regression model
# see http://people.duke.edu/~rnau/mathreg.htm
def calculate_regression_SEs(kidney_ols_model, ages, s, n, mean_ages, var_ages):
    tots_prev = []
    SEs = []
    for age in ages:
        tot_prev = kidney_ols_model.predict({'age':age})
        tots_prev.append(tot_prev[0])

        SE = (s / np.sqrt(n)) * np.sqrt(1 + ((age - mean_ages)**2)/var_ages)
        SEs.append(SE)
        
    return tots_prev, SEs
In [9]:
mean_ages = np.mean(kidney_df['age'])
var_ages = np.var(kidney_df['age'])
n = len(kidney_df.index)

# "standard error of the model" or "standard error of the regression"
s = np.std(kidney_ols_model.resid, ddof=1) * np.sqrt((n-1)/(n-2))

some_ages = [20, 30, 40, 50, 60, 70, 80]

# calculates an inference (standard errors) for the linear regression model
# see http://people.duke.edu/~rnau/mathreg.htm
kidney_ols_model_tots_prev, kidney_ols_model_SEs = calculate_regression_SEs(
    kidney_ols_model, some_ages, s, n, mean_ages, var_ages
)

table_1_1_part1 = pd.DataFrame({
    'ages':some_ages, 
    'linear reg. predic.':np.round(kidney_ols_model_tots_prev, decimals=2),
    'linear reg. std error':np.round(kidney_ols_model_SEs, decimals=2)
})
table_1_1_part1
Out[9]:
ages linear reg. predic. linear reg. std error
0 20 1.29 0.21
1 30 0.50 0.15
2 40 -0.28 0.15
3 50 -1.07 0.19
4 60 -1.86 0.26
5 70 -2.64 0.34
6 80 -3.43 0.42
In [10]:
def make_CIs_bar_plot(ages, tots_prev, SEs):
    cis_bar_data = pd.DataFrame({
        'age': ages,
        'ci_bar_lb': [(tot_prev - 2*se) for tot_prev, se in zip(tots_prev, SEs)],
        'ci_bar_ub': [(tot_prev + 2*se) for tot_prev, se in zip(tots_prev, SEs)]
    })

    cis_bar_plot = alt.Chart(cis_bar_data).mark_rule(color='red').encode(
        y='ci_bar_lb:Q', y2='ci_bar_ub:Q', x='age:Q', size=alt.value(1)
    )
    
    return cis_bar_plot
In [11]:
kidney_ols_model_cis_bar_plot = make_CIs_bar_plot(some_ages, kidney_ols_model_tots_prev, kidney_ols_model_SEs)

# dataset + model + cis plots   
figure_1_1 = (kidney_df_plot + kidney_ols_model_plot + kidney_ols_model_cis_bar_plot).properties(
    title=kidney_ols_model_plot_title
)
figure_1_1
Out[11]:

In [12]:
def make_CIs_line_plot(ages, tots_prev, SEs):
    cis_line_data = pd.DataFrame({
        'age': ages,
        'ci_ln_lb': [(tot_prev - 2*se) for tot_prev, se in zip(tots_prev, SEs)],
        'ci_ln_ub': [(tot_prev + 2*se) for tot_prev, se in zip(tots_prev, SEs)]
    })

    cis_line_lb_plot = alt.Chart(cis_line_data).mark_line(color='red').encode(
        y='ci_ln_lb:Q', x='age:Q'
    )
    cis_line_ub_plot = alt.Chart(cis_line_data).mark_line(color='red').encode(
        y='ci_ln_ub:Q', x='age:Q'
    )
    
    return cis_line_lb_plot + cis_line_ub_plot
In [13]:
# calculates an inference (standard errors) for the linear regression model
# see http://people.duke.edu/~rnau/mathreg.htm
kidney_ols_model_tots_prev, kidney_ols_model_SEs = calculate_regression_SEs(
    kidney_ols_model, kidney_df['age'], s, n, mean_ages, var_ages
)

kidney_ols_model_cis_line_plot = make_CIs_line_plot(
    kidney_df['age'], kidney_ols_model_tots_prev, kidney_ols_model_SEs
)

# dataset + model + cis plots   
plot = (kidney_df_plot + kidney_ols_model_plot + kidney_ols_model_cis_line_plot).properties(
    title=kidney_ols_model_plot_title
)
plot
Out[13]:

A modern computer-based algorithm lowess

Algorithmic aspect of lowess

In [14]:
from scipy.interpolate import interp1d
import statsmodels.api as sm
lowess = sm.nonparametric.lowess

kidney_lowess_model = lowess(kidney_df['tot'], kidney_df['age'], frac=1./3)
kidney_lowess_model_x = list(zip(*kidney_lowess_model))[0]
kidney_lowess_model_y = list(zip(*kidney_lowess_model))[1]
kidney_lowess_model_f = interp1d(kidney_lowess_model_x, kidney_lowess_model_y, bounds_error=False)

kidney_lowess_model_tots_prev = kidney_lowess_model_y

# curve plot that represents the lowess model
kidney_lowess_model_data = pd.DataFrame({'age': kidney_df['age'], 'tot_pred': kidney_lowess_model_tots_prev})
kidney_lowess_model_plot = alt.Chart(kidney_lowess_model_data).mark_line(color="green").encode(x='age', y='tot_pred')

# dataset + model plots
kidney_lowess_model_plot_title = "Local polynomial lowess(y,x,1/3)"
plot = (kidney_df_plot + kidney_lowess_model_plot).properties(
    title=kidney_lowess_model_plot_title
)
plot
Out[14]:

Inferential aspect of lowess

In [15]:
# calculates an inference (standard errors) for the lowess model using bootstrap
replications = 250
kidney_lowess_model_bootstrap_df = pd.DataFrame(columns=['replication', 'age', 'tot_pred'])
for rep in range(replications):
    bootstrap_idxs = np.random.choice(n, n)
    new_kidney_df = kidney_df.iloc[bootstrap_idxs]
    new_kidney_df = new_kidney_df.sort_values(by=['age'])
    
    new_kidney_lowess_model = lowess(new_kidney_df['tot'], new_kidney_df['age'], frac=1./3)
    new_kidney_lowess_model_x = list(zip(*new_kidney_lowess_model))[0]
    new_kidney_lowess_model_y = list(zip(*new_kidney_lowess_model))[1]
    new_kidney_lowess_model_f = interp1d(new_kidney_lowess_model_x, new_kidney_lowess_model_y, bounds_error=False)
    
    tot_pred = new_kidney_lowess_model_y
    df_row = pd.DataFrame({
        'replication': [rep]*n,
        'age': new_kidney_df['age'],
        'tot_pred': tot_pred
    })
    
    # insert predictions about ages 40 and 50 that does not existis in the original dataset
    tot_pred_for_ages_40_50 = new_kidney_lowess_model_f([40, 50])    
    # update df_row
    df_row = pd.concat([
        df_row, 
        pd.DataFrame({'replication':[rep, rep], 'age':[40, 50], 'tot_pred': tot_pred_for_ages_40_50})
    ])
    df_row = df_row.sort_values(by=['age'])
    
    kidney_lowess_model_bootstrap_df = pd.concat([kidney_lowess_model_bootstrap_df, df_row])
In [16]:
# filter the first 25 bootstraped lowess models
filter_data = kidney_lowess_model_bootstrap_df[(kidney_lowess_model_bootstrap_df.replication < 25)]

# curve plot that represents the lowess model
kidney_lowess_model_bootstrap_plot = alt.Chart(filter_data).mark_line().encode(
    x='age:Q', y='tot_pred:Q', color=alt.Color('replication:O', legend=None)
)

# dataset + model plots
figure_1_3 = (kidney_df_plot + kidney_lowess_model_bootstrap_plot).properties(
    title="25 first Local polynomial lowess(y,x,1/3) from bootstrap"
)
figure_1_3
Out[16]:

In [17]:
import bootstrapped.bootstrap as bs
import bootstrapped.stats_functions as bs_stats

# calculates an inference (standard errors) using bootstrap
def calculate_bootstrap_SEs(bootstrap_data, ages):
    SEs = []
    for age in ages:
        filter_bootstrap_df = bootstrap_data[(bootstrap_data['age'] == age)]
        bs_std = bs.bootstrap(np.array(filter_bootstrap_df['tot_pred']), stat_func=bs_stats.std)
        SEs.append(bs_std.value)
        
    return SEs
In [18]:
kidney_lowess_model_tots_prev_some_ages = kidney_lowess_model_f(some_ages)

# calculates an inference (standard errors) for the lowess model using bootstrap
kidney_lowess_model_SEs = calculate_bootstrap_SEs(kidney_lowess_model_bootstrap_df, some_ages)

table_1_1 = pd.DataFrame({
    'ages':some_ages, 
    'linear reg. predic.':table_1_1_part1['linear reg. predic.'],
    'linear reg. std error':table_1_1_part1['linear reg. std error'],
    'lowess predic.':np.round(kidney_lowess_model_tots_prev_some_ages, decimals=2),
    'bootstrap std error':np.round(kidney_lowess_model_SEs, decimals=2)
})
table_1_1
Out[18]:
ages linear reg. predic. linear reg. std error lowess predic. bootstrap std error
0 20 1.29 0.21 1.66 0.56
1 30 0.50 0.15 0.65 0.22
2 40 -0.28 0.15 -0.59 0.34
3 50 -1.07 0.19 -1.27 0.33
4 60 -1.86 0.26 -1.91 0.36
5 70 -2.64 0.34 -2.68 0.46
6 80 -3.43 0.42 -3.50 0.61

... Table 1.1 show the lowess estimates and their standard errors. We have paid a price for the increased flexibility of lowess, its standard errors roughly doubling those for linear regression. (Fonte: Efron, B., & Hastie, T. (2016). Computer Age Statistical Inference: Algorithms, Evidence, and Data Science. Cambridge: Cambridge University Press.)

In [19]:
kidney_lowess_model_cis_bar_plot = make_CIs_bar_plot(
    some_ages, kidney_lowess_model_tots_prev_some_ages, kidney_lowess_model_SEs
)

# dataset + model + cis plots   
figure_1_2 = (kidney_df_plot + kidney_lowess_model_plot + kidney_lowess_model_cis_bar_plot).properties(
    title=kidney_lowess_model_plot_title
)
figure_1_2
Out[19]:

In [20]:
%%time

# calculates an inference (standard errors) for the lowess model using bootstrap
kidney_lowess_model_SEs = calculate_bootstrap_SEs(kidney_lowess_model_bootstrap_df, kidney_df['age'])
Wall time: 56.4 s
In [21]:
kidney_lowess_model_cis_line_plot = make_CIs_line_plot(
    kidney_df['age'], kidney_lowess_model_tots_prev, kidney_lowess_model_SEs
)

# dataset + model + cis plots   
plot = (kidney_df_plot + kidney_lowess_model_plot + kidney_lowess_model_cis_line_plot).properties(
    title=kidney_lowess_model_plot_title
)
plot
Out[21]:

Hypothesis Testing

Esta seção se refere ao exemplo abaixo do livro:

Our second example concerns the march of methodology and inference for hypothesis testing rather than estimation: $72$ leukemia patients, $47$ with ALL (acute lymphoblastic leukemia) and $25$ with AML (acute myeloid leukemia, a worse prognosis) have each had genetic activity measured for a panel of $7128$ genes. The histograms in Figure 1.4 compare the genetic activities in the two groups for gene 136. (Fonte: Efron, B., & Hastie, T. (2016). Computer Age Statistical Inference: Algorithms, Evidence, and Data Science. Cambridge: Cambridge University Press.)

Leukemia data

Os dados utilizados aqui são os disponibilizados no próprio site do livro através da URL https://web.stanford.edu/~hastie/CASI/data.html.

Dados:

  • Gene expression measurements on 72 leukemia patients, 47 "ALL" (see section 1.2), 25 "AML".
  • These data arise from the landmark Golub et al (1999) Science paper.
  • There is a larger set consisting of 7128 genes, which was used in Chapters 1, 10, 11, and possibly elsewhere.
  • It is stored as the 7128 x 72 matrix (10MB) leukemia_big.csv, with the column names denoting the class labels.
  • The histograms in Figure 1.4 arise from row 136 of this matrix, and the histogram in Figure 1.5 is of the 7128 two-sample t-test statistics on the rows (genes).
  • https://web.stanford.edu/~hastie/CASI_files/DATA/leukemia_big.csv
In [22]:
leukemia_original_df = pd.read_csv("data/leukemia_big.csv", delimiter=',')
leukemia_original_df.head()
Out[22]:
ALL ALL.1 ALL.2 ALL.3 ALL.4 ALL.5 ALL.6 ALL.7 ALL.8 ALL.9 ... AML.15 AML.16 AML.17 AML.18 AML.19 AML.20 AML.21 AML.22 AML.23 AML.24
0 -1.533622 -0.867610 -0.433172 -1.671903 -1.187689 -1.127234 -1.045409 -0.106917 -1.198796 -1.190899 ... -0.436650 -1.274708 -0.681458 -0.876610 -0.624022 -0.431628 -1.435259 -0.671954 -1.013161 -0.969482
1 -1.235673 -1.275501 -1.184492 -1.596424 -1.335256 -1.113730 -0.800880 -0.745177 -0.849312 -1.190899 ... -0.915483 -1.354363 -0.653559 -1.096250 -1.066594 -1.335256 -1.204586 -0.751457 -0.889592 -1.080988
2 -0.333983 0.375927 -0.459196 -1.422571 -0.797493 -1.362768 -0.671954 -1.175674 0.320813 0.646610 ... -0.736156 -0.022153 -0.037455 -0.567335 -1.100749 -0.552938 -0.948874 -0.231657 -0.742163 -0.779500
3 0.488702 0.444011 0.436264 0.193353 0.235632 -0.360312 0.184941 0.425653 0.333983 0.235270 ... 0.083781 0.356562 0.416241 0.533986 0.227505 0.416816 0.408202 0.326556 0.361813 0.298864
4 -1.300893 -1.229660 -1.325882 -1.818329 -1.311206 -1.513975 -1.651624 -1.339555 -0.593132 0.133302 ... -1.547444 -1.264475 -1.512318 -1.469583 -1.283472 -0.977672 -1.090178 -1.545120 -1.174272 -1.443183

5 rows × 72 columns

In [23]:
leukemia_df = pd.DataFrame(columns=['patient', 'leukemia', 'gene', 'genetic_activity'])
patient = 0
total_genes = len(leukemia_original_df.index)
for column in leukemia_original_df.columns:
    patient = patient + 1
    leukemia = "ALL" if (column.find("ALL") != -1) else "AML"
    
    df_row = pd.DataFrame({
        'patient': [patient]*total_genes,
        'leukemia': [leukemia]*total_genes,
        'gene': np.arange(1,(total_genes+1)),
        'genetic_activity': leukemia_original_df[column]
    })
    
    leukemia_df = pd.concat([leukemia_df, df_row], ignore_index=True)

leukemia_df.head()
Out[23]:
patient leukemia gene genetic_activity
0 1 ALL 1 -1.533622
1 1 ALL 2 -1.235673
2 1 ALL 3 -0.333983
3 1 ALL 4 0.488702
4 1 ALL 5 -1.300893
In [24]:
leukemia_gene136_df = leukemia_df[(leukemia_df.gene == 136)]
leukemia_gene136_df.head()
Out[24]:
patient leukemia gene genetic_activity
135 1 ALL 136 0.918695
7263 2 ALL 136 1.634002
14391 3 ALL 136 0.459587
21519 4 ALL 136 0.637966
28647 5 ALL 136 0.344038
In [25]:
def draw_hist_plot(values, x_axis_limits=None): 
    data = pd.DataFrame({'scores': values})
    hist = None
    if x_axis_limits is None:
        hist = alt.Chart(data).mark_bar().encode(
            alt.X('scores:Q', bin=True),
            alt.Y('count()')
        )
    else:
        hist = alt.Chart(data).mark_bar().encode(
            alt.X('scores:Q', bin=True, scale=alt.Scale(domain=x_axis_limits)),
            alt.Y('count()')
        )
        
    rule = alt.Chart(data).mark_rule(color='red').encode(
        x='mean(scores)',
        size=alt.value(2)
    )
    return hist + rule
In [26]:
# filter only gene 136 by leukemia type
leukemia_gene136_ALL_df = leukemia_gene136_df[(leukemia_gene136_df.leukemia == "ALL")]
leukemia_gene136_AML_df = leukemia_gene136_df[(leukemia_gene136_df.leukemia == "AML")]

# generates histograms
leukemia_gene136_ALL_hist = draw_hist_plot(leukemia_gene136_ALL_df['genetic_activity'])
leukemia_gene136_AML_hist = draw_hist_plot(leukemia_gene136_AML_df['genetic_activity'])

# calculates means
mean_leukemia_gene136_ALL = np.mean(leukemia_gene136_ALL_df['genetic_activity'])
mean_leukemia_gene136_AML = np.mean(leukemia_gene136_AML_df['genetic_activity'])

# set titles plots
leukemia_gene136_ALL_hist = leukemia_gene136_ALL_hist.properties(
    title="ALL scores - mean {0:.3f}".format(mean_leukemia_gene136_ALL)
)
leukemia_gene136_AML_hist = leukemia_gene136_AML_hist.properties(
    title="AML scores - mean {0:.3f}".format(mean_leukemia_gene136_AML)
)

# draw histograms
figure_1_4 = leukemia_gene136_ALL_hist | leukemia_gene136_AML_hist
figure_1_4
Out[26]:

two-sided t-test

Methodology aspect of a two-sided t-test

In [27]:
from statsmodels.stats.weightstats import ttest_ind

# run a two-sided t-test for gene 136
t_statistic_gene136, p_value_gene136, dof = ttest_ind(
    leukemia_gene136_AML_df['genetic_activity'], leukemia_gene136_ALL_df['genetic_activity'],
    value=0, alternative='two-sided', usevar='pooled'
)

print("t-statistic for gene 136 = {:.2f}".format(t_statistic_gene136))
print("p-value for gene 136 = {:.4f}".format(p_value_gene136))
print("degrees of freedom = {}".format(dof))
t-statistic for gene 136 = 3.01
p-value for gene 136 = 0.0036
degrees of freedom = 70.0

Inferential aspect of a two-sided t-test

A small significance level (or “p-value”) is a statement of statistical surprise: something very unusual has happened if in fact there is no difference in gene $136$ expression levels between ALL and AML patients. We are less surprised by $t = 3.01$ if gene $136$ is just one candidate out of thousands that might have produced “interesting” results (Fonte: Efron, B., & Hastie, T. (2016). Computer Age Statistical Inference: Algorithms, Evidence, and Data Science. Cambridge: Cambridge University Press.)

In [28]:
%%time

# run a two-sided t-test for all genes
t_statistics = []
genes = leukemia_df['gene'].drop_duplicates()
for gene in genes:
    # filter gene in question by leukemia type
    leukemia_gene_df = leukemia_df[(leukemia_df.gene == gene)]
    leukemia_gene_ALL_df = leukemia_gene_df[(leukemia_gene_df.leukemia == "ALL")]
    leukemia_gene_AML_df = leukemia_gene_df[(leukemia_gene_df.leukemia == "AML")]
    
    # run a two-sided t-test
    t_statistic, p_value, _ = ttest_ind(
        leukemia_gene_AML_df['genetic_activity'], leukemia_gene_ALL_df['genetic_activity'],
        value=0, alternative='two-sided', usevar='pooled'
    )
    
    t_statistics.append(t_statistic)
Wall time: 4min 27s
In [29]:
# counts how many t-statistics are grater than the one for gene 136
t_statistics_GT_gene136 = (t_statistics > t_statistic_gene136)
t_statistics_GT_gene136 = np.sum(t_statistics_GT_gene136)

# calculates proportion of t-statistics that are grater than the one for gene 136
total_statistics = len(t_statistics)
proportion_statistics_GT_gene136 = t_statistics_GT_gene136/float(total_statistics)

print("total genes = {0}".format(total_statistics))
print("t-statistic for gene 136 = {:.2f}".format(t_statistic_gene136))
print("number of genes w/ t-stat > t-stat-gene136 = {}".format(t_statistics_GT_gene136))
print("proportion of genes w/ t-stat > t-stat-gene136 = {:.1f}%".format(proportion_statistics_GT_gene136*100))
total genes = 7128
t-statistic for gene 136 = 3.01
number of genes w/ t-stat > t-stat-gene136 = 414
proportion of genes w/ t-stat > t-stat-gene136 = 5.8%
In [30]:
alt.data_transformers.enable('default', max_rows=None)

hist_tstats_df = pd.DataFrame({'t_statistic': t_statistics})
hist_tstats_plot = alt.Chart(hist_tstats_df).mark_bar().encode(
    alt.X('t_statistic:Q', bin=alt.Bin(step=(5/13))),
    alt.Y('count()')
)

t_statistic_gene136_mark = alt.Chart(hist_tstats_df).mark_rule(color='red').encode(
    x='t_statistic_gene136:Q',
    size=alt.value(1)
).transform_calculate(
    t_statistic_gene136=str(t_statistic_gene136)
)

plot = (hist_tstats_plot + t_statistic_gene136_mark).properties(
    title="Two-sided t-statistics for 7128 genes  |  t-statistic for gene 136"
)
plot
Out[30]:

In [31]:
# theoretical null distribution for the t-test of this example
t_distribution_70dof = st.t(df=dof)
In [32]:
def draw_density_plot(distribution, distribution_name, color):
    # Create the points to draw the probability density function
    x_i = distribution.ppf(0.001)
    x_f = distribution.ppf(0.999)
    x = np.linspace(x_i, x_f, 100)
    f_x = distribution.pdf(x)
    
    # draw the density function f(x)
    data = pd.DataFrame({'x': x, 'f(x)': f_x})
    density = alt.Chart(data).mark_line(color=color).encode(
        x='x:Q',
        y='f(x):Q'
    ).properties(title='{0}'.format(distribution_name))
    
    return density
In [33]:
t_distribution_70dof_plot = draw_density_plot(t_distribution_70dof, "t-distribution with 70 dof", 'yellow')
t_distribution_70dof_plot
Out[33]:

In [34]:
number_of_bins = (np.max(t_statistics) - np.min(t_statistics))/(5/13)
number_of_bins = int(np.ceil(number_of_bins))

# calculates the:
#   "pdf_at_bins" values of the histogram so that the result is the value of the probability 
#                 density function (pdf) at the bin, normalized such that the integral over 
#                 the range is 1.
#   "bin_edges" 
pdf_at_bins, bin_edges = np.histogram(t_statistics, bins=number_of_bins, density=True)

hist_tstats_df = pd.DataFrame({'pdf': pdf_at_bins, 't_statistic':bin_edges[:len(bin_edges)-1]})
hist_tstats_plot = alt.Chart(hist_tstats_df).mark_bar().encode(
    x='t_statistic:Q',
    y='pdf:Q'
)
hist_tstats_plot
Out[34]:

In [35]:
figure_1_5 = (hist_tstats_plot + t_distribution_70dof_plot).properties(
    title="Two-sided t-statistics for 7128 genes | theoretical null density for the t-statistic"
)
figure_1_5
Out[35]:

The histogram implies that in this study there is something wrong with the theoretical null distribution (“Student’s t with $70$ degrees of freedom”), the smooth curve in Figure 1.5. It is much too narrow at the center, where presumably most of the genes are reporting non-significant results. (Fonte: Efron, B., & Hastie, T. (2016). Computer Age Statistical Inference: Algorithms, Evidence, and Data Science. Cambridge: Cambridge University Press.)

We will see in Chapter 15 that a low false-discovery rate, i.e., a low chance of crying wolf over an innocuous gene, requires $t$ exceeding $6.16$ in the ALL/AML study. Only $47$ of the $7128$ genes make the cut. False-discovery-rate theory is an impressive advance in statistical inference, incorporating Bayesian, frequentist, and empirical Bayesian (Chapter 6) elements. It was a necessary advance in a scientific world where computer based technology routinely presents thousands of comparisons to be evaluated at once. (Fonte: Efron, B., & Hastie, T. (2016). Computer Age Statistical Inference: Algorithms, Evidence, and Data Science. Cambridge: Cambridge University Press.)

Referências além do próprio livro:

In [ ]: