Statistics is the science of learning from experience, particularly experience that arrives a little bit at a time: the successes and failures of a new experimental drug, the uncertain measurements of an asteroid’s path toward Earth. (Fonte: Efron, B., & Hastie, T. (2016). Computer Age Statistical Inference: Algorithms, Evidence, and Data Science. Cambridge: Cambridge University Press.)
First, however, we want to discuss a less philosophical, more operational division of labor that applies to both theories: between the algorithmic and inferential aspects of statistical analysis. (Fonte: Efron, B., & Hastie, T. (2016). Computer Age Statistical Inference: Algorithms, Evidence, and Data Science. Cambridge: Cambridge University Press.)
"Very broadly speaking, algorithms are what statisticians do while inference says why they do them. A particularly energetic brand of the statistical enterprise has flourished in the new century, data science, emphasizing algorithmic thinking rather than its inferential justification." (Fonte: Efron, B., & Hastie, T. (2016). Computer Age Statistical Inference: Algorithms, Evidence, and Data Science. Cambridge: Cambridge University Press.)
Here averaging is the algorithm, while the standard error provides an inference of the algorithm’s accuracy. It is a surprising, and crucial, aspect of statistical theory that the same data that supplies an estimate can also assess its accuracy (Fonte: Efron, B., & Hastie, T. (2016). Computer Age Statistical Inference: Algorithms, Evidence, and Data Science. Cambridge: Cambridge University Press.)
The point is that the algorithm comes first and the inference follows at a second level of statistical consideration. (Fonte: Efron, B., & Hastie, T. (2016). Computer Age Statistical Inference: Algorithms, Evidence, and Data Science. Cambridge: Cambridge University Press.)
Suponha que você tenha em mãos uma amostra aleatória (random sample) $x = \{ x_1, x_2, \ldots, x_n \}$ de tamanho $n$ que foi coletada de alguma população (population) e que estes valores $x_1, x_2, \ldots, x_n$ se aplicam a algum fenômeno do seu interesse (digamos, a quantidade de alunos matriculados no ensino fundamental em $n$ municípios do estado de Minas Gerais).
Você precisa de uma estimativa para a média ($\mu$) da população (population). Com a amostra aleatória (random sample size) $x = \{ x_1, x_2, \ldots, x_n \}$ em mãos, a melhor estimativa que você pode ter para a média ($\mu$) da população (population) é a média ($\bar{x}$) desta amostra aleatória (random sample size).
Aqui, o nosso algoritmo para calcular a melhor estimativa para o parâmetro de interesse é a média aritmética:
$$ \bar{x} = \sum_{i=1}^n \frac{x_i}{n} $$Mas qual a incerteza (uncertanty) em relação a esta estimativa? Em outras palavras, o quão preciso é este número $\bar{x}$?
De acordo com o Teorema Central do Limite (Central Limit Theorem - CLT), sob certas condições (mais detalhes aqui), a distribuição das médias das amostras (sample means) $\bar{X}$ é aproximadamente normal (nearly normal), centrada na média da população (population mean) $\mu$, e com um desvio padrão (standard deviation) $SE$ igual ao desvio padrão da população (population standard deviation) $\sigma$ dividido pela raiz quadrada do tamanho das amostras (samples size) $n$.
$$ \bar{X} \sim N\left(mean=\mu, SE=\frac{\sigma}{\sqrt{n}}\right) $$Então, o desvio padrão (standard deviation) $SE = \frac{\sigma}{\sqrt{n}}$ desta distribuição das médias das amostras (sample means) $\bar{X}$ é, na verdade, uma medida do erro da nossa estimativa $\bar{x}$ e, portanto, $SE$ fornece uma inferência a respeito da precisão do nosso algoritmo (a média aritmética). Por este motivo, o desvio padrão (standard deviation) $SE$ desta distribuição das medias das amostras (sample means) $\bar{X}$ é denominado de erro padrão (standard error).
No entanto, tipicamente, não conhecemos o desvio padrão da população (population standard deviation) $\sigma$, porém, podemos estimá-lo através do desvio padrão da amostra (sample standard deviation) $s$. Logo,
$$ SE \approx \hat{se} = \frac{s}{\sqrt{n}} $$Então, ao invés de reportar somente a estimativa $\bar{x}$, podemos reportar alguma incerteza sobre esta estimativa através de $\bar{x} \pm \hat{se}$.
from scipy import stats as st
import numpy as np
# collects random samples (for a range of sizes) from population distribution and
# calculates estimations to mean and standard error
def collect_samples_and_calc_mean_SE_estimations(pop_distribution, start_size, end_size, increment_size):
random_sample_sizes = []
estimated_means = []
estimated_standard_errors = []
for random_sample_size in range(start_size,end_size,increment_size):
random_sample = pop_distribution.rvs(size=random_sample_size)
estimated_mean = np.mean(random_sample)
estimated_standard_error = np.std(random_sample) / np.sqrt(random_sample_size)
random_sample_sizes.append(random_sample_size)
estimated_means.append(estimated_mean)
estimated_standard_errors.append(estimated_standard_error)
return random_sample_sizes, estimated_means, estimated_standard_errors
# definition of the population distribution
pop_par_mu = 0
pop_par_sd = 1
pop_distribution = st.norm(loc=pop_par_mu, scale=pop_par_sd)
random_sample_sizes = []
estimated_means = []
estimated_SEs = []
# collects samples, calculates estimates and updates arrays
# start_size=10, end_size=100, increment_size=20
sizes, means, SEs = collect_samples_and_calc_mean_SE_estimations(pop_distribution, 10, 100, 20)
random_sample_sizes.extend(sizes)
estimated_means.extend(means)
estimated_SEs.extend(SEs)
# collects samples, calculates estimates and updates arrays
# start_size=100, end_size=1.000, increment_size=100
sizes, means, SEs = collect_samples_and_calc_mean_SE_estimations(pop_distribution, 100, 1000, 100)
random_sample_sizes.extend(sizes)
estimated_means.extend(means)
estimated_SEs.extend(SEs)
# collects samples, calculates estimates and updates arrays
# start_size=1.000, end_size=10.000, increment_size=1.000
sizes, means, SEs = collect_samples_and_calc_mean_SE_estimations(pop_distribution, 1000, 10000, 1000)
random_sample_sizes.extend(sizes)
estimated_means.extend(means)
estimated_SEs.extend(SEs)
# collects samples, calculates estimates and updates arrays
# start_size=10.000, end_size=100.000, increment_size=10.000
sizes, means, SEs = collect_samples_and_calc_mean_SE_estimations(pop_distribution, 10000, 100000, 10000)
random_sample_sizes.extend(sizes)
estimated_means.extend(means)
estimated_SEs.extend(SEs)
import pandas as pd
import altair as alt
alt.renderers.enable('notebook') # for the notebook only (not for JupyterLab) run this command once per session
def plot_truemean_estimatedmean_SE(true_mean, estimated_means, estimated_standard_errors, random_sample_sizes):
data = pd.DataFrame({
'true_mean': true_mean, 'random_sample_size': random_sample_sizes,
'estimated_mean': estimated_means, 'estimated_se': estimated_standard_errors
})
true_mean_plot = alt.Chart(data).mark_rule(color='red').encode(
y='true_mean:Q',
size=alt.value(1)
)
estimated_mean_plot = alt.Chart(data).mark_line().encode(
x='random_sample_size:O',
y='estimated_mean'
)
standard_error_plot = alt.Chart(data).mark_area(opacity=0.3).encode(
x='random_sample_size:O',
y='mu_hat_sup:Q',
y2='mu_hat_inf:Q'
).transform_calculate(
mu_hat_sup='datum.estimated_mean + datum.estimated_se',
mu_hat_inf='datum.estimated_mean - datum.estimated_se',
)
chart = (true_mean_plot + estimated_mean_plot + standard_error_plot).properties(title='(est_mean +- se) | (true_mean)')
return chart
plot_truemean_estimatedmean_SE(pop_par_mu, estimated_means, estimated_SEs, random_sample_sizes)
Esta seção se refere ao exemplo abaixo do livro:
Figure 1.1 concerns a study of kidney function. Data points $\left( x_i, y_i \right)$ have been observed for $n=157$ healthy volunteers, with $x_i$ the $i$th volunteer’s age in years, and $y_i$ a composite measure “tot” of overall function. Kidney function generally declines with age, as evident in the downward scatter of the points. The rate of decline is an important question in kidney transplantation: in the past, potential donors past age $60$ were prohibited, though, given a shortage of donors, this is no longer enforced. (Fonte: Efron, B., & Hastie, T. (2016). Computer Age Statistical Inference: Algorithms, Evidence, and Data Science. Cambridge: Cambridge University Press.)
Os dados utilizados aqui são os disponibilizados no próprio site do livro através da URL https://web.stanford.edu/~hastie/CASI/data.html.
Dados:
kidney_df = pd.read_csv("data/kidney.txt", delimiter=' ')
kidney_df.head()
# scatter plot of dataset
kidney_df_plot = alt.Chart(kidney_df).mark_point().encode(
alt.X('age:Q', scale=alt.Scale(zero=False)),
y='tot:Q',
size=alt.value(10)
).properties(title="kidney dataset")
kidney_df_plot
Algorithmic aspect of linear regression model
import statsmodels.formula.api as smf
# fit linear model
kidney_ols_model = smf.ols(formula='tot ~ age', data=kidney_df).fit()
# predict dataset
kidney_ols_model_tot_pred = kidney_ols_model.predict(kidney_df['age'])
# line plot that represents the linear model
kidney_ols_model_data = pd.DataFrame({'age': kidney_df['age'], 'tot_pred': kidney_ols_model_tot_pred})
kidney_ols_model_plot = alt.Chart(kidney_ols_model_data).mark_line(color="green").encode(x='age', y='tot_pred')
kidney_ols_model_plot_title = "OLS model (tot = {0:.2f} + {1:.3f}age)".format(
kidney_ols_model.params[0], kidney_ols_model.params[1]
)
# dataset + model plots
plot = (kidney_df_plot + kidney_ols_model_plot).properties(
title=kidney_ols_model_plot_title
)
plot
Inferential aspect of linear regression model
# calculates an inference (standard errors) for the linear regression model
# see http://people.duke.edu/~rnau/mathreg.htm
def calculate_regression_SEs(kidney_ols_model, ages, s, n, mean_ages, var_ages):
tots_prev = []
SEs = []
for age in ages:
tot_prev = kidney_ols_model.predict({'age':age})
tots_prev.append(tot_prev[0])
SE = (s / np.sqrt(n)) * np.sqrt(1 + ((age - mean_ages)**2)/var_ages)
SEs.append(SE)
return tots_prev, SEs
mean_ages = np.mean(kidney_df['age'])
var_ages = np.var(kidney_df['age'])
n = len(kidney_df.index)
# "standard error of the model" or "standard error of the regression"
s = np.std(kidney_ols_model.resid, ddof=1) * np.sqrt((n-1)/(n-2))
some_ages = [20, 30, 40, 50, 60, 70, 80]
# calculates an inference (standard errors) for the linear regression model
# see http://people.duke.edu/~rnau/mathreg.htm
kidney_ols_model_tots_prev, kidney_ols_model_SEs = calculate_regression_SEs(
kidney_ols_model, some_ages, s, n, mean_ages, var_ages
)
table_1_1_part1 = pd.DataFrame({
'ages':some_ages,
'linear reg. predic.':np.round(kidney_ols_model_tots_prev, decimals=2),
'linear reg. std error':np.round(kidney_ols_model_SEs, decimals=2)
})
table_1_1_part1
def make_CIs_bar_plot(ages, tots_prev, SEs):
cis_bar_data = pd.DataFrame({
'age': ages,
'ci_bar_lb': [(tot_prev - 2*se) for tot_prev, se in zip(tots_prev, SEs)],
'ci_bar_ub': [(tot_prev + 2*se) for tot_prev, se in zip(tots_prev, SEs)]
})
cis_bar_plot = alt.Chart(cis_bar_data).mark_rule(color='red').encode(
y='ci_bar_lb:Q', y2='ci_bar_ub:Q', x='age:Q', size=alt.value(1)
)
return cis_bar_plot
kidney_ols_model_cis_bar_plot = make_CIs_bar_plot(some_ages, kidney_ols_model_tots_prev, kidney_ols_model_SEs)
# dataset + model + cis plots
figure_1_1 = (kidney_df_plot + kidney_ols_model_plot + kidney_ols_model_cis_bar_plot).properties(
title=kidney_ols_model_plot_title
)
figure_1_1
def make_CIs_line_plot(ages, tots_prev, SEs):
cis_line_data = pd.DataFrame({
'age': ages,
'ci_ln_lb': [(tot_prev - 2*se) for tot_prev, se in zip(tots_prev, SEs)],
'ci_ln_ub': [(tot_prev + 2*se) for tot_prev, se in zip(tots_prev, SEs)]
})
cis_line_lb_plot = alt.Chart(cis_line_data).mark_line(color='red').encode(
y='ci_ln_lb:Q', x='age:Q'
)
cis_line_ub_plot = alt.Chart(cis_line_data).mark_line(color='red').encode(
y='ci_ln_ub:Q', x='age:Q'
)
return cis_line_lb_plot + cis_line_ub_plot
# calculates an inference (standard errors) for the linear regression model
# see http://people.duke.edu/~rnau/mathreg.htm
kidney_ols_model_tots_prev, kidney_ols_model_SEs = calculate_regression_SEs(
kidney_ols_model, kidney_df['age'], s, n, mean_ages, var_ages
)
kidney_ols_model_cis_line_plot = make_CIs_line_plot(
kidney_df['age'], kidney_ols_model_tots_prev, kidney_ols_model_SEs
)
# dataset + model + cis plots
plot = (kidney_df_plot + kidney_ols_model_plot + kidney_ols_model_cis_line_plot).properties(
title=kidney_ols_model_plot_title
)
plot
Algorithmic aspect of lowess
from scipy.interpolate import interp1d
import statsmodels.api as sm
lowess = sm.nonparametric.lowess
kidney_lowess_model = lowess(kidney_df['tot'], kidney_df['age'], frac=1./3)
kidney_lowess_model_x = list(zip(*kidney_lowess_model))[0]
kidney_lowess_model_y = list(zip(*kidney_lowess_model))[1]
kidney_lowess_model_f = interp1d(kidney_lowess_model_x, kidney_lowess_model_y, bounds_error=False)
kidney_lowess_model_tots_prev = kidney_lowess_model_y
# curve plot that represents the lowess model
kidney_lowess_model_data = pd.DataFrame({'age': kidney_df['age'], 'tot_pred': kidney_lowess_model_tots_prev})
kidney_lowess_model_plot = alt.Chart(kidney_lowess_model_data).mark_line(color="green").encode(x='age', y='tot_pred')
# dataset + model plots
kidney_lowess_model_plot_title = "Local polynomial lowess(y,x,1/3)"
plot = (kidney_df_plot + kidney_lowess_model_plot).properties(
title=kidney_lowess_model_plot_title
)
plot
Inferential aspect of lowess
# calculates an inference (standard errors) for the lowess model using bootstrap
replications = 250
kidney_lowess_model_bootstrap_df = pd.DataFrame(columns=['replication', 'age', 'tot_pred'])
for rep in range(replications):
bootstrap_idxs = np.random.choice(n, n)
new_kidney_df = kidney_df.iloc[bootstrap_idxs]
new_kidney_df = new_kidney_df.sort_values(by=['age'])
new_kidney_lowess_model = lowess(new_kidney_df['tot'], new_kidney_df['age'], frac=1./3)
new_kidney_lowess_model_x = list(zip(*new_kidney_lowess_model))[0]
new_kidney_lowess_model_y = list(zip(*new_kidney_lowess_model))[1]
new_kidney_lowess_model_f = interp1d(new_kidney_lowess_model_x, new_kidney_lowess_model_y, bounds_error=False)
tot_pred = new_kidney_lowess_model_y
df_row = pd.DataFrame({
'replication': [rep]*n,
'age': new_kidney_df['age'],
'tot_pred': tot_pred
})
# insert predictions about ages 40 and 50 that does not existis in the original dataset
tot_pred_for_ages_40_50 = new_kidney_lowess_model_f([40, 50])
# update df_row
df_row = pd.concat([
df_row,
pd.DataFrame({'replication':[rep, rep], 'age':[40, 50], 'tot_pred': tot_pred_for_ages_40_50})
])
df_row = df_row.sort_values(by=['age'])
kidney_lowess_model_bootstrap_df = pd.concat([kidney_lowess_model_bootstrap_df, df_row])
# filter the first 25 bootstraped lowess models
filter_data = kidney_lowess_model_bootstrap_df[(kidney_lowess_model_bootstrap_df.replication < 25)]
# curve plot that represents the lowess model
kidney_lowess_model_bootstrap_plot = alt.Chart(filter_data).mark_line().encode(
x='age:Q', y='tot_pred:Q', color=alt.Color('replication:O', legend=None)
)
# dataset + model plots
figure_1_3 = (kidney_df_plot + kidney_lowess_model_bootstrap_plot).properties(
title="25 first Local polynomial lowess(y,x,1/3) from bootstrap"
)
figure_1_3
import bootstrapped.bootstrap as bs
import bootstrapped.stats_functions as bs_stats
# calculates an inference (standard errors) using bootstrap
def calculate_bootstrap_SEs(bootstrap_data, ages):
SEs = []
for age in ages:
filter_bootstrap_df = bootstrap_data[(bootstrap_data['age'] == age)]
bs_std = bs.bootstrap(np.array(filter_bootstrap_df['tot_pred']), stat_func=bs_stats.std)
SEs.append(bs_std.value)
return SEs
kidney_lowess_model_tots_prev_some_ages = kidney_lowess_model_f(some_ages)
# calculates an inference (standard errors) for the lowess model using bootstrap
kidney_lowess_model_SEs = calculate_bootstrap_SEs(kidney_lowess_model_bootstrap_df, some_ages)
table_1_1 = pd.DataFrame({
'ages':some_ages,
'linear reg. predic.':table_1_1_part1['linear reg. predic.'],
'linear reg. std error':table_1_1_part1['linear reg. std error'],
'lowess predic.':np.round(kidney_lowess_model_tots_prev_some_ages, decimals=2),
'bootstrap std error':np.round(kidney_lowess_model_SEs, decimals=2)
})
table_1_1
... Table 1.1 show the lowess estimates and their standard errors. We have paid a price for the increased flexibility of lowess, its standard errors roughly doubling those for linear regression. (Fonte: Efron, B., & Hastie, T. (2016). Computer Age Statistical Inference: Algorithms, Evidence, and Data Science. Cambridge: Cambridge University Press.)
kidney_lowess_model_cis_bar_plot = make_CIs_bar_plot(
some_ages, kidney_lowess_model_tots_prev_some_ages, kidney_lowess_model_SEs
)
# dataset + model + cis plots
figure_1_2 = (kidney_df_plot + kidney_lowess_model_plot + kidney_lowess_model_cis_bar_plot).properties(
title=kidney_lowess_model_plot_title
)
figure_1_2
%%time
# calculates an inference (standard errors) for the lowess model using bootstrap
kidney_lowess_model_SEs = calculate_bootstrap_SEs(kidney_lowess_model_bootstrap_df, kidney_df['age'])
kidney_lowess_model_cis_line_plot = make_CIs_line_plot(
kidney_df['age'], kidney_lowess_model_tots_prev, kidney_lowess_model_SEs
)
# dataset + model + cis plots
plot = (kidney_df_plot + kidney_lowess_model_plot + kidney_lowess_model_cis_line_plot).properties(
title=kidney_lowess_model_plot_title
)
plot
Esta seção se refere ao exemplo abaixo do livro:
Our second example concerns the march of methodology and inference for hypothesis testing rather than estimation: $72$ leukemia patients, $47$ with ALL (acute lymphoblastic leukemia) and $25$ with AML (acute myeloid leukemia, a worse prognosis) have each had genetic activity measured for a panel of $7128$ genes. The histograms in Figure 1.4 compare the genetic activities in the two groups for gene 136. (Fonte: Efron, B., & Hastie, T. (2016). Computer Age Statistical Inference: Algorithms, Evidence, and Data Science. Cambridge: Cambridge University Press.)
Os dados utilizados aqui são os disponibilizados no próprio site do livro através da URL https://web.stanford.edu/~hastie/CASI/data.html.
Dados:
leukemia_original_df = pd.read_csv("data/leukemia_big.csv", delimiter=',')
leukemia_original_df.head()
leukemia_df = pd.DataFrame(columns=['patient', 'leukemia', 'gene', 'genetic_activity'])
patient = 0
total_genes = len(leukemia_original_df.index)
for column in leukemia_original_df.columns:
patient = patient + 1
leukemia = "ALL" if (column.find("ALL") != -1) else "AML"
df_row = pd.DataFrame({
'patient': [patient]*total_genes,
'leukemia': [leukemia]*total_genes,
'gene': np.arange(1,(total_genes+1)),
'genetic_activity': leukemia_original_df[column]
})
leukemia_df = pd.concat([leukemia_df, df_row], ignore_index=True)
leukemia_df.head()
leukemia_gene136_df = leukemia_df[(leukemia_df.gene == 136)]
leukemia_gene136_df.head()
def draw_hist_plot(values, x_axis_limits=None):
data = pd.DataFrame({'scores': values})
hist = None
if x_axis_limits is None:
hist = alt.Chart(data).mark_bar().encode(
alt.X('scores:Q', bin=True),
alt.Y('count()')
)
else:
hist = alt.Chart(data).mark_bar().encode(
alt.X('scores:Q', bin=True, scale=alt.Scale(domain=x_axis_limits)),
alt.Y('count()')
)
rule = alt.Chart(data).mark_rule(color='red').encode(
x='mean(scores)',
size=alt.value(2)
)
return hist + rule
# filter only gene 136 by leukemia type
leukemia_gene136_ALL_df = leukemia_gene136_df[(leukemia_gene136_df.leukemia == "ALL")]
leukemia_gene136_AML_df = leukemia_gene136_df[(leukemia_gene136_df.leukemia == "AML")]
# generates histograms
leukemia_gene136_ALL_hist = draw_hist_plot(leukemia_gene136_ALL_df['genetic_activity'])
leukemia_gene136_AML_hist = draw_hist_plot(leukemia_gene136_AML_df['genetic_activity'])
# calculates means
mean_leukemia_gene136_ALL = np.mean(leukemia_gene136_ALL_df['genetic_activity'])
mean_leukemia_gene136_AML = np.mean(leukemia_gene136_AML_df['genetic_activity'])
# set titles plots
leukemia_gene136_ALL_hist = leukemia_gene136_ALL_hist.properties(
title="ALL scores - mean {0:.3f}".format(mean_leukemia_gene136_ALL)
)
leukemia_gene136_AML_hist = leukemia_gene136_AML_hist.properties(
title="AML scores - mean {0:.3f}".format(mean_leukemia_gene136_AML)
)
# draw histograms
figure_1_4 = leukemia_gene136_ALL_hist | leukemia_gene136_AML_hist
figure_1_4
Methodology aspect of a two-sided t-test
from statsmodels.stats.weightstats import ttest_ind
# run a two-sided t-test for gene 136
t_statistic_gene136, p_value_gene136, dof = ttest_ind(
leukemia_gene136_AML_df['genetic_activity'], leukemia_gene136_ALL_df['genetic_activity'],
value=0, alternative='two-sided', usevar='pooled'
)
print("t-statistic for gene 136 = {:.2f}".format(t_statistic_gene136))
print("p-value for gene 136 = {:.4f}".format(p_value_gene136))
print("degrees of freedom = {}".format(dof))
Inferential aspect of a two-sided t-test
A small significance level (or “p-value”) is a statement of statistical surprise: something very unusual has happened if in fact there is no difference in gene $136$ expression levels between ALL and AML patients. We are less surprised by $t = 3.01$ if gene $136$ is just one candidate out of thousands that might have produced “interesting” results (Fonte: Efron, B., & Hastie, T. (2016). Computer Age Statistical Inference: Algorithms, Evidence, and Data Science. Cambridge: Cambridge University Press.)
%%time
# run a two-sided t-test for all genes
t_statistics = []
genes = leukemia_df['gene'].drop_duplicates()
for gene in genes:
# filter gene in question by leukemia type
leukemia_gene_df = leukemia_df[(leukemia_df.gene == gene)]
leukemia_gene_ALL_df = leukemia_gene_df[(leukemia_gene_df.leukemia == "ALL")]
leukemia_gene_AML_df = leukemia_gene_df[(leukemia_gene_df.leukemia == "AML")]
# run a two-sided t-test
t_statistic, p_value, _ = ttest_ind(
leukemia_gene_AML_df['genetic_activity'], leukemia_gene_ALL_df['genetic_activity'],
value=0, alternative='two-sided', usevar='pooled'
)
t_statistics.append(t_statistic)
# counts how many t-statistics are grater than the one for gene 136
t_statistics_GT_gene136 = (t_statistics > t_statistic_gene136)
t_statistics_GT_gene136 = np.sum(t_statistics_GT_gene136)
# calculates proportion of t-statistics that are grater than the one for gene 136
total_statistics = len(t_statistics)
proportion_statistics_GT_gene136 = t_statistics_GT_gene136/float(total_statistics)
print("total genes = {0}".format(total_statistics))
print("t-statistic for gene 136 = {:.2f}".format(t_statistic_gene136))
print("number of genes w/ t-stat > t-stat-gene136 = {}".format(t_statistics_GT_gene136))
print("proportion of genes w/ t-stat > t-stat-gene136 = {:.1f}%".format(proportion_statistics_GT_gene136*100))
alt.data_transformers.enable('default', max_rows=None)
hist_tstats_df = pd.DataFrame({'t_statistic': t_statistics})
hist_tstats_plot = alt.Chart(hist_tstats_df).mark_bar().encode(
alt.X('t_statistic:Q', bin=alt.Bin(step=(5/13))),
alt.Y('count()')
)
t_statistic_gene136_mark = alt.Chart(hist_tstats_df).mark_rule(color='red').encode(
x='t_statistic_gene136:Q',
size=alt.value(1)
).transform_calculate(
t_statistic_gene136=str(t_statistic_gene136)
)
plot = (hist_tstats_plot + t_statistic_gene136_mark).properties(
title="Two-sided t-statistics for 7128 genes | t-statistic for gene 136"
)
plot
# theoretical null distribution for the t-test of this example
t_distribution_70dof = st.t(df=dof)
def draw_density_plot(distribution, distribution_name, color):
# Create the points to draw the probability density function
x_i = distribution.ppf(0.001)
x_f = distribution.ppf(0.999)
x = np.linspace(x_i, x_f, 100)
f_x = distribution.pdf(x)
# draw the density function f(x)
data = pd.DataFrame({'x': x, 'f(x)': f_x})
density = alt.Chart(data).mark_line(color=color).encode(
x='x:Q',
y='f(x):Q'
).properties(title='{0}'.format(distribution_name))
return density
t_distribution_70dof_plot = draw_density_plot(t_distribution_70dof, "t-distribution with 70 dof", 'yellow')
t_distribution_70dof_plot
number_of_bins = (np.max(t_statistics) - np.min(t_statistics))/(5/13)
number_of_bins = int(np.ceil(number_of_bins))
# calculates the:
# "pdf_at_bins" values of the histogram so that the result is the value of the probability
# density function (pdf) at the bin, normalized such that the integral over
# the range is 1.
# "bin_edges"
pdf_at_bins, bin_edges = np.histogram(t_statistics, bins=number_of_bins, density=True)
hist_tstats_df = pd.DataFrame({'pdf': pdf_at_bins, 't_statistic':bin_edges[:len(bin_edges)-1]})
hist_tstats_plot = alt.Chart(hist_tstats_df).mark_bar().encode(
x='t_statistic:Q',
y='pdf:Q'
)
hist_tstats_plot
figure_1_5 = (hist_tstats_plot + t_distribution_70dof_plot).properties(
title="Two-sided t-statistics for 7128 genes | theoretical null density for the t-statistic"
)
figure_1_5
The histogram implies that in this study there is something wrong with the theoretical null distribution (“Student’s t with $70$ degrees of freedom”), the smooth curve in Figure 1.5. It is much too narrow at the center, where presumably most of the genes are reporting non-significant results. (Fonte: Efron, B., & Hastie, T. (2016). Computer Age Statistical Inference: Algorithms, Evidence, and Data Science. Cambridge: Cambridge University Press.)
We will see in Chapter 15 that a low false-discovery rate, i.e., a low chance of crying wolf over an innocuous gene, requires $t$ exceeding $6.16$ in the ALL/AML study. Only $47$ of the $7128$ genes make the cut. False-discovery-rate theory is an impressive advance in statistical inference, incorporating Bayesian, frequentist, and empirical Bayesian (Chapter 6) elements. It was a necessary advance in a scientific world where computer based technology routinely presents thousands of comparisons to be evaluated at once. (Fonte: Efron, B., & Hastie, T. (2016). Computer Age Statistical Inference: Algorithms, Evidence, and Data Science. Cambridge: Cambridge University Press.)
Referências além do próprio livro: