This document is intended to help you prepare for the final exam. No answers will be posted, but I suggest to attempt the activities and seek feedback from the Assistance Hub tutors or Alec and Liza in Friday labs.
Remember that for this exam calculators are not required. You will see that previous exams have a lot of calculation questions, whereas ours will not. Please be prepared that there will be some differences. Class activities, assignments and labs are the best guide to coverage, in addition to this document.
I think it will take a very well prepared student about one minute per point. The exam is 2 hours long plus reading time so even if you’re spending two minutes per point, I hope most students won’t feel overly time pressured.
Make sure you’re familiar with:
The class activities and discussions from W9 L3 & W10 L1.
Assignment 4 (which refers to Nemes et al. (2009)).
Week 09 Lab Task discussion questions.
Suppose you are interested in using a simulation study to explore the distribution of p-values associated with a \(\beta\) coefficient in logistic regression when the null hypothesis is true.
Specifically, you are interested in setting up a bootstrap with the following criteria:
a binary response variable
an explanatory variable that is a factor with 3 levels, "a", "b" and "c", all with equal probability of getting a 1 or 0 for the response.
Write code to simulate 10,000 repetitions for samples of size 100, saving the value of the p-value associated with the coefficient for “b”.
Plot these p-values.
What is this plot?
Is this an example of parametric, non-parametric bootstrapping or something else?
Statistics is often called both “an art and a science”. In this question, we will take a look at the actual world of fine art. As a lot of the data on forgeries, and the AI tools being developed to detect them, are sensitive and/or proprietary, we’ll be working with simulated data for this question, but based on some real-world estimates related to our questions of interest.
The following text will be provided in the exam.
Rob R. Faiksette Esq. (or at least that’s what it says on his business card) heard about your excellent statistical skills and would like your help with some data from his small gallery where he sells paintings. It’s very unfortunate, but several of his clients have recently discovered paintings they purchased from him were, in fact, (high quality!) forgeries!
The data set art_data.csv
contains the following
variables:
title
: character; the title of the painting
forged
: logical; whether the art was discovered to
be forged or not.
tech1
, tech2
, tech3
,
tech4
: numeric; “proprietary” metrics based on a high
resolution scan
price
: numeric; how much the piece sold for in 000s
of Euros.
has_the
: logical; whether the title of the artwork
has the word “the” in it.
This dataset contains 498 paintings Faiksette has sold in his
gallery. According to him, tech1
, tech2
,
tech3
, and tech4
, are “proprietary metrics”
that he can’t disclose too many details of due to their commercial
sensitivity. All he would say is that they were calculated based on
high-resolution scans of the paintings, taking into account features
like brush stroke patterns and pigments used.
Faiksette had another statistician helping him, but they left when they found out their payment would be made in paintings. Some of the code and outputs they developed are shown below. Use them to answer the following questions.
Optional (not examinable): Always look at your data. Even if there isn’t a T-Rex 🦖 hiding in your data, I think you may notice some anomalies with Rob R. Faiksette’s Esq. stock of paintings…
The following text will be provided in the exam.
Blakely et al. (2018), explores how
differences in socioeconomic position (SEP
) and smoking
behaviours (Smoking
) between Māori and Europeans
(Ethnicity
) in Aotearoa New Zealand explain differences in
mortality outcomes (Mortality
) (i.e., did a person die or
not). The paper uses a ‘causal mediation’ approach (and also
bootstrapping!) to estimate that, controlling for age
(Age
), almost HALF of the disparity in mortality outcomes
between males (Sex
) in these two groups were explained by
differences in socioeconomic position. The six variable names are
indicated in this font
.
Can you identify whether prediction or explanation is the main purpose of a study? (See W10 L1 class discussion)
Can you identify the design of a study? (Observational vs designed experiment.) (See W10 L1 class discussion)
Can you identify what model would be most appropriate (at least to start with) for investigating a research question based on a brief description of the response variable? (A big part of the whole first half of the course)
Can you write null and alternative hypotheses associated with beta coefficients in regression?
Can you interpret a confidence interval in context?
Look at sample answers for the test for an idea of what goes in to a good answer for this. (A1 & A2 sample answers also).
Can you identify the level of confidence from code? (See W10 L1 class discussion).
Do you understand the relationship between p-values and confidence intervals?
Do you know the null value on the log/logit scale, multiplicative scale, etc?
Based on a p-value, could you tell if a null value was in our outside of a CI? Vice versa?
What is a prediction interval and how do they compare to confidence intervals?
Why do we do cross-validation?
How different information criteria (AICc, AIC, BIC) relate to model complexity, how their penalties work, etc. (Handout 8, recapped with the evaporation example in Handout 15, see class discussion from W10 L2).
Using GAMs as a tool for model development and selection (see class discussion from W10 L2 & 3).
What is a variance inflation factor and what can it tell us?