Introduction

This document is intended to help you prepare for the final exam. No answers will be posted, but I suggest to attempt the activities and seek feedback from the Assistance Hub tutors or Alec and Liza in Friday labs.

Remember that for this exam calculators are not required. You will see that previous exams have a lot of calculation questions, whereas ours will not. Please be prepared that there will be some differences. Class activities, assignments and labs are the best guide to coverage, in addition to this document.

Overview

  • There are 50 points overall for exam.
  • It is organised into three sections.
  • There are 8 pages, including a cover page (below).
  • All weeks content in the course are eligible for coverage, but the focus will be on handouts covered in the second half of the semester.

Time management

I think it will take a very well prepared student about one minute per point. The exam is 2 hours long plus reading time so even if you’re spending two minutes per point, I hope most students won’t feel overly time pressured.

Question 1: Simulation study [20 marks]

Make sure you’re familiar with:

  • The class activities and discussions from W9 L3 & W10 L1.

  • Assignment 4 (which refers to Nemes et al. (2009)).

  • Week 09 Lab Task discussion questions.

Activity

Suppose you are interested in using a simulation study to explore the distribution of p-values associated with a \(\beta\) coefficient in logistic regression when the null hypothesis is true.

Specifically, you are interested in setting up a bootstrap with the following criteria:

  • a binary response variable

  • an explanatory variable that is a factor with 3 levels, "a", "b" and "c", all with equal probability of getting a 1 or 0 for the response.

  1. Write code to simulate 10,000 repetitions for samples of size 100, saving the value of the p-value associated with the coefficient for “b”.

  2. Plot these p-values.

  3. What is this plot?

  4. Is this an example of parametric, non-parametric bootstrapping or something else?

Question 2: Art? aRt? [20 marks]

Statistics is often called both “an art and a science”. In this question, we will take a look at the actual world of fine art. As a lot of the data on forgeries, and the AI tools being developed to detect them, are sensitive and/or proprietary, we’ll be working with simulated data for this question, but based on some real-world estimates related to our questions of interest.

Blurb

The following text will be provided in the exam.

Rob R. Faiksette Esq. (or at least that’s what it says on his business card) heard about your excellent statistical skills and would like your help with some data from his small gallery where he sells paintings. It’s very unfortunate, but several of his clients have recently discovered paintings they purchased from him were, in fact, (high quality!) forgeries!

The data set art_data.csv contains the following variables:

  • title: character; the title of the painting

  • forged: logical; whether the art was discovered to be forged or not.

  • tech1, tech2, tech3, tech4: numeric; “proprietary” metrics based on a high resolution scan

  • price: numeric; how much the piece sold for in 000s of Euros.

  • has_the: logical; whether the title of the artwork has the word “the” in it.

This dataset contains 498 paintings Faiksette has sold in his gallery. According to him, tech1, tech2, tech3, and tech4, are “proprietary metrics” that he can’t disclose too many details of due to their commercial sensitivity. All he would say is that they were calculated based on high-resolution scans of the paintings, taking into account features like brush stroke patterns and pigments used.

Faiksette had another statistician helping him, but they left when they found out their payment would be made in paintings. Some of the code and outputs they developed are shown below. Use them to answer the following questions.

Suggested preparation

  • Load the data and explore it
  • Consider some appropriate models
    • Consider the variance inflation factors
    • Use GAMs to investigate any non-linearity
    • Find the best model according to AIC and BIC
  • What statistic would be most appropriate for considering which model works best for prediction? How would this work with cross-validation?

Optional (not examinable): Always look at your data. Even if there isn’t a T-Rex 🦖 hiding in your data, I think you may notice some anomalies with Rob R. Faiksette’s Esq. stock of paintings…

Question 3: Causal mediation [10 marks]

Blurb

The following text will be provided in the exam.

Blakely et al. (2018), explores how differences in socioeconomic position (SEP) and smoking behaviours (Smoking) between Māori and Europeans (Ethnicity) in Aotearoa New Zealand explain differences in mortality outcomes (Mortality) (i.e., did a person die or not). The paper uses a ‘causal mediation’ approach (and also bootstrapping!) to estimate that, controlling for age (Age), almost HALF of the disparity in mortality outcomes between males (Sex) in these two groups were explained by differences in socioeconomic position. The six variable names are indicated in this font.

Activity

  1. Redraw the causal diagram from Figure 1B with the following requirements:
    1. Simplify the grey box to just be “Ethnicity”.
    2. Separate Age and Sex and have them just be initial nodes, no arrows leading in to them, only out.
    3. Colour the direct path between Ethnicity and Mortality RED.
    4. Colour any indirect paths between Ethnicity and Mortality BLUE.
    5. Colour any other direct effects on Mortality GREEN.
  2. Based on the causal diagram, which variables should be included in a model if we want to estimate:
    1. the total effect of Ethnicity on Mortality?
    2. the direct effect of Ethnicity on Mortality?

General

  • Can you identify whether prediction or explanation is the main purpose of a study? (See W10 L1 class discussion)

  • Can you identify the design of a study? (Observational vs designed experiment.) (See W10 L1 class discussion)

    • Could you explain whether or not the same research question could be answered by the other design type. (See W10 L2 class discussion)
  • Can you identify what model would be most appropriate (at least to start with) for investigating a research question based on a brief description of the response variable? (A big part of the whole first half of the course)

  • Can you write null and alternative hypotheses associated with beta coefficients in regression?

  • Can you interpret a confidence interval in context?

    • Look at sample answers for the test for an idea of what goes in to a good answer for this. (A1 & A2 sample answers also).

    • Can you identify the level of confidence from code? (See W10 L1 class discussion).

  • Do you understand the relationship between p-values and confidence intervals?

    • Do you know the null value on the log/logit scale, multiplicative scale, etc?

    • Based on a p-value, could you tell if a null value was in our outside of a CI? Vice versa?

  • What is a prediction interval and how do they compare to confidence intervals?

  • Why do we do cross-validation?

  • How different information criteria (AICc, AIC, BIC) relate to model complexity, how their penalties work, etc. (Handout 8, recapped with the evaporation example in Handout 15, see class discussion from W10 L2).

  • Using GAMs as a tool for model development and selection (see class discussion from W10 L2 & 3).

  • What is a variance inflation factor and what can it tell us?

    • Can you recognise the code that does this/a description of the matrix operations?

References

Blakely, T., G. Disney, L. Valeri, J. Atkinson, A. Teng, N. Wilson, and L. Gurrin. 2018. “Socioeconomic and Tobacco Mediation of Ethnic Inequalities in Mortality over Time: Repeated Census-Mortality Cohort Studies.” Epidemiology 29 (4): 506–16. https://doi.org/10.1097/EDE.0000000000000842.
Nemes, Sandor, Johanna M Jonasson, Annika Genell, et al. 2009. “Bias in Odds Ratios by Logistic Regression Modelling and Sample Size.” BMC Medical Research Methodology 9 (1): 56.