STAT 507 — Statistical Data Science

Author

David Kepplinger

Published

January 17, 2025

Administrative

  • Course meetings: Wednesday, Jan 22 – Apr 30 (final exam period: May 7). No class on March 14th (Spring break).
    • Class time: 4:30 – 7:10 P.M. in Innovation Hall 135.
  • Instructor: Dr. David Kepplinger (he/him/his)
    • Email: dkepplin@gmu.edu
    • Office: Nguyen Engineering Building (ENGR), Room 1711
    • Office hour: Thursday, 2 – 3 P.M. virtually over Zoom (Meeting ID: 910 8972 0315; Passcode: 368242)
  • Canvas course page: https://canvas.gmu.edu/courses/28878

Course description

Overview

The following topics will be covered in some detail.

  • Introduction to statistical programming languages:
    • Writing code in R with the “tidyverse” and python
    • Common data structures
    • Interfacing with cloud computing frameworks/libraries
    • Parallel computing
    • Interactive visualizations
    • Data wrangling and accessing data from multiple sources
  • Project organization, including:
    • git and GitHub for version control and collaborating
    • Managing analysis environments using renv for R and miniconda for python.
    • Literate programming
  • Overview of common statistical methods, including:
    • Linear & logistic regression
    • Ensemble methods for regression & classification
    • Resampling techniques for inference
    • Multiple testing
    • Dimension reduction
    • Smoothing
    • Penalized regression

Learning outcomes

After successfully completing this course you are expected to have mastered the following.

  • Using R and python for data science projects
  • Managing large data analyses projects
  • Running data analyses on high-performance computer clusters (HPCs) and other unsupervised environments (cloud computing)
    • Using the command line to manage data analysis jobs
  • Understanding the utility and applicability of common statistical methods encountered in data science projects
  • Applying approximate inference using resampling techniques
  • Understanding and mitigating sources of bias in observational studies
  • Working in teams and communicating effectively

Prerequisites

Recommended Prerequisite: STAT 250 or STAT 344 or equivalent, some familiarity with programming concepts, or permission of the instructor.

Textbooks

The main textbooks for this course are

  • Wickham, H., Çetinkaya-Rundel, M., Grolemund, G. (2023). R for Data Science. MIT Press. This book is available online for free.
  • Hu, B., Barter, R.L. (2024). Veridical Data Science. This book is available online for free.
  • VanderPlas, J. (2022). Python Data Science Handbook. 2nd Edition. O’Reilly. This book is available online for free.
  • Peng, R.D., Matsui, E. (2018). The Art of Data Science. Leanpub. Available online for free.
  • Peng, R.D. (2022). R Programming for Data Science. Leanpub. Available online for free.
  • James, G., Witten, D., Hastie, T., Tibshirani, R. (2021). An Introduction to Statistical Learning with Applications in R. 2nd Edition. Springer. This book is available online for free.

Further details which will likely not be covered in detail during class but which may be of interest to some learners can be found in

  • Wickham. (2019). Advanced R. Chapman and Hall/CRC. doi:10.1201/b17487. Available online.

Additional articles and book chapters may be posted in our Canvas course as different topics are discussed.

Academic integrity

Mason is an Honor Code university; please see the Office for Academic Integrity for a full description of the code and the honor committee process. Three fundamental principles to follow at all times are that:

  1. all work submitted be your own, as defined by the assignment;
  2. when you use the work, the words, or the ideas of others, including fellow students or online sites, you give full credit through accurate citations; and
  3. if you are uncertain about the ground rules on a particular assignment or exam, ask for clarification.

No grade is important enough to justify academic misconduct.

Use of generative AI

You may use Generative AI tools whenever you believe it would be useful to your learning of course material. You are particularly encouraged to leverage Generative AI to fully understand code shared by the instructor and/or code from textbooks. You must properly cite the used tool(s) and a statement-of-usage is required. This includes citations and statement-of-usage in R or python code as comments. All academic integrity violations will be reported to the office of Academic Integrity.

Generative AI tools may only be used if following the fundamental principles of the Honor Code. This includes being honest about the use of these tools for submitted work and including citations when using the work of others, whether individual people or Generative-AI tools.

Although you are unrestricted with your use of Generative AI tools, you will be responsible for any incorrect, biased, or unethical information that is submitted. Your assignment grade will reflect the inclusion of any material that is incorrect or offensive.

Logistics

The class is scheduled as a face-to-face meeting on-campus, with several classes being conducted virtually over Zoom. Please see the detailed schedule on the Canvas page for details.

All learners taking courses with a face-to-face component are required to follow the university’s public health and safety precautions and procedures outlined on the University’s Safe Return to Campus webpage. If the campus closes, or if a class meeting needs to be canceled or adjusted due to weather or other concern, learners should check the Canvas course for updates on how to continue learning and for information about any changes to events or assignments.

Communications

The Canvas site for this course is the primary channel of communication. Please check the Canvas course regularly for updates! Information posted on the Canvas site includes

  • announcements,
  • lecture notes,
  • homework assignments, quizzes, midterm and final exam,
  • changes to the posted office hours,
  • handouts and readings.

Any question related to concepts and topics should be asked on the Course Q&A. Questions will be visible to all registered students, and everyone is expected to actively participate in answering questions posted by peers. Active participation in answering questions will be counted towards the participation grade.

E-mail communication must be restricted to questions relating to sensitive and confidential information (such as grade concerns, personal circumstances requiring specific accommodations, etc.).

  • E-mails will be returned within 2 business days and may not be returned on weekends/holidays.
  • When you send an e-mail to me, please put STAT 507 at the beginning of the subject line.
  • E-mails related to this course must be sent and received via your Mason e-mail account. E-mails sent from other e-mail accounts may not be answered. (This is a university policy and part of your guaranteed rights under FERPA.)
  • E-mails with questions that should be posted to our course Q&A may not be answered.

Should you have concerns that you may not be able to fully participate or engage in any of the activities listed below, please do not hesitate to contact me either by e-mail or speak to me in person during office hours or after class. We can discuss alternative arrangements that suit your needs.

Hardware requirements

We will frequently use laptop computers for in-class activities. Please be respectful of your peers and your instructor and do not engage in activities that are unrelated to the class.

Software requirements

This class will use the following interpreters and programming environments:

Activities and assignments in this course may sometimes use web-conferencing software (Zoom). In addition to the requirements above, you are required to have a device with a functional camera and microphone. In an emergency, you can connect through a telephone call, but video connection is the expected norm.

Grading

Your grade in this course will be based on in-class activities, bi-weekly homework assignments of different types, and various submissions related to a data analysis project.

The instructor reserves the right to change the weights if needed.

Assignment # of graded assignments Weight each Weight total
In-class participation 10 1% 10%
In-class activities 9 4% 36%
1st Data Science Case Study 1 20% 20%
2nd Data Science Case Study 1 34% 34%

Written and oral communication are an integral part of any statistical work, and as such, grammar, style, and spelling are part of grading rubrics applied to all deliverables. You are strongly encouraged to use the resources and tutoring offered by the writing center (https://writingcenter.gmu.edu).

Unless clearly communicated in advance, all assignments in this course are designated as individual assignments, which are to be undertaken independently. You may discuss your ideas with others but everything you turn in must be your own work. You may not share analyses, graphs, code, and other materials. You are responsible for making sure that there is no reason to doubt that the work you hand in is your own.

Attendance

Attendance is mandatory and in-class participation counts towards your final grade. You are responsible for material covered in class and announcements made during class. In case of approved absence, you are expected to get notes from your peers and submit the in-class activity on an alternative schedule as arranged with the instructor.

Participation

Success in this course requires active participation in in-class activities and discussions, for which you will need to prepare in advance for each class period. Accordingly, you are expected to prepare for class period by

  • reading the corresponding sections of textbooks or research articles to be covered in class,
  • reviewing class materials posted on Canvas,
  • familiarizing yourself with the use of the covered methods and techniques.

In-class activities

There will be 10 in-class activities through the term which will vary in length and content. The worst of these activities will not count towards your final grade. In-class activities are due the day after the lecture at 11:59 PM and must be submitted on Canvas.

Late submissions will be penalized by reducing the total number of points possible by 10% of the original total number of points for each day late. For example, if an in-class activity is worth a total of 10 points, it will be worth only 9 points when submitted within the first 24 hours after the due date, 8 points when submitted between 24–48 hours after the due date, and 7 points when submitted more than 48 hours late. Submissions will not be accepted more than 72 hours (3 days) past the due date.

Case studies

There will be two graded case studies conducted over the course of the semester. For each case study you will be graded on the case study report, the rigor and validity of the analysis as well as the analysis code (including reproducibility). Grading rubrics will be shared on Canvas.

Case study 1

The topic and scope of the first case study will be provided by the instructor. All work is to be done individually.

Case study 1 Weight
Report 5%
Data analysis 10%
Code and reproducibility 5%
Total 20%

Case study 2

The second case study will be done in pairs to practice communication and team-work skills. Each team must decide on the topic and scope of their case study, with guidance and approval by the instructor.

In addition to the components of the first case study, the teams must also give an oral presentation about their case study. You must also review a draft report from one other team.

Case study 2 Weight
Report 8%
Data analysis 13%
Presentation 5%
Peer review 3%
Code and reproducibility 5%
Total 34%

Grading Policies

Percentage grade Letter grade
≥ 90% A
≥ 80% but < 90% B
≥ 70% but < 80% C
< 70% F

Regrading policies

You have at most one week after a score is posted for an assignment to appeal the score. If you want parts of an assignment remarked, send an email to the instructor specifying the question/part and the reason for requesting a review of grading. If you do not notify the instructor in writing of any issues with your score within that time, then the posted score stands (whether or not it is correct).

Further policies

Please see the George Mason Common Course Policies for additional policies governing this course.