The Applied Social Data Science Programme is a postgraduate diploma (PG Dip.) offered by the Department of Political Science and funded by the HEA’s Springboard+ initiative. The target student group for Springboard+ programmes is unemployed people with a previous history of employment, those in employment, and those returning to the workforce. In practice, approximately half of students are recent graduates. The programme is currently close to completing its second year, and will next year be offered also as a paid MSc (i.e. with additional 30 ECTS dissertation). As well as Springboard+ students, ASDS modules are also offered to Political Science PhD researchers, and the Statistics I and II modules are compulsory methods classes for first year PhD researchers.
Following year one of the programme it became clear from student feedback that teaching did not always meet expectations, particularly regarding the retraining and upskilling aspect of Springboard+ programmes. Students felt that the programme was too narrowly focussed on purely academic skills appropriate for PhD researchers and those wishing to proceed to a career in HE, and did not include enough focus on the broader skillset required for working in the private or public sectors as a data scientist.
A need was therefore identified to improve professionalisation, and introduce students to the wider set of techniques necessary to be competitive within the job market. These include managing the data science workflow, familiarising students with common data science platforms for collaboration, as well as the more practical coding skills needed to wrangle data and communicate results.
To meet this need I redesigned my teaching for Statistics I, a compulsory 10 ECTS module offered to both ASDS students and first year PhD researchers. Teaching on this 10 week module is currently divided between 2hrs per week of lectures, given by a colleague, and 2hrs per week of tutorials/labs, taught by me. In year one of the programme these tutorials were divided into two 1 hour sessions, for two groups of students.
My proposed solution involved combining the two groups into a single 2 hour class, and then using the additional time available to restructure the pedagogical approach of the tutorials to a more praxis-oriented method. Rather than treating professional skill acquisition as separate to the learning outcomes, and simply pointing students toward additional external resources, I attempted to model each tutorial around a (realistic) data science project workflow, including the systems and tools which would be necessary to complete the task.
Within this approach an emphasis was placed on iteration: the same processes were repeated each week, with progressive complications and technical challenges added to stretch students whilst reinforcing a mental map of data science as a set of practices.
To begin with, processes were introduced simply as workflow (i.e., without an explicit requirement to code or engage with overly technical resources); once the motivation for an approach was clear, the technical aspect of practice was then gradually introduced, until students were able to grasp both why and how certain processes were followed.
Implementation of the new teaching strategy involved advancing pedagogy in three specific areas: firstly, new teaching material was developed in skills for workflow management and professionalisation; secondly, synchronous teaching sessions were redesigned to include greater focus on collaboration, group work and peer learning; and finally, additional opportunities for formative assessment and continuous feedback were integrated into the teaching design. Each of these areas involved embedding digital pedagogy in some form.
To embed good workflow practices the first five weeks of tutorials were designed as discrete mini projects. I researched online for resources by data scientists that described their own workflow in a pragmatic way, and eventually decided on the approach used by Pat Schloss (Riffomonas), a data scientist working in genetics who uses a combination of R Studio with github for version control and collaborative work.
In the existing teaching approach to this module students were already required to sign up to github. Github is an online repository for sharing code which makes use of git, a version control system. Git and github are both widely used in the industry, and skill acquisition in both is thus useful for professionalisation. However, git/github were previously used only as a method for distributing and grading assignments, with no explicit instruction provided, and students thus struggled both with motivation (why) and implementation (how).
I decided therefore to explicitly integrate git and github within the
weekly workflow of tutorials. Within the Statistics I
repository, I placed each week a tutorial
folder
which was further divided into sub-folders for code
,
data
and results
. When students
forked the main repository they gained access to copies of
these folders on their own system, which they could then work with in R
Studio. In this way, I could model students both the process needed to
upload their own assignments, as well as good practice in managing
future data science projects (version control, separation of code from
data from results, etc.)
Github repository
As the term progressed students were required each week to update their github repositories, an iterative process which reinforced learning and helped students understand in an intuitive way how git/github works and, more importantly, why data scientists use it to keep track of projects and work collaboratively. Students were also continually required to interface git/github with R Studio, another important professional requirement, as much of data science involves getting different systems to interface with each other.
Further professionalisation was encouraged through use of R markdown (see below) and Latex
- both
forms of markdown language used to communicate results, but which
require some degree of expertise to use well. Different mini projects
required students to write up their results in RMD or
Latex
.
Finally, an added benefit of using git/github with RMD concerned the
availability of materials: tutorial guides, written in RMD, could be
published to github and viewed by students as html
web
pages. This both demonstrated to students a viable method for
communicating results themselves, as well as providing a stable platform
for tutorial resources which would not be limited by access to
Blackboard.
A typical requirement of professional data scientists is the ability to work collaboratively with colleagues, while the academic literature on teaching and learning has also identified peer learning as an important component of the student learning process.
In the first year of the programme few structured opportunities for collaboration were offered to students as part of their learning (synchronous or asynchronous), while spontaneous collaboration across problem sets resulted in more than one case of plagiarism, which could at least partially be explained by a lack of clear guidance to students on the extent to which collaboration was permitted, and perhaps too great a focus on individual work.
I therefore decided to incorporate a group work element within the module: as students improved their practical skills across the first five weeks of the programme, short synchronous exercises involving collaboration were worked into the tutorials. Following reading week in the middle of term, the final six classes were structured around a single, large group project, the house price project.
This project involved students working in groups of four or five to find the best model to predict house prices, using a dataset taken from Practical Statistics for Data Scientists (Bruce, Bruce and Gedeck). I had previously found this textbook to be useful for students, as it presented all the main concepts and models in a not overly-technical way.
The house price project was designed to introduce students gradually to the fundamental concept of linear regression, with additional technical aspects added each week, from modelling non-linear relationships, to diagnosing non-normal errors and outliers. By taking a practical, applied approach to these concepts (which are often taught in a very abstract, mathematically intense manner), I tried to emphasise to students an intuitive understanding of regression, and to model data science to them as a practice, rather than a checklist of formal models and equations.
Group work was also an opportunity for embedding digital pedagogy:
students were invited to find their own approach to working
collaboratively, with many deciding to set up their own github
repositories, while communication of results involved students deciding
whether to use RMD or Latex
. As previously in the module,
weekly tutorial guides were provided in html
and RMD
formats, with data shared to students through github.
The shift to group work and practical, project-oriented pedagogy also enabled greater opportunity for formative assessment and continuous feedback. In terms of embedding digital pedagogy, this is perhaps one area where there remains room for improvement: students were encouraged to upload their completed work during synchronous teaching to a Blackboard discussion board. This approach saw mixed results: students were often reluctant to share their models this way, and so I will consider alternative approaches for next year’s module, including use of Vevox, which I briefly experimented with on another module later in the academic year, with positive results.
Nonetheless, the ability for students to work on problems during synchronous sessions and then discuss outcomes (rather than passively watch the instructor work through the problems) resulted in improved student engagement relative to the first year of the programme. In this respect, the doubling of tutorials in length from one to two hours allowed for more in-depth exploration of techniques; in practice, a single hour does not allow sufficient time for this kind of interactive teaching approach.
Another innovation I brought to this year’s module concerned asynchronous feedback, paired with formative assessment. This related specifically to the house price project. I do not have any authority within the module for setting summative assessment, and given the significant workload placed on students I had decided not to set any asynchronous formative tasks for tutorial. However, during the house price project I did ask students to try to complete whatever they were working on in class and upload their group’s results each week to the Blackboard discussion board. I found this asynchronous approach to work much better than in-class use of the discussion board, and when I also committed to provide students feedback on their uploaded material engagement improved further.