How to Teach Data Science – zyBooks Guide
I’ve thought about how to teach data science – a lot.
Six years ago I was a statistics professor at a Midwestern university when my dean asked me to create a data science class from scratch. That was a tough challenge, because at the time there wasn’t a clear consensus of what a data science course should even look like.
And you know what? There still isn’t.
Data science is such a new field that the pedagogy isn’t yet etched in granite. Instructors often shift into this subject from other disciplines, like I did, and have to figure it out on their own (again, like I did). I joined zyBooks to help solve this problem, and create the data science textbook I wish I had back at school.
Although the discipline is still evolving, through my own experience, research and talking with instructors across the country, I’ve evolved the following set of best practices that are invaluable in helping students master this subject – best practices you can put into action in your own classroom right away.
Best Practices for Teaching Data Science
- Key topics for a data science course
- Use a variety of real-life datasets
- Communication skills are essential – teach ‘em
- Lecturing isn’t enough
- Coding is key…
- …but data science is more than just programming
- Encourage good coding practices
- Use the “data science lifecycle” in your classroom
- Meet students where they are (I’ll explain)
- Use frequent, meaningful assessments
- Bonus: Teaching data science with zyBooks
1. Key topics for a data science course
First, a not-so-rhetorical question: What is data science, anyway? Think of the discipline as the entire process of working with a dataset and extracting meaningful insights from it. While no consensus curriculum yet exists, you’ll want to cover the following foundational topics in your classes:
Data Wrangling
How do you manipulate or structure a dataset?
Data Visualization
How do you “see” a dataset to understand the relationships within?
Modeling Data
How do you make meaningful insights or describe relationships between features in your dataset?
Data Wrangling
This is the process of manipulating the structure and formatting of a dataset to answer a particular research question. Show how:
- Datasets may need to be “tidied” into a row-column format
- Features or variables in a dataset may need to be combined or split
- New features might be calculated based on existing features
Data Visualization
A picture is worth a thousand words, and so is a plot! Creating static and dynamic data visualizations to explore and describe relationships in a dataset are the baguette and butter of data scientists. Demonstrate how:
- Data visualizations should be clearly formatted and accessible
- Good data visualizations don’t try to show too much. Simpler can be better!
Modeling Data
Models are algorithmic or mathematical tools for making predictions and describing relationships in a dataset. Explain:
- How models in data science come from machine learning, statistics, and artificial intelligence
- How to use a model, and when each model is appropriate
- How to evaluate a model, and choose the best model from a series of options
Programming
Programming is an essential tool in data science. Languages like Python and R are important for data wrangling, data visualization, and modeling data. These languages are also used to put data science models “into production” so that companies can make real-time decisions based on incoming information. If your students have a computer science background they may need less programming instruction than students without coding experience.
Consider your students’ interests and future goals when planning the sequence of topics for your course. For a single course in data science, for example, you may want to focus just on data wrangling and visualization, or on an overview of data science models.
2. Use a variety of real-life datasets
Just say no to boring datasets!
Data science is being applied to basically everything now – business, medicine, social sciences, sports, and on and on – so we’re surrounded by fascinating datasets.
Your students come from a variety of backgrounds and disciplines, so give them a wide range of datasets they can really sink their molars into. They’ll enjoy a richer experience in your class, and gain hands-on experience with real-life challenges that will be invaluable for their future careers.
It’s easy to dig up cool datasets. Try these three free repositories, for starters:
Free Collections of Fascinating Datasets
Tidy Tuesday
New datasets from a wide range of sources are added every Tuesday, including datasets on Bigfoot sightings, cosmetic brands and Bob Ross paintings.
Kaggle
Open-source datasets and dataset competitions with prizes. Datasets available include the most streamed songs of all time on Spotify; brain tumor images; and fast-fashion eco-data.
UC Irvine Machine Learning Repository
Over 600 datasets and counting, including datasets on income predictions from census data; landmine detection; and one of the first datasets ever, on irises (nearly 90 years old!).
3. Communications skills are essential – teach ‘em
While data scientists are technical experts, it’s also critical that they effectively communicate their findings to a variety of audiences. So it’s imperative to teach communications skills to your students.
Writing assignments, labs and term projects are excellent opportunities to practice written and verbal abilities. And tools like Jupyter Notebooks, Google Colab and RMarkdown give students the chance to write code and describe their findings in a single document.
“Explain it like I’m your grandma”
Dr. Schwab-McCoy breaks down how to develop communication skills in the data science classroom:
4. Lecturing isn’t enough
Data science is a discipline where you learn by doing. Simple as that. So just lecturing won’t work. Instead, you’ll want to join your students on their learning journey.
What do I mean by that?
Hold live classroom demonstrations where you write code, execute it, and interpret the data. Don’t be afraid to make mistakes – show your students that you’re learning by doing just like them. And engage them by assigning lab activities and short coding exercises they complete during class. Keep it interactive and dynamic, ask loads of questions, and prompt discussion.
5. Coding is key…
Like I mentioned earlier, programming languages like Python and R are crucial to data science. Giving your students well-documented sample code or template analyses are great ways to get them started. And this is where live coding during lectures becomes so important – show your students that programming isn’t always straightforward and mistakes are the name of the game (and okay!).
6. …but data science is not just programming
Coding is only part of the story. Unless we really understand what the features in our data are representing, and what we’re trying to learn, the code might not tell us anything. We can create a cool graph with Python. But if it doesn’t address the research question, what’s the point?
Another heads up: Avoid introducing too much code too quickly so you don’t lose sight of the bigger picture, and lose your students at the same time.
How to get students engaged from Day One
Dr. Schwab-McCoy shares her approach to getting – and keeping – students engaged:
7. Encourage good coding practices
Emphasizing good coding practices in the classroom can be tricky, but there’s an immediate payoff for both instructors and students: It’s easier for you to review and grade nicely formatted code, and also easier for students to share and review it with their peers.
How to get there?
- Set expectations of code formatting through your own examples
- Teach students how to write meaningful, well-documented comments to their code
- Share a style guide that outlines things like formatting and naming conventions
- Remind students that in the real world, code doesn’t just have to work; it must be readable and accessible to your coworkers, managers and other stakeholders
8. Use the “data science lifecycle” in your classroom
Data scientists approach a research problem through a deliberate five-step process, otherwise known as the data science lifecycle. Use it as a model for critical thinking in your classroom. Assignments, projects and discussion examples should all emulate these five-steps, to help your students learn how to effectively conceptualize and analyze complex data.
Five Steps of the Data Science Lifecycle
Gathering data
Identify what data is available and relevant, and collect new data if necessary
Cleaning data
Reformat datasets, create new features, and address unusual or missing values
Exploring data
Create visualizations, calculate descriptive statistics, and identify possible relationships
Modeling data
Use statistical or algorithmic techniques to make predictions or measure relationships
Interpreting data
Describe conclusions and make recommendations
Reinforcing the Data Science Lifecycle
How does Dr. Schwab-McCoy emphasize this pivotal concept in the classroom?
9. Meet students where they are
Some data science courses require programming or statistics courses as a prerequisite. Others don’t. No one pathway into data science exists; no one curriculum does either.
(For example, we offer three different versions of our data science zyBooks).
Students come to data science from a wide range of disciplines, so it’s really important to understand their background to tailor your course to where they are. For example, in a class where students have taken introductory statistics but haven’t done much programming, you might want to spend more time at the beginning on the ins and outs of, say, Python. But if students are already familiar with coding, you can jump right into more advanced aspects of data visualization and modeling.
10. Use frequent, meaningful assessments
For data science, frequent assessment is the golden rule.
Small, meaningful assessments help your students build up their knowledge and confidence bit by bit by bit. Assigning weekly coding tasks, in-class labs and ongoing projects are more effective ways to assess critical data science thinking than quizzes or exams. (Or getting tripped up by a huge final exam.)
Grading all these assessments can be a big challenge, of course. Relying on Jupyter Notebooks, Google Colab or RMarkdown can help. Since code runs live in their environments, they can provide quick checks on quality.
Remember, think “small, constant checkpoints.” Assessments can be as simple as filling in the blanks in a Jupyter notebook, running an analysis of code you’ve given them, or writing a short interpretation of what they’re finding. All this will help students stay on track, and help you gauge the pace of class, and adjust as needed as the term progresses.
11. BONUS: Teaching data science with zyBooks
Since data science is so dynamic, requiring coding and live investigations of datasets, interactive, web-native zyBooks, I feel, are the ideal format to study this discipline. So much so that I helped create the groundbreaking Data Science Foundations zyBooks series!
These books cover the entire range of real-life tasks that data scientists might face in their daily practice. Here are quick tips to get the most out of them:
- zyBooks are great for active learning, so assign reading and Participation Activities before class to increase student accountability and identify points to cover during lecture
- Use built-in Jupyter notebooks as a starting point. Datasets and sample code can be expanded on in class or as homework. Data Science Foundations uses real datasets, which can be downloaded from the appendix; feel free to augment with your own
- Challenge Activities and zyLabs are great to use as homework assignments, or as precursors to your own assignments
- Programming is an important part of data science, but not the only part, of course. Data Science Foundations builds conceptual understanding before diving into programming
How to Teach Data Science with zyBooks
Dr. Schwab-McCoy walks through best practices for teaching data science with zyBooks: