Data Science 101 – is there a “consensus curriculum”?
Our previous posts in the Data Science Curriculum series have covered computing competencies and program outlines for data science majors. In our final post, we’ll explore the introduction to data science course.
Data science is still a young discipline
The earliest data science courses are only about 10 years old, which means that introductory data science courses vary widely from school to school. Unlike first programming or statistics courses, a true consensus curriculum for data science has yet to emerge. However, some topics are widely taught at the introductory level.
In Fall 2019, researchers at Creighton University sent a survey to mathematics, statistics, and computer science faculty asking them to indicate which of 34 topics were covered in their intro data science course. 68 faculty responded and completed the topic ranking.
The most common topics listed were:
Description | Proportion of courses |
Exploratory data analysis | 82% |
Data cleaning and wrangling | 75% |
Data ethics and responsible data use | 63% |
Data curation and data quality | 53% |
Linear and logistic regression | 53% |
Reproducible research | 51% |
Data lifecycle and data collection | 50% |
Research methods | 41% |
Data architecture, data types, and data formats | 40% |
Text mining | 40% |
Customizing data visualizations | 40% |
Supervised machine learning | 38% |
Data exploration, data wrangling, and data ethics were the three most common topics in the intro data science course
Basic models like linear and logistic regression, the data lifecycle, and data types were also important. Supervised machine learning algorithms and applications like text mining and custom visualizations rounded out the most common topics.
Some topics were ranked high as covered in the data science curriculum, but not necessarily the introduction course, including:
- Linear algebra: matrix manipulation, eigenvalues, singularity (74% covered elsewhere)
- Traditional statistical inference: hypothesis tests, confidence intervals (66%)
- Relational and non-relational databases (59%)
- Experimental design, modeling, and planning (57%)
- Simulation-based inference: bootstrapping, randomization tests (53%)
- Optimization and numerical algorithms (53%)
- Systems engineering and software engineering principles (51%)
- Unsupervised machine learning (47%)
- Big data technologies: batch and parallel processing (46%)
- Supervised machine learning (41%)
- Cloud computing (41%)
Data science courses have continued to evolve since 2019, so some topics may be more or less important in your course. The Data Science Foundations zyBook covers all of the essential data science topics and more, allowing you to customize your curriculum. Please visit the How to Teach Data Science – zyBooks Guide for additional resources and best practices.
For more information, check out the original study:Aimee Schwab-McCoy, Catherine M. Baker & Rebecca E. Gasper (2021) Data Science in 2020: Computing, Curricula, and Challenges for the Next 10 Years, Journal of Statistics and Data Science Education, 29:1, S40-S50, DOI: 10.1080/10691898.2020.1851159