DATA 1030: Introduction to Topics in Data and Computational Science


Course Info

Lecture Time: TTh 9am-11:50am
Lecture Room: CIT Center (Thomas Watson CIT) 219
Instructor: Dan Potter
TAs: Ajay Balaji, Ashley Chen, Shivani Guturu and Nathaniel Ostrer
HTAs: Luke Zhu and Charlene Wang

Important Docs

Contact Info

Course Schedule

Open Hours

All TA hours are held on the 9th floor of the SciLi.

Course Description

Data Science is an exciting and new discipline that has begun to emerge around data and computation. Traditionally, statistics has been focused on understanding and modeling data, and computer science has been focused on algorithms and efficient computation. Data Scientists use both fields, as well as domain-specific knowledge, to gain insight, build models and design systems for making decisions that are in part data driven.

DATA1030, Introduction to Topics in Data and Computational Science, is a double-credit course, that contains three threads of instruction. The first is designed to give students hands-on experience with some of the applied statistical concepts, and software tools that are essential for modern data science. Gathering data, data wrangling, exploratory data analysis, and machine learning are all part of this aspect part of the course.

The second thread is a review of many of the Algorithms and Data Structures that one would typically encounter in an Undergraduate Computer Science program. The approach here is also hands-on with short presentations of theoretical material followed by many coding problems and questions. The goal here is two-fold. It ensures students are familiar with the algorithmic underpinnings of the computational and data systems used for Data Science at scale. And it develops good programming skills that are of the sort that are commonly tested during technical interviews.

The third thread in the course is an introduction to what I like to call Data Systems. Data Systems organize data for efficient storage and/or computation. Relational databases are a canonical example of a Data System. As they scale in size, Relational Databases typically rely on interconnected computing and storage systems. Issues involving things like parallelism and reliability are all conveniently hidden under the hood. However, the performance tradeoffs one experiences with using a Relational Database are tied to the underlying data structures, algorithms and system latencies built into its design. Similar considerations affect all the more recent entrants into the Data Systems space, including No-SQL, GPUs and cloud computing.

By the end of the course, students will become comfortable with creating and working with end-to-end data science pipelines. They should also feel confident in their ability to they can solve programming problems, and be able to demonstrate this ability to others. Finally, they should be comfortable in working in at least one Data Systems computing environment and be able to make knowledgeable comparisons between different Data Systems computing environments.

Diversity and Inclusion

Our intent is that this course provide a welcoming environment for all students who satisfy the prerequisites. Our TAs have undergone training in diversity and inclusion, and all members of the CS community, including faculty and staff, are expected to treat one another in a professional manner. If you feel you have not been treated in a professional manner by any of the course staff, please contact any of Dan (the instructor), Ugur Cetintemel (Dept. Chair), Tom Doeppner (Vice Chair) or Laura Dobler (diversity and inclusion staff member). We will take all complaints about unprofessional behavior seriously. Your suggestions are encouraged and appreciated. Please let Dan know of ways to improve the effectiveness of the course for you personally, or for other students or student groups. To access student support services and resources, and to learn more about diversity and inclusion in CS, please visit

Prof. Krishnamurthi has good notes on this area.