-
Syllabus Overview
-
Weekly Class Schedule
-
SQLite
-
GraphViz
-
TF-IDF
Syllabus Overview
Health Data Analytics & AI
Prof. Javed Mostafa, University of Toronto
Profile:https://discover.research.utoronto.ca/54363-javed-mostafa
Email: dr.javedm@utoronto.ca
Teaching assistant 1: Nibras Ar Rakib, email: nibras.rakib@mail.utoronto.ca
Teaching assistant 2: JunHyuk Song, email: rruysong@skku.edu
Course website: https://it4.world/
SKKU International Summer Semester Schedule: https://summer.skku.edu/summer/program/Course_DATA.do
What diseases tend to occur more frequently in neighborhoods near chemical plants? How can an oncologist view and understand the long-term progression of cancer in one of her patients? The health analytics course will introduce students to a broad range of informatics and data analytics methods used to answer these kinds of questions. The course will focus on building students’ practical skills for processing and analyzing both structured and unstructured clinical data by taking them through the implementation of specific statistical methods and basic machine learning methods (applied AI techniques) to understand trends in individual and population health. By taking this course, students will become aware of the range of tools in a data analyst’s and applied AI expert’s toolbox and learn which ones to use to answer their own clinical or population health research questions.
Reading Materials
-
Textbook: Berman, J. J. (2018). Methods in Medical Informatics: Fundamentals of Healthcare Programming in Perl, Python, and Ruby. CRC Press.
-
Selected interactive chapters used in the course are hosted on a cloud platform and we will share the URLs of the scripts with you through the course website (https://it4.world/) on 30 June, 2025, during the first interactive class session (we call this site: Google Colab). You’ll be given instructions for accessing a read-write version to use for the course during the first meeting on the first-class day.
- Resource: Click to view
- Lectures: Click to view
Course Requirements and Grading
- Class Participation – 35%
- Project Mock Presentation – 5%
- Group project
- Assignment 1 – 10%
- Assignment 2 – 15%
- Assignment 3 – 15%
- Project Completion & Presentation – 20%
Participation grades will account for forum posting and synchronous session engagement. Responding to other students’ questions in the forum will also positively impact your participation grade. Additional information on the project development and deliverables will be shared in the class.
A final grade score 60 or above will be considered a Pass.
TA office location: Room 9B316 on Basement Level 3
Appendix A
Assignment 1: Hypothesis or a Research Question
The goal of assignment 1 is to develop a research question (hypothesis) based on healthcare data and to understand and implement a method to address the research question (hypothesis). Review and discuss the first few cases (chapters 19, 22, 24) in Part IV of the textbook with your group. You can also consider the case based on Chapter 5 dataset (please see Notes). These are examples of the kinds of questions you can answer with the tools and methods you are learning in this course. At this point, you don’t have to understand the code used to answer the question – just focus on getting a feel for the type and scope of the problem you’re expected to work on. Your first assignment as a group is to choose a dataset (associated with chapters 19, 22,24, or 5) and to define and contextualize a research question (hypothesis) that employs that dataset. For this assignment, you will:
- Identify a set of datasets you plan to work with for your project based on Chapters 19, 22, 24, or 5. Choose only one case as described in chapters 19, 22, 24, or 5. The files are available under the “Project Data” directory of the Google Drive: Click here.
- Chapter 19: Case Study: Emphysema Rates
- Mortality datasets (subset) : mort1999us.dat.
- Mortality datasets (full set) : MORT1999us.zip.
- Codebook for mortality dataset: CodebookForMort1999.pdf.
- ICD 10 datasets: each10.txt (Codebook instructions can be found in chapter-6 of the book).
- Chapter 22: Case Study: Ranking the Death-Certifying Process, by State
- Mortality datasets (subset): mort1999us.dat.
- Mortality datasets (full set) : Mort1999us.zip
- Codebook for mortality dataset: CodebookForMort1999.pdf.
- States information: cdc_states.txt.
- Chapter 24: Case Study: Sickle Cell Rates
- Mortality datasets (subset): mort1996us.dat, mort1999us.dat, mort2002us.dat, and mort2004us.dat.
- Mortality datasets (full set): MORT1996.zip, Mort1999us.zip, Mort2002us.zip, Mort2004us.zip.
- Codebook for mortality dataset: CodebookForMort1996.pdf, CodebookForMort1999.pdf, CodebookForMort2002.pdf, CodebookForMort2004.pdf.
- ICD 10 datasets: each10.txt (Codebook instructions can be found in chapter-6 of the book).
- Chapter 5_MeSH: Case Study: Nomenclature and Taxonomies
- Medical Subject Heading or MeSH: d2009.bin.
- In the book, if you look at Chapter-5, you will find the data file “d2009.bin” being used. Please study chapter 5 closely if you wish to understand the content of the d2009.bin data file.
- Chapter 19: Case Study: Emphysema Rates
- Please describe the dataset using the following criteria:
- Dataset name.
- Source of the dataset.
- Record count.
- Types of data (text, numeric or categorical).
- Describe a research question (Check part-IV (chapter 19, 22, or 24) of the book for more examples) related to a medical or health sciences topic that could be answered using the dataset you have selected and methods learned in this course.
- An example of research question is below:
- “If alpha-1 antitrypsin disease mutations play a significant contributory role in the pathogenesis of emphysema in the general population.” *
- *Example taken from “Methods in Medical Informatics, Page 270”.
- An example of research question is below:
- Justify why your research question is an important one. Identify one or two relevant research articles. Provide a brief overview of relevant research and include references to the articles.
- An example of justification: On occasion, a difference in the rate of occurrence of tumors among races may be due to identifiable (and mutable) exposure to carcinogens, or living conditions. Sometimes, socioeconomic conditions account for the differences. Occasionally, the differences lie in genetic traits that are found more often in one race than in another, and this may lead to an intervention that modifies the effect of the trait on the development of cancer [1].
- Notice after the justification, the reference 1. You are supposed to include references to at least one article for your assignment 1. Example taken from “Methods in Medical Informatics, Page 275”.
Reference:
DeCroo, S., Kamboh, M.I., Ferrell, R.E. Population genetics of alpha-1-antitrypsin polymorphism in US whites, US blacks and African blacks. Hum Hered 41:215–221, 1991; Hutchison, D.C.S. Alpha-1-antitrypsin deficiency in Europe: Geographical distribution of Pi types S and Z. Resp Med 92:367–377,1998.
Notes: We have also shared the MeSH dataset from Chapter 5. Students can also consider using the MeSH dataset for assignment 1. As Chapter 5 does not have any specific case studies similar to Chapters 19, 22, and 24, we are sharing some interesting ideas that can be used to develop a project using the MeSH dataset.
- Develop an interactive system to analyze and visualize the MeSH database.
- Create a search system to search the MeSH database, given a query.
- Create a search system that can accept a search term, map the term with MeSH, search PubMed using MeSH terms, and return the results.
Responses to this assignment should be no more than 2 pages in length, single spaced, excluding any references or appendices. Include one additional cover page at the top of the document mentioning assignment number, name of the project, group member’s name, and submission date. One group member should submit the assignment in PDF via email to dr.javedm@utoronto.ca and nibras.rakib@mail.utoronto.ca, no later than 11:59 PM on Wednesday, July 7, 2025.
Assignment 2: Provide a Project Outline
Now that you have defined your research question or hypothesis, it is time to plan your approach to address it. Throughout the course you have learned many methods and tools for answering medical questions; in this assignment you will choose one or more of these methods and tools best suited to help you answer your question. What methods and tools you choose depends significantly on both your question and the dataset(s) you chose. For this assignment, follow the two steps below:
- What general approach would you plan to apply to your dataset to answer your identified question? Which methods and tools would you plan to employ to answer this question? Why are these the appropriate methods and tools? For example, by methods and tools, we mean, specific script or script algorithm that you will use to answer the research question. The script or script algorithm should be chosen from the example that we have covered in the class.
- At the beginning of your document, you should briefly restate your research question. This is also a chance to redefine your research question based on feedback to Assignment 1, if you choose to do so.
Responses to this assignment should be no more than 2 pages in length, single spaced, excluding any references or appendices. Include one additional cover page at the top of the document mentioning assignment number, name of the project, group member’s name, and submission date. One group member should submit the assignment in PDF via email to dr.javedm@utoronto.ca and nibras.rakib@mail.utoronto.ca, no later than 11:59 PM on Wednesday, July 9, 2025.
Assignment 3: Script Algorithm
Based on your identified dataset, research question, and approach, you’ll develop a script algorithm to operationalize your plan. These should be formatted similarly to the script algorithms presented throughout the textbook.
- Consider each step needed to get from your raw dataset to the answer to your research question.
- Write the complete step-by-step script algorithm (pseudocode). (See p. 270 in the textbook for an example.)
- Write a brief introduction to restate your research question and approach developed in Assignments 1 and 2 and to contextualize your script algorithm.
- Write a brief conclusion describing the type of output you would expect from following the steps in the script algorithm, and the information you would get from this output to answer your research question.
Responses to this assignment should be no longer than 2 pages in length, single spaced, excluding references or appendices. Include one additional cover page at the top of the document mentioning assignment number, name of the project, group member’s name, and submission date. One group member should submit the assignment in PDF via email to dr.javedm@utoronto.ca and nibras.rakib@mail.utoronto.ca, no later than 11:59 PM on Wednesday, July 17th, 2025.
Notes: Between the completion of assignment 3 and final presentation, you are expected to write code operationalizing your pseudocode (script algorithm).
Appendix B
Assignment 4: Final presentation
Wednesday, July 23rd, 2025 at 9AM – 11AM
Your group will present on your research question, approach, and script algorithm. Each presentation should be a maximum of 15 minutes in length, with 2 minutes for questions. In your presentation, you should:
- Describe and contextualize your research question: why is this an important question to answer?
- Describe your dataset and the overall approach you chose to answer the question.
- Walk through your script algorithm to explain how you planned to operationalize your approach.
- Describe the results based on your operationalization of your script algorithm.
- An outline of the presentation and appropriate instructions for each slide has been provided: Click to view.
Weekly Class Schedule
Week 1: Monday June 30th – Thursday July 3rd
Monday (30 June): Lecture 1 & Lab 1
- Topics of the class: Introduction to health analytics and informatics. Review of syllabus, class requirements and parsing and transforming text files.
- Book Chapter: Chapter 1: Parsing and transforming text files.
- Steps you need to follow for this class:
- You must a have Google account.
- Instructions for creating Google account: Click to view
- Video instruction: Click to view
- You must download the data files to your computer.
- Instructions for downloading the data files: Click to view
- Video instruction: Click to view
- Click to download the data files
- You must extract the data files in your computer.
- Instructions for extracting the data files: Click to view
- Video instruction: Click to view
- You must upload the data files to the Google account.
- Instructions for uploading the data files to your Google account: Click to view
- Video instruction: Click to view
- You must run the script in your Google account.
- Instructions for running script in your Google account: Click to view
- Video instruction: Click to view
- Chapter 1 script URL: Click to view
- You can also save the script to your Google account.
- Instructions for saving the script in your Google account: Click to view
- Video instruction: Click to view
- You can also modify the script in your Google account.
- Instructions for modifying the script in your Google account: Click to view
- Video instruction: Click to view
- You must a have Google account.
Tuesday (1 July): Lecture 2 & Lab 2
- Topics of the class: Parsing and transforming text files.
- Book Chapter: Chapter 1: Parsing and transforming text files.
- Please follow steps 1 to 5 from the first class (30 June) before executing the script below (you do not need to do this again if you have done this once).
- Steps you need to follow for this class:
- You must run the script in your Google account.
- Chapter 1 script URL: Click to view
- You can also save the script to your Google account. Instructions for saving the script in your Google account: Click to view
- You can also modify the script in your Google account. Instructions for modifying the script in your Google account: Click to view
- You must run the script in your Google account.
- Assignment 1 (Deadline: 7 July, 2025)
- Steps you need to follow for this assignment:
- You must read the instruction from Syllabus Overview -> Appendix A -> Assignment 1.
- Dataset for Assignment 1: Click to view
- Chapter 19, 22
- The Mort1999us.zip file contains the full dataset. If you want to use the subset of the data, please use mort1999us.dat file.
- Chapter 24
- The MORT1996.zip, Mort199us.zip, Mort2002us.zip, and Mort2004us.zip contains the full dataset. If you want to use the subset of the data, please use mort1996.dat, mort1999us.dat, mort2002us.dat, and mort2004us.dat files.
- Chapter 19, 22
- Steps you need to follow for this assignment:
Wednesday (2 July): Lecture 3 & Lab 3
- Topics of the class: Utility scripts for specific tasks.
- Book Chapter: Chapter 2: Utility Scripts.
- Please follow steps 1 to 5 from the first class (30 June) before executing the script below (you do not need to do this again if you have done this once).
- Steps you need to follow for this class:
- You must run the script in your Google account.
- Chapter 2 script URL: Click to view
- You can also save the script to your Google account. Instructions for saving the script in your Google account: Click to view
- You can also modify the script in your Google account. Instructions for modifying the script in your Google account: Click to view
- You must run the script in your Google account.
Thursday (3 July): Lecture 4 & Lab 4
- Topics of the class:
- Utility scripts for specific tasks.
- Ways to represent data in images.
- Book Chapter:
- Chapter 2: Utility scripts.
- Chapter 3: Viewing and modifying images.
- Please follow steps 1 to 5 from the first class (30 June) before executing the script below (you do not need to do this again if you have done this once).
- Steps you need to follow for this class:
- You must run the script in your Google account.
- Chapter 2 script URL: Click to view
- Chapter 3 script URL: Click to view
- You can also save the script to your Google account. Instructions for saving the script in your Google account:
- Chapter 2: Click to view
- Chapter 3: Click to view
- You can also modify the script in your Google account. Instructions for modifying the script in your Google account:
- Chapter 2: Click to view
- Chapter 3: Click to view
- You must run the script in your Google account.
- Assignment 2 (Deadline: 9 July, 2025)
- Steps you need to follow for this assignment:
- You must read the instruction. Instruction for assignment 2: Syllabus Overview -> Appendix A -> Assignment 2.
- Steps you need to follow for this assignment:
Week 2: Monday July 7th – Thursday July 10th
- Topics of the class:
- Parsing and transforming text files.
- Utility scripts for specific tasks.
- Viewing and modifying images.
- Complete exercises.
- Book Chapter: Chapter 1, 2, and 3.
- Submit Assignment 1.
Tuesday (8 July): Lecture 6 & Lab 6
- Topics of the class: Indexing.
- Book Chapter: Chapter 4: Indexing text.
- Please follow steps 1 to 5 from the first class (30 June) before executing the script below (you do not need to do this again if you have done this once).
- Steps you need to follow for this class:
- You must run the script in your Google account.
- Chapter 4 script URL: Click to view
- You can also save the script to your Google account. Instructions for saving the script in your Google account: Click to view
- You can also modify the script in your Google account. Instructions for modifying the script in your Google account: Click to view
- You must run the script in your Google account.
Wednesday (9 July): Lecture 7 & Lab 7
- Topics of the class:
- Indexing
- MeSH.
- Book Chapter:
- Chapter 4: Indexing text.
- Chapter 5: The National Library of Medicine’s Medical Subject Headings (MeSH).
- Please follow steps 1 to 5 from the first class (30 June) before executing the script below (you do not need to do this again if you have done this once).
- Steps you need to follow for this class:
- You must run the script in your Google account.
- Chapter 4 script URL: Click to view
- Chapter 5 script URL: Click to view
- You can also save the script to your Google account. Instructions for saving the script in your Google account:
- Chapter 4: Click to view
- Chapter 5: Click to view
- You can also modify the script in your Google account. Instructions for modifying the script in your Google account:
- Chapter 4: Click to view
- Chapter 5: Click to view
- You must run the script in your Google account.
- Assignment 3 (Deadline: 17 July, 2025)
- Steps you need to follow for this assignment:
- You must read the instruction. Instruction for assignment 3: Syllabus Overview -> Appendix A -> Assignment 3.
- Steps you need to follow for this assignment:
- Assignment 4 (Deadline: 23 July, 2025)
- Steps you need to follow for this assignment:
- You must read the instruction. Instruction for assignment 4: Syllabus Overview -> Appendix B -> Assignment 4.
- You must read the instruction. Instruction for assignment 4: Syllabus Overview -> Appendix B -> Assignment 4.
- Steps you need to follow for this assignment:
- Submit Assignment 2
Thursday (10 July): Lecture 8 & Lab 8
- Topics of the class:
- MeSH
- Controlled Vocabularies (ICD).
- Book Chapter:
- Chapter 5: The National Library of Medicine’s Medical Subject Headings (MeSH).
- Chapter 6: The International Classification of Diseases.
- Please follow steps 1 to 5 from the first class (30 June) before executing the script below (you do not need to do this again if you have done this once).
- Steps you need to follow for this class:
- You must run the script in your Google account.
- Chapter 5 script URL: Click to view
- Chapter 6 script URL: Click to view
- You can also save the script to your Google account. Instructions for saving the script in your Google account:
- Chapter 5: Click to view
- Chapter 6: Click to view
- You can also modify the script in your Google account. Instructions for modifying the script in your Google account:
- Chapter 5: Click to view
- Chapter 6: Click to view
- You must run the script in your Google account.
Week 3: Monday July 14th – July 17th
Monday (14 July): Lecture 9 & Lab 9
- Topics of the class:
- Controlled Vocabularies (ICD)
- NLM.
- Book Chapter:
- Chapter 6: The International Classification of Diseases.
- Chapter 9: PubMed.
- Please follow steps 1 to 5 from the first class (30 June) before executing the script below (you do not need to do this again if you have done this once).
- Steps you need to follow for this class:
- You must run the script in your Google account.
- Chapter 6 script URL: Click to view
- Chapter 9 script URL: Click to view
- You can also save the script to your Google account. Instructions for saving the script in your Google account:
- Chapter 6: Click to view
- Chapter 9: Click to view
- You can also modify the script in your Google account. Instructions for modifying the script in your Google account:
- Chapter 6: Click to view
- Chapter 9: Click to view
- You must run the script in your Google account.
Tuesday (15 July): Lecture 10 & Lab 10
- Topics of the class: Working with External Data: API access to and Local Data extracted from PubMed.
- Book Chapter:
- Chapter 9: PubMed.
- Please follow steps 1 to 5 from the first class (30 June) before executing the script below (you do not need to do this again if you have done this once).
- Steps you need to follow for this class:
- You must run the script in your Google account.
- Chapter 9 script URL: Click to view
- You can also save the script to your Google account. Instructions for saving the script in your Google account:
- Chapter 9: Click to view
- You can also modify the script in your Google account. Instructions for modifying the script in your Google account:
- Chapter 9: Click to view
- You must run the script in your Google account.
Wednesday (16 July): Lecture 11 & Lab 11
- Topics of the class: Leveraging Taxonomies and Classification Schemes.
- Book Chapter: Chapter 10: Taxonomy.
- Please follow steps 1 to 5 from the first class (30 June) before executing the script below (you do not need to do this again if you have done this once).
- Steps you need to follow for this class:
- You must run the script in your Google account.
- Chapter 10 script URL: Click to view
- You can also save the script to your Google account. Instructions for modifying the script in your Google account:
- Chapter 10: Click to view
- You can also modify the script in your Google account. Instructions for modifying the script in your Google account:
- Chapter 10: Click to view
- You must run the script in your Google account.
Thursday (17 July): Lecture 12 & Lab 12
- Topics of the class:
- Leveraging Taxonomies and Classification Schemes
- Structuring Data and Manipulating Structured Data
- Group Project Progress
- Mock Demos.
- Book Chapter:
- Chapter 10: Taxonomy.
- Chapter 18: Describing Data with Data Using XML
- Please follow steps 1 to 5 from the first class (30 June) before executing the script below (you do not need to do this again if you have done this once).
- Steps you need to follow for this class:
- You must run the script in your Google account.
- Chapter 10 script URL: Click to view
- Chapter 18 script URL: Click to view
- You can also save the script to your Google account. Instructions for modifying the script in your Google account:
- Chapter 10: Click to view
- Chapter 18: Click to view
- You can also modify the script in your Google account. Instructions for modifying the script in your Google account:
- Chapter 10: Click to view
- Chapter 18: Click to view
- You must run the script in your Google account.
- Submit Assignment 3
Week 4: Monday July 21st – Thursday July 24th
Monday (21 July): Lecture 13 & Lab 13
- Topics of the class:
- Structuring Data and Manipulating Structured Data.
- Book Chapter:
- Chapter 18: Describing Data with Data Using XML
- Please follow steps 1 to 5 from the first class (30 June) before executing the script below (you do not need to do this again if you have done this once).
- Steps you need to follow for this class:
- You must run the script in your Google account.
- Chapter 18 script URL: Click to view
- You can also save the script to your Google account. Instructions for saving the script in your Google account:
- Chapter 18: Click to view
- Chapter 18: Click to view
- You can also modify the script in your Google account. Instructions for modifying the script in your Google account:
- Chapter 18: Click to view
- Chapter 18: Click to view
- You must run the script in your Google account.
- ETTA
Tuesday (22 July): Lecture 14 & Lab 14
- Topics of the class:
- Structuring Data and Manipulating Structured Data.
- Mining Information.
- Book Chapter:
- Chapter 18: Describing Data with Data Using XML.
- Chapter 14: Autocoding.
- Please follow steps 1 to 5 from the first class (30 June) before executing the script below (you do not need to do this again if you have done this once).
- Steps you need to follow for this class:
- You must run the script in your Google account.
- Chapter 18 script URL: Click to view
- Chapter 14 script URL: Click to view
- You can also save the script to your Google account. Instructions for saving the script in your Google account:
- Chapter 18: Click to view
- Chapter 14: Click to view
- You can also modify the script in your Google account. Instructions for modifying the script in your Google account:
- Chapter 18: Click to view
- Chapter 14: Click to view
- You must run the script in your Google account.
Wednesday (23 July): Lecture 15
- Topics of the class: Project Presentations.
- Submit Assignment 4
Thursday (24 July): Lecture 16
- Topics of the class: Class Wrap-up.
SQLite
- For this course, we are going to use SQLite database. To properly connect with a SQLite database, enter and revise data, and retrieve the results, follow the instructions below.
- Instructions for Windows 11: Click to view
- Instructions for macOS: Click to view
- Bulk CSV file:
- Click to download
- Please review the Chapter 5 script (section 5.6 (Bulk Import – Downloading Article Information from PubMed)) to understand the preparation of the CSV file.
- Chapter 5 Script URL: Click to view
- Instructions to import bulk CSV file into the SQLite:
- Instructions for Windows 11: Click to view
- Instructions for macOS: Click to view
- Official SQLite Documentation: Click to view
- Structured query language as understood by SQLite: Click to view
- Built-in aggregate function for SQLite: Click to view
- Date and time functions: Click to view
GraphViz
In this course, we are going to use a graph visualization software called “Graphviz”. Graphviz is a prominent application in the domain of healthcare, bioinformatics, networking, software engineering, and machine learning. For details please visit: https://graphviz.org/.
- The script below implements the Python implementation of Graphviz.
- To use the script, first download the term-term similarity matrix (a CSV file) from the URL below:
- To understand how we populated the CSV file, please go to the “TF-IDF tab” and review the “Additional Script”.
- Term-term similarity matrix CSV file: Click to download
- Upload the term-term similarity matrix CSV file into the Google Drive (“My Drive/HealthDataAnalyticsData/” folder)
- Execute the script below:
- Graph viz python implementation: Click to view
- To use the script, first download the term-term similarity matrix (a CSV file) from the URL below:
- To use the Graphviz software, we are going to follow the steps below:
-
- Download the grammar file (also known as scripting file) for defining Graphviz nodes, edges, and graphs.
- Click to download
- Please review Chapter-18 for details for generating grammar file from RDF schema.
- Chapter 18 script URL: Click to view
- Upload the grammar file to Graphviz.
- Instructions for Windows 11: Click to view
- Instructions for MacOS: Click to view
- Visualize the RDF schema with Graphviz.
- Instructions for Windows 11: Click to view
- Instructions for MacOS: Click to view
- Download the grammar file (also known as scripting file) for defining Graphviz nodes, edges, and graphs.
The script below contains the following program.
- Term frequency and inverse document frequency
Additional Script URL: Click to view