CS 5854: Computational Systems Biology

CS 5854: Computational Systems Biology

Spring 2023, 4pm-5:15pm, Mondays and Wednesdays

Mailing list: class-cs-5854-20668-202301-g@vt.edu

About the Course

  • What is Computational Systems Biology?
  • What is the focus of this course?
  • Who should take this course?
  • Pre-requisites
  • Introductory Videos
  • Course structure
  • What is Computational Systems Biology?

    Cells, tissues, organs and organisms are systems of components whose interactions have been defined, refined, and optimised over hundreds of millions of years of evolution. Computational systems biology is a field that aims at a system-level understanding of biological systems by analysing biological data using computational techniques. Systems biology aims to answer the following key questions by integrating experimental and computational approaches:

    1. What are the basic structures and properties of the biological networks in a living cell?
    2. How does a biological system behave over time under various conditions?
    3. How does a biological system maintain its robustness and stability?
    4. How can we modify or construct biological systems to achieve desired properties?
    Answers to these questions require breakthroughs in the fields of biology, chemistry, computer science, engineering, mathematics and other fields together with an evolution of our educational structures. The explosive progress of genome sequencing projects and the massive amounts of data that high-throughput experiments in DNA microarrays, proteomics, and metabolomics yield drive advances in this field. Sophisticated computational ideas process these data sources in an effort to systematically analyse and unravel the complex biological phenomena that take place in a cell.

    What is the focus of this course?

    As mentioned above, cell and molecular biology are awash in data. Over the last 20 years, numerous computational methods have been developed that can assimilate and integrate massive quantities of data in order to find hidden patterns in them that may contain useful biological information. Of late, single-cell technologies have come to the fore. This iteration of the course will stress computational methods to analyse these data, especially single-cell RNA-seq measurements. We will discuss pseudotime computation, clustering of cells and discovery of cell types, gene regulatory networks, and integrated analyses of multiple datasets.

    Who should take this course?

    You should take this course if you are curious to find out how the latest research is shaping our understanding of how the living cell behaves as a system. The course will cover the latest research in computational systems biology, primarily in the context of molecular interaction networks. We will spend a significant part of the course on examining how the analysis of single-cell datasets and other high-throughput data is crucial to progress in this area. The course is geared towards graduate students whose main research interest is bioinformatics or who use bioinformatic tools and techniques in their research.

    There are many exciting and profound issues that researchers in this area are actively investigating, such as the robustness of biological systems, network structures and dynamics, and applications to drug discovery. During this course, we will come across many interesting open research problems. Taking this course might be an excellent way to create research topics and projects for your Master's or Ph.D. thesis in the area of bioinformatics/computational biology. In this course, you will be able to communicate and work with students and researchers with varied backgrounds. In addition, Virginia Tech is humming with research activities in this area.

    Pre-requisites

    The course is open to students with graduate standing. I hope that both students with computational backgrounds and students with experience in the life sciences will take this course. If you find this course interesting but are not sure whether your background matches the pre-requisites, please talk to me.

    Computer Science graduate students: the Data and Algorithm Analysis (CS 4104) or similar course is a pre-requisite. It will help if you also have taken Algorithms in Bioinformatics (CS 5124) and a course on combinatorics and graph theory such as Applied Combinatorics (MATH 3134). An introductory molecular biology course such as Biological Paradigms for Bioinformatics will provide extremely useful biological background.

    Life science graduate students: I expect that you have taken courses in biochemistry, cell biology, and molecular biology. A course like Computation for Life Sciences (CS 5045) provides very useful computational background.

    Introductory Videos

    For students with computational backgrounds, I have listed some videos below that provide introductions into molecular and cell biology.

    • The Cell (7:21 min): an overview of cell structure from Nucleus Medical Media

    Course structure

    The course will primarily be driven by lectures and by seminars where one or more students present a related group of papers from literature. I will try to arrange papers that cover both biological and computational aspects. Ideally, I would like a group to contain students with backgrounds in computer science, mathematics, and/or statistics and students with backgrounds in biology and chemistry.

    Your grade will depend on your presentation (20%), on class participation (30%), and a final project (50%). The final project is a group software project. I will define software projects that are inspired by the papers you present in class. The project will involve creating some new software or using existing software innovatively combined with some intensive biological analysis of the results. You are welcome to suggest a project to me.

Table 1: Schedule (subject to change throughout the semester). Links in the "Topic and Papers" column point to specific papers assigned for each class. Links in "Presenter" column point to the slides for the lecture.
Date Topic and Papers Presenter(s)
Jan 18, 2023 Class cancelled  
Jan 23, 2023 Introduction to Computational Systems Biology T. M. Murali
Jan 25, 2023 Introduction to Computational Systems Biology, continued. Discussion of papers T. M. Murali
Jan 30, 2023 Pathways on demand: automated reconstruction of human signaling networks T. M. Murali
Feb 1, 2023 Pathways on demand, continued T. M. Murali
Feb 6, 2023 Pathways on demand, continued T. M. Murali
Feb 8, 2023 Pathways on demand, continued T. M. Murali
Feb 13, 2023 Pathways on demand, continued T. M. Murali
Feb 15, 2023 Class projects T. M. Murali
Feb 20, 2023 Introduction to single-cell RNA-seq data T. M. Murali
Feb 20, 2023 Benchmarking algorithms for gene regulatory network inference from single-cell transcriptomic data T. M. Murali
Feb 22, 2023 Benchmarking, continued T. M. Murali
Feb 27, 2023 Benchmarking, continued T. M. Murali
Mar 1, 2023 Benchmarking, continued T. M. Murali
Mar 6, 2023 No class (Spring break)  
Mar 8, 2023 No class (Spring break)  
Mar 13, 2023 Growing DAGs: Optimization Functions for Pathway Reconstruction Algorithms Nabayan Chaudhury and Parva Patel
Mar 15, 2023 Growing DAGs: Optimization Functions for Pathway Reconstruction Algorithms Nabayan Chaudhury and Parva Patel
Mar 20, 2023 Class cancelled  
Mar 22, 2023 Mutual interactors as a principle for phenotype discovery in molecular interaction networks Bernard Moussad and Pouyan Shirzadian
Mar 27, 2023 Mutual interactors as a principle for phenotype discovery in molecular interaction networks Bernard Moussad and Pouyan Shirzadian
Mar 29, 2023 Identifying tumor cells at the single-cell level using machine learning Amartya Dutta and Shivansh Mathur
Apr 3, 2023 Identifying tumor cells at the single-cell level using machine learning Amartya Dutta and Shivansh Mathur
Apr 5, 2023 DIALOGUE maps multicellular programs in tissue from single-cell or spatial transcriptomics data Neeti Gandhi and Dallin Leatham
Apr 10, 2023 DIALOGUE maps multicellular programs in tissue from single-cell or spatial transcriptomics data Neeti Gandhi and Dallin Leatham
Apr 12, 2023 scGNN is a novel graph neural network framework for single-cell RNA-Seq analyses Maryam Haghani and Yiqi Su
Apr 17, 2023 scGNN is a novel graph neural network framework for single-cell RNA-Seq analyses Maryam Haghani and Yiqi Su
Apr 19, 2023 scGNN is a novel graph neural network framework for single-cell RNA-Seq analyses Maryam Haghani and Yiqi Su
Apr 24, 2023 NetAct: a computational platform to construct core transcription factor regulatory networks using gene activity Sahar Heydari and Reza Mazloom
Apr 26, 2023 NetAct: a computational platform to construct core transcription factor regulatory networks using gene activity Sahar Heydari and Reza Mazloom
May 1, 2023 Final project presentations  
May 3, 2023 Final project presentations  

Reading Notes

  • PathLinker (January 30, 2023 to February 12, 2023)
  • BEELINE (February 20, 2023 to March 1, 2023)
  1. Assigned for January 30, 2023 to February 12, 2023 Pathways on demand: automated reconstruction of human signaling networks, Anna Ritz, Christopher L Poirel, Allison N Tegge, Nicholas Sharp, Kelsey Simmons, Allison Powell, Shiv D Kale, and T M Murali, npj Systems Biology and Applications, volume 2, Article number: 16002 (2016)

    • Read the "Editorial Summary" on the right hand side.
    • Read the "Abstract".
      • As you read it, think about whether there is an algorithm that you already know that might do what PathLinker does, at least to the extent that you can glean its functionality from a few sentences.
      • You may not know what NetPath and KEGG are. Don't worry. I will explain.
      • You can ignore the sentences about CFTR.
      • There are some claims about PathLinker's success. As you read the paper and listen to my lectures, ask yourself if you are convinced that these claims are accurate.
    • Now, it is time for the "Introduction". There may be many terms here and in the rest of the paper that are new and whose meaning is unclear. Not a problem. We will discuss all the important ones in class. As always, when in doubt, post a query on Piazza!
      • What does PathLinker do? Does Figure 1 help you figure out the answer to this question?
      • Think of "receptors" as starting points of a journey (sources) and transcription factors as destinations (targets). Do you now have a better sense of what PathLinker does? What algorithms might you use for this task?
      • There is something "human" and "yeast". You can ask a question during class to clarify this difference.
      • The Introduction then states two properties that a good reconstruction algorithm must have. As a reader, it makes sense to have the expectation that PathLinker will have both these properties but other, competing algorithms will lack one or both. You will have to check whether this expectation works out as you read the paper.
      • Further down, the paper is fairly explicit about what PathLinker computes. What are the inputs to PathLinker? What is the output? How is the output related to the input? Does this problem make sense to you? Do you know of an algorithm that solves the problem?
      • Immediately, the authors claim PathLinker satisfies both properties! Can this be true?!
      • The next para extols PathLinker's virtues. There is something about "recall" (there is another important concept called "precision"). Read about them on the Wikipedia.
    • Let us read "Results" next. I know, it is strange to read the results before knowing what algorithms the paper compares! But it turns out that this strange order (which is common in bioinformatics and biology journals) is actually quite effective and instructive in the case of this paper.
      • First, you notice that the paper compares PathLinker to many algorithms. We will discuss as many as possible in class. You can certainly follow the link to the Supplementary Materials to read more about the algorithms. But be warned that these descriptions are terse.
      • There is the notion of positives and negatives. Do you understand them? Or at least have an approximate notion of what they may mean?
      • Now comes the main set of results (those important words "precision" and "recall" again). We will discuss each of these results in detail.
      • First, Figure 2(a). What are the authors plotting? What will an ideal precision-recall curve look like? Do the claims made by the authors about the superiority of PathLinker hold true?
      • Now, let us look at Figure 2(b). How did the evaluation change between Figures 2(a) and 2(b). The authors say that they ignored some edges. Do you understand what properties these edges have? What is the takeaway from comparing Figure 2(a) to 2(b)?
      • Try to explain to someone else in the class what Figure 2(c) is all about. Note that the authors are comparing only PathLinker and RWR by this point. Is ignoring the other algorithms warranted?
      • The next few plots (Figures 2e, 2f, 2g) are all about PathLinker. What are the different types of analysis being done? Are you convinced they are useful? Is there some evaluation you would like to make this is missing in this paper?
      • Now comes a complex figure on the Wnt pathway (Figure 3). The left-hand side of Figure 3a is actually part of another figure. Which one is it? What can you say about the algorithms named here?
      • What do the authors conclude from coampring the network layouts in the right of Figure 3(a) to the layout in Figure 3(b)?
      • Ignore the section "Exploring the role of CFTR in Wnt signaling" for now.
    • The next section to read is "Materials and Methods".
      • First up, is the "PathLinker" section. There is an explicit description of the inputs to the pathway reconstruction problem. Make sure this list matches what you have gleaned from reading this paper so far.
      • What range can the weight of an edge take? Is it sensible? Does this range have a bearing on any step of the algorithm?
      • How do you decipher "the \(k\) highest scoring loopless paths that begin at any receptor in \(S\) and terminate at any TR in \(T\)"? Key phrases are "highest scoring path", "loopless", and "any".
      • Did the description of PathLinker put the cart before the horse? Where is the score of a path defined? It comes in the next sentence.
      • Then the text describes how PathLinker modifies the original graph in two ways. First is the addition of an artificial source and sink. This change is relevant to the word "any" above.
      • Second is the transformation of the weight of each edge. The text explains the reason for this transformation. Of course, to understand the explanation, you have to read both the paper on Yen's algorithm and Supplementary Section S6. Do your best to understand both. I will give you a high-level, informal sketch of Yen' algorithm in class.
      • The final paragraph of "PathLinker" includes the sentence: "By construction, the interactions in the k shortest paths are a subset of those in the (k+1) shortest paths." What is the significance of this sentence.
      • Does the value of \(k = 20,000\) seem very high?
    • Read the "Datasets" section.
      • Scan citations 3, 4, 5, and 40 to determine what types of interactions the human protein network contains.
      • You can read citation 17 on how to compute edge weights, if you are interested. I will skip over this computation.
      • What can you learn about how the authors determine the receptors and TRs in each pathway?
    • Let us move on to "Wnt pathway reconstructions".

      • What is the connection between the points marked with pinkish-red dots on the left of Figure 3a and the network layouts on the right of Figure 3a?
      • Which networks on the right of Figure 3a match your conception of a pathway?
      • Now look at Figure 3b. What do the numbers on each node mean? What do the node colours (green, pink, orange, gray) mean? What strategy did the authors use to assign these colours? Where will you find information on this strategy? I will ask detailed questions in class.
      • Why did the authors focus on RYK-CFTR-DAB2?
      • Now we will consider "Exploring the role of CFTR in Wnt signaling" to come.
        • There are two types of experiments that measure the output of the Wnt pathway. What are these?
        • Each bar except the first in Figure 4a corresponds a specific Wnt protein. Why does the $y$-axis start at 1? What can you conclude from this $y$-range? What is the main purpose of this experiment?
        • Several subsequent figures use siRNA-mediated silencing. siRNAs are a fascinating, Nobel-prize winning concept. I encourage you to read about them so that you know their main function in these experiments, at the very least.
        • What is the purpose of the experiments in Figure 4b? The "Anti" in the names of the different plots refers to antibodies used to measure the level of the respective protein. What do you glean from the statement "In the No Wnt control cells, cellular levels of β-catenin … increased as cellular protein levels of Dab2 decreased"?
        • The real tests of the predictions are in Figure 4c,d.e! Figures 4c/d are the counterpart of Figure 4a but with siRNAs for Ryk, CFTR, and Dab2. What should the white bars in Figure 4c/d correspond to it in Figure 4a?
        • Read the text in the paper and see if you can make sense of the results in Figure 4c,d/e. There are two parts.
        • First, what happens in the absence of Dab2 or CFTR?
        • Second, what happens (differently from Dab2 and CFTR)) in the absense of Ryk?
        • Now understand the model presented in Figure 5. Does it correspond to the results in Figure 4?
  2. Assigned for February 20, 2023 to March 1, 2023 Benchmarking algorithms for gene regulatory network inference from single-cell transcriptomic data, Aditya Pratapa, Amogh P. Jalihal, Jeffrey N. Law, Aditya Bharadwaj & T. M. Murali, volume 17, pages 147–154, Nature Methods, 2020
    • This paper is all about gene regulatory networks (GRNs) and automated methods to reconstruct them. Reading it will be a challenging exercise in going back-and-forth across different parts, navigating the (poorly-formatted) Methods section, and correlating the text with figures in the main text, the supplementary figures, and the supplementary section.
    • The first paragraph of the introduction does not even cite a paper on GRNs. So a reader has to rely on their knowledge or reading to determine what the authors mean. The Wikipedia page on GRNs is a bit abstract for my taste. As an alternative, read Hierarchical Differentiation of Myeloid Progenitors Is Encoded in the Transcription Factor Network (Introduction and Model Construction section of Results). I have already shown Figure 1 from this paper in class. What type of network are the authors of this paper creating? Can you correlate the information in Table 1 to Figure 1? Hopefully, you have a good understanding of what a GRN means after reading this paper.
    • Understand the motivation for this paper (final sentence of the second paragraph of the paper). Again, there are no citations to justify this sentence.
      • Read the "Performance evaluation" section of the paper to remind yourself about AUPRC, AUROC, and learn about EP and AUPRC ratio and EP ratios. What do we mean by top-\(k\) edges?
      • I will be more elaborate in my lecture using the SCODE and SINGE algorithms as examples.
      • Homework for each of you. Read a few of the other papers cited at the end of the sentence "Despite these challenges, over a dozen methods have been developed or used to infer GRNs from single-cell data." By "other papers", I mean ignore the SCODE and SINGE papers. Your homework is to find additional evidence to support the final sentence in the second paragraph of the paper. Send me the evidence you find (you can cut and paste text and figures from the paper you have selected) by email with the subject "CS 5854: support for BEELINE motivation" by 11:59pm on Tuesday, Feb 21,
    • What are the various components of BEELINE? See Figure 1 and the text around it.
    • The second paragraph of the main text says "there are no widely accepted ground-truth datasets for assessing algorithm accuracy". How did BEELINE address this problem?
    • A critical aspect of BEELINE is the difference between the simulated data (from synthetic networks or curated models) and single-cell RNA-seq data.
      • Do you understand why the authors use synthetic networks or curated models? In the evaluation pipeline, what role do these networks play?
      • How do the single-cell RNA-seq data arise for such a network? What method does the paper use?
    • It is very important to understand the dynamics of the synthetic networks shown in the first column of Supplementary Figure 1. Consider the "Linear" network. Assume each gene can be in state 0 (off) or 1 (on) at any time. The rules are as follows: a gene is turned on if and only if any of its regulators is on and none of its inhibitors are off. If a gene only has inhibitors, then it is turned on if and only if none of its inhibitors is on. Now think of simulating this network. At any time point, pick a gene uniformly at random and apply the rule to compute its state. How do you see the Linear network evolving? What about Cycle network? What about the Bifurcating network? Think about their names. It is really important to do this exercise yourself before the lecture.
    • Now we turn our attention to BoolODE. Try to answer these questions by reading Read the first paragraph of "BoolODE: converting Boolean models to ODEs", ignoring the subsequent sub-sections with equations, and reading the remaining sub-sections.
      • What does BoolODE do? A high-level understanding is sufficient. The mathematical details are not important now.
      • How long do we perform each simulation?
      • How do we define (expression profile) for a cell? How did we create simulated datasets?
      • Do we need to compute pseudotime? This concept is new, so you will have to do auxiliary reading.
    • Now turn your attention to "Datasets", especially "Creating datasets from synthetic networks". Read until the beginning of "Creating datasets from curated models".
    • As you are reading, also examine Supplementary Figure 1 and the text in the "Datasets from synthetic networks" in "Results". It is important to correlate the steps in "Datasets" to the columns of the figure and the corresponding main text to gain a complete understanding of the results. Why are the authors going to all this trouble?
      • What did the authors cluster in column (c)? Which clustering algorithm did they use?
      • What value did we use if a GRN method required time-ordered cells? Why?
      • What do we learn by comparing plots in column (b) to those in column(c)?
      • Does GeneNetWeaver (column (d)) replicate trajectories expected from synthetic networks?
    • So far, there has been no mention of any specific GRN algorithms! What criteria did the authors use to select them? Do they make sense to you? There is a brief description of each algorithm in "Regulatory network inference algorithms". What are the broad features that are common to more than one method. Correspond your conclusions with "Properties" columns of Figure 6.
      • Which algorithms were developed for bulk RNA-seq data?
      • How did we evaluate methods that output undirected GRNs?
      • How did we handle methods with parameters?
    • Now let us look at the results of running BEELINE on the datasets created for the toy networks. Which figures should you look at? As you examine these figures and read the corresponding text in "Results", answer the following questions:
      • What is AUPRC ratio?
      • What is a top-\(k\) GRN?
      • How did we compute stability?
      • What are the numbers inside the cells in Figure 2? Why don't all cells have numbers inside them?
    • (Added for class on February 27, 2023) There were other analyses done on the toy networks in the paper.
      • How many cells were simulated for the results in Figure 2.
      • What other counts of cells were simulated? Was there any comparison of the results between different pairs of number of cells? There is a specific suplementary note that addresses this question. What did it find?
      • What did the authors check in Supplementary Note 1.4?
      • What about Supplementary Note 1.5?
    • Now we can start reading about the curated Boolean network.
      • Why did the authors curate Boolean models from the literature?
      • To what types of cells or process do the selected models correspond?
      • How many proteins and interactions do they contain?
      • Read the paper on the mCAD model, especially the caption for Figure 2. Do you see anything interesting about the interactions?
    • Read "Creating datasets from curated models" in "Methods" and see Figure 3.
      • Why did the authors perform two types of clustering here (column (c) and “behind the scenes” for column (d))?
      • What did they compute in column (d)?
      • What do we learn by comparing plots in column (c) to those in column (b)?
      • How well do pseudotime values computed by Slingshot (column (d)) correspond to simulation times of the cells (column (b))?
    • There is an additional validation of the simulations that appears in the paragraph (ii) of Supplementary Note 2.1 and in Figures 6 to 9 in the supplement. What is this analysis? Read the paper on the VSC model. Figure 1 is critical here. How many steady states should the model have? What are their names? What gene expression values characterise each steady state? How can we check the t-SNE plots of the BoolODE-simulated data to visually associate clusters of cells to each steady state?
    • Read "Datasets from curated models" in "Results" and see Figure 4 and Supplementary Figures 2 and 4.
      • Do you notice a trend when comparing the box plots in Supplementary Figure 2 to those in Supplementary Figure 4? Did the algorithms perform change in any way? Did some algorithm improve its AUPRC considerably? Did some algorithm's AUPRC decrease substantially?
      • What is the difference between the first two sets of columns in Figure 4 and the last two sets of columns in this figure?
      • what conclusions on algorithm performance can you draw by comparing the ordering of algorithms in Figure 4 with the ranking in Figure 2?
    • There are many other analyses of the results on Boolean models. Read Supplementary Notes 2.3 to 2.6 carefully. I will discuss some of them carefully in class. The supplementary note on indirect edges is especially important.
    • So far, we have ignored the details of the BoolODE method. It is time to rectify that. Read that section in "Methods". It make take a few times to understand the idea. The key point is the text following \(X = (P \vee Q) \wedge \lnot (R)\).
    • (Added for class on March 1, 2023) We now consider the final set of results, those on experimental single cell RNA-seq data.
      • What do the authors say is the motivation to study the performance of GRN inference algorithms on these data?
      • Let us first look at the datasets used by the authors ("Experimental single-cell RNA-seq datasets" in "Methods" and Supplementary Table 3). Read the steps taken to process each dataset.
        • Why do the authors need to specify starting cell type?
        • Were there any multi-lineage datasets?
        • Were there any time-series datasets?
        • What do the numbers in “#genes” and “#TFs” denote? How did the authors select genes for analysis?
      • Now let us think about the ground truth networks ("Ground-truth networks collection and processing" in "Methods" and "Supplementary Table 4).
        • How many types of networks?
        • What was the rationale for selecting them?
        • What are the differences between these networks?
      • Which algorithms did the authors apply to experimental data? What criteria did they use to select these methods? Read the second and third paragraphs of "Experimental single-cell RNA-seq datasets" in "Results".
      • Did the authors apply the algorithms to all the genes and TFs in each scRNA-seq data or did they apply some filtering? Where will you look to answer this question? (Hint: it should be in the "Datasets" section.)
      • Look at the results in Figure 5. Focus on the results for mouse datasets and "TFs + 500 genes". Note that the algorithms appear jn columns now. (why?)
        • Why did the authors use the EPR?
        • How did they measure network density?
        • What trends did they observe in the EPR? Compare the results for STRING to those for Nonspecific networks and to those for Cell-type-specific networks.
        • Why did PPCOR's performance drop for experimental datasets?
      • Are the results for human datasets any different from those for mouse data?
      • Read four supplementary notes. We will discuss them briefly in class.
        • Consensus with direct clustering of gene expression data (Supplementary Note 3.2)
        • Gene selection strategy (Supplementary Note 3.3)
        • Stability across runs (Supplementary Note 3.4)
        • Similarity across algorithms (Supplementary Note Figure 20)
      • Supplementary Note 3.5 is important. First read Wisdom of crowds for robust gene network inference, especially Figures 2a and 3a and the accompanying text. Now read Supplementary 3.5. How are the results here starkly different from "Wisdom of Crowds"?
      • Finally, we will walk through the various observations made on the results in Figure 6.
      • I will also share my observations on how BEELINE has been used since February 2020 when it was published.
      • And remind me to tell you about the timeline for developing BEELINE, writing the manuscript, submitting it, and responding to reviews.

Author: "T. M. Murali"

Created: 2023-04-24 Mon 16:16