Summary

In this practical we will use R to make some simple predictions about the level of polymorphism in a sample of DNA sequences. Our predictions will be based on theorectical and simulation results obtained from coalescent models.

Getting Started

Connecting to UNIX server

To get started, we will connect to one of the Unix servers on Campus. UNIX is an operating system designed specifically for multiple users and scientific computing. We will use a terminal server to transmit encrypted commands between your Windows PC and the Unix server.

To get started, go to the Start Menu, and in select the Programs Folder and then Internet Applications. Start Putty. To request a secure connection, make sure that the SSH Port is selected. Next, we will need to specify a Unix server name in the hostname box.

Biostatistics Students: You can select one of the Department's compute servers by specifying one of compute1-compute5 in the hostname box. For example, to connect to compute3, specify compute3.sph.umich.edu in the hostname box and select Open.

Statistics and Bioinformatics Students: If you have access to one your home Department's compute servers, you can specify it in the hostname box. Otherwise, connect to fantasia.sph.umich.edu and use your uniqname and student id number to login.

Putty can be downloaded freely from here.

Unix Very Basics

You should now get to a Unix prompt. At the Unix prompt, you will be able to run specific programs or manipulate files and folders. Some key commands for this session will be:

   pwd          - Report the current working folder
   ls           - List the contents of the current folder
   cd           - Change between different folders
   mkdir        - Creates a new folder
   file filename - Provides information on the contents of a file
   pico         - A simple text editor
   R            - A statistics package

I recommend you get started by creating a folder for Biostatistics 666 practicals. Execute the following series of commands:

   cd                         Returns you to your home directory 
   mkdir 666                  Create 666 coursework directory 
   cd 666                    
   mkdir lab1                 Create lab1 directory  
   cd lab1

At any stage you can use the pwd, ls and file commands to get information about your current working directory or a specific file

You can open multiple connections to the same Unix server, and you may find it convenient to do so. You could then use one connection to edit code and a separate connection to run R.

Getting Started with R

From the 666/lab1 directory, start R. To do this, simple execute the command R. If you want to install R in your own computer or look at the documentation you can get it from www.r-project.org.

R starts with a simple prompt. Like the Unix shell, R will execute simple commands you give it. While Unix commands generally focus on manipulating files and directories, R commands are designed to evaluate simple mathematical expressions and provide statistical summaries of data.

Try the following simple R commands:

   1 + 2 * 3                          Evaluate simple expression 
   sqrt(2)                            
   x <- 1                             Assign value 1 to x 
   x                                  Display the value of x 
   y <- 1:3                           Assign the values 1,2,3 to y 
   y                                  Display the value of y 
   y * 2
   y + 1
   y * y                              A few more simple expressions 
   square <- function(x) x * x        Define square function 
   square(4)
   square(1:3)                        And use it 
   z <- rep(0,3)                      Define a vector of 3 values, all 0 
   z
   c(y, z)                            Concatenate arrays y and z 
   objects()                          List current objects 
   rnorm(1)                           Random normal deviate 
   rexp(1)                            Random draw from exponential distribution
   sample(y,2)                        Select two elements from y at random 
   intersect(1:10, 5:15)              Intersection of two sets 
   setdiff(1:10, 5:15)                Difference between two sets (not cumutative) 

Probability of coalescent events

If you feel confortable with the basics of R, lets define some simple functions related to the coalescent, using the formulas defined in class. For example, we could start with some functions that estimate the probability of a coalescent event:

   Pca <- function(n, N = 1000) {

      # This function will approximate the probability of no
      # coalescent events in a sample of n chromosomes from
      # a diploid population of size N

      1.0 - (n * (n-1)) / (4 * N)

      }

   PcaExact <- function(n, N = 1000) {

      # This function will calculate the probability of no
      # coalescent events in a sample of n chromosomes from
      # a diploid population of size N

      result = 1.0

      for (i in 1:(n-1)) {
         result <- result * (2*N - i)/ (2*N)
         }
   
      result
      }

The first function uses the approximate formula to calculate the probability that, in the previous generation, n individuals have distinct ancestors in a population of size N. The second argument takes a default value of 1000. Try for example:

   # Look at effect of sample size
   Pca(n = 2)
   Pca(n = 4)
   Pca(n = 10)
   Pca(n = 20)

   # Look at effect of population size
   Pca(n = 2)
   Pca(n = 2, 10)
   Pca(n = 2, 100)
   Pca(n = 2, 1000)

   # Or compare exact and approximate results
   Pca(n = 5, 10)
   Pca(n = 5, 100)

   PcaExact(n = 5, 10)
   PcaExact(n = 5, 100)

Number of segregating sites

Below are functions that estimate the probability of various numbers of segregating sites in a sequence. Rather than retyping them, you can cut and paste these into your R (Copy in your browser edit menu, then right-click within the PuTTY window).

   # The implementations below are conceptually simple, but crude
   # they will be very slow when N or n is large

   P2 <- function (S, N = 1000, mu = 10^-6) {

      # This function will calculate the probability of
      # exactly S segregating sites in a sample of two
      # chromosomes from a population of size N and
      # with mutation rate mu

      theta <- 4 * N * mu
 
      (theta / (1 + theta)) ^ S * (1/(1 + theta))
      } 

   Qn <- function (S, n, N = 1000, mu = 10^-6) {

      # This function will calculate the probability that
      # exactly S mutations occur before the next coalescent
      # event in a sample of n chromosomes from a population
      # of size N with mutation rate mu

      theta <- 4 * N * mu

      (theta / (theta + n - 1)) ^ S * ((n - 1)/(n - 1 + theta))
      }

   Pn <- function(S, n, N = 1000, mu = 10^-6) {
 
      # This function will calculate the probability that
      # exactly S mutations are present in a sample of n
      # chromosomes from a population of size N with
      # mutation rate mu

      if (n > 2) {

         result <- 0

         for (i in 0:S)
            result <- result + Pn(i, n-1, N, mu) * Qn(S-i, n, N, mu)
         }
      else
         result <- P2(S, N, mu)

      result
      }

After you define these functions, consider the following problem: in a sample of 5 chromosomes from Northern Europe 10 variants were identified within the a gene that protects the skin from sunlight. In a sample of 6 chromosomes from East Asia, only 1 variant was identified. Assuming that the effective population size in each of these populations is N = 5000 and the mutation rate within the gene is mu = 10^-4 per generation, estimate the probability of observing these numbers of variants.

Single Locus Coalescent

The functions below generate a simple one locus coalescent tree. The consists of a series of brances. Each branch has a specific length, a total number of descendants and an ancestor or parent branch.

SingleLocusTree <- function(n = 10, N = 1000, debug = FALSE) {

   # Length of each branch in the tree
   lengths <- rep(0, n)

   # Number of descendants for each node
   descendants <- rep(1, n)

   # Parent node for each node
   parent <- rep(0, n)

   # Descendant nodes for each node
   activeNodes <- 1:n

   nextNode = n + 1

   for (i in n:2)
      {
      # sample time to next coalescent event
      t <- rexp(1, rate = (i*(i-1)/2)/N)

      # all current nodes increase in length by time t
      lengths[activeNodes] <- lengths[activeNodes] + t

      # next we select two nodes to coalesce at random
      coalescence <- sample(activeNodes, 2)

      if (debug)
         cat("After ", t, "generations: Nodes ", coalescence[1], " and ", coalescence[2], "coalesce\n")

      # assign their parental nodes
      parent[coalescence] = nextNode

      # remove these from the list of active nodes
      activeNodes <- setdiff(activeNodes, coalescence)

      # create a new node
      parent[nextNode] <- 0
      lengths[nextNode] <- 0
      descendants[nextNode] <- descendants[coalescence[1]] + descendants[coalescence[2]]

      # And update list of active nodes
      activeNodes <- c(activeNodes, nextNode)

      nextNode <- nextNode + 1
      }

  list(descendants = descendants, parent = parent, lengths = lengths)
  }

GetTreeSpectrum <- function(tree, n)
  {
  spectrum <- rep(0, n)

  for (i in 1:length(tree$descendants))
      spectrum[tree$descendants[i]] <- spectrum[tree$descendants[i]] + tree$lengths[i]

  spectrum
  }

Start by generating a simple tree and trying to draw it out on paper. You can do this by executing SingleLocusTree(n = 3). If you want more detail on what is happening, you can also include the debug flag, e.g., run SingleLocusTree(n = 3, debug = TRUE).

To generate several trees and tally their branch lengths by number of descendants try the following:

   n <- 10
   spectrum <- rep(0, n)
 
   for (i in 1:100)
      {
      tree <- SingleLocusTree(n)
      spectrum <- GetTreeSpectrum(tree, n) + spectrum
      }

   spectrum <- spectrum / 100

Two locus tree

If you are curious, this is what a two locus simulator looks like...

TwoLocusTree <- function(n, N = 1000, r = 0.0001, debug = FALSE)
      {
      # This function generates trees for two linked loci

      # Then we initialize the tree with n sequences for
      # the right hand locus
      rightDescendants <- rep(1, n)
      rightLengths     <- rep(0, n)
      rightParent      <- rep(0, n)

      # First we initialize the tree with n sequences for
      # the left hand locus
      leftDescendants  <- rep(1, n)
      leftLengths      <- rep(0, n)
      leftParent       <- rep(0, n)

      activePairs       <- 1:n
      activeLeft        <- vector()
      activeRight       <- vector()

      nextNode <- n + 1

      startN <- n

      while (length(activePairs) + max(length(activeLeft), length(activeRight)) > 1) {

          # Calculate probability of recombination
          prec <- length(activePairs) * r

          # Calculate probability of coalescence
          n    <- length(activePairs) + length(activeLeft) + length(activeRight)
          pca  <- n * (n - 1) / (2 * N)

          # Calculate time to next event
          t    <- rexp(1, pca + prec)

          if (debug) cat(length(activePairs), length(activeLeft), length(activeRight),
                         "After ", t, " generations: ")

          # Update times for all nodes
          indexRight <- c(activeRight, activePairs)
          rightLengths[indexRight] <- rightLengths[indexRight] + t

          indexLeft <- c(activeLeft, activePairs)
          leftLengths[indexLeft] <- leftLengths[indexLeft] + t

          # Was this a coalescence or recombination event?
          if (rbinom(1, 1, pca / (pca + prec)) == 1) {

             # Pick two nodes at random ...
             coalescent <- sample(c(activeLeft, activeRight, activePairs), 2)

             if (debug) cat("Nodes ", coalescent[1], " and ", coalescent[2], " coalesce into ", nextNode, "\n")

             # Set parents appropriately
             rightParent[intersect(coalescent, indexRight)] <- nextNode
             leftParent[intersect(coalescent, indexLeft)] <- nextNode

             # Create ancestor node
             rightParent[nextNode] <- leftParent[nextNode] <- 0
             rightLengths[nextNode] <- leftLengths[nextNode] <- 0
             rightDescendants[nextNode] <- sum(rightDescendants[coalescent])
             leftDescendants[nextNode]  <- sum(leftDescendants[coalescent])

             # Delete coalescing nodes from appropriate list
             activePairs <- setdiff(activePairs, coalescent)
             activeLeft  <- setdiff(activeLeft, coalescent)
             activeRight <- setdiff(activeRight, coalescent)

             # Add new ancestral node to appropriate list
             if (rightDescendants[nextNode] == 0) {
                activeLeft <- c(activeLeft, nextNode)
             } else if (leftDescendants[nextNode] == 0) {
                activeRight <- c(activeRight, nextNode)
             } else
                activePairs <- c(activePairs, nextNode)

             # Check for special situations where one locus finishes coalescing
             if (length(activePairs) == 0)  {
                if (length(activeRight) == 1) activeRight <- vector()
                if (length(activeLeft) == 1) activeLeft <- vector()
                }

             if ((length(activePairs) == 1))  {
                if (length(activeRight) == 0) {
                   activeLeft <- c(activeLeft, activePairs)
                   activePairs <- vector()
                   }
                if (length(activeLeft) == 0) {
                   activeRight <- c(activeRight, activePairs)
                   activePairs <- vector()
                   }
                }

             nextNode <- nextNode + 1

          } else {

             # Pick one node at random to recombine
             recombinant <- ifelse(length(activePairs)==1, activePairs, sample(activePairs, 1))

             if (debug) cat("Node ", recombinant, " splits into ", nextNode, " and ", nextNode + 1, "\n")

             # Assign different ancestors to each stretch
             rightParent[recombinant] <- nextNode
             leftParent[recombinant]  <- nextNode+1

             activeRight <- c(activeRight, nextNode)
             activeLeft <- c(activeLeft, nextNode + 1)

             # Create ancestor for right portion
             rightParent[nextNode] <- leftParent[nextNode] <- 0
             rightLengths[nextNode] <- leftLengths[nextNode] <- 0
             rightDescendants[nextNode] <- rightDescendants[recombinant]
             leftDescendants[nextNode]  <- 0
             nextNode <- nextNode + 1

             # Create ancestor for left portion
             rightParent[nextNode] <- leftParent[nextNode] <- 0
             rightLengths[nextNode] <- leftLengths[nextNode] <- 0
             rightDescendants[nextNode] <- 0
             leftDescendants[nextNode]  <- leftDescendants[recombinant]
             nextNode <- nextNode + 1

             # Remove recombinant node from active list
             activePairs <- setdiff(activePairs, recombinant)
             }
          }

      left <- list(descendants = leftDescendants, parent = leftParent, lengths = leftLengths)
      right <- list(descendants = rightDescendants, parent = rightParent, lengths = rightLengths)

      list(left = left, right = right)
      }