In this practical we will use R to make some simple predictions about the level of polymorphism in a sample of DNA sequences. Our predictions will be based on theorectical and simulation results obtained from coalescent models.
To get started, we will connect to one of the Unix servers on Campus. UNIX is an operating system designed specifically for multiple users and scientific computing. We will use a terminal server to transmit encrypted commands between your Windows PC and the Unix server.
To get started, go to the Start Menu, and in select the Programs Folder and then Internet Applications. Start Putty. To request a secure connection, make sure that the SSH Port is selected. Next, we will need to specify a Unix server name in the hostname box.
Biostatistics Students: You can select one of the Department's compute servers by specifying one of compute1-compute5 in the hostname box. For example, to connect to compute3, specify compute3.sph.umich.edu in the hostname box and select Open.
Statistics and Bioinformatics Students: If you have access to one your home Department's compute servers, you can specify it in the hostname box. Otherwise, connect to fantasia.sph.umich.edu and use your uniqname and student id number to login.
Putty can be downloaded freely from here.
You should now get to a Unix prompt. At the Unix prompt, you will be able to run specific programs or manipulate files and folders. Some key commands for this session will be:
pwd - Report the current working folder ls - List the contents of the current folder cd - Change between different folders mkdir - Creates a new folder file filename - Provides information on the contents of a file pico - A simple text editor R - A statistics package
I recommend you get started by creating a folder for Biostatistics 666 practicals. Execute the following series of commands:
cd Returns you to your home directory mkdir 666 Create 666 coursework directory cd 666 mkdir lab1 Create lab1 directory cd lab1
At any stage you can use the pwd, ls and file commands to get information about your current working directory or a specific file
You can open multiple connections to the same Unix server, and you may find it convenient to do so. You could then use one connection to edit code and a separate connection to run R.
From the 666/lab1 directory, start R. To do this, simple execute the command R. If you want to install R in your own computer or look at the documentation you can get it from www.r-project.org.
R starts with a simple prompt. Like the Unix shell, R will execute simple commands you give it. While Unix commands generally focus on manipulating files and directories, R commands are designed to evaluate simple mathematical expressions and provide statistical summaries of data.
Try the following simple R commands:
1 + 2 * 3 Evaluate simple expression sqrt(2) x <- 1 Assign value 1 to x x Display the value of x y <- 1:3 Assign the values 1,2,3 to y y Display the value of y y * 2 y + 1 y * y A few more simple expressions square <- function(x) x * x Define square function square(4) square(1:3) And use it z <- rep(0,3) Define a vector of 3 values, all 0 z c(y, z) Concatenate arrays y and z objects() List current objects rnorm(1) Random normal deviate rexp(1) Random draw from exponential distribution sample(y,2) Select two elements from y at random intersect(1:10, 5:15) Intersection of two sets setdiff(1:10, 5:15) Difference between two sets (not cumutative)
If you feel confortable with the basics of R, lets define some simple functions related to the coalescent, using the formulas defined in class. For example, we could start with some functions that estimate the probability of a coalescent event:
Pca <- function(n, N = 1000) { # This function will approximate the probability of no # coalescent events in a sample of n chromosomes from # a diploid population of size N 1.0 - (n * (n-1)) / (4 * N) } PcaExact <- function(n, N = 1000) { # This function will calculate the probability of no # coalescent events in a sample of n chromosomes from # a diploid population of size N result = 1.0 for (i in 1:(n-1)) { result <- result * (2*N - i)/ (2*N) } result }
The first function uses the approximate formula to calculate the probability that, in the previous generation, n individuals have distinct ancestors in a population of size N. The second argument takes a default value of 1000. Try for example:
# Look at effect of sample size Pca(n = 2) Pca(n = 4) Pca(n = 10) Pca(n = 20) # Look at effect of population size Pca(n = 2) Pca(n = 2, 10) Pca(n = 2, 100) Pca(n = 2, 1000) # Or compare exact and approximate results Pca(n = 5, 10) Pca(n = 5, 100) PcaExact(n = 5, 10) PcaExact(n = 5, 100)
Below are functions that estimate the probability of various numbers of segregating sites in a sequence. Rather than retyping them, you can cut and paste these into your R (Copy in your browser edit menu, then right-click within the PuTTY window).
# The implementations below are conceptually simple, but crude # they will be very slow when N or n is large P2 <- function (S, N = 1000, mu = 10^-6) { # This function will calculate the probability of # exactly S segregating sites in a sample of two # chromosomes from a population of size N and # with mutation rate mu theta <- 4 * N * mu (theta / (1 + theta)) ^ S * (1/(1 + theta)) } Qn <- function (S, n, N = 1000, mu = 10^-6) { # This function will calculate the probability that # exactly S mutations occur before the next coalescent # event in a sample of n chromosomes from a population # of size N with mutation rate mu theta <- 4 * N * mu (theta / (theta + n - 1)) ^ S * ((n - 1)/(n - 1 + theta)) } Pn <- function(S, n, N = 1000, mu = 10^-6) { # This function will calculate the probability that # exactly S mutations are present in a sample of n # chromosomes from a population of size N with # mutation rate mu if (n > 2) { result <- 0 for (i in 0:S) result <- result + Pn(i, n-1, N, mu) * Qn(S-i, n, N, mu) } else result <- P2(S, N, mu) result }
After you define these functions, consider the following problem: in a sample of 5 chromosomes from Northern Europe 10 variants were identified within the a gene that protects the skin from sunlight. In a sample of 6 chromosomes from East Asia, only 1 variant was identified. Assuming that the effective population size in each of these populations is N = 5000 and the mutation rate within the gene is mu = 10^-4 per generation, estimate the probability of observing these numbers of variants.
The functions below generate a simple one locus coalescent tree. The consists of a series of brances. Each branch has a specific length, a total number of descendants and an ancestor or parent branch.
SingleLocusTree <- function(n = 10, N = 1000, debug = FALSE) { # Length of each branch in the tree lengths <- rep(0, n) # Number of descendants for each node descendants <- rep(1, n) # Parent node for each node parent <- rep(0, n) # Descendant nodes for each node activeNodes <- 1:n nextNode = n + 1 for (i in n:2) { # sample time to next coalescent event t <- rexp(1, rate = (i*(i-1)/2)/N) # all current nodes increase in length by time t lengths[activeNodes] <- lengths[activeNodes] + t # next we select two nodes to coalesce at random coalescence <- sample(activeNodes, 2) if (debug) cat("After ", t, "generations: Nodes ", coalescence[1], " and ", coalescence[2], "coalesce\n") # assign their parental nodes parent[coalescence] = nextNode # remove these from the list of active nodes activeNodes <- setdiff(activeNodes, coalescence) # create a new node parent[nextNode] <- 0 lengths[nextNode] <- 0 descendants[nextNode] <- descendants[coalescence[1]] + descendants[coalescence[2]] # And update list of active nodes activeNodes <- c(activeNodes, nextNode) nextNode <- nextNode + 1 } list(descendants = descendants, parent = parent, lengths = lengths) } GetTreeSpectrum <- function(tree, n) { spectrum <- rep(0, n) for (i in 1:length(tree$descendants)) spectrum[tree$descendants[i]] <- spectrum[tree$descendants[i]] + tree$lengths[i] spectrum }
Start by generating a simple tree and trying to draw it out on paper. You can do this by executing SingleLocusTree(n = 3). If you want more detail on what is happening, you can also include the debug flag, e.g., run SingleLocusTree(n = 3, debug = TRUE).
To generate several trees and tally their branch lengths by number of descendants try the following:
n <- 10 spectrum <- rep(0, n) for (i in 1:100) { tree <- SingleLocusTree(n) spectrum <- GetTreeSpectrum(tree, n) + spectrum } spectrum <- spectrum / 100
If you are curious, this is what a two locus simulator looks like...
TwoLocusTree <- function(n, N = 1000, r = 0.0001, debug = FALSE) { # This function generates trees for two linked loci # Then we initialize the tree with n sequences for # the right hand locus rightDescendants <- rep(1, n) rightLengths <- rep(0, n) rightParent <- rep(0, n) # First we initialize the tree with n sequences for # the left hand locus leftDescendants <- rep(1, n) leftLengths <- rep(0, n) leftParent <- rep(0, n) activePairs <- 1:n activeLeft <- vector() activeRight <- vector() nextNode <- n + 1 startN <- n while (length(activePairs) + max(length(activeLeft), length(activeRight)) > 1) { # Calculate probability of recombination prec <- length(activePairs) * r # Calculate probability of coalescence n <- length(activePairs) + length(activeLeft) + length(activeRight) pca <- n * (n - 1) / (2 * N) # Calculate time to next event t <- rexp(1, pca + prec) if (debug) cat(length(activePairs), length(activeLeft), length(activeRight), "After ", t, " generations: ") # Update times for all nodes indexRight <- c(activeRight, activePairs) rightLengths[indexRight] <- rightLengths[indexRight] + t indexLeft <- c(activeLeft, activePairs) leftLengths[indexLeft] <- leftLengths[indexLeft] + t # Was this a coalescence or recombination event? if (rbinom(1, 1, pca / (pca + prec)) == 1) { # Pick two nodes at random ... coalescent <- sample(c(activeLeft, activeRight, activePairs), 2) if (debug) cat("Nodes ", coalescent[1], " and ", coalescent[2], " coalesce into ", nextNode, "\n") # Set parents appropriately rightParent[intersect(coalescent, indexRight)] <- nextNode leftParent[intersect(coalescent, indexLeft)] <- nextNode # Create ancestor node rightParent[nextNode] <- leftParent[nextNode] <- 0 rightLengths[nextNode] <- leftLengths[nextNode] <- 0 rightDescendants[nextNode] <- sum(rightDescendants[coalescent]) leftDescendants[nextNode] <- sum(leftDescendants[coalescent]) # Delete coalescing nodes from appropriate list activePairs <- setdiff(activePairs, coalescent) activeLeft <- setdiff(activeLeft, coalescent) activeRight <- setdiff(activeRight, coalescent) # Add new ancestral node to appropriate list if (rightDescendants[nextNode] == 0) { activeLeft <- c(activeLeft, nextNode) } else if (leftDescendants[nextNode] == 0) { activeRight <- c(activeRight, nextNode) } else activePairs <- c(activePairs, nextNode) # Check for special situations where one locus finishes coalescing if (length(activePairs) == 0) { if (length(activeRight) == 1) activeRight <- vector() if (length(activeLeft) == 1) activeLeft <- vector() } if ((length(activePairs) == 1)) { if (length(activeRight) == 0) { activeLeft <- c(activeLeft, activePairs) activePairs <- vector() } if (length(activeLeft) == 0) { activeRight <- c(activeRight, activePairs) activePairs <- vector() } } nextNode <- nextNode + 1 } else { # Pick one node at random to recombine recombinant <- ifelse(length(activePairs)==1, activePairs, sample(activePairs, 1)) if (debug) cat("Node ", recombinant, " splits into ", nextNode, " and ", nextNode + 1, "\n") # Assign different ancestors to each stretch rightParent[recombinant] <- nextNode leftParent[recombinant] <- nextNode+1 activeRight <- c(activeRight, nextNode) activeLeft <- c(activeLeft, nextNode + 1) # Create ancestor for right portion rightParent[nextNode] <- leftParent[nextNode] <- 0 rightLengths[nextNode] <- leftLengths[nextNode] <- 0 rightDescendants[nextNode] <- rightDescendants[recombinant] leftDescendants[nextNode] <- 0 nextNode <- nextNode + 1 # Create ancestor for left portion rightParent[nextNode] <- leftParent[nextNode] <- 0 rightLengths[nextNode] <- leftLengths[nextNode] <- 0 rightDescendants[nextNode] <- 0 leftDescendants[nextNode] <- leftDescendants[recombinant] nextNode <- nextNode + 1 # Remove recombinant node from active list activePairs <- setdiff(activePairs, recombinant) } } left <- list(descendants = leftDescendants, parent = leftParent, lengths = leftLengths) right <- list(descendants = rightDescendants, parent = rightParent, lengths = rightLengths) list(left = left, right = right) }