libStatGen Software
1
|
00001 /* 00002 * Copyright (C) 2010 Regents of the University of Michigan 00003 * 00004 * This program is free software: you can redistribute it and/or modify 00005 * it under the terms of the GNU General Public License as published by 00006 * the Free Software Foundation, either version 3 of the License, or 00007 * (at your option) any later version. 00008 * 00009 * This program is distributed in the hope that it will be useful, 00010 * but WITHOUT ANY WARRANTY; without even the implied warranty of 00011 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 00012 * GNU General Public License for more details. 00013 * 00014 * You should have received a copy of the GNU General Public License 00015 * along with this program. If not, see <http://www.gnu.org/licenses/>. 00016 */ 00017 00018 #ifndef __MEMORYMAP_H 00019 #define __MEMORYMAP_H 00020 #include <sys/types.h> 00021 #include <fcntl.h> 00022 00023 #if defined(_WIN32) 00024 #include <windows.h> 00025 #endif 00026 00027 /// 00028 /// There are a pair of related data structures in the operating system, 00029 /// and also a few simple algorithms that explain why your processes are 00030 /// waiting forever. 00031 /// 00032 /// The symptom you have is that they are getting little or no CPU time, 00033 /// as shown in the command 'top'. The machine will appear to have 00034 /// available CPU time (look at the Cpu(s): parameter - if less than 100%, 00035 /// you have available CPU). The real key, however, is to look at the 00036 /// 'top' column with the label 'S' - that is the status of the process, 00037 /// and crucial to understanding what is going on. 00038 /// 00039 /// In your instance, the 'S' column for your karma jobs is 'D', which 00040 /// means it is waiting for data. This is because the process is doing 00041 /// something that is waiting for the filesystem to return data to it. 00042 /// Usually, this is because of a C call like read() or write(), but it 00043 /// also happens in large processes where memory was copied to disk and 00044 /// re-used for other purposes (this is called paging). 00045 /// 00046 /// So, a bit of background on the operating system... there is a CPU 00047 /// secheduler that takes a list of waiting processes, and picks one to 00048 /// run - if the job is waiting for the disk, there is no point in picking 00049 /// it to run, since it is blocked, waiting for the disk to return data. 00050 /// The scheduler marks the process with 'D' and moves on to the next 00051 /// process to schedule. 00052 /// 00053 /// In terms of data structures that we care about for this example, there 00054 /// are two that we care about. First is a linear list of disk buffers 00055 /// that are stored in RAM and controlled by the operating system. This 00056 /// is usually called the disk buffer pool. Usually, when a program asks 00057 /// for data from the disk, this list can be scanned quickly to see if the 00058 /// data is already in RAM - if so, no disk operation needs to take place. 00059 /// 00060 /// Now in the case of the normal Unix read() and write() calls, when the 00061 /// operating system is done finding the page, it copies the data into a 00062 /// buffer to be used by the process that requested it (in the case of a 00063 /// read() - a write() is the opposite). This copy operation is slow and 00064 /// inefficient, but gets the job done. 00065 /// 00066 /// So overall, you gain some efficiency in a large memory system by 00067 /// having this disk buffer pool data structure, since you aren't 00068 /// re-reading the disk over and over to get the same data that you 00069 /// already have in RAM. However, it is less efficient than it might be 00070 /// because of the extra buffer copying. 00071 /// 00072 /// Now we come to memory mapped files, and karma. The underlying system 00073 /// call of interest to us is mmap(), and is in MemoryMap.cpp. What it 00074 /// does and how it works are important to understanding the benefits of 00075 /// it, and frankly, most people don't care about it because it is 00076 /// seemingly complex. 00077 /// 00078 /// Two things are important to know: firstly, there is a data structure 00079 /// in the CPU called the page table, which is mostly contained in the CPU 00080 /// hardware itself. All memory accesses for normal user processes like 00081 /// karma go through this hardware page table. Secondly, it is very fast 00082 /// for the operating system to put together a page table that 'connects' 00083 /// a bunch of memory locations in your user programs address space to the 00084 /// disk buffer pool pages. 00085 /// 00086 /// The combination of those two facts mean that you can implement a 'zero 00087 /// copy' approach to reading data, which means that the data that is in 00088 /// the disk buffer pool is directly readable by the program without the 00089 /// operating system ever having to actually copy the data, like it does 00090 /// for read() or write(). 00091 /// 00092 /// So the benefit of mmap() is that when the underlying disk pages are 00093 /// already in the disk buffer pool, a hardware data structure gets built, 00094 /// then the program returns, and the data is available at full processor 00095 /// speed with no intervening copy of the data, or waiting for disk or 00096 /// anything else. It is as near to instantaneous as you can possibly 00097 /// get. This works whether it is 100 bytes or 100 gigabytes. 00098 /// 00099 /// So, the last part of the puzzle is why your program winds up in 'D' 00100 /// (data wait), and what to do about it. 00101 /// 00102 /// The disk buffer pool is a linear list of blocks ordered by the time 00103 /// and date of access. A process runs every once in awhile to take the 00104 /// oldest of those pages, and free them, during which it also has to 00105 /// update the hardware page tables of any processes referencing them. 00106 /// 00107 /// So on wonderland, most file access (wget, copy, md5sum, anything else) 00108 /// is constantly putting new fresh pages at the front of the list, and 00109 /// karma index files, having been opened awhile ago, are prime candidates 00110 /// for being paged out. The reason they get paged out as far as I know 00111 /// is that in any given second of execution, nowhere near the entire 00112 /// index is getting accessed... so at some point, at least one page gets 00113 /// sent back to disk (well, flushed from RAM). Once that happens, a 00114 /// cascading effect happens, where the longer it waits, the older the 00115 /// other pages get, then the more that get reclaimed, and the slower it 00116 /// gets, until karma is at a standstill, waiting for pages to be brought 00117 /// back into RAM. 00118 /// 00119 /// Now in an ideal world, karma would rapidly recover, and it can... 00120 /// sometimes. The problem is that your karma job is accessing data all 00121 /// over that index, and it is essentially looking like a pure random I/O 00122 /// to the underlying filesystem. There is about a 10 to 1 performance 00123 /// difference between accessing the disk sequentially as compared to 00124 /// randomly. 00125 /// 00126 /// So to make karma work better, the first thing I do when starting karma 00127 /// is force it to read all of the disk pages in order. This causes the 00128 /// entire index to be forced into memory in order, so it is forcing 00129 /// sequential reads, which is the best case possible. There are 00130 /// problems, for example if three karma jobs start at once, the disk I/O 00131 /// is no longer as purely sequential as we would like. Also, if the 00132 /// filesystem is busy taking care of other programs, even if karma thinks 00133 /// it is forcing sequential I/O, the net result looks more random. This 00134 /// happens when the system is starting to break down (thrashing) and it 00135 /// will certainly stall, or look very very slow, or crash. 00136 /// 00137 /// The upshot of all of this is that when a single reference is shared, 00138 /// it is more likely that all the pages will be in the disk buffer pool 00139 /// to begin with, and thereby reduce startup time to nearly zero. It is 00140 /// also the ideal situation in terms of sharing the same reference among 00141 /// say 24 copies of karma on wonderland - the only cost is the hardware 00142 /// page table that gets set up to point to all of the disk buffers. 00143 /// 00144 /// As I mentioned a paragraph back, the pages can still get swapped out, 00145 /// even with dozens of karma jobs running. A workaround I created is a 00146 /// program in utilities called mapfile - it simply repeatedly accesses 00147 /// the data in sequential order to help ensure that all of the pages are 00148 /// at the head of the disk buffer pool, and therefore less likely to get 00149 /// swapped out. 00150 /// 00151 /// The benefit of such a program (mapfile) is greater on wonderland, 00152 /// where a lot of processes are competing for memory and disk buffers. 00153 /// 00154 /// 00155 class MemoryMap 00156 { 00157 #if defined(_WIN32) 00158 HANDLE file_handle; 00159 HANDLE map_handle; 00160 DWORD page_size; 00161 #else 00162 int fd; 00163 size_t page_size; 00164 #endif 00165 off_t offset; 00166 size_t mapped_length; 00167 size_t total_length; 00168 bool useMemoryMapFlag; 00169 public: 00170 00171 void *data; 00172 00173 MemoryMap(); 00174 00175 virtual ~MemoryMap(); 00176 00177 void debug_print(); 00178 00179 void constructor_clear(); 00180 00181 void destructor_clear(); 00182 00183 virtual bool allocate(); 00184 00185 /// open a previously created mapped vector 00186 /// 00187 /// useMemoryMapFlag will determine whether it 00188 /// uses mmap() or malloc()/read() to populate 00189 /// the memory 00190 virtual bool open(const char * file, int flags = O_RDONLY); 00191 00192 /// create the memory mapped file on disk 00193 /// 00194 /// a file will be created on disk with the header 00195 /// filled in. The caller must now populate elements 00196 /// using (*this).set(index, value). 00197 // 00198 virtual bool create(const char * file, size_t size); 00199 00200 /// store in allocated memory (malloc), not mmap: 00201 /// 00202 /// This is for code that needs to more flexibly 00203 /// the case when an mmap() file _might_ be available, 00204 /// but if it is not, we want to load it as a convenience 00205 /// to the user. GenomeSequence::populateDBSNP does exactly this. 00206 // 00207 virtual bool create(size_t size); 00208 00209 bool close(); 00210 void test(); 00211 size_t length() 00212 { 00213 return mapped_length; 00214 } 00215 00216 char operator[](unsigned int index) 00217 { 00218 return ((char *)data)[index]; 00219 }; 00220 int prefetch(); // force pages into RAM 00221 00222 // 00223 // set or unset use of mmap() call in ::open(). 00224 // This flag must be set before ::open() is called, 00225 // if it is called afterwards, it has no effect. 00226 // 00227 void useMemoryMap(bool flag=true) 00228 { 00229 useMemoryMapFlag = flag; 00230 } 00231 }; 00232 00233 #endif