libStatGen Software  1
MemoryMap.h
00001 /*
00002  *  Copyright (C) 2010  Regents of the University of Michigan
00003  *
00004  *   This program is free software: you can redistribute it and/or modify
00005  *   it under the terms of the GNU General Public License as published by
00006  *   the Free Software Foundation, either version 3 of the License, or
00007  *   (at your option) any later version.
00008  *
00009  *   This program is distributed in the hope that it will be useful,
00010  *   but WITHOUT ANY WARRANTY; without even the implied warranty of
00011  *   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
00012  *   GNU General Public License for more details.
00013  *
00014  *   You should have received a copy of the GNU General Public License
00015  *   along with this program.  If not, see <http://www.gnu.org/licenses/>.
00016  */
00017 
00018 #ifndef __MEMORYMAP_H
00019 #define __MEMORYMAP_H
00020 #include <sys/types.h>
00021 #include <fcntl.h>
00022 
00023 #if defined(_WIN32)
00024 #include <windows.h>
00025 #endif
00026 
00027 ///
00028 /// There are a pair of related data structures in the operating system,
00029 /// and also a few simple algorithms that explain why your processes are
00030 /// waiting forever.
00031 ///
00032 /// The symptom you have is that they are getting little or no CPU time,
00033 /// as shown in the command 'top'.  The machine will appear to have
00034 /// available CPU time (look at the Cpu(s): parameter - if less than 100%,
00035 /// you have available CPU).  The real key, however, is to look at the
00036 /// 'top' column with the label 'S' - that is the status of the process,
00037 /// and crucial to understanding what is going on.
00038 ///
00039 /// In your instance, the 'S' column for your karma jobs is 'D', which
00040 /// means it is waiting for data.  This is because the process is doing
00041 /// something that is waiting for the filesystem to return data to it.
00042 /// Usually, this is because of a C call like read() or write(), but it
00043 /// also happens in large processes where memory was copied to disk and
00044 /// re-used for other purposes (this is called paging).
00045 ///
00046 /// So, a bit of background on the operating system... there is a CPU
00047 /// secheduler that takes a list of waiting processes, and picks one to
00048 /// run - if the job is waiting for the disk, there is no point in picking
00049 /// it to run, since it is blocked, waiting for the disk to return data.
00050 /// The scheduler marks the process with 'D' and moves on to the next
00051 /// process to schedule.
00052 ///
00053 /// In terms of data structures that we care about for this example, there
00054 /// are two that we care about.  First is a linear list of disk buffers
00055 /// that are stored in RAM and controlled by the operating system.  This
00056 /// is usually called the disk buffer pool.  Usually, when a program asks
00057 /// for data from the disk, this list can be scanned quickly to see if the
00058 /// data is already in RAM - if so, no disk operation needs to take place.
00059 ///
00060 /// Now in the case of the normal Unix read() and write() calls, when the
00061 /// operating system is done finding the page, it copies the data into a
00062 /// buffer to be used by the process that requested it (in the case of a
00063 /// read() - a write() is the opposite).  This copy operation is slow and
00064 /// inefficient, but gets the job done.
00065 ///
00066 /// So overall, you gain some efficiency in a large memory system by
00067 /// having this disk buffer pool data structure, since you aren't
00068 /// re-reading the disk over and over to get the same data that you
00069 /// already have in RAM.  However, it is less efficient than it might be
00070 /// because of the extra buffer copying.
00071 ///
00072 /// Now we come to memory mapped files, and karma.  The underlying system
00073 /// call of interest to us is mmap(), and is in MemoryMap.cpp.  What it
00074 /// does and how it works are important to understanding the benefits of
00075 /// it, and frankly, most people don't care about it because it is
00076 /// seemingly complex.
00077 ///
00078 /// Two things are important to know: firstly, there is a data structure
00079 /// in the CPU called the page table, which is mostly contained in the CPU
00080 /// hardware itself.  All memory accesses for normal user processes like
00081 /// karma go through this hardware page table.  Secondly, it is very fast
00082 /// for the operating system to put together a page table that 'connects'
00083 /// a bunch of memory locations in your user programs address space to the
00084 /// disk buffer pool pages.
00085 ///
00086 /// The combination of those two facts mean that you can implement a 'zero
00087 /// copy' approach to reading data, which means that the data that is in
00088 /// the disk buffer pool is directly readable by the program without the
00089 /// operating system ever having to actually copy the data, like it does
00090 /// for read() or write().
00091 ///
00092 /// So the benefit of mmap() is that when the underlying disk pages are
00093 /// already in the disk buffer pool, a hardware data structure gets built,
00094 /// then the program returns, and the data is available at full processor
00095 /// speed with no intervening copy of the data, or waiting for disk or
00096 /// anything else.  It is as near to instantaneous as you can possibly
00097 /// get.  This works whether it is 100 bytes or 100 gigabytes.
00098 ///
00099 /// So, the last part of the puzzle is why your program winds up in 'D'
00100 /// (data wait), and what to do about it.
00101 ///
00102 /// The disk buffer pool is a linear list of blocks ordered by the time
00103 /// and date of access.  A process runs every once in awhile to take the
00104 /// oldest of those pages, and free them, during which it also has to
00105 /// update the hardware page tables of any processes referencing them.
00106 ///
00107 /// So on wonderland, most file access (wget, copy, md5sum, anything else)
00108 /// is constantly putting new fresh pages at the front of the list, and
00109 /// karma index files, having been opened awhile ago, are prime candidates
00110 /// for being paged out.  The reason they get paged out as far as I know
00111 /// is that in any given second of execution, nowhere near the entire
00112 /// index is getting accessed... so at some point, at least one page gets
00113 /// sent back to disk (well, flushed from RAM).  Once that happens, a
00114 /// cascading effect happens, where the longer it waits, the older the
00115 /// other pages get, then the more that get reclaimed, and the slower it
00116 /// gets, until karma is at a standstill, waiting for pages to be brought
00117 /// back into RAM.
00118 ///
00119 /// Now in an ideal world, karma would rapidly recover, and it can...
00120 /// sometimes.  The problem is that your karma job is accessing data all
00121 /// over that index, and it is essentially looking like a pure random I/O
00122 /// to the underlying filesystem.  There is about a 10 to 1 performance
00123 /// difference between accessing the disk sequentially as compared to
00124 /// randomly.
00125 ///
00126 /// So to make karma work better, the first thing I do when starting karma
00127 /// is force it to read all of the disk pages in order.  This causes the
00128 /// entire index to be forced into memory in order, so it is forcing
00129 /// sequential reads, which is the best case possible.  There are
00130 /// problems, for example if three karma jobs start at once, the disk I/O
00131 /// is no longer as purely sequential as we would like.  Also, if the
00132 /// filesystem is busy taking care of other programs, even if karma thinks
00133 /// it is forcing sequential I/O, the net result looks more random.  This
00134 /// happens when the system is starting to break down (thrashing) and it
00135 /// will certainly stall, or look very very slow, or crash.
00136 ///
00137 /// The upshot of all of this is that when a single reference is shared,
00138 /// it is more likely that all the pages will be in the disk buffer pool
00139 /// to begin with, and thereby reduce startup time to nearly zero.  It is
00140 /// also the ideal situation in terms of sharing the same reference among
00141 /// say 24 copies of karma on wonderland - the only cost is the hardware
00142 /// page table that gets set up to point to all of the disk buffers.
00143 ///
00144 /// As I mentioned a paragraph back, the pages can still get swapped out,
00145 /// even with dozens of karma jobs running.  A workaround I created is a
00146 /// program in utilities called mapfile - it simply repeatedly accesses
00147 /// the data in sequential order to help ensure that all of the pages are
00148 /// at the head of the disk buffer pool, and therefore less likely to get
00149 /// swapped out.
00150 ///
00151 /// The benefit of such a program (mapfile) is greater on wonderland,
00152 /// where a lot of processes are competing for memory and disk buffers.
00153 ///
00154 ///
00155 class MemoryMap
00156 {
00157 #if defined(_WIN32)
00158     HANDLE      file_handle;
00159     HANDLE      map_handle;
00160     DWORD       page_size;
00161 #else
00162     int fd;
00163     size_t page_size;
00164 #endif
00165     off_t offset;
00166     size_t mapped_length;
00167     size_t total_length;
00168     bool    useMemoryMapFlag;
00169 public:
00170 
00171     void *data;
00172 
00173     MemoryMap();
00174 
00175     virtual ~MemoryMap();
00176 
00177     void debug_print();
00178 
00179     void constructor_clear();
00180 
00181     void destructor_clear();
00182 
00183     virtual bool allocate();
00184 
00185     /// open a previously created mapped vector
00186     ///
00187     /// useMemoryMapFlag will determine whether it
00188     /// uses mmap() or malloc()/read() to populate
00189     /// the memory
00190     virtual bool open(const char * file, int flags = O_RDONLY);
00191 
00192     /// create the memory mapped file on disk
00193     ///
00194     /// a file will be created on disk with the header
00195     /// filled in.  The caller must now populate elements
00196     /// using (*this).set(index, value).
00197     //
00198     virtual bool create(const char * file, size_t size);
00199 
00200     /// store in allocated memory (malloc), not mmap:
00201     ///
00202     /// This is for code that needs to more flexibly
00203     /// the case when an mmap() file _might_ be available,
00204     /// but if it is not, we want to load it as a convenience
00205     /// to the user. GenomeSequence::populateDBSNP does exactly this.
00206     //
00207     virtual bool create(size_t size);
00208 
00209     bool close();
00210     void test();
00211     size_t length()
00212     {
00213         return mapped_length;
00214     }
00215 
00216     char operator[](unsigned int index)
00217     {
00218         return ((char *)data)[index];
00219     };
00220     int prefetch();     // force pages into RAM
00221 
00222     //
00223     // set or unset use of mmap() call in ::open().
00224     // This flag must be set before ::open() is called,
00225     // if it is called afterwards, it has no effect.
00226     //
00227     void useMemoryMap(bool flag=true)
00228     {
00229         useMemoryMapFlag = flag;
00230     }
00231 };
00232 
00233 #endif
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends