There are a pair of related data structures in the operating system, and also a few simple algorithms that explain why your processes are waiting forever. More...
#include <MemoryMap.h>
Public Member Functions | |
void | debug_print () |
void | constructor_clear () |
void | destructor_clear () |
virtual bool | open (const char *file, int flags=O_RDONLY) |
open a previously created mapped vector | |
virtual bool | create (const char *file, size_t size) |
create the memory mapped file on disk | |
virtual bool | create (size_t size) |
store in allocated memory (malloc), not mmap: | |
bool | close () |
void | test () |
size_t | length () |
char | operator[] (unsigned int index) |
int | prefetch () |
void | useMemoryMap (bool flag=true) |
Public Attributes | |
void * | data |
There are a pair of related data structures in the operating system, and also a few simple algorithms that explain why your processes are waiting forever.
The symptom you have is that they are getting little or no CPU time, as shown in the command 'top'. The machine will appear to have available CPU time (look at the Cpu(s): parameter - if less than 100%, you have available CPU). The real key, however, is to look at the 'top' column with the label 'S' - that is the status of the process, and crucial to understanding what is going on.
In your instance, the 'S' column for your karma jobs is 'D', which means it is waiting for data. This is because the process is doing something that is waiting for the filesystem to return data to it. Usually, this is because of a C call like read() or write(), but it also happens in large processes where memory was copied to disk and re-used for other purposes (this is called paging).
So, a bit of background on the operating system... there is a CPU secheduler that takes a list of waiting processes, and picks one to run - if the job is waiting for the disk, there is no point in picking it to run, since it is blocked, waiting for the disk to return data. The scheduler marks the process with 'D' and moves on to the next process to schedule.
In terms of data structures that we care about for this example, there are two that we care about. First is a linear list of disk buffers that are stored in RAM and controlled by the operating system. This is usually called the disk buffer pool. Usually, when a program asks for data from the disk, this list can be scanned quickly to see if the data is already in RAM - if so, no disk operation needs to take place.
Now in the case of the normal Unix read() and write() calls, when the operating system is done finding the page, it copies the data into a buffer to be used by the process that requested it (in the case of a read() - a write() is the opposite). This copy operation is slow and inefficient, but gets the job done.
So overall, you gain some efficiency in a large memory system by having this disk buffer pool data structure, since you aren't re-reading the disk over and over to get the same data that you already have in RAM. However, it is less efficient than it might be because of the extra buffer copying.
Now we come to memory mapped files, and karma. The underlying system call of interest to us is mmap(), and is in MemoryMap.cpp. What it does and how it works are important to understanding the benefits of it, and frankly, most people don't care about it because it is seemingly complex.
Two things are important to know: firstly, there is a data structure in the CPU called the page table, which is mostly contained in the CPU hardware itself. All memory accesses for normal user processes like karma go through this hardware page table. Secondly, it is very fast for the operating system to put together a page table that 'connects' a bunch of memory locations in your user programs address space to the disk buffer pool pages.
The combination of those two facts mean that you can implement a 'zero copy' approach to reading data, which means that the data that is in the disk buffer pool is directly readable by the program without the operating system ever having to actually copy the data, like it does for read() or write().
So the benefit of mmap() is that when the underlying disk pages are already in the disk buffer pool, a hardware data structure gets built, then the program returns, and the data is available at full processor speed with no intervening copy of the data, or waiting for disk or anything else. It is as near to instantaneous as you can possibly get. This works whether it is 100 bytes or 100 gigabytes.
So, the last part of the puzzle is why your program winds up in 'D' (data wait), and what to do about it.
The disk buffer pool is a linear list of blocks ordered by the time and date of access. A process runs every once in awhile to take the oldest of those pages, and free them, during which it also has to update the hardware page tables of any processes referencing them.
So on wonderland, most file access (wget, copy, md5sum, anything else) is constantly putting new fresh pages at the front of the list, and karma index files, having been opened awhile ago, are prime candidates for being paged out. The reason they get paged out as far as I know is that in any given second of execution, nowhere near the entire index is getting accessed... so at some point, at least one page gets sent back to disk (well, flushed from RAM). Once that happens, a cascading effect happens, where the longer it waits, the older the other pages get, then the more that get reclaimed, and the slower it gets, until karma is at a standstill, waiting for pages to be brought back into RAM.
Now in an ideal world, karma would rapidly recover, and it can... sometimes. The problem is that your karma job is accessing data all over that index, and it is essentially looking like a pure random I/O to the underlying filesystem. There is about a 10 to 1 performance difference between accessing the disk sequentially as compared to randomly.
So to make karma work better, the first thing I do when starting karma is force it to read all of the disk pages in order. This causes the entire index to be forced into memory in order, so it is forcing sequential reads, which is the best case possible. There are problems, for example if three karma jobs start at once, the disk I/O is no longer as purely sequential as we would like. Also, if the filesystem is busy taking care of other programs, even if karma thinks it is forcing sequential I/O, the net result looks more random. This happens when the system is starting to break down (thrashing) and it will certainly stall, or look very very slow, or crash.
The upshot of all of this is that when a single reference is shared, it is more likely that all the pages will be in the disk buffer pool to begin with, and thereby reduce startup time to nearly zero. It is also the ideal situation in terms of sharing the same reference among say 24 copies of karma on wonderland - the only cost is the hardware page table that gets set up to point to all of the disk buffers.
As I mentioned a paragraph back, the pages can still get swapped out, even with dozens of karma jobs running. A workaround I created is a program in utilities called mapfile - it simply repeatedly accesses the data in sequential order to help ensure that all of the pages are at the head of the disk buffer pool, and therefore less likely to get swapped out.
The benefit of such a program (mapfile) is greater on wonderland, where a lot of processes are competing for memory and disk buffers.
Definition at line 155 of file MemoryMap.h.
bool MemoryMap::create | ( | size_t | size | ) | [virtual] |
store in allocated memory (malloc), not mmap:
This is for code that needs to more flexibly the case when an mmap() file _might_ be available, but if it is not, we want to load it as a convenience to the user. GenomeSequence::populateDBSNP does exactly this.
Definition at line 279 of file MemoryMap.cpp.
References create().
00280 { 00281 return create(NULL, size); 00282 }
bool MemoryMap::create | ( | const char * | file, | |
size_t | size | |||
) | [virtual] |
create the memory mapped file on disk
a file will be created on disk with the header filled in. The caller must now populate elements using (*this).set(index, value).
Definition at line 229 of file MemoryMap.cpp.
References open().
Referenced by create().
00230 { 00231 00232 if (file==NULL) 00233 { 00234 data = calloc(size, 1); 00235 if (data==NULL) return true; 00236 } 00237 else 00238 { 00239 int mmap_prot_flag = PROT_READ | PROT_WRITE; 00240 00241 fd = ::open(file, O_RDWR|O_CREAT|O_TRUNC, 0666); 00242 if (fd==-1) 00243 { 00244 fprintf(stderr, "MemoryMap::open: can't create file '%s'\n",(const char *) file); 00245 constructor_clear(); 00246 return true; 00247 } 00248 00249 lseek(fd, (off_t) size - 1, SEEK_SET); 00250 char ch = 0; 00251 if(write(fd, &ch, 1)!=1) { 00252 perror("MemoryMap::create:"); 00253 throw std::logic_error("unable to write at end of file"); 00254 } 00255 00256 data = ::mmap( 00257 NULL, // start 00258 size, 00259 mmap_prot_flag, // protection flags 00260 MAP_SHARED, // share/execute/etc flags 00261 fd, 00262 offset 00263 ); 00264 if (data == MAP_FAILED) 00265 { 00266 ::close(fd); 00267 unlink(file); 00268 perror("MemoryMap::open"); 00269 constructor_clear(); 00270 return true; 00271 } 00272 mapped_length = size; 00273 total_length = size; 00274 } 00275 return false; 00276 }
bool MemoryMap::open | ( | const char * | file, | |
int | flags = O_RDONLY | |||
) | [virtual] |
open a previously created mapped vector
useMemoryMapFlag will determine whether it uses mmap() or malloc()/read() to populate the memory
Reimplemented in GenomeSequence, and MemoryMapArray< elementT, indexT, cookieVal, versionVal, accessorFunc, setterFunc, elementCount2BytesFunc, arrayHeaderClass >.
Definition at line 117 of file MemoryMap.cpp.
Referenced by create().
00118 { 00119 00120 struct stat buf; 00121 00122 int mmap_prot_flag = PROT_READ; 00123 if (flags != O_RDONLY) mmap_prot_flag = PROT_WRITE; 00124 00125 fd = ::open(file, flags); 00126 if (fd==-1) 00127 { 00128 fprintf(stderr, "MemoryMap::open: file %s not found\n", (const char *) file); 00129 constructor_clear(); 00130 return true; 00131 } 00132 if (fstat(fd, &buf)) 00133 { 00134 perror("MemoryMap::open"); 00135 constructor_clear(); 00136 return true; 00137 } 00138 mapped_length = buf.st_size; 00139 total_length = mapped_length; 00140 00141 if (useMemoryMapFlag) 00142 { 00143 00144 int additionalFlags = 0; 00145 00146 // try this for amusement ... not portable: 00147 // additionalFlags |= MAP_HUGETLB; 00148 // #define USE_LOCKED_MMAP 00149 #if defined(USE_LOCKED_MMAP) 00150 // MAP_POPULATE only makes sense if we are reading 00151 // the data 00152 if (flags != O_RDONLY) 00153 { 00154 // furthermore, according to Linux mmap page, populate only 00155 // works if the map is private. 00156 additionalFlags |= MAP_POPULATE; 00157 additionalFlags |= MAP_PRIVATE; 00158 } 00159 else 00160 { 00161 additionalFlags |= MAP_SHARED; 00162 } 00163 #else 00164 additionalFlags |= MAP_SHARED; 00165 #endif 00166 00167 data = ::mmap( 00168 NULL, // start 00169 mapped_length, 00170 mmap_prot_flag, // protection flags 00171 additionalFlags, 00172 fd, 00173 offset 00174 ); 00175 if (data == MAP_FAILED) 00176 { 00177 ::close(fd); 00178 std::cerr << "Error: Attempting to map " << mapped_length << " bytes of file " 00179 << file << ":" << std::endl; 00180 perror("MemoryMap::open"); 00181 constructor_clear(); 00182 return true; 00183 } 00184 00185 #if defined(USE_LOCKED_MMAP) 00186 // 00187 // non-POSIX, so non portable. 00188 // This call is limited by the RLIMIT_MEMLOCK resource. 00189 // 00190 // In bash, "ulimit -l" shows the limit. 00191 // 00192 if (mlock(data, mapped_length)) 00193 { 00194 std::cerr << "Warning: Attempting to lock " << mapped_length << " bytes of file " << file << ":" << std::endl; 00195 perror("unable to lock memory"); 00196 // not a fatal error, so continue 00197 } 00198 #endif 00199 00200 // these really don't appear to have any greatly useful effect on 00201 // Linux. Last time I checked, the effect on Solaris and AIX was 00202 // exactly what was documented and was significant. 00203 // 00204 madvise(data, mapped_length, MADV_WILLNEED); // warning, this could hose the system 00205 00206 } 00207 else 00208 { 00209 data = (void *) malloc(mapped_length); 00210 if (data==NULL) 00211 { 00212 ::close(fd); 00213 perror("MemoryMap::open"); 00214 constructor_clear(); 00215 return true; 00216 } 00217 ssize_t resultSize = read(fd, data, mapped_length); 00218 if (resultSize!=(ssize_t) mapped_length) 00219 { 00220 ::close(fd); 00221 perror("MemoryMap::open"); 00222 constructor_clear(); 00223 return true; 00224 } 00225 } 00226 return false; 00227 }