The idea
The usual way to work with a file: open, then read(fd, buf, n) into your
own buffer, then work with that buffer. This is a copy: page cache into a user buffer.
With mmap: open, then mmap(fd), and you get a pointer that you read and
write like an ordinary array. No read(). The kernel loads pages lazily on
a page fault.
int fd = open("data.bin", O_RDONLY);void *p = mmap(NULL, len, PROT_READ, MAP_SHARED, fd, 0);
// now p[42] is the first byte read; the page loads on the first access
Kinds of mappings
File-backed:
MAP_PRIVATEis a private copy. Writes do not reach the file. Used to load binaries and libraries.MAP_SHAREDmakes changes visible to everyone who mapped the same file, and they are saved to disk. This is shared memory through a file.
Anonymous (MAP_ANONYMOUS, no fd):
MAP_PRIVATE | MAP_ANONYMOUSis an ordinary heap. This is howmalloc()works for large allocations.MAP_SHARED | MAP_ANONYMOUSis shared between fork children.
Why use it
- Databases and search engines. Postgres, Lucene, and SQLite use mmap for their data files. The page cache does the work for them.
- Loading binaries. Every ELF file and
.sois mapped (/proc/<pid>/maps). - Large files with random access. Map the file, jump around by offset. The kernel loads only the pages you touch.
- IPC between processes.
MAP_SHAREDon one file gives a fast shared region between independent processes, with no copies. - Memory-mapped I/O for devices (
/dev/mem).
/dev/shm: POSIX shared memory
This is a tmpfs, made for shared mappings:
int fd = shm_open("/myseg", O_CREAT | O_RDWR, 0600);ftruncate(fd, 1024 * 1024);
void *p = mmap(NULL, 1024*1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
// p points to 1 MB; another process calls shm_open("/myseg") and sees the same dataThe file actually lives in /dev/shm/myseg, which you can see with ls /dev/shm.
The size of /dev/shm is usually 50% of RAM (tmpfs). Change it through mount:
sudo mount -o remount,size=8G /dev/shm
Does Postgres use /dev/shm for shared_buffers? No. It usually uses
System-V shared memory or maps a file directly. /dev/shm is used more by
redis, scientific computing, and video processing.
When mmap hurts
- Network filesystems. NFS and CIFS can give odd results (consistency across the network).
- Huge files larger than the VAS on 32-bit. On 32-bit systems the VAS is 3-4 GB.
- Append-heavy workloads. Growing a mapped file over and over is
expensive; an ordinary
write()is better. - Fault storms. Random access to a cold file means thousands of major faults and can stall.
madvise: hints to the kernel
madvise(addr, len, MADV_SEQUENTIAL); // I will read sequentially: large readahead
madvise(addr, len, MADV_RANDOM); // random access: disable readahead
madvise(addr, len, MADV_DONTNEED); // these pages are no longer needed: drop them
madvise(addr, len, MADV_HUGEPAGE); // merge into huge pages where possible
madvise(addr, len, MADV_DONTFORK); // do not duplicate in the child on fork
Debugging and observation
cat /proc/<pid>/maps # all mappings of the process
pmap -x <pid> # with sizes and RSS
cat /proc/<pid>/smaps | head -30 # per region block:
# Size: 4 kB <- VSZ
# Rss: 4 kB <- actually in RAM
# Pss: 2 kB <- proportional (divided across shared)
# Shared_Clean: 4 kB <- shared, clean
# Private_Dirty: 0 kB
# Swap: 0 kB
PSS (Proportional Set Size) is the better metric for "actually in use": a 100MB shared library across 10 processes gives each one a PSS of 10MB, while each RSS is 100MB.
How it ties into other parts
- mmap of a file plus shared equals page cache (page-cache). The same
page is visible both in an ordinary
read()and throughmmap. - Anonymous mmap is heap, and it swaps out when RAM runs short (swap).
- All process-and-pid processes share
libc.so.6throughMAP_PRIVATEmappings (read-only code shared).