mmap: files and shared memory: mmap

The idea

The usual way to work with a file: open, then read(fd, buf, n) into your own buffer, then work with that buffer. This is a copy: page cache into a user buffer.

With mmap: open, then mmap(fd), and you get a pointer that you read and write like an ordinary array. No read(). The kernel loads pages lazily on a page fault.

int fd = open("data.bin", O_RDONLY);

void *p = mmap(NULL, len, PROT_READ, MAP_SHARED, fd, 0);

// now p[42] is the first byte read; the page loads on the first access

Kinds of mappings

File-backed:

MAP_PRIVATE is a private copy. Writes do not reach the file. Used to load binaries and libraries.
MAP_SHARED makes changes visible to everyone who mapped the same file, and they are saved to disk. This is shared memory through a file.

Anonymous (MAP_ANONYMOUS, no fd):

MAP_PRIVATE | MAP_ANONYMOUS is an ordinary heap. This is how malloc() works for large allocations.
MAP_SHARED | MAP_ANONYMOUS is shared between fork children.

Why use it

Databases and search engines. Postgres, Lucene, and SQLite use mmap for their data files. The page cache does the work for them.
Loading binaries. Every ELF file and .so is mapped (/proc/<pid>/maps).
Large files with random access. Map the file, jump around by offset. The kernel loads only the pages you touch.
IPC between processes. MAP_SHARED on one file gives a fast shared region between independent processes, with no copies.
Memory-mapped I/O for devices (/dev/mem).

/dev/shm: POSIX shared memory

This is a tmpfs, made for shared mappings:

int fd = shm_open("/myseg", O_CREAT | O_RDWR, 0600);

ftruncate(fd, 1024 * 1024);

void *p = mmap(NULL, 1024*1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);

// p points to 1 MB; another process calls shm_open("/myseg") and sees the same data

The file actually lives in /dev/shm/myseg, which you can see with ls /dev/shm.

The size of /dev/shm is usually 50% of RAM (tmpfs). Change it through mount:

bash

sudo mount -o remount,size=8G /dev/shm

Does Postgres use /dev/shm for shared_buffers? No. It usually uses System-V shared memory or maps a file directly. /dev/shm is used more by redis, scientific computing, and video processing.

When mmap hurts

Network filesystems. NFS and CIFS can give odd results (consistency across the network).
Huge files larger than the VAS on 32-bit. On 32-bit systems the VAS is 3-4 GB.
Append-heavy workloads. Growing a mapped file over and over is expensive; an ordinary write() is better.
Fault storms. Random access to a cold file means thousands of major faults and can stall.

madvise: hints to the kernel

madvise(addr, len, MADV_SEQUENTIAL);    // I will read sequentially: large readahead

madvise(addr, len, MADV_RANDOM);         // random access: disable readahead

madvise(addr, len, MADV_DONTNEED);       // these pages are no longer needed: drop them

madvise(addr, len, MADV_HUGEPAGE);       // merge into huge pages where possible

madvise(addr, len, MADV_DONTFORK);       // do not duplicate in the child on fork

Debugging and observation

bash

cat /proc/<pid>/maps              # all mappings of the process

pmap -x <pid>                      # with sizes and RSS

cat /proc/<pid>/smaps | head -30   # per region block:

# Size:                4 kB    <- VSZ

# Rss:                 4 kB    <- actually in RAM

# Pss:                 2 kB    <- proportional (divided across shared)

# Shared_Clean:        4 kB    <- shared, clean

# Private_Dirty:       0 kB

# Swap:                0 kB

PSS (Proportional Set Size) is the better metric for "actually in use": a 100MB shared library across 10 processes gives each one a PSS of 10MB, while each RSS is 100MB.

How it ties into other parts

mmap of a file plus shared equals page cache (page-cache). The same page is visible both in an ordinary read() and through mmap.
Anonymous mmap is heap, and it swaps out when RAM runs short (swap).
All process-and-pid processes share libc.so.6 through MAP_PRIVATE mappings (read-only code shared).

The idea

The usual way to work with a file: open, then read(fd, buf, n) into your own buffer, then work with that buffer. This is a copy: page cache into a user buffer.

With mmap: open, then mmap(fd), and you get a pointer that you read and write like an ordinary array. No read(). The kernel loads pages lazily on a page fault.

int fd = open("data.bin", O_RDONLY);

void *p = mmap(NULL, len, PROT_READ, MAP_SHARED, fd, 0);

// now p[42] is the first byte read; the page loads on the first access

Kinds of mappings

File-backed:

MAP_PRIVATE is a private copy. Writes do not reach the file. Used to load binaries and libraries.
MAP_SHARED makes changes visible to everyone who mapped the same file, and they are saved to disk. This is shared memory through a file.

Anonymous (MAP_ANONYMOUS, no fd):

MAP_PRIVATE | MAP_ANONYMOUS is an ordinary heap. This is how malloc() works for large allocations.
MAP_SHARED | MAP_ANONYMOUS is shared between fork children.

Why use it

Databases and search engines. Postgres, Lucene, and SQLite use mmap for their data files. The page cache does the work for them.
Loading binaries. Every ELF file and .so is mapped (/proc/<pid>/maps).
Large files with random access. Map the file, jump around by offset. The kernel loads only the pages you touch.
IPC between processes. MAP_SHARED on one file gives a fast shared region between independent processes, with no copies.
Memory-mapped I/O for devices (/dev/mem).

/dev/shm: POSIX shared memory

This is a tmpfs, made for shared mappings:

int fd = shm_open("/myseg", O_CREAT | O_RDWR, 0600);

ftruncate(fd, 1024 * 1024);

void *p = mmap(NULL, 1024*1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);

// p points to 1 MB; another process calls shm_open("/myseg") and sees the same data

The file actually lives in /dev/shm/myseg, which you can see with ls /dev/shm.

The size of /dev/shm is usually 50% of RAM (tmpfs). Change it through mount:

bash

sudo mount -o remount,size=8G /dev/shm

Does Postgres use /dev/shm for shared_buffers? No. It usually uses System-V shared memory or maps a file directly. /dev/shm is used more by redis, scientific computing, and video processing.

When mmap hurts

Network filesystems. NFS and CIFS can give odd results (consistency across the network).
Huge files larger than the VAS on 32-bit. On 32-bit systems the VAS is 3-4 GB.
Append-heavy workloads. Growing a mapped file over and over is expensive; an ordinary write() is better.
Fault storms. Random access to a cold file means thousands of major faults and can stall.

madvise: hints to the kernel

madvise(addr, len, MADV_SEQUENTIAL);    // I will read sequentially: large readahead

madvise(addr, len, MADV_RANDOM);         // random access: disable readahead

madvise(addr, len, MADV_DONTNEED);       // these pages are no longer needed: drop them

madvise(addr, len, MADV_HUGEPAGE);       // merge into huge pages where possible

madvise(addr, len, MADV_DONTFORK);       // do not duplicate in the child on fork

Debugging and observation

bash

cat /proc/<pid>/maps              # all mappings of the process

pmap -x <pid>                      # with sizes and RSS

cat /proc/<pid>/smaps | head -30   # per region block:

# Size:                4 kB    <- VSZ

# Rss:                 4 kB    <- actually in RAM

# Pss:                 2 kB    <- proportional (divided across shared)

# Shared_Clean:        4 kB    <- shared, clean

# Private_Dirty:       0 kB

# Swap:                0 kB

PSS (Proportional Set Size) is the better metric for "actually in use": a 100MB shared library across 10 processes gives each one a PSS of 10MB, while each RSS is 100MB.

How it ties into other parts

mmap of a file plus shared equals page cache (page-cache). The same page is visible both in an ordinary read() and through mmap.
Anonymous mmap is heap, and it swaps out when RAM runs short (swap).
All process-and-pid processes share libc.so.6 through MAP_PRIVATE mappings (read-only code shared).

mmap: files and shared memory

The idea

Kinds of mappings

Why use it

/dev/shm: POSIX shared memory

When mmap hurts

madvise: hints to the kernel

Debugging and observation

How it ties into other parts

§ команды

§ см. также

mmap: files and shared memory

The idea

Kinds of mappings

Why use it

/dev/shm: POSIX shared memory

When mmap hurts

madvise: hints to the kernel

Debugging and observation

How it ties into other parts

§ команды

§ см. также