博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
Kernel developers MM documentation
阅读量:2031 次
发布时间:2019-04-28

本文共 110180 字,大约阅读时间需要 367 分钟。

The below documents describe MM internals with different level of details ranging from notes and mailing list responses to elaborate descriptions of data structures and algorithms.

 

Architecture Page Table Helpers

Generic MM expects architectures (with MMU) to provide helpers to create, access and modify page table entries at various level for different memory functions. These page table helpers need to conform to a common semantics across platforms. Following tables describe the expected semantics which can also be tested during boot via CONFIG_DEBUG_VM_PGTABLE option. All future changes in here or the debug test need to be in sync.

PTE Page Table Helpers

pte_same

Tests whether both PTE entries are the same

pte_bad

Tests a non-table mapped PTE

pte_present

Tests a valid mapped PTE

pte_young

Tests a young PTE

pte_dirty

Tests a dirty PTE

pte_write

Tests a writable PTE

pte_special

Tests a special PTE

pte_protnone

Tests a PROT_NONE PTE

pte_devmap

Tests a ZONE_DEVICE mapped PTE

pte_soft_dirty

Tests a soft dirty PTE

pte_swp_soft_dirty

Tests a soft dirty swapped PTE

pte_mkyoung

Creates a young PTE

pte_mkold

Creates an old PTE

pte_mkdirty

Creates a dirty PTE

pte_mkclean

Creates a clean PTE

pte_mkwrite

Creates a writable PTE

pte_wrprotect

Creates a write protected PTE

pte_mkspecial

Creates a special PTE

pte_mkdevmap

Creates a ZONE_DEVICE mapped PTE

pte_mksoft_dirty

Creates a soft dirty PTE

pte_clear_soft_dirty

Clears a soft dirty PTE

pte_swp_mksoft_dirty

Creates a soft dirty swapped PTE

pte_swp_clear_soft_dirty

Clears a soft dirty swapped PTE

pte_mknotpresent

Invalidates a mapped PTE

ptep_get_and_clear

Clears a PTE

ptep_get_and_clear_full

Clears a PTE

ptep_test_and_clear_young

Clears young from a PTE

ptep_set_wrprotect

Converts into a write protected PTE

ptep_set_access_flags

Converts into a more permissive PTE

PMD Page Table Helpers

pmd_same

Tests whether both PMD entries are the same

pmd_bad

Tests a non-table mapped PMD

pmd_leaf

Tests a leaf mapped PMD

pmd_huge

Tests a HugeTLB mapped PMD

pmd_trans_huge

Tests a Transparent Huge Page (THP) at PMD

pmd_present

Tests a valid mapped PMD

pmd_young

Tests a young PMD

pmd_dirty

Tests a dirty PMD

pmd_write

Tests a writable PMD

pmd_special

Tests a special PMD

pmd_protnone

Tests a PROT_NONE PMD

pmd_devmap

Tests a ZONE_DEVICE mapped PMD

pmd_soft_dirty

Tests a soft dirty PMD

pmd_swp_soft_dirty

Tests a soft dirty swapped PMD

pmd_mkyoung

Creates a young PMD

pmd_mkold

Creates an old PMD

pmd_mkdirty

Creates a dirty PMD

pmd_mkclean

Creates a clean PMD

pmd_mkwrite

Creates a writable PMD

pmd_wrprotect

Creates a write protected PMD

pmd_mkspecial

Creates a special PMD

pmd_mkdevmap

Creates a ZONE_DEVICE mapped PMD

pmd_mksoft_dirty

Creates a soft dirty PMD

pmd_clear_soft_dirty

Clears a soft dirty PMD

pmd_swp_mksoft_dirty

Creates a soft dirty swapped PMD

pmd_swp_clear_soft_dirty

Clears a soft dirty swapped PMD

pmd_mkinvalid

Invalidates a mapped PMD [1]

pmd_set_huge

Creates a PMD huge mapping

pmd_clear_huge

Clears a PMD huge mapping

pmdp_get_and_clear

Clears a PMD

pmdp_get_and_clear_full

Clears a PMD

pmdp_test_and_clear_young

Clears young from a PMD

pmdp_set_wrprotect

Converts into a write protected PMD

pmdp_set_access_flags

Converts into a more permissive PMD

PUD Page Table Helpers

pud_same

Tests whether both PUD entries are the same

pud_bad

Tests a non-table mapped PUD

pud_leaf

Tests a leaf mapped PUD

pud_huge

Tests a HugeTLB mapped PUD

pud_trans_huge

Tests a Transparent Huge Page (THP) at PUD

pud_present

Tests a valid mapped PUD

pud_young

Tests a young PUD

pud_dirty

Tests a dirty PUD

pud_write

Tests a writable PUD

pud_devmap

Tests a ZONE_DEVICE mapped PUD

pud_mkyoung

Creates a young PUD

pud_mkold

Creates an old PUD

pud_mkdirty

Creates a dirty PUD

pud_mkclean

Creates a clean PUD

pud_mkwrite

Creates a writable PUD

pud_wrprotect

Creates a write protected PUD

pud_mkdevmap

Creates a ZONE_DEVICE mapped PUD

pud_mkinvalid

Invalidates a mapped PUD [1]

pud_set_huge

Creates a PUD huge mapping

pud_clear_huge

Clears a PUD huge mapping

pudp_get_and_clear

Clears a PUD

pudp_get_and_clear_full

Clears a PUD

pudp_test_and_clear_young

Clears young from a PUD

pudp_set_wrprotect

Converts into a write protected PUD

pudp_set_access_flags

Converts into a more permissive PUD

HugeTLB Page Table Helpers

pte_huge

Tests a HugeTLB

pte_mkhuge

Creates a HugeTLB

huge_pte_dirty

Tests a dirty HugeTLB

huge_pte_write

Tests a writable HugeTLB

huge_pte_mkdirty

Creates a dirty HugeTLB

huge_pte_mkwrite

Creates a writable HugeTLB

huge_pte_wrprotect

Creates a write protected HugeTLB

huge_ptep_get_and_clear

Clears a HugeTLB

huge_ptep_set_wrprotect

Converts into a write protected HugeTLB

huge_ptep_set_access_flags | Converts into a more permissive HugeTLB

SWAP Page Table Helpers

__pte_to_swp_entry

Creates a swapped entry (arch) from a mapped PTE

__swp_to_pte_entry

Creates a mapped PTE from a swapped entry (arch)

__pmd_to_swp_entry

Creates a swapped entry (arch) from a mapped PMD

__swp_to_pmd_entry

Creates a mapped PMD from a swapped entry (arch)

is_migration_entry

Tests a migration (read or write) swapped entry

is_write_migration_entry

Tests a write migration swapped entry

make_migration_entry_read

Converts into read migration swapped entry

make_migration_entry

Creates a migration swapped entry (read or write)

[1] 

 

Memory Balancing

Started Jan 2000 by Kanoj Sarcar <>

Memory balancing is needed for !__GFP_ATOMIC and !__GFP_KSWAPD_RECLAIM as well as for non __GFP_IO allocations.

The first reason why a caller may avoid reclaim is that the caller can not sleep due to holding a spinlock or is in interrupt context. The second may be that the caller is willing to fail the allocation without incurring the overhead of page reclaim. This may happen for opportunistic high-order allocation requests that have order-0 fallback options. In such cases, the caller may also wish to avoid waking kswapd.

__GFP_IO allocation requests are made to prevent file system deadlocks.

In the absence of non sleepable allocation requests, it seems detrimental to be doing balancing. Page reclamation can be kicked off lazily, that is, only when needed (aka zone free memory is 0), instead of making it a proactive process.

That being said, the kernel should try to fulfill requests for direct mapped pages from the direct mapped pool, instead of falling back on the dma pool, so as to keep the dma pool filled for dma requests (atomic or not). A similar argument applies to highmem and direct mapped pages. OTOH, if there is a lot of free dma pages, it is preferable to satisfy regular memory requests by allocating one from the dma pool, instead of incurring the overhead of regular zone balancing.

In 2.2, memory balancing/page reclamation would kick off only when the _total_ number of free pages fell below 1/64 th of total memory. With the right ratio of dma and regular memory, it is quite possible that balancing would not be done even when the dma zone was completely empty. 2.2 has been running production machines of varying memory sizes, and seems to be doing fine even with the presence of this problem. In 2.3, due to HIGHMEM, this problem is aggravated.

In 2.3, zone balancing can be done in one of two ways: depending on the zone size (and possibly of the size of lower class zones), we can decide at init time how many free pages we should aim for while balancing any zone. The good part is, while balancing, we do not need to look at sizes of lower class zones, the bad part is, we might do too frequent balancing due to ignoring possibly lower usage in the lower class zones. Also, with a slight change in the allocation routine, it is possible to reduce the memclass() macro to be a simple equality.

Another possible solution is that we balance only when the free memory of a zone _and_ all its lower class zones falls below 1/64th of the total memory in the zone and its lower class zones. This fixes the 2.2 balancing problem, and stays as close to 2.2 behavior as possible. Also, the balancing algorithm works the same way on the various architectures, which have different numbers and types of zones. If we wanted to get fancy, we could assign different weights to free pages in different zones in the future.

Note that if the size of the regular zone is huge compared to dma zone, it becomes less significant to consider the free dma pages while deciding whether to balance the regular zone. The first solution becomes more attractive then.

The appended patch implements the second solution. It also “fixes” two problems: first, kswapd is woken up as in 2.2 on low memory conditions for non-sleepable allocations. Second, the HIGHMEM zone is also balanced, so as to give a fighting chance for replace_with_highmem() to get a HIGHMEM page, as well as to ensure that HIGHMEM allocations do not fall back into regular zone. This also makes sure that HIGHMEM pages are not leaked (for example, in situations where a HIGHMEM page is in the swapcache but is not being used by anyone)

kswapd also needs to know about the zones it should balance. kswapd is primarily needed in a situation where balancing can not be done, probably because all allocation requests are coming from intr context and all process contexts are sleeping. For 2.3, kswapd does not really need to balance the highmem zone, since intr context does not request highmem pages. kswapd looks at the zone_wake_kswapd field in the zone structure to decide whether a zone needs balancing.

Page stealing from process memory and shm is done if stealing the page would alleviate memory pressure on any zone in the page’s node that has fallen below its watermark.

watemark[WMARK_MIN/WMARK_LOW/WMARK_HIGH]/low_on_memory/zone_wake_kswapd: These are per-zone fields, used to determine when a zone needs to be balanced. When the number of pages falls below watermark[WMARK_MIN], the hysteric field low_on_memory gets set. This stays set till the number of free pages becomes watermark[WMARK_HIGH]. When low_on_memory is set, page allocation requests will try to free some pages in the zone (providing GFP_WAIT is set in the request). Orthogonal to this, is the decision to poke kswapd to free some zone pages. That decision is not hysteresis based, and is done when the number of free pages is below watermark[WMARK_LOW]; in which case zone_wake_kswapd is also set.

(Good) Ideas that I have heard:

  1. Dynamic experience should influence balancing: number of failed requests for a zone can be tracked and fed into the balancing scheme ()

  2. Implement a replace_with_highmem()-like replace_with_regular() to preserve dma pages. ()

 

 

Cleancache

Motivation

Cleancache is a new optional feature provided by the VFS layer that potentially dramatically increases page cache effectiveness for many workloads in many environments at a negligible cost.

Cleancache can be thought of as a page-granularity victim cache for clean pages that the kernel’s pageframe replacement algorithm (PFRA) would like to keep around, but can’t since there isn’t enough memory. So when the PFRA “evicts” a page, it first attempts to use cleancache code to put the data contained in that page into “transcendent memory”, memory that is not directly accessible or addressable by the kernel and is of unknown and possibly time-varying size.

Later, when a cleancache-enabled filesystem wishes to access a page in a file on disk, it first checks cleancache to see if it already contains it; if it does, the page of data is copied into the kernel and a disk access is avoided.

Transcendent memory “drivers” for cleancache are currently implemented in Xen (using hypervisor memory) and zcache (using in-kernel compressed memory) and other implementations are in development.

 are included below.

Implementation Overview

A cleancache “backend” that provides transcendent memory registers itself to the kernel’s cleancache “frontend” by calling cleancache_register_ops, passing a pointer to a cleancache_ops structure with funcs set appropriately. The functions provided must conform to certain semantics as follows:

Most important, cleancache is “ephemeral”. Pages which are copied into cleancache have an indefinite lifetime which is completely unknowable by the kernel and so may or may not still be in cleancache at any later time. Thus, as its name implies, cleancache is not suitable for dirty pages. Cleancache has complete discretion over what pages to preserve and what pages to discard and when.

Mounting a cleancache-enabled filesystem should call “init_fs” to obtain a pool id which, if positive, must be saved in the filesystem’s superblock; a negative return value indicates failure. A “put_page” will copy a (presumably about-to-be-evicted) page into cleancache and associate it with the pool id, a file key, and a page index into the file. (The combination of a pool id, a file key, and an index is sometimes called a “handle”.) A “get_page” will copy the page, if found, from cleancache into kernel memory. An “invalidate_page” will ensure the page no longer is present in cleancache; an “invalidate_inode” will invalidate all pages associated with the specified file; and, when a filesystem is unmounted, an “invalidate_fs” will invalidate all pages in all files specified by the given pool id and also surrender the pool id.

An “init_shared_fs”, like init_fs, obtains a pool id but tells cleancache to treat the pool as shared using a 128-bit UUID as a key. On systems that may run multiple kernels (such as hard partitioned or virtualized systems) that may share a clustered filesystem, and where cleancache may be shared among those kernels, calls to init_shared_fs that specify the same UUID will receive the same pool id, thus allowing the pages to be shared. Note that any security requirements must be imposed outside of the kernel (e.g. by “tools” that control cleancache). Or a cleancache implementation can simply disable shared_init by always returning a negative value.

If a get_page is successful on a non-shared pool, the page is invalidated (thus making cleancache an “exclusive” cache). On a shared pool, the page is NOT invalidated on a successful get_page so that it remains accessible to other sharers. The kernel is responsible for ensuring coherency between cleancache (shared or not), the page cache, and the filesystem, using cleancache invalidate operations as required.

Note that cleancache must enforce put-put-get coherency and get-get coherency. For the former, if two puts are made to the same handle but with different data, say AAA by the first put and BBB by the second, a subsequent get can never return the stale data (AAA). For get-get coherency, if a get for a given handle fails, subsequent gets for that handle will never succeed unless preceded by a successful put with that handle.

Last, cleancache provides no SMP serialization guarantees; if two different Linux threads are simultaneously putting and invalidating a page with the same handle, the results are indeterminate. Callers must lock the page to ensure serial behavior.

Cleancache Performance Metrics

If properly configured, monitoring of cleancache is done via debugfs in the /sys/kernel/debug/cleancache directory. The effectiveness of cleancache can be measured (across all filesystems) with:

succ_gets

number of gets that were successful

failed_gets

number of gets that failed

puts

number of puts attempted (all “succeed”)

invalidates

number of invalidates attempted

A backend implementation may provide additional metrics.

FAQ

  • Where’s the value? (Andrew Morton)

Cleancache provides a significant performance benefit to many workloads in many environments with negligible overhead by improving the effectiveness of the pagecache. Clean pagecache pages are saved in transcendent memory (RAM that is otherwise not directly addressable to the kernel); fetching those pages later avoids “refaults” and thus disk reads.

Cleancache (and its sister code “frontswap”) provide interfaces for this transcendent memory (aka “tmem”), which conceptually lies between fast kernel-directly-addressable RAM and slower DMA/asynchronous devices. Disallowing direct kernel or userland reads/writes to tmem is ideal when data is transformed to a different form and size (such as with compression) or secretly moved (as might be useful for write- balancing for some RAM-like devices). Evicted page-cache pages (and swap pages) are a great use for this kind of slower-than-RAM-but-much- faster-than-disk transcendent memory, and the cleancache (and frontswap) “page-object-oriented” specification provides a nice way to read and write – and indirectly “name” – the pages.

In the virtual case, the whole point of virtualization is to statistically multiplex physical resources across the varying demands of multiple virtual machines. This is really hard to do with RAM and efforts to do it well with no kernel change have essentially failed (except in some well-publicized special-case workloads). Cleancache – and frontswap – with a fairly small impact on the kernel, provide a huge amount of flexibility for more dynamic, flexible RAM multiplexing. Specifically, the Xen Transcendent Memory backend allows otherwise “fallow” hypervisor-owned RAM to not only be “time-shared” between multiple virtual machines, but the pages can be compressed and deduplicated to optimize RAM utilization. And when guest OS’s are induced to surrender underutilized RAM (e.g. with “self-ballooning”), page cache pages are the first to go, and cleancache allows those pages to be saved and reclaimed if overall host system memory conditions allow.

And the identical interface used for cleancache can be used in physical systems as well. The zcache driver acts as a memory-hungry device that stores pages of data in a compressed state. And the proposed “RAMster” driver shares RAM across multiple physical systems.

  • Why does cleancache have its sticky fingers so deep inside the filesystems and VFS? (Andrew Morton and Christoph Hellwig)

The core hooks for cleancache in VFS are in most cases a single line and the minimum set are placed precisely where needed to maintain coherency (via cleancache_invalidate operations) between cleancache, the page cache, and disk. All hooks compile into nothingness if cleancache is config’ed off and turn into a function-pointer- compare-to-NULL if config’ed on but no backend claims the ops functions, or to a compare-struct-element-to-negative if a backend claims the ops functions but a filesystem doesn’t enable cleancache.

Some filesystems are built entirely on top of VFS and the hooks in VFS are sufficient, so don’t require an “init_fs” hook; the initial implementation of cleancache didn’t provide this hook. But for some filesystems (such as btrfs), the VFS hooks are incomplete and one or more hooks in fs-specific code are required. And for some other filesystems, such as tmpfs, cleancache may be counterproductive. So it seemed prudent to require a filesystem to “opt in” to use cleancache, which requires adding a hook in each filesystem. Not all filesystems are supported by cleancache only because they haven’t been tested. The existing set should be sufficient to validate the concept, the opt-in approach means that untested filesystems are not affected, and the hooks in the existing filesystems should make it very easy to add more filesystems in the future.

The total impact of the hooks to existing fs and mm files is only about 40 lines added (not counting comments and blank lines).

  • Why not make cleancache asynchronous and batched so it can more easily interface with real devices with DMA instead of copying each individual page? (Minchan Kim)

The one-page-at-a-time copy semantics simplifies the implementation on both the frontend and backend and also allows the backend to do fancy things on-the-fly like page compression and page deduplication. And since the data is “gone” (copied into/out of the pageframe) before the cleancache get/put call returns, a great deal of race conditions and potential coherency issues are avoided. While the interface seems odd for a “real device” or for real kernel-addressable RAM, it makes perfect sense for transcendent memory.

  • Why is non-shared cleancache “exclusive”? And where is the page “invalidated” after a “get”? (Minchan Kim)

The main reason is to free up space in transcendent memory and to avoid unnecessary cleancache_invalidate calls. If you want inclusive, the page can be “put” immediately following the “get”. If put-after-get for inclusive becomes common, the interface could be easily extended to add a “get_no_invalidate” call.

The invalidate is done by the cleancache backend implementation.

  • What’s the performance impact?

Performance analysis has been presented at OLS’09 and LCA’10. Briefly, performance gains can be significant on most workloads, especially when memory pressure is high (e.g. when RAM is overcommitted in a virtual workload); and because the hooks are invoked primarily in place of or in addition to a disk read/write, overhead is negligible even in worst case workloads. Basically cleancache replaces I/O with memory-copy-CPU-overhead; on older single-core systems with slow memory-copy speeds, cleancache has little value, but in newer multicore machines, especially consolidated/virtualized machines, it has great value.

  • How do I add cleancache support for filesystem X? (Boaz Harrash)

Filesystems that are well-behaved and conform to certain restrictions can utilize cleancache simply by making a call to cleancache_init_fs at mount time. Unusual, misbehaving, or poorly layered filesystems must either add additional hooks and/or undergo extensive additional testing… or should just not enable the optional cleancache.

Some points for a filesystem to consider:

  • The FS should be block-device-based (e.g. a ram-based FS such as tmpfs should not enable cleancache)

  • To ensure coherency/correctness, the FS must ensure that all file removal or truncation operations either go through VFS or add hooks to do the equivalent cleancache “invalidate” operations

  • To ensure coherency/correctness, either inode numbers must be unique across the lifetime of the on-disk file OR the FS must provide an “encode_fh” function.

  • The FS must call the VFS superblock alloc and deactivate routines or add hooks to do the equivalent cleancache calls done there.

  • To maximize performance, all pages fetched from the FS should go through the do_mpag_readpage routine or the FS should add hooks to do the equivalent (cf. btrfs)

  • Currently, the FS blocksize must be the same as PAGESIZE. This is not an architectural restriction, but no backends currently support anything different.

  • A clustered FS should invoke the “shared_init_fs” cleancache hook to get best performance for some backends.

  • Why not use the KVA of the inode as the key? (Christoph Hellwig)

If cleancache would use the inode virtual address instead of inode/filehandle, the pool id could be eliminated. But, this won’t work because cleancache retains pagecache data pages persistently even when the inode has been pruned from the inode unused list, and only invalidates the data page if the file gets removed/truncated. So if cleancache used the inode kva, there would be potential coherency issues if/when the inode kva is reused for a different file. Alternately, if cleancache invalidated the pages when the inode kva was freed, much of the value of cleancache would be lost because the cache of pages in cleanache is potentially much larger than the kernel pagecache and is most useful if the pages survive inode cache removal.

  • Why is a global variable required?

The cleancache_enabled flag is checked in all of the frequently-used cleancache hooks. The alternative is a function call to check a static variable. Since cleancache is enabled dynamically at runtime, systems that don’t enable cleancache would suffer thousands (possibly tens-of-thousands) of unnecessary function calls per second. So the global variable allows cleancache to be enabled by default at compile time, but have insignificant performance impact when cleancache remains disabled at runtime.

  • Does cleanache work with KVM?

The memory model of KVM is sufficiently different that a cleancache backend may have less value for KVM. This remains to be tested, especially in an overcommitted system.

  • Does cleancache work in userspace? It sounds useful for memory hungry caches like web browsers. (Jamie Lokier)

No plans yet, though we agree it sounds useful, at least for apps that bypass the page cache (e.g. O_DIRECT).

Last updated: Dan Magenheimer, April 13 2011

 

Free Page Reporting

Free page reporting is an API by which a device can register to receive lists of pages that are currently unused by the system. This is useful in the case of virtualization where a guest is then able to use this data to notify the hypervisor that it is no longer using certain pages in memory.

For the driver, typically a balloon driver, to use of this functionality it will allocate and initialize a page_reporting_dev_info structure. The field within the structure it will populate is the “report” function pointer used to process the scatterlist. It must also guarantee that it can handle at least PAGE_REPORTING_CAPACITY worth of scatterlist entries per call to the function. A call to page_reporting_register will register the page reporting interface with the reporting framework assuming no other page reporting devices are already registered.

Once registered the page reporting API will begin reporting batches of pages to the driver. The API will start reporting pages 2 seconds after the interface is registered and will continue to do so 2 seconds after any page of a sufficiently high order is freed.

Pages reported will be stored in the scatterlist passed to the reporting function with the final entry having the end bit set in entry nent - 1. While pages are being processed by the report function they will not be accessible to the allocator. Once the report function has been completed the pages will be returned to the free area from which they were obtained.

Prior to removing a driver that is making use of free page reporting it is necessary to call page_reporting_unregister to have the page_reporting_dev_info structure that is currently in use by free page reporting removed. Doing this will prevent further reports from being issued via the interface. If another driver or the same driver is registered it is possible for it to resume where the previous driver had left off in terms of reporting free pages.

Alexander Duyck, Dec 04, 2019

 

Frontswap

Frontswap provides a “transcendent memory” interface for swap pages. In some environments, dramatic performance savings may be obtained because swapped pages are saved in RAM (or a RAM-like device) instead of a swap disk.

(Note, frontswap – and  (merged at 3.0) – are the “frontends” and the only necessary changes to the core kernel for transcendent memory; all other supporting code – the “backends” – is implemented as drivers. See the LWN.net article  for a detailed overview of frontswap and related kernel parts)

Frontswap is so named because it can be thought of as the opposite of a “backing” store for a swap device. The storage is assumed to be a synchronous concurrency-safe page-oriented “pseudo-RAM device” conforming to the requirements of transcendent memory (such as Xen’s “tmem”, or in-kernel compressed memory, aka “zcache”, or future RAM-like devices); this pseudo-RAM device is not directly accessible or addressable by the kernel and is of unknown and possibly time-varying size. The driver links itself to frontswap by calling frontswap_register_ops to set the frontswap_ops funcs appropriately and the functions it provides must conform to certain policies as follows:

An “init” prepares the device to receive frontswap pages associated with the specified swap device number (aka “type”). A “store” will copy the page to transcendent memory and associate it with the type and offset associated with the page. A “load” will copy the page, if found, from transcendent memory into kernel memory, but will NOT remove the page from transcendent memory. An “invalidate_page” will remove the page from transcendent memory and an “invalidate_area” will remove ALL pages associated with the swap type (e.g., like swapoff) and notify the “device” to refuse further stores with that swap type.

Once a page is successfully stored, a matching load on the page will normally succeed. So when the kernel finds itself in a situation where it needs to swap out a page, it first attempts to use frontswap. If the store returns success, the data has been successfully saved to transcendent memory and a disk write and, if the data is later read back, a disk read are avoided. If a store returns failure, transcendent memory has rejected the data, and the page can be written to swap as usual.

If a backend chooses, frontswap can be configured as a “writethrough cache” by calling frontswap_writethrough(). In this mode, the reduction in swap device writes is lost (and also a non-trivial performance advantage) in order to allow the backend to arbitrarily “reclaim” space used to store frontswap pages to more completely manage its memory usage.

Note that if a page is stored and the page already exists in transcendent memory (a “duplicate” store), either the store succeeds and the data is overwritten, or the store fails AND the page is invalidated. This ensures stale data may never be obtained from frontswap.

If properly configured, monitoring of frontswap is done via debugfs in the /sys/kernel/debug/frontswap directory. The effectiveness of frontswap can be measured (across all swap devices) with:

failed_stores

how many store attempts have failed

loads

how many loads were attempted (all should succeed)

succ_stores

how many store attempts have succeeded

invalidates

how many invalidates were attempted

A backend implementation may provide additional metrics.

FAQ

  • Where’s the value?

When a workload starts swapping, performance falls through the floor. Frontswap significantly increases performance in many such workloads by providing a clean, dynamic interface to read and write swap pages to “transcendent memory” that is otherwise not directly addressable to the kernel. This interface is ideal when data is transformed to a different form and size (such as with compression) or secretly moved (as might be useful for write-balancing for some RAM-like devices). Swap pages (and evicted page-cache pages) are a great use for this kind of slower-than-RAM- but-much-faster-than-disk “pseudo-RAM device” and the frontswap (and cleancache) interface to transcendent memory provides a nice way to read and write – and indirectly “name” – the pages.

Frontswap – and cleancache – with a fairly small impact on the kernel, provides a huge amount of flexibility for more dynamic, flexible RAM utilization in various system configurations:

In the single kernel case, aka “zcache”, pages are compressed and stored in local memory, thus increasing the total anonymous pages that can be safely kept in RAM. Zcache essentially trades off CPU cycles used in compression/decompression for better memory utilization. Benchmarks have shown little or no impact when memory pressure is low while providing a significant performance improvement (25%+) on some workloads under high memory pressure.

“RAMster” builds on zcache by adding “peer-to-peer” transcendent memory support for clustered systems. Frontswap pages are locally compressed as in zcache, but then “remotified” to another system’s RAM. This allows RAM to be dynamically load-balanced back-and-forth as needed, i.e. when system A is overcommitted, it can swap to system B, and vice versa. RAMster can also be configured as a memory server so many servers in a cluster can swap, dynamically as needed, to a single server configured with a large amount of RAM… without pre-configuring how much of the RAM is available for each of the clients!

In the virtual case, the whole point of virtualization is to statistically multiplex physical resources across the varying demands of multiple virtual machines. This is really hard to do with RAM and efforts to do it well with no kernel changes have essentially failed (except in some well-publicized special-case workloads). Specifically, the Xen Transcendent Memory backend allows otherwise “fallow” hypervisor-owned RAM to not only be “time-shared” between multiple virtual machines, but the pages can be compressed and deduplicated to optimize RAM utilization. And when guest OS’s are induced to surrender underutilized RAM (e.g. with “selfballooning”), sudden unexpected memory pressure may result in swapping; frontswap allows those pages to be swapped to and from hypervisor RAM (if overall host system memory conditions allow), thus mitigating the potentially awful performance impact of unplanned swapping.

A KVM implementation is underway and has been RFC’ed to lkml. And, using frontswap, investigation is also underway on the use of NVM as a memory extension technology.

  • Sure there may be performance advantages in some situations, but what’s the space/time overhead of frontswap?

If CONFIG_FRONTSWAP is disabled, every frontswap hook compiles into nothingness and the only overhead is a few extra bytes per swapon’ed swap device. If CONFIG_FRONTSWAP is enabled but no frontswap “backend” registers, there is one extra global variable compared to zero for every swap page read or written. If CONFIG_FRONTSWAP is enabled AND a frontswap backend registers AND the backend fails every “store” request (i.e. provides no memory despite claiming it might), CPU overhead is still negligible – and since every frontswap fail precedes a swap page write-to-disk, the system is highly likely to be I/O bound and using a small fraction of a percent of a CPU will be irrelevant anyway.

As for space, if CONFIG_FRONTSWAP is enabled AND a frontswap backend registers, one bit is allocated for every swap page for every swap device that is swapon’d. This is added to the EIGHT bits (which was sixteen until about 2.6.34) that the kernel already allocates for every swap page for every swap device that is swapon’d. (Hugh Dickins has observed that frontswap could probably steal one of the existing eight bits, but let’s worry about that minor optimization later.) For very large swap disks (which are rare) on a standard 4K pagesize, this is 1MB per 32GB swap.

When swap pages are stored in transcendent memory instead of written out to disk, there is a side effect that this may create more memory pressure that can potentially outweigh the other advantages. A backend, such as zcache, must implement policies to carefully (but dynamically) manage memory limits to ensure this doesn’t happen.

  • OK, how about a quick overview of what this frontswap patch does in terms that a kernel hacker can grok?

Let’s assume that a frontswap “backend” has registered during kernel initialization; this registration indicates that this frontswap backend has access to some “memory” that is not directly accessible by the kernel. Exactly how much memory it provides is entirely dynamic and random.

Whenever a swap-device is swapon’d frontswap_init() is called, passing the swap device number (aka “type”) as a parameter. This notifies frontswap to expect attempts to “store” swap pages associated with that number.

Whenever the swap subsystem is readying a page to write to a swap device (c.f swap_writepage()), frontswap_store is called. Frontswap consults with the frontswap backend and if the backend says it does NOT have room, frontswap_store returns -1 and the kernel swaps the page to the swap device as normal. Note that the response from the frontswap backend is unpredictable to the kernel; it may choose to never accept a page, it could accept every ninth page, or it might accept every page. But if the backend does accept a page, the data from the page has already been copied and associated with the type and offset, and the backend guarantees the persistence of the data. In this case, frontswap sets a bit in the “frontswap_map” for the swap device corresponding to the page offset on the swap device to which it would otherwise have written the data.

When the swap subsystem needs to swap-in a page (swap_readpage()), it first calls frontswap_load() which checks the frontswap_map to see if the page was earlier accepted by the frontswap backend. If it was, the page of data is filled from the frontswap backend and the swap-in is complete. If not, the normal swap-in code is executed to obtain the page of data from the real swap device.

So every time the frontswap backend accepts a page, a swap device read and (potentially) a swap device write are replaced by a “frontswap backend store” and (possibly) a “frontswap backend loads”, which are presumably much faster.

  • Can’t frontswap be configured as a “special” swap device that is just higher priority than any real swap device (e.g. like zswap, or maybe swap-over-nbd/NFS)?

No. First, the existing swap subsystem doesn’t allow for any kind of swap hierarchy. Perhaps it could be rewritten to accommodate a hierarchy, but this would require fairly drastic changes. Even if it were rewritten, the existing swap subsystem uses the block I/O layer which assumes a swap device is fixed size and any page in it is linearly addressable. Frontswap barely touches the existing swap subsystem, and works around the constraints of the block I/O subsystem to provide a great deal of flexibility and dynamicity.

For example, the acceptance of any swap page by the frontswap backend is entirely unpredictable. This is critical to the definition of frontswap backends because it grants completely dynamic discretion to the backend. In zcache, one cannot know a priori how compressible a page is. “Poorly” compressible pages can be rejected, and “poorly” can itself be defined dynamically depending on current memory constraints.

Further, frontswap is entirely synchronous whereas a real swap device is, by definition, asynchronous and uses block I/O. The block I/O layer is not only unnecessary, but may perform “optimizations” that are inappropriate for a RAM-oriented device including delaying the write of some pages for a significant amount of time. Synchrony is required to ensure the dynamicity of the backend and to avoid thorny race conditions that would unnecessarily and greatly complicate frontswap and/or the block I/O subsystem. That said, only the initial “store” and “load” operations need be synchronous. A separate asynchronous thread is free to manipulate the pages stored by frontswap. For example, the “remotification” thread in RAMster uses standard asynchronous kernel sockets to move compressed frontswap pages to a remote machine. Similarly, a KVM guest-side implementation could do in-guest compression and use “batched” hypercalls.

In a virtualized environment, the dynamicity allows the hypervisor (or host OS) to do “intelligent overcommit”. For example, it can choose to accept pages only until host-swapping might be imminent, then force guests to do their own swapping.

There is a downside to the transcendent memory specifications for frontswap: Since any “store” might fail, there must always be a real slot on a real swap device to swap the page. Thus frontswap must be implemented as a “shadow” to every swapon’d device with the potential capability of holding every page that the swap device might have held and the possibility that it might hold no pages at all. This means that frontswap cannot contain more pages than the total of swapon’d swap devices. For example, if NO swap device is configured on some installation, frontswap is useless. Swapless portable devices can still use frontswap but a backend for such devices must configure some kind of “ghost” swap device and ensure that it is never used.

  • Why this weird definition about “duplicate stores”? If a page has been previously successfully stored, can’t it always be successfully overwritten?

Nearly always it can, but no, sometimes it cannot. Consider an example where data is compressed and the original 4K page has been compressed to 1K. Now an attempt is made to overwrite the page with data that is non-compressible and so would take the entire 4K. But the backend has no more space. In this case, the store must be rejected. Whenever frontswap rejects a store that would overwrite, it also must invalidate the old data and ensure that it is no longer accessible. Since the swap subsystem then writes the new data to the read swap device, this is the correct course of action to ensure coherency.

  • What is frontswap_shrink for?

When the (non-frontswap) swap subsystem swaps out a page to a real swap device, that page is only taking up low-value pre-allocated disk space. But if frontswap has placed a page in transcendent memory, that page may be taking up valuable real estate. The frontswap_shrink routine allows code outside of the swap subsystem to force pages out of the memory managed by frontswap and back into kernel-addressable memory. For example, in RAMster, a “suction driver” thread will attempt to “repatriate” pages sent to a remote machine back to the local machine; this is driven using the frontswap_shrink mechanism when memory pressure subsides.

  • Why does the frontswap patch create the new include file swapfile.h?

The frontswap code depends on some swap-subsystem-internal data structures that have, over the years, moved back and forth between static and global. This seemed a reasonable compromise: Define them as global but declare them in a new include file that isn’t included by the large number of source files that include swap.h.

Dan Magenheimer, last updated April 9, 2012

 

High Memory Handling

By: Peter Zijlstra <>

High memory (highmem) is used when the size of physical memory approaches or exceeds the maximum size of virtual memory. At that point it becomes impossible for the kernel to keep all of the available physical memory mapped at all times. This means the kernel needs to start using temporary mappings of the pieces of physical memory that it wants to access.

The part of (physical) memory not covered by a permanent mapping is what we refer to as ‘highmem’. There are various architecture dependent constraints on where exactly that border lies.

In the i386 arch, for example, we choose to map the kernel into every process’s VM space so that we don’t have to pay the full TLB invalidation costs for kernel entry/exit. This means the available virtual memory space (4GiB on i386) has to be divided between user and kernel space.

The traditional split for architectures using this approach is 3:1, 3GiB for userspace and the top 1GiB for kernel space:

+--------+ 0xffffffff| Kernel |+--------+ 0xc0000000|        || User   ||        |+--------+ 0x00000000

This means that the kernel can at most map 1GiB of physical memory at any one time, but because we need virtual address space for other things - including temporary maps to access the rest of the physical memory - the actual direct map will typically be less (usually around ~896MiB).

Other architectures that have mm context tagged TLBs can have separate kernel and user maps. Some hardware (like some ARMs), however, have limited virtual space when they use mm context tags.

The kernel contains several ways of creating temporary mappings:

  • . This can be used to make a long duration mapping of multiple physical pages into a contiguous virtual space. It needs global synchronization to unmap.

  • kmap(). This permits a short duration mapping of a single page. It needs global synchronization, but is amortized somewhat. It is also prone to deadlocks when using in a nested fashion, and so it is not recommended for new code.

  • kmap_atomic(). This permits a very short duration mapping of a single page. Since the mapping is restricted to the CPU that issued it, it performs well, but the issuing task is therefore required to stay on that CPU until it has finished, lest some other task displace its mappings.

    kmap_atomic() may also be used by interrupt contexts, since it is does not sleep and the caller may not sleep until after kunmap_atomic() is called.

    It may be assumed that k[un]map_atomic() won’t fail.

When and where to use kmap_atomic() is straightforward. It is used when code wants to access the contents of a page that might be allocated from high memory (see __GFP_HIGHMEM), for example a page in the pagecache. The API has two functions, and they can be used in a manner similar to the following:

/* Find the page of interest. */struct page *page = find_get_page(mapping, offset);/* Gain access to the contents of that page. */void *vaddr = kmap_atomic(page);/* Do something to the contents of that page. */memset(vaddr, 0, PAGE_SIZE);/* Unmap that page. */kunmap_atomic(vaddr);

Note that the kunmap_atomic() call takes the result of the kmap_atomic() call not the argument.

If you need to map two pages because you want to copy from one page to another you need to keep the kmap_atomic calls strictly nested, like:

vaddr1 = kmap_atomic(page1);vaddr2 = kmap_atomic(page2);memcpy(vaddr1, vaddr2, PAGE_SIZE);kunmap_atomic(vaddr2);kunmap_atomic(vaddr1);

The cost of creating temporary mappings can be quite high. The arch has to manipulate the kernel’s page tables, the data TLB and/or the MMU’s registers.

If CONFIG_HIGHMEM is not set, then the kernel will try and create a mapping simply with a bit of arithmetic that will convert the page struct address into a pointer to the page contents rather than juggling mappings about. In such a case, the unmap operation may be a null operation.

If CONFIG_MMU is not set, then there can be no temporary mappings and no highmem. In such a case, the arithmetic approach will also be used.

The i386 arch, under some circumstances, will permit you to stick up to 64GiB of RAM into your 32-bit machine. This has a number of consequences:

  • Linux needs a page-frame structure for each page in the system and the pageframes need to live in the permanent mapping, which means:

  • you can have 896M/sizeof(struct page) page-frames at most; with struct page being 32-bytes that would end up being something in the order of 112G worth of pages; the kernel, however, needs to store more than just page-frames in that memory…

  • PAE makes your page tables larger - which slows the system down as more data has to be accessed to traverse in TLB fills and the like. One advantage is that PAE has more PTE bits and can provide advanced features like NX and PAT.

The general recommendation is that you don’t use more than 8GiB on a 32-bit machine - although more might work for you and your workload, you’re pretty much on your own - don’t expect kernel developers to really care much if things come apart.

 

HugeTLB Pages

Overview

The intent of this file is to give a brief summary of hugetlbpage support in the Linux kernel. This support is built on top of multiple page size support that is provided by most modern architectures. For example, x86 CPUs normally support 4K and 2M (1G if architecturally supported) page sizes, ia64 architecture supports multiple page sizes 4K, 8K, 64K, 256K, 1M, 4M, 16M, 256M and ppc64 supports 4K and 16M. A TLB is a cache of virtual-to-physical translations. Typically this is a very scarce resource on processor. Operating systems try to make best use of limited number of TLB resources. This optimization is more critical now as bigger and bigger physical memories (several GBs) are more readily available.

Users can use the huge page support in Linux kernel by either using the mmap system call or standard SYSV shared memory system calls (shmget, shmat).

First the Linux kernel needs to be built with the CONFIG_HUGETLBFS (present under “File systems”) and CONFIG_HUGETLB_PAGE (selected automatically when CONFIG_HUGETLBFS is selected) configuration options.

The /proc/meminfo file provides information about the total number of persistent hugetlb pages in the kernel’s huge page pool. It also displays default huge page size and information about the number of free, reserved and surplus huge pages in the pool of huge pages of default size. The huge page size is needed for generating the proper alignment and size of the arguments to system calls that map huge page regions.

The output of cat /proc/meminfo will include lines like:

HugePages_Total: uuuHugePages_Free:  vvvHugePages_Rsvd:  wwwHugePages_Surp:  xxxHugepagesize:    yyy kBHugetlb:         zzz kB

where:

HugePages_Total

is the size of the pool of huge pages.

HugePages_Free

is the number of huge pages in the pool that are not yet allocated.

HugePages_Rsvd

is short for “reserved,” and is the number of huge pages for which a commitment to allocate from the pool has been made, but no allocation has yet been made. Reserved huge pages guarantee that an application will be able to allocate a huge page from the pool of huge pages at fault time.

HugePages_Surp

is short for “surplus,” and is the number of huge pages in the pool above the value in /proc/sys/vm/nr_hugepages. The maximum number of surplus huge pages is controlled by /proc/sys/vm/nr_overcommit_hugepages.

Hugepagesize

is the default hugepage size (in Kb).

Hugetlb

is the total amount of memory (in kB), consumed by huge pages of all sizes. If huge pages of different sizes are in use, this number will exceed HugePages_Total * Hugepagesize. To get more detailed information, please, refer to /sys/kernel/mm/hugepages (described below).

/proc/filesystems should also show a filesystem of type “hugetlbfs” configured in the kernel.

/proc/sys/vm/nr_hugepages indicates the current number of “persistent” huge pages in the kernel’s huge page pool. “Persistent” huge pages will be returned to the huge page pool when freed by a task. A user with root privileges can dynamically allocate more or free some persistent huge pages by increasing or decreasing the value of nr_hugepages.

Pages that are used as huge pages are reserved inside the kernel and cannot be used for other purposes. Huge pages cannot be swapped out under memory pressure.

Once a number of huge pages have been pre-allocated to the kernel huge page pool, a user with appropriate privilege can use either the mmap system call or shared memory system calls to use the huge pages. See the discussion of , below.

The administrator can allocate persistent huge pages on the kernel boot command line by specifying the “hugepages=N” parameter, where ‘N’ = the number of huge pages requested. This is the most reliable method of allocating huge pages as memory has not yet become fragmented.

Some platforms support multiple huge page sizes. To allocate huge pages of a specific size, one must precede the huge pages boot command parameters with a huge page size selection parameter “hugepagesz=<size>”. <size> must be specified in bytes with optional scale suffix [kKmMgG]. The default huge page size may be selected with the “default_hugepagesz=<size>” boot parameter.

Hugetlb boot command line parameter semantics

hugepagesz

Specify a huge page size. Used in conjunction with hugepages parameter to preallocate a number of huge pages of the specified size. Hence, hugepagesz and hugepages are typically specified in pairs such as:

hugepagesz=2M hugepages=512

hugepagesz can only be specified once on the command line for a specific huge page size. Valid huge page sizes are architecture dependent.

hugepages

Specify the number of huge pages to preallocate. This typically follows a valid hugepagesz or default_hugepagesz parameter. However, if hugepages is the first or only hugetlb command line parameter it implicitly specifies the number of huge pages of default size to allocate. If the number of huge pages of default size is implicitly specified, it can not be overwritten by a hugepagesz,hugepages parameter pair for the default size.

For example, on an architecture with 2M default huge page size:

hugepages=256 hugepagesz=2M hugepages=512

will result in 256 2M huge pages being allocated and a warning message indicating that the hugepages=512 parameter is ignored. If a hugepages parameter is preceded by an invalid hugepagesz parameter, it will be ignored.

default_hugepagesz

Specify the default huge page size. This parameter can only be specified once on the command line. default_hugepagesz can optionally be followed by the hugepages parameter to preallocate a specific number of huge pages of default size. The number of default sized huge pages to preallocate can also be implicitly specified as mentioned in the hugepages section above. Therefore, on an architecture with 2M default huge page size:

hugepages=256default_hugepagesz=2M hugepages=256hugepages=256 default_hugepagesz=2M

will all result in 256 2M huge pages being allocated. Valid default huge page size is architecture dependent.

When multiple huge page sizes are supported, /proc/sys/vm/nr_hugepages indicates the current number of pre-allocated huge pages of the default size. Thus, one can use the following command to dynamically allocate/deallocate default sized persistent huge pages:

echo 20 > /proc/sys/vm/nr_hugepages

This command will try to adjust the number of default sized huge pages in the huge page pool to 20, allocating or freeing huge pages, as required.

On a NUMA platform, the kernel will attempt to distribute the huge page pool over all the set of allowed nodes specified by the NUMA memory policy of the task that modifies nr_hugepages. The default for the allowed nodes–when the task has default memory policy–is all on-line nodes with memory. Allowed nodes with insufficient available, contiguous memory for a huge page will be silently skipped when allocating persistent huge pages. See the  of the interaction of task memory policy, cpusets and per node attributes with the allocation and freeing of persistent huge pages.

The success or failure of huge page allocation depends on the amount of physically contiguous memory that is present in system at the time of the allocation attempt. If the kernel is unable to allocate huge pages from some nodes in a NUMA system, it will attempt to make up the difference by allocating extra pages on other nodes with sufficient available contiguous memory, if any.

System administrators may want to put this command in one of the local rc init files. This will enable the kernel to allocate huge pages early in the boot process when the possibility of getting physical contiguous pages is still very high. Administrators can verify the number of huge pages actually allocated by checking the sysctl or meminfo. To check the per node distribution of huge pages in a NUMA system, use:

cat /sys/devices/system/node/node*/meminfo | fgrep Huge

/proc/sys/vm/nr_overcommit_hugepages specifies how large the pool of huge pages can grow, if more huge pages than /proc/sys/vm/nr_hugepages are requested by applications. Writing any non-zero value into this file indicates that the hugetlb subsystem is allowed to try to obtain that number of “surplus” huge pages from the kernel’s normal page pool, when the persistent huge page pool is exhausted. As these surplus huge pages become unused, they are freed back to the kernel’s normal page pool.

When increasing the huge page pool size via nr_hugepages, any existing surplus pages will first be promoted to persistent huge pages. Then, additional huge pages will be allocated, if necessary and if possible, to fulfill the new persistent huge page pool size.

The administrator may shrink the pool of persistent huge pages for the default huge page size by setting the nr_hugepages sysctl to a smaller value. The kernel will attempt to balance the freeing of huge pages across all nodes in the memory policy of the task modifying nr_hugepages. Any free huge pages on the selected nodes will be freed back to the kernel’s normal page pool.

Caveat: Shrinking the persistent huge page pool via nr_hugepages such that it becomes less than the number of huge pages in use will convert the balance of the in-use huge pages to surplus huge pages. This will occur even if the number of surplus pages would exceed the overcommit value. As long as this condition holds–that is, until nr_hugepages+nr_overcommit_hugepages is increased sufficiently, or the surplus huge pages go out of use and are freed– no more surplus huge pages will be allowed to be allocated.

With support for multiple huge page pools at run-time available, much of the huge page userspace interface in /proc/sys/vm has been duplicated in sysfs. The /proc interfaces discussed above have been retained for backwards compatibility. The root huge page control directory in sysfs is:

/sys/kernel/mm/hugepages

For each huge page size supported by the running kernel, a subdirectory will exist, of the form:

hugepages-${size}kB

Inside each of these directories, the same set of files will exist:

nr_hugepagesnr_hugepages_mempolicynr_overcommit_hugepagesfree_hugepagesresv_hugepagessurplus_hugepages

which function as described above for the default huge page-sized case.

Interaction of Task Memory Policy with Huge Page Allocation/Freeing

Whether huge pages are allocated and freed via the /proc interface or the /sysfs interface using the nr_hugepages_mempolicy attribute, the NUMA nodes from which huge pages are allocated or freed are controlled by the NUMA memory policy of the task that modifies the nr_hugepages_mempolicy sysctl or attribute. When the nr_hugepages attribute is used, mempolicy is ignored.

The recommended method to allocate or free huge pages to/from the kernel huge page pool, using the nr_hugepages example above, is:

numactl --interleave 
echo 20 \ >/proc/sys/vm/nr_hugepages_mempolicy

or, more succinctly:

numactl -m 
echo 20 >/proc/sys/vm/nr_hugepages_mempolicy

This will allocate or free abs(20 - nr_hugepages) to or from the nodes specified in <node-list>, depending on whether number of persistent huge pages is initially less than or greater than 20, respectively. No huge pages will be allocated nor freed on any node not included in the specified <node-list>.

When adjusting the persistent hugepage count via nr_hugepages_mempolicy, any memory policy mode–bind, preferred, local or interleave–may be used. The resulting effect on persistent huge page allocation is as follows:

  1. Regardless of mempolicy mode [see ], persistent huge pages will be distributed across the node or nodes specified in the mempolicy as if “interleave” had been specified. However, if a node in the policy does not contain sufficient contiguous memory for a huge page, the allocation will not “fallback” to the nearest neighbor node with sufficient contiguous memory. To do this would cause undesirable imbalance in the distribution of the huge page pool, or possibly, allocation of persistent huge pages on nodes not allowed by the task’s memory policy.

  2. One or more nodes may be specified with the bind or interleave policy. If more than one node is specified with the preferred policy, only the lowest numeric id will be used. Local policy will select the node where the task is running at the time the nodes_allowed mask is constructed. For local policy to be deterministic, the task must be bound to a cpu or cpus in a single node. Otherwise, the task could be migrated to some other node at any time after launch and the resulting node will be indeterminate. Thus, local policy is not very useful for this purpose. Any of the other mempolicy modes may be used to specify a single node.

  3. The nodes allowed mask will be derived from any non-default task mempolicy, whether this policy was set explicitly by the task itself or one of its ancestors, such as numactl. This means that if the task is invoked from a shell with non-default policy, that policy will be used. One can specify a node list of “all” with numactl –interleave or –membind [-m] to achieve interleaving over all nodes in the system or cpuset.

  4. Any task mempolicy specified–e.g., using numactl–will be constrained by the resource limits of any cpuset in which the task runs. Thus, there will be no way for a task with non-default policy running in a cpuset with a subset of the system nodes to allocate huge pages outside the cpuset without first moving to a cpuset that contains all of the desired nodes.

  5. Boot-time huge page allocation attempts to distribute the requested number of huge pages over all on-lines nodes with memory.

Per Node Hugepages Attributes

A subset of the contents of the root huge page control directory in sysfs, described above, will be replicated under each the system device of each NUMA node with memory in:

/sys/devices/system/node/node[0-9]*/hugepages/

Under this directory, the subdirectory for each supported huge page size contains the following attribute files:

nr_hugepagesfree_hugepagessurplus_hugepages

The free_’ and surplus_’ attribute files are read-only. They return the number of free and surplus [overcommitted] huge pages, respectively, on the parent node.

The nr_hugepages attribute returns the total number of huge pages on the specified node. When this attribute is written, the number of persistent huge pages on the parent node will be adjusted to the specified value, if sufficient resources exist, regardless of the task’s mempolicy or cpuset constraints.

Note that the number of overcommit and reserve pages remain global quantities, as we don’t know until fault time, when the faulting task’s mempolicy is applied, from which node the huge page allocation will be attempted.

Using Huge Pages

If the user applications are going to request huge pages using mmap system call, then it is required that system administrator mount a file system of type hugetlbfs:

mount -t hugetlbfs \      -o uid=
,gid=
,mode=
,pagesize=
,size=
,\ min_size=
,nr_inodes=
none /mnt/huge

This command mounts a (pseudo) filesystem of type hugetlbfs on the directory /mnt/huge. Any file created on /mnt/huge uses huge pages.

The uid and gid options sets the owner and group of the root of the file system. By default the uid and gid of the current process are taken.

The mode option sets the mode of root of file system to value & 01777. This value is given in octal. By default the value 0755 is picked.

If the platform supports multiple huge page sizes, the pagesize option can be used to specify the huge page size and associated pool. pagesize is specified in bytes. If pagesize is not specified the platform’s default huge page size and associated pool will be used.

The size option sets the maximum value of memory (huge pages) allowed for that filesystem (/mnt/huge). The size option can be specified in bytes, or as a percentage of the specified huge page pool (nr_hugepages). The size is rounded down to HPAGE_SIZE boundary.

The min_size option sets the minimum value of memory (huge pages) allowed for the filesystem. min_size can be specified in the same way as size, either bytes or a percentage of the huge page pool. At mount time, the number of huge pages specified by min_size are reserved for use by the filesystem. If there are not enough free huge pages available, the mount will fail. As huge pages are allocated to the filesystem and freed, the reserve count is adjusted so that the sum of allocated and reserved huge pages is always at least min_size.

The option nr_inodes sets the maximum number of inodes that /mnt/huge can use.

If the sizemin_size or nr_inodes option is not provided on command line then no limits are set.

For pagesizesizemin_size and nr_inodes options, you can use [G|g]/[M|m]/[K|k] to represent giga/mega/kilo. For example, size=2K has the same meaning as size=2048.

While read system calls are supported on files that reside on hugetlb file systems, write system calls are not.

Regular chown, chgrp, and chmod commands (with right permissions) could be used to change the file attributes on hugetlbfs.

Also, it is important to note that no such mount command is required if applications are going to use only shmat/shmget system calls or mmap with MAP_HUGETLB. For an example of how to use mmap with MAP_HUGETLB see  below.

Users who wish to use hugetlb memory via shared memory segment should be members of a supplementary group and system admin needs to configure that gid into /proc/sys/vm/hugetlb_shm_group. It is possible for same or different applications to use any combination of mmaps and shm* calls, though the mount of filesystem will be required for using mmap calls without MAP_HUGETLB.

Syscalls that operate on memory backed by hugetlb pages only have their lengths aligned to the native page size of the processor; they will normally fail with errno set to EINVAL or exclude hugetlb pages that extend beyond the length if not hugepage aligned. For example, munmap(2) will fail if memory is backed by a hugetlb page and the length is smaller than the hugepage size.

Examples

map_hugetlb

see tools/testing/selftests/vm/map_hugetlb.c

hugepage-shm

see tools/testing/selftests/vm/hugepage-shm.c

hugepage-mmap

see tools/testing/selftests/vm/hugepage-mmap.c

The  library provides a wide range of userspace tools to help with huge page usability, environment setup, and control.

 

Hugetlbfs Reservation

Overview

Huge pages as described at  are typically preallocated for application use. These huge pages are instantiated in a task’s address space at page fault time if the VMA indicates huge pages are to be used. If no huge page exists at page fault time, the task is sent a SIGBUS and often dies an unhappy death. Shortly after huge page support was added, it was determined that it would be better to detect a shortage of huge pages at mmap() time. The idea is that if there were not enough huge pages to cover the mapping, the mmap() would fail. This was first done with a simple check in the code at mmap() time to determine if there were enough free huge pages to cover the mapping. Like most things in the kernel, the code has evolved over time. However, the basic idea was to ‘reserve’ huge pages at mmap() time to ensure that huge pages would be available for page faults in that mapping. The description below attempts to describe how huge page reserve processing is done in the v4.10 kernel.

Audience

This description is primarily targeted at kernel developers who are modifying hugetlbfs code.

The Data Structures

resv_huge_pages

This is a global (per-hstate) count of reserved huge pages. Reserved huge pages are only available to the task which reserved them. Therefore, the number of huge pages generally available is computed as (free_huge_pages - resv_huge_pages).

Reserve Map

A reserve map is described by the structure:

struct resv_map {        struct kref refs;        spinlock_t lock;        struct list_head regions;        long adds_in_progress;        struct list_head region_cache;        long region_cache_count;};

There is one reserve map for each huge page mapping in the system. The regions list within the resv_map describes the regions within the mapping. A region is described as:

struct file_region {        struct list_head link;        long from;        long to;};

The ‘from’ and ‘to’ fields of the file region structure are huge page indices into the mapping. Depending on the type of mapping, a region in the reserv_map may indicate reservations exist for the range, or reservations do not exist.

Flags for MAP_PRIVATE Reservations

These are stored in the bottom bits of the reservation map pointer.

#define HPAGE_RESV_OWNER    (1UL << 0)

Indicates this task is the owner of the reservations associated with the mapping.

#define HPAGE_RESV_UNMAPPED (1UL << 1)

Indicates task originally mapping this range (and creating reserves) has unmapped a page from this task (the child) due to a failed COW.

Page Flags

The PagePrivate page flag is used to indicate that a huge page reservation must be restored when the huge page is freed. More details will be discussed in the “Freeing huge pages” section.

Reservation Map Location (Private or Shared)

A huge page mapping or segment is either private or shared. If private, it is typically only available to a single address space (task). If shared, it can be mapped into multiple address spaces (tasks). The location and semantics of the reservation map is significantly different for the two types of mappings. Location differences are:

  • For private mappings, the reservation map hangs off the VMA structure. Specifically, vma->vm_private_data. This reserve map is created at the time the mapping (mmap(MAP_PRIVATE)) is created.

  • For shared mappings, the reservation map hangs off the inode. Specifically, inode->i_mapping->private_data. Since shared mappings are always backed by files in the hugetlbfs filesystem, the hugetlbfs code ensures each inode contains a reservation map. As a result, the reservation map is allocated when the inode is created.

Creating Reservations

Reservations are created when a huge page backed shared memory segment is created (shmget(SHM_HUGETLB)) or a mapping is created via mmap(MAP_HUGETLB). These operations result in a call to the routine hugetlb_reserve_pages():

int hugetlb_reserve_pages(struct inode *inode,                          long from, long to,                          struct vm_area_struct *vma,                          vm_flags_t vm_flags)

The first thing hugetlb_reserve_pages() does is check if the NORESERVE flag was specified in either the shmget() or mmap() call. If NORESERVE was specified, then this routine returns immediately as no reservations are desired.

The arguments ‘from’ and ‘to’ are huge page indices into the mapping or underlying file. For shmget(), ‘from’ is always 0 and ‘to’ corresponds to the length of the segment/mapping. For mmap(), the offset argument could be used to specify the offset into the underlying file. In such a case, the ‘from’ and ‘to’ arguments have been adjusted by this offset.

One of the big differences between PRIVATE and SHARED mappings is the way in which reservations are represented in the reservation map.

  • For shared mappings, an entry in the reservation map indicates a reservation exists or did exist for the corresponding page. As reservations are consumed, the reservation map is not modified.

  • For private mappings, the lack of an entry in the reservation map indicates a reservation exists for the corresponding page. As reservations are consumed, entries are added to the reservation map. Therefore, the reservation map can also be used to determine which reservations have been consumed.

For private mappings, hugetlb_reserve_pages() creates the reservation map and hangs it off the VMA structure. In addition, the HPAGE_RESV_OWNER flag is set to indicate this VMA owns the reservations.

The reservation map is consulted to determine how many huge page reservations are needed for the current mapping/segment. For private mappings, this is always the value (to - from). However, for shared mappings it is possible that some reservations may already exist within the range (to - from). See the section  for details on how this is accomplished.

The mapping may be associated with a subpool. If so, the subpool is consulted to ensure there is sufficient space for the mapping. It is possible that the subpool has set aside reservations that can be used for the mapping. See the section  for more details.

After consulting the reservation map and subpool, the number of needed new reservations is known. The routine hugetlb_acct_memory() is called to check for and take the requested number of reservations. hugetlb_acct_memory() calls into routines that potentially allocate and adjust surplus page counts. However, within those routines the code is simply checking to ensure there are enough free huge pages to accommodate the reservation. If there are, the global reservation count resv_huge_pages is adjusted something like the following:

if (resv_needed <= (resv_huge_pages - free_huge_pages))        resv_huge_pages += resv_needed;

Note that the global lock hugetlb_lock is held when checking and adjusting these counters.

If there were enough free huge pages and the global count resv_huge_pages was adjusted, then the reservation map associated with the mapping is modified to reflect the reservations. In the case of a shared mapping, a file_region will exist that includes the range ‘from’ - ‘to’. For private mappings, no modifications are made to the reservation map as lack of an entry indicates a reservation exists.

If hugetlb_reserve_pages() was successful, the global reservation count and reservation map associated with the mapping will be modified as required to ensure reservations exist for the range ‘from’ - ‘to’.

Consuming Reservations/Allocating a Huge Page

Reservations are consumed when huge pages associated with the reservations are allocated and instantiated in the corresponding mapping. The allocation is performed within the routine alloc_huge_page():

struct page *alloc_huge_page(struct vm_area_struct *vma,                             unsigned long addr, int avoid_reserve)

alloc_huge_page is passed a VMA pointer and a virtual address, so it can consult the reservation map to determine if a reservation exists. In addition, alloc_huge_page takes the argument avoid_reserve which indicates reserves should not be used even if it appears they have been set aside for the specified address. The avoid_reserve argument is most often used in the case of Copy on Write and Page Migration where additional copies of an existing page are being allocated.

The helper routine vma_needs_reservation() is called to determine if a reservation exists for the address within the mapping(vma). See the section  for detailed information on what this routine does. The value returned from vma_needs_reservation() is generally 0 or 1. 0 if a reservation exists for the address, 1 if no reservation exists. If a reservation does not exist, and there is a subpool associated with the mapping the subpool is consulted to determine if it contains reservations. If the subpool contains reservations, one can be used for this allocation. However, in every case the avoid_reserve argument overrides the use of a reservation for the allocation. After determining whether a reservation exists and can be used for the allocation, the routine dequeue_huge_page_vma() is called. This routine takes two arguments related to reservations:

  • avoid_reserve, this is the same value/argument passed to alloc_huge_page()

  • chg, even though this argument is of type long only the values 0 or 1 are passed to dequeue_huge_page_vma. If the value is 0, it indicates a reservation exists (see the section “Memory Policy and Reservations” for possible issues). If the value is 1, it indicates a reservation does not exist and the page must be taken from the global free pool if possible.

The free lists associated with the memory policy of the VMA are searched for a free page. If a page is found, the value free_huge_pages is decremented when the page is removed from the free list. If there was a reservation associated with the page, the following adjustments are made:

SetPagePrivate(page);   /* Indicates allocating this page consumed                         * a reservation, and if an error is                         * encountered such that the page must be                         * freed, the reservation will be restored. */resv_huge_pages--;      /* Decrement the global reservation count */

Note, if no huge page can be found that satisfies the VMA’s memory policy an attempt will be made to allocate one using the buddy allocator. This brings up the issue of surplus huge pages and overcommit which is beyond the scope reservations. Even if a surplus page is allocated, the same reservation based adjustments as above will be made: SetPagePrivate(page) and resv_huge_pages–.

After obtaining a new huge page, (page)->private is set to the value of the subpool associated with the page if it exists. This will be used for subpool accounting when the page is freed.

The routine vma_commit_reservation() is then called to adjust the reserve map based on the consumption of the reservation. In general, this involves ensuring the page is represented within a file_region structure of the region map. For shared mappings where the reservation was present, an entry in the reserve map already existed so no change is made. However, if there was no reservation in a shared mapping or this was a private mapping a new entry must be created.

It is possible that the reserve map could have been changed between the call to vma_needs_reservation() at the beginning of alloc_huge_page() and the call to vma_commit_reservation() after the page was allocated. This would be possible if hugetlb_reserve_pages was called for the same page in a shared mapping. In such cases, the reservation count and subpool free page count will be off by one. This rare condition can be identified by comparing the return value from vma_needs_reservation and vma_commit_reservation. If such a race is detected, the subpool and global reserve counts are adjusted to compensate. See the section  for more information on these routines.

Instantiate Huge Pages

After huge page allocation, the page is typically added to the page tables of the allocating task. Before this, pages in a shared mapping are added to the page cache and pages in private mappings are added to an anonymous reverse mapping. In both cases, the PagePrivate flag is cleared. Therefore, when a huge page that has been instantiated is freed no adjustment is made to the global reservation count (resv_huge_pages).

Freeing Huge Pages

Huge page freeing is performed by the routine free_huge_page(). This routine is the destructor for hugetlbfs compound pages. As a result, it is only passed a pointer to the page struct. When a huge page is freed, reservation accounting may need to be performed. This would be the case if the page was associated with a subpool that contained reserves, or the page is being freed on an error path where a global reserve count must be restored.

The page->private field points to any subpool associated with the page. If the PagePrivate flag is set, it indicates the global reserve count should be adjusted (see the section  for information on how these are set).

The routine first calls hugepage_subpool_put_pages() for the page. If this routine returns a value of 0 (which does not equal the value passed 1) it indicates reserves are associated with the subpool, and this newly free page must be used to keep the number of subpool reserves above the minimum size. Therefore, the global resv_huge_pages counter is incremented in this case.

If the PagePrivate flag was set in the page, the global resv_huge_pages counter will always be incremented.

Subpool Reservations

There is a struct hstate associated with each huge page size. The hstate tracks all huge pages of the specified size. A subpool represents a subset of pages within a hstate that is associated with a mounted hugetlbfs filesystem.

When a hugetlbfs filesystem is mounted a min_size option can be specified which indicates the minimum number of huge pages required by the filesystem. If this option is specified, the number of huge pages corresponding to min_size are reserved for use by the filesystem. This number is tracked in the min_hpages field of a struct hugepage_subpool. At mount time, hugetlb_acct_memory(min_hpages) is called to reserve the specified number of huge pages. If they can not be reserved, the mount fails.

The routines hugepage_subpool_get/put_pages() are called when pages are obtained from or released back to a subpool. They perform all subpool accounting, and track any reservations associated with the subpool. hugepage_subpool_get/put_pages are passed the number of huge pages by which to adjust the subpool ‘used page’ count (down for get, up for put). Normally, they return the same value that was passed or an error if not enough pages exist in the subpool.

However, if reserves are associated with the subpool a return value less than the passed value may be returned. This return value indicates the number of additional global pool adjustments which must be made. For example, suppose a subpool contains 3 reserved huge pages and someone asks for 5. The 3 reserved pages associated with the subpool can be used to satisfy part of the request. But, 2 pages must be obtained from the global pools. To relay this information to the caller, the value 2 is returned. The caller is then responsible for attempting to obtain the additional two pages from the global pools.

COW and Reservations

Since shared mappings all point to and use the same underlying pages, the biggest reservation concern for COW is private mappings. In this case, two tasks can be pointing at the same previously allocated page. One task attempts to write to the page, so a new page must be allocated so that each task points to its own page.

When the page was originally allocated, the reservation for that page was consumed. When an attempt to allocate a new page is made as a result of COW, it is possible that no free huge pages are free and the allocation will fail.

When the private mapping was originally created, the owner of the mapping was noted by setting the HPAGE_RESV_OWNER bit in the pointer to the reservation map of the owner. Since the owner created the mapping, the owner owns all the reservations associated with the mapping. Therefore, when a write fault occurs and there is no page available, different action is taken for the owner and non-owner of the reservation.

In the case where the faulting task is not the owner, the fault will fail and the task will typically receive a SIGBUS.

If the owner is the faulting task, we want it to succeed since it owned the original reservation. To accomplish this, the page is unmapped from the non-owning task. In this way, the only reference is from the owning task. In addition, the HPAGE_RESV_UNMAPPED bit is set in the reservation map pointer of the non-owning task. The non-owning task may receive a SIGBUS if it later faults on a non-present page. But, the original owner of the mapping/reservation will behave as expected.

Reservation Map Modifications

The following low level routines are used to make modifications to a reservation map. Typically, these routines are not called directly. Rather, a reservation map helper routine is called which calls one of these low level routines. These low level routines are fairly well documented in the source code (mm/hugetlb.c). These routines are:

long region_chg(struct resv_map *resv, long f, long t);long region_add(struct resv_map *resv, long f, long t);void region_abort(struct resv_map *resv, long f, long t);long region_count(struct resv_map *resv, long f, long t);

Operations on the reservation map typically involve two operations:

  1. region_chg() is called to examine the reserve map and determine how many pages in the specified range [f, t) are NOT currently represented.

    The calling code performs global checks and allocations to determine if there are enough huge pages for the operation to succeed.

    1. If the operation can succeed, region_add() is called to actually modify the reservation map for the same range [f, t) previously passed to region_chg().

    2. If the operation can not succeed, region_abort is called for the same range [f, t) to abort the operation.

Note that this is a two step process where region_add() and region_abort() are guaranteed to succeed after a prior call to region_chg() for the same range. region_chg() is responsible for pre-allocating any data structures necessary to ensure the subsequent operations (specifically region_add())) will succeed.

As mentioned above, region_chg() determines the number of pages in the range which are NOT currently represented in the map. This number is returned to the caller. region_add() returns the number of pages in the range added to the map. In most cases, the return value of region_add() is the same as the return value of region_chg(). However, in the case of shared mappings it is possible for changes to the reservation map to be made between the calls to region_chg() and region_add(). In this case, the return value of region_add() will not match the return value of region_chg(). It is likely that in such cases global counts and subpool accounting will be incorrect and in need of adjustment. It is the responsibility of the caller to check for this condition and make the appropriate adjustments.

The routine region_del() is called to remove regions from a reservation map. It is typically called in the following situations:

  • When a file in the hugetlbfs filesystem is being removed, the inode will be released and the reservation map freed. Before freeing the reservation map, all the individual file_region structures must be freed. In this case region_del is passed the range [0, LONG_MAX).

  • When a hugetlbfs file is being truncated. In this case, all allocated pages after the new file size must be freed. In addition, any file_region entries in the reservation map past the new end of file must be deleted. In this case, region_del is passed the range [new_end_of_file, LONG_MAX).

  • When a hole is being punched in a hugetlbfs file. In this case, huge pages are removed from the middle of the file one at a time. As the pages are removed, region_del() is called to remove the corresponding entry from the reservation map. In this case, region_del is passed the range [page_idx, page_idx + 1).

In every case, region_del() will return the number of pages removed from the reservation map. In VERY rare cases, region_del() can fail. This can only happen in the hole punch case where it has to split an existing file_region entry and can not allocate a new structure. In this error case, region_del() will return -ENOMEM. The problem here is that the reservation map will indicate that there is a reservation for the page. However, the subpool and global reservation counts will not reflect the reservation. To handle this situation, the routine hugetlb_fix_reserve_counts() is called to adjust the counters so that they correspond with the reservation map entry that could not be deleted.

region_count() is called when unmapping a private huge page mapping. In private mappings, the lack of a entry in the reservation map indicates that a reservation exists. Therefore, by counting the number of entries in the reservation map we know how many reservations were consumed and how many are outstanding (outstanding = (end - start) - region_count(resv, start, end)). Since the mapping is going away, the subpool and global reservation counts are decremented by the number of outstanding reservations.

Reservation Map Helper Routines

Several helper routines exist to query and modify the reservation maps. These routines are only interested with reservations for a specific huge page, so they just pass in an address instead of a range. In addition, they pass in the associated VMA. From the VMA, the type of mapping (private or shared) and the location of the reservation map (inode or VMA) can be determined. These routines simply call the underlying routines described in the section “Reservation Map Modifications”. However, they do take into account the ‘opposite’ meaning of reservation map entries for private and shared mappings and hide this detail from the caller:

long vma_needs_reservation(struct hstate *h,                           struct vm_area_struct *vma,                           unsigned long addr)

This routine calls region_chg() for the specified page. If no reservation exists, 1 is returned. If a reservation exists, 0 is returned:

long vma_commit_reservation(struct hstate *h,                            struct vm_area_struct *vma,                            unsigned long addr)

This calls region_add() for the specified page. As in the case of region_chg and region_add, this routine is to be called after a previous call to vma_needs_reservation. It will add a reservation entry for the page. It returns 1 if the reservation was added and 0 if not. The return value should be compared with the return value of the previous call to vma_needs_reservation. An unexpected difference indicates the reservation map was modified between calls:

void vma_end_reservation(struct hstate *h,                         struct vm_area_struct *vma,                         unsigned long addr)

This calls region_abort() for the specified page. As in the case of region_chg and region_abort, this routine is to be called after a previous call to vma_needs_reservation. It will abort/end the in progress reservation add operation:

long vma_add_reservation(struct hstate *h,                         struct vm_area_struct *vma,                         unsigned long addr)

This is a special wrapper routine to help facilitate reservation cleanup on error paths. It is only called from the routine restore_reserve_on_error(). This routine is used in conjunction with vma_needs_reservation in an attempt to add a reservation to the reservation map. It takes into account the different reservation map semantics for private and shared mappings. Hence, region_add is called for shared mappings (as an entry present in the map indicates a reservation), and region_del is called for private mappings (as the absence of an entry in the map indicates a reservation). See the section “Reservation cleanup in error paths” for more information on what needs to be done on error paths.

Reservation Cleanup in Error Paths

As mentioned in the section , reservation map modifications are performed in two steps. First vma_needs_reservation is called before a page is allocated. If the allocation is successful, then vma_commit_reservation is called. If not, vma_end_reservation is called. Global and subpool reservation counts are adjusted based on success or failure of the operation and all is well.

Additionally, after a huge page is instantiated the PagePrivate flag is cleared so that accounting when the page is ultimately freed is correct.

However, there are several instances where errors are encountered after a huge page is allocated but before it is instantiated. In this case, the page allocation has consumed the reservation and made the appropriate subpool, reservation map and global count adjustments. If the page is freed at this time (before instantiation and clearing of PagePrivate), then free_huge_page will increment the global reservation count. However, the reservation map indicates the reservation was consumed. This resulting inconsistent state will cause the ‘leak’ of a reserved huge page. The global reserve count will be higher than it should and prevent allocation of a pre-allocated page.

The routine restore_reserve_on_error() attempts to handle this situation. It is fairly well documented. The intention of this routine is to restore the reservation map to the way it was before the page allocation. In this way, the state of the reservation map will correspond to the global reservation count after the page is freed.

The routine restore_reserve_on_error itself may encounter errors while attempting to restore the reservation map entry. In this case, it will simply clear the PagePrivate flag of the page. In this way, the global reserve count will not be incremented when the page is freed. However, the reservation map will continue to look as though the reservation was consumed. A page can still be allocated for the address, but it will not use a reserved page as originally intended.

There is some code (most notably userfaultfd) which can not call restore_reserve_on_error. In this case, it simply modifies the PagePrivate so that a reservation will not be leaked when the huge page is freed.

Reservations and Memory Policy

Per-node huge page lists existed in struct hstate when git was first used to manage Linux code. The concept of reservations was added some time later. When reservations were added, no attempt was made to take memory policy into account. While cpusets are not exactly the same as memory policy, this comment in hugetlb_acct_memory sums up the interaction between reservations and cpusets/memory policy:

/* * When cpuset is configured, it breaks the strict hugetlb page * reservation as the accounting is done on a global variable. Such * reservation is completely rubbish in the presence of cpuset because * the reservation is not checked against page availability for the * current cpuset. Application can still potentially OOM'ed by kernel * with lack of free htlb page in cpuset that the task is in. * Attempt to enforce strict accounting with cpuset is almost * impossible (or too ugly) because cpuset is too fluid that * task or memory node can be dynamically moved between cpusets. * * The change of semantics for shared hugetlb mapping with cpuset is * undesirable. However, in order to preserve some of the semantics, * we fall back to check against current free page availability as * a best attempt and hopefully to minimize the impact of changing * semantics that cpuset has. */

Huge page reservations were added to prevent unexpected page allocation failures (OOM) at page fault time. However, if an application makes use of cpusets or memory policy there is no guarantee that huge pages will be available on the required nodes. This is true even if there are a sufficient number of global reservations.

Hugetlbfs regression testing

The most complete set of hugetlb tests are in the libhugetlbfs repository. If you modify any hugetlb related code, use the libhugetlbfs test suite to check for regressions. In addition, if you add any new hugetlb functionality, please add appropriate tests to libhugetlbfs.

– Mike Kravetz, 7 April 2017

 

Kernel Samepage Merging

Overview

KSM is a memory-saving de-duplication feature, enabled by CONFIG_KSM=y, added to the Linux kernel in 2.6.32. See mm/ksm.c for its implementation, and  and 

KSM was originally developed for use with KVM (where it was known as Kernel Shared Memory), to fit more virtual machines into physical memory, by sharing the data common between them. But it can be useful to any application which generates many instances of the same data.

The KSM daemon ksmd periodically scans those areas of user memory which have been registered with it, looking for pages of identical content which can be replaced by a single write-protected page (which is automatically copied if a process later wants to update its content). The amount of pages that KSM daemon scans in a single pass and the time between the passes are configured using 

KSM only merges anonymous (private) pages, never pagecache (file) pages. KSM’s merged pages were originally locked into kernel memory, but can now be swapped out just like other user pages (but sharing is broken when they are swapped back in: ksmd must rediscover their identity and merge again).

Controlling KSM with madvise

KSM only operates on those areas of address space which an application has advised to be likely candidates for merging, by using the madvise(2) system call:

int madvise(addr, length, MADV_MERGEABLE)

The app may call

int madvise(addr, length, MADV_UNMERGEABLE)

to cancel that advice and restore unshared pages: whereupon KSM unmerges whatever it merged in that range. Note: this unmerging call may suddenly require more memory than is available - possibly failing with EAGAIN, but more probably arousing the Out-Of-Memory killer.

If KSM is not configured into the running kernel, madvise MADV_MERGEABLE and MADV_UNMERGEABLE simply fail with EINVAL. If the running kernel was built with CONFIG_KSM=y, those calls will normally succeed: even if the KSM daemon is not currently running, MADV_MERGEABLE still registers the range for whenever the KSM daemon is started; even if the range cannot contain any pages which KSM could actually merge; even if MADV_UNMERGEABLE is applied to a range which was never MADV_MERGEABLE.

If a region of memory must be split into at least one new MADV_MERGEABLE or MADV_UNMERGEABLE region, the madvise may return ENOMEM if the process will exceed vm.max_map_count (see ).

Like other madvise calls, they are intended for use on mapped areas of the user address space: they will report ENOMEM if the specified range includes unmapped gaps (though working on the intervening mapped areas), and might fail with EAGAIN if not enough memory for internal structures.

Applications should be considerate in their use of MADV_MERGEABLE, restricting its use to areas likely to benefit. KSM’s scans may use a lot of processing power: some installations will disable KSM for that reason.

KSM daemon sysfs interface

The KSM daemon is controlled by sysfs files in /sys/kernel/mm/ksm/, readable by all but writable only by root:

pages_to_scan

how many pages to scan before ksmd goes to sleep e.g. echo 100 > /sys/kernel/mm/ksm/pages_to_scan.

Default: 100 (chosen for demonstration purposes)

sleep_millisecs

how many milliseconds ksmd should sleep before next scan e.g. echo 20 > /sys/kernel/mm/ksm/sleep_millisecs

Default: 20 (chosen for demonstration purposes)

merge_across_nodes

specifies if pages from different NUMA nodes can be merged. When set to 0, ksm merges only pages which physically reside in the memory area of same NUMA node. That brings lower latency to access of shared pages. Systems with more nodes, at significant NUMA distances, are likely to benefit from the lower latency of setting 0. Smaller systems, which need to minimize memory usage, are likely to benefit from the greater sharing of setting 1 (default). You may wish to compare how your system performs under each setting, before deciding on which to use. merge_across_nodes setting can be changed only when there are no ksm shared pages in the system: set run 2 to unmerge pages first, then to 1 after changing merge_across_nodes, to remerge according to the new setting.

Default: 1 (merging across nodes as in earlier releases)

run

  • set to 0 to stop ksmd from running but keep merged pages,

  • set to 1 to run ksmd e.g. echo 1 > /sys/kernel/mm/ksm/run,

  • set to 2 to stop ksmd and unmerge all pages currently merged, but leave mergeable areas registered for next run.

Default: 0 (must be changed to 1 to activate KSM, except if CONFIG_SYSFS is disabled)

use_zero_pages

specifies whether empty pages (i.e. allocated pages that only contain zeroes) should be treated specially. When set to 1, empty pages are merged with the kernel zero page(s) instead of with each other as it would happen normally. This can improve the performance on architectures with coloured zero pages, depending on the workload. Care should be taken when enabling this setting, as it can potentially degrade the performance of KSM for some workloads, for example if the checksums of pages candidate for merging match the checksum of an empty page. This setting can be changed at any time, it is only effective for pages merged after the change.

Default: 0 (normal KSM behaviour as in earlier releases)

max_page_sharing

Maximum sharing allowed for each KSM page. This enforces a deduplication limit to avoid high latency for virtual memory operations that involve traversal of the virtual mappings that share the KSM page. The minimum value is 2 as a newly created KSM page will have at least two sharers. The higher this value the faster KSM will merge the memory and the higher the deduplication factor will be, but the slower the worst case virtual mappings traversal could be for any given KSM page. Slowing down this traversal means there will be higher latency for certain virtual memory operations happening during swapping, compaction, NUMA balancing and page migration, in turn decreasing responsiveness for the caller of those virtual memory operations. The scheduler latency of other tasks not involved with the VM operations doing the virtual mappings traversal is not affected by this parameter as these traversals are always schedule friendly themselves.

stable_node_chains_prune_millisecs

specifies how frequently KSM checks the metadata of the pages that hit the deduplication limit for stale information. Smaller milllisecs values will free up the KSM metadata with lower latency, but they will make ksmd use more CPU during the scan. It’s a noop if not a single KSM page hit the max_page_sharing yet.

The effectiveness of KSM and MADV_MERGEABLE is shown in /sys/kernel/mm/ksm/:

pages_shared

how many shared pages are being used

pages_sharing

how many more sites are sharing them i.e. how much saved

pages_unshared

how many pages unique but repeatedly checked for merging

pages_volatile

how many pages changing too fast to be placed in a tree

full_scans

how many times all mergeable areas have been scanned

stable_node_chains

the number of KSM pages that hit the max_page_sharing limit

stable_node_dups

number of duplicated KSM pages

A high ratio of pages_sharing to pages_shared indicates good sharing, but a high ratio of pages_unshared to pages_sharing indicates wasted effort. pages_volatile embraces several different kinds of activity, but a high proportion there would also indicate poor use of madvise MADV_MERGEABLE.

The maximum possible pages_sharing/pages_shared ratio is limited by the max_page_sharing tunable. To increase the ratio max_page_sharing must be increased accordingly.

– Izik Eidus, Hugh Dickins, 17 Nov 2009

 

Kernel Samepage Merging

KSM is a memory-saving de-duplication feature, enabled by CONFIG_KSM=y, added to the Linux kernel in 2.6.32. See mm/ksm.c for its implementation, and  and 

The userspace interface of KSM is described in 

Design

Overview

A few notes about the KSM scanning process, to make it easier to understand the data structures below:

In order to reduce excessive scanning, KSM sorts the memory pages by their contents into a data structure that holds pointers to the pages’ locations.

Since the contents of the pages may change at any moment, KSM cannot just insert the pages into a normal sorted tree and expect it to find anything. Therefore KSM uses two data structures - the stable and the unstable tree.

The stable tree holds pointers to all the merged pages (ksm pages), sorted by their contents. Because each such page is write-protected, searching on this tree is fully assured to be working (except when pages are unmapped), and therefore this tree is called the stable tree.

The stable tree node includes information required for reverse mapping from a KSM page to virtual addresses that map this page.

In order to avoid large latencies of the rmap walks on KSM pages, KSM maintains two types of nodes in the stable tree:

  • the regular nodes that keep the reverse mapping structures in a linked list

  • the “chains” that link nodes (“dups”) that represent the same write protected memory content, but each “dup” corresponds to a different KSM page copy of that content

Internally, the regular nodes, “dups” and “chains” are represented using the same  structure.

In addition to the stable tree, KSM uses a second data structure called the unstable tree: this tree holds pointers to pages which have been found to be “unchanged for a period of time”. The unstable tree sorts these pages by their contents, but since they are not write-protected, KSM cannot rely upon the unstable tree to work correctly - the unstable tree is liable to be corrupted as its contents are modified, and so it is called unstable.

KSM solves this problem by several techniques:

  1. The unstable tree is flushed every time KSM completes scanning all memory areas, and then the tree is rebuilt again from the beginning.

  2. KSM will only insert into the unstable tree, pages whose hash value has not changed since the previous scan of all memory areas.

  3. The unstable tree is a RedBlack Tree - so its balancing is based on the colors of the nodes and not on their contents, assuring that even when the tree gets “corrupted” it won’t get out of balance, so scanning time remains the same (also, searching and inserting nodes in an rbtree uses the same algorithm, so we have no overhead when we flush and rebuild).

  4. KSM never flushes the stable tree, which means that even if it were to take 10 attempts to find a page in the unstable tree, once it is found, it is secured in the stable tree. (When we scan a new page, we first compare it against the stable tree, and then against the unstable tree.)

If the merge_across_nodes tunable is unset, then KSM maintains multiple stable trees and multiple unstable trees: one of each for each NUMA node.

Reverse mapping

KSM maintains reverse mapping information for KSM pages in the stable tree.

If a KSM page is shared between less than max_page_sharing VMAs, the node of the stable tree that represents such KSM page points to a list of  and the page->mapping of the KSM page points to the stable tree node.

When the sharing passes this threshold, KSM adds a second dimension to the stable tree. The tree node becomes a “chain” that links one or more “dups”. Each “dup” keeps reverse mapping information for a KSM page with page->mapping pointing to that “dup”.

Every “chain” and all “dups” linked into a “chain” enforce the invariant that they represent the same write protected memory content, even if each “dup” will be pointed by a different KSM page copy of that content.

This way the stable tree lookup computational complexity is unaffected if compared to an unlimited list of reverse mappings. It is still enforced that there cannot be KSM page content duplicates in the stable tree itself.

The deduplication limit enforced by max_page_sharing is required to avoid the virtual memory rmap lists to grow too large. The rmap walk has O(N) complexity where N is the number of rmap_items (i.e. virtual mappings) that are sharing the page, which is in turn capped by max_page_sharing. So this effectively spreads the linear O(N) computational complexity from rmap walk context over different KSM pages. The ksmd walk over the stable_node “chains” is also O(N), but N is the number of stable_node “dups”, not the number of rmap_items, so it has not a significant impact on ksmd performance. In practice the best stable_node “dup” candidate will be kept and found at the head of the “dups” list.

High values of max_page_sharing result in faster memory merging (because there will be fewer stable_node dups queued into the stable_node chain->hlist to check for pruning) and higher deduplication factor at the expense of slower worst case for rmap walks for any KSM page which can happen during swapping, compaction, NUMA balancing and page migration.

The stable_node_dups/stable_node_chains ratio is also affected by the max_page_sharing tunable, and an high ratio may indicate fragmentation in the stable_node dups, which could be solved by introducing fragmentation algorithms in ksmd which would refile rmap_items from one stable_node dup to another stable_node dup, in order to free up stable_node “dups” with few rmap_items in them, but that may increase the ksmd CPU usage and possibly slowdown the readonly computations on the KSM pages of the applications.

The whole list of stable_node “dups” linked in the stable_node “chains” is scanned periodically in order to prune stale stable_nodes. The frequency of such scans is defined by stable_node_chains_prune_millisecs sysfs tunable.

Reference

struct mm_slot

ksm information per mm that is being scanned

Definition

struct mm_slot {  struct hlist_node link;  struct list_head mm_list;  struct rmap_item *rmap_list;  struct mm_struct *mm;};

Members

link

link to the mm_slots hash list

mm_list

link into the mm_slots list, rooted in ksm_mm_head

rmap_list

head for this mm_slot’s singly-linked list of rmap_items

mm

the mm that this information is valid for

struct ksm_scan

cursor for scanning

Definition

struct ksm_scan {  struct mm_slot *mm_slot;  unsigned long address;  struct rmap_item **rmap_list;  unsigned long seqnr;};

Members

mm_slot

the current mm_slot we are scanning

address

the next address inside that to be scanned

rmap_list

link to the next rmap to be scanned in the rmap_list

seqnr

count of completed full scans (needed when removing unstable node)

Description

There is only the one ksm_scan instance of this cursor structure.

struct stable_node

node of the stable rbtree

Definition

struct stable_node {  union {    struct rb_node node;    struct {      struct list_head *head;      struct {        struct hlist_node hlist_dup;        struct list_head list;      };    };  };  struct hlist_head hlist;  union {    unsigned long kpfn;    unsigned long chain_prune_time;  };#define STABLE_NODE_CHAIN -1024;  int rmap_hlist_len;#ifdef CONFIG_NUMA;  int nid;#endif;};

Members

{unnamed_union}

anonymous

node

rb node of this ksm page in the stable tree

{unnamed_struct}

anonymous

head

(overlaying parent) migrate_nodes indicates temporarily on that list

{unnamed_struct}

anonymous

hlist_dup

linked into the stable_node->hlist with a stable_node chain

list

linked into migrate_nodes, pending placement in the proper node tree

hlist

hlist head of rmap_items using this ksm page

{unnamed_union}

anonymous

kpfn

page frame number of this ksm page (perhaps temporarily on wrong nid)

chain_prune_time

time of the last full garbage collection

rmap_hlist_len

number of rmap_item entries in hlist or STABLE_NODE_CHAIN

nid

NUMA node id of stable tree in which linked (may not match kpfn)

struct rmap_item

reverse mapping item for virtual addresses

Definition

struct rmap_item {  struct rmap_item *rmap_list;  union {    struct anon_vma *anon_vma;#ifdef CONFIG_NUMA;    int nid;#endif;  };  struct mm_struct *mm;  unsigned long address;  unsigned int oldchecksum;  union {    struct rb_node node;    struct {      struct stable_node *head;      struct hlist_node hlist;    };  };};

Members

rmap_list

next rmap_item in mm_slot’s singly-linked rmap_list

{unnamed_union}

anonymous

anon_vma

pointer to anon_vma for this mm,address, when in stable tree

nid

NUMA node id of unstable tree in which linked (may not match page)

mm

the memory structure this rmap_item is pointing into

address

the virtual address this rmap_item tracks (+ flags in low bits)

oldchecksum

previous checksum of the page at that virtual address

{unnamed_union}

anonymous

node

rb node of this rmap_item in the unstable tree

{unnamed_struct}

anonymous

head

pointer to stable_node heading this list in the stable tree

hlist

link into hlist of rmap_items hanging off that stable_node

– Izik Eidus, Hugh Dickins, 17 Nov 2009

 

 

 

 

转载地址:http://gvvaf.baihongyu.com/

你可能感兴趣的文章
老司机入职一周,给我们解读 Spring Boot 最流行的 16 条实践
查看>>
maven删除不必要的依赖;优化pom依赖研究
查看>>
不同类型接口的异常处理规范
查看>>
如何决定使用 HashMap 还是 TreeMap?
查看>>
Java泛型:泛型类、泛型接口、泛型方法
查看>>
Java三元表达式拆包
查看>>
图解|为什么HTTP3.0使用UDP协议
查看>>
springboot项目里用MultipartFile获取前端传的file为null问题
查看>>
IDEA 不显示 Services 工具栏
查看>>
Java工程师该如何编写高效代码?
查看>>
kafka详解【二】
查看>>
JAVA中List集合按照对象的某一个或多个字段去重实现
查看>>
Java中List集合对象去重及按属性去重的8种方法
查看>>
面试官:啥是集群策略啊?
查看>>
eclipse Maven配置以及使用方法
查看>>
JS中数组的操作
查看>>
LINUX经常使用命令详解
查看>>
对 Linux 新手非常有用的 20 个命令
查看>>
年薪12W升至25W美元的非科班程序员之路
查看>>
初级程序员到首席架构师的经历
查看>>