NAME
jemalloc,
malloc.conf —
the default system allocator
LIBRARY
Standard C Library (libc, -lc)
SYNOPSIS
const char * _malloc_options;
DESCRIPTION
The
jemalloc is a general-purpose concurrent
malloc(3) implementation
specifically designed to be scalable on modern multi-processor systems. It is
the default user space system allocator in
NetBSD.
When the first call is made to one of the memory allocation routines such as
malloc() or
realloc(), various flags that
affect the workings of the allocator are set or reset. These are described
below.
The “name” of the file referenced by the symbolic link named
/etc/malloc.conf, the value of the environment variable
MALLOC_OPTIONS
, and the string pointed to by the
global variable
_malloc_options will be interpreted, in
that order, character by character as flags.
Most flags are single letters. Uppercase letters indicate that the behavior is
set, or on, and lowercase letters mean that the behavior is not set, or off.
The following options are available.
-
-
- A
- All warnings (except for the warning about unknown flags
being set) become fatal. The process will call
abort(3) in these cases.
-
-
- H
- Use
madvise(2) when pages
within a chunk are no longer in use, but the chunk as a whole cannot yet
be deallocated. This is primarily of use when swapping is a real
possibility, due to the high overhead of the madvise()
system call.
-
-
- J
- Each byte of new memory allocated by
malloc(), realloc() will be
initialized to 0xa5. All memory returned by free(),
realloc() will be initialized to 0x5a. This is intended
for debugging and will impact performance negatively.
-
-
- K
- Increase/decrease the virtual memory chunk size by a factor
of two. The default chunk size is 1 MB. This option can be specified
multiple times.
-
-
- N
- Increase/decrease the number of arenas by a factor of two.
The default number of arenas is four times the number of CPUs, or one if
there is a single CPU. This option can be specified multiple times.
-
-
- P
- Various statistics are printed at program exit via an
atexit(3) function. This has
the potential to cause deadlock for a multi-threaded process that exits
while one or more threads are executing in the memory allocation
functions. Therefore, this option should only be used with care; it is
primarily intended as a performance tuning aid during application
development.
-
-
- Q
- Increase/decrease the size of the allocation quantum by a
factor of two. The default quantum is the minimum allowed by the
architecture (typically 8 or 16 bytes). This option can be specified
multiple times.
-
-
- S
- Increase/decrease the size of the maximum size class that
is a multiple of the quantum by a factor of two. Above this size,
power-of-two spacing is used for size classes. The default value is 512
bytes. This option can be specified multiple times.
-
-
- U
- Generate “utrace” entries for
ktrace(1), for all
operations. Consult the source for details on this option.
-
-
- V
- Attempting to allocate zero bytes will return a
NULL
pointer instead of a valid pointer. (The
default behavior is to make a minimal allocation and return a pointer to
it.) This option is provided for System V compatibility. This option is
incompatible with the X option.
-
-
- X
- Rather than return failure for any allocation function,
display a diagnostic message on
stderr
and cause
the program to drop core (using
abort(3)). This option should
be set at compile time by including the following in the source code:
-
-
- Z
- Each byte of new memory allocated by
malloc(), realloc() will be
initialized to 0. Note that this initialization only happens once for each
byte, so realloc() does not zero memory that was
previously allocated. This is intended for debugging and will impact
performance negatively.
Extra care should be taken when enabling any of the options in production
environments. The
A,
J, and
Z options are intended for testing and debugging. An
application which changes its behavior when these options are used is flawed.
IMPLEMENTATION NOTES
The
jemalloc allocator uses multiple arenas in order to reduce
lock contention for threaded programs on multi-processor systems. This works
well with regard to threading scalability, but incurs some costs. There is a
small fixed per-arena overhead, and additionally, arenas manage memory
completely independently of each other, which means a small fixed increase in
overall memory fragmentation. These overheads are not generally an issue,
given the number of arenas normally used. Note that using substantially more
arenas than the default is not likely to improve performance, mainly due to
reduced cache performance. However, it may make sense to reduce the number of
arenas if an application does not make much use of the allocation functions.
Memory is conceptually broken into equal-sized chunks, where the chunk size is a
power of two that is greater than the page size. Chunks are always aligned to
multiples of the chunk size. This alignment makes it possible to find metadata
for user objects very quickly.
User objects are broken into three categories according to size:
- Small objects are smaller than one page.
- Large objects are smaller than the chunk size.
- Huge objects are a multiple of the chunk size.
Small and large objects are managed by arenas; huge objects are managed
separately in a single data structure that is shared by all threads. Huge
objects are used by applications infrequently enough that this single data
structure is not a scalability issue.
Each chunk that is managed by an arena tracks its contents in a page map as runs
of contiguous pages (unused, backing a set of small objects, or backing one
large object). The combination of chunk alignment and chunk page maps makes it
possible to determine all metadata regarding small and large allocations in
constant time.
Small objects are managed in groups by page runs. Each run maintains a bitmap
that tracks which regions are in use. Allocation requests can be grouped as
follows.
- Allocation requests that are no more than half the
quantum (see the Q option) are rounded up to the nearest
power of two (typically 2, 4, or 8).
- Allocation requests that are more than half the quantum,
but no more than the maximum quantum-multiple size class (see the
S option) are rounded up to the nearest multiple of the
quantum.
- Allocation requests that are larger than the maximum
quantum-multiple size class, but no larger than one half of a page, are
rounded up to the nearest power of two.
- Allocation requests that are larger than half of a page,
but small enough to fit in an arena-managed chunk (see the
K option), are rounded up to the nearest run size.
- Allocation requests that are too large to fit in an
arena-managed chunk are rounded up to the nearest multiple of the chunk
size.
Allocations are packed tightly together, which can be an issue for
multi-threaded applications. If you need to assure that allocations do not
suffer from cache line sharing, round your allocation requests up to the
nearest multiple of the cache line size.
DEBUGGING
The first thing to do is to set the
A option. This option
forces a coredump (if possible) at the first sign of trouble, rather than the
normal policy of trying to continue if at all possible.
It is probably also a good idea to recompile the program with suitable options
and symbols for debugger support.
If the program starts to give unusual results, coredump or generally behave
differently without emitting any of the messages mentioned in the next
section, it is likely because it depends on the storage being filled with zero
bytes. Try running it with the
Z option set; if that
improves the situation, this diagnosis has been confirmed. If the program
still misbehaves, the likely problem is accessing memory outside the allocated
area.
Alternatively, if the symptoms are not easy to reproduce, setting the
J option may help provoke the problem. In truly difficult
cases, the
U option, if supported by the kernel, can provide
a detailed trace of all calls made to these functions.
Unfortunately,
jemalloc does not provide much detail about the
problems it detects; the performance impact for storing such information would
be prohibitive. There are a number of allocator implementations available on
the Internet which focus on detecting and pinpointing problems by trading
performance for extra sanity checks and detailed diagnostics.
ENVIRONMENT
The following environment variables affect the execution of the allocation
functions:
-
-
MALLOC_OPTIONS
- If the environment variable
MALLOC_OPTIONS
is set, the characters it contains
will be interpreted as flags to the allocation functions.
EXAMPLES
To dump core whenever a problem occurs:
ln -s 'A' /etc/malloc.conf
To specify in the source that a program does no return value checking on calls
to these functions:
DIAGNOSTICS
If any of the memory allocation/deallocation functions detect an error or
warning condition, a message will be printed to file descriptor
STDERR_FILENO
. Errors will result in the process
dumping core. If the
A option is set, all warnings are
treated as errors.
The
_malloc_message variable allows the programmer to
override the function which emits the text strings forming the errors and
warnings if for some reason the
stderr
file descriptor
is not suitable for this. Please note that doing anything which tries to
allocate memory in this function is likely to result in a crash or deadlock.
All messages are prefixed by
“⟨
progname⟩
:
(malloc)”.
SEE ALSO
emalloc(3),
malloc(3),
memory(3),
memoryallocators(9)
Jason Evans, A
Scalable Concurrent malloc(3) Implementation for FreeBSD,
http://people.freebsd.org/~jasone/jemalloc/bsdcan2006/jemalloc.pdf,
April 16, 2006, BSDCan
2006.
Poul-Henning Kamp,
Malloc(3) revisited, Proceedings of
the FREENIX Track: 1998 USENIX Annual Technical Conference,
USENIX Association,
http://www.usenix.org/publications/library/proceedings/usenix98/freenix/kamp.pdf,
June 15-19, 1998.
Paul R. Wilson, Mark
S. Johnstone, Michael Neely, and
David Boles, Dynamic Storage
Allocation: A Survey and Critical Review, University of
Texas at Austin,
ftp://ftp.cs.utexas.edu/pub/garbage/allocsrv.ps,
1995.
HISTORY
The
jemalloc allocator became the default system allocator
first in
FreeBSD 7.0 and then in
NetBSD 5.0. In both systems it replaced the older
so-called “phkmalloc” implementation.
AUTHORS
Jason Evans
<
jasone@canonware.com>