Add doc describing native allocator.
This includes tests that a native allocator has to pass, and the ways to benchmark a native allocator on Android. Test: NA Change-Id: I5e1f4b3eea46ee2240e63a487ca3dfabecff81c0
This commit is contained in:
parent
b2bd7c575e
commit
4316d43f4d
|
@ -0,0 +1,340 @@
|
|||
# Native Memory Allocator Verification
|
||||
This document describes how to verify the native memory allocator on Android.
|
||||
This procedure should be followed when upgrading or moving to a new allocator.
|
||||
A small minor upgrade might not need to run all of the benchmarks, however,
|
||||
at least the
|
||||
[SQL Allocation Trace Benchmark](#sql-allocation-trace-benchmark),
|
||||
[Memory Replay Benchmarks](#memory-replay-benchmarks) and
|
||||
[Performance Trace Benchmarks](#performance-trace-benchmarks) should be run.
|
||||
|
||||
It is important to note that there are two modes for a native allocator
|
||||
to run in on Android. The first is the normal allocator, the second is
|
||||
called the svelte config, which is designed to run on memory constrained
|
||||
systems and be a bit slower, but take less PSS. To enable the svelte config,
|
||||
add this line to the `BoardConfig.mk` for the given target:
|
||||
|
||||
MALLOC_SVELTE := true
|
||||
|
||||
The `BoardConfig.mk` file is usually found in the directory
|
||||
`device/<DEVICE_NAME>/` or in a sub directory.
|
||||
|
||||
When evaluating a native allocator, make sure that you benchmark both
|
||||
versions.
|
||||
|
||||
## Android Extensions
|
||||
Android supports a few non-standard functions and mallopt controls that
|
||||
a native allocator needs to implement.
|
||||
|
||||
### Iterator Functions
|
||||
These are functions that are used to implement a memory leak detector
|
||||
called `libmemunreachable`.
|
||||
|
||||
#### malloc\_disable
|
||||
This function, when called, should pause all threads that are making a
|
||||
call to an allocation function (malloc/free/etc). When a call
|
||||
is made to `malloc_enable`, the paused threads should start running again.
|
||||
|
||||
#### malloc\_enable
|
||||
This function, when called, does nothing unless there was a previous call
|
||||
to `malloc_disable`. This call will unpause any thread which is making
|
||||
a call to an allocation function (malloc/free/etc) when `malloc_disable`
|
||||
was called previously.
|
||||
|
||||
#### malloc\_iterate
|
||||
This function enumerates all of the allocations currently live in the
|
||||
system. It is meant to be called after a call to `malloc_disable` to
|
||||
prevent further allocations while this call is being executed. To
|
||||
see what is expected for this function, the best description is the
|
||||
tests for this funcion in `bionic/tests/malloc_itearte_test.cpp`.
|
||||
|
||||
### Mallopt Extensions
|
||||
These are mallopt options that Android requires for a native allocator
|
||||
to work efficiently.
|
||||
|
||||
#### M\_DECAY\_TIME
|
||||
When set to zero, `mallopt(M_DECAY_TIME, 0)`, it is expected that an
|
||||
allocator will attempt to purge and release any unused memory back to the
|
||||
kernel on free calls. This is important in Android to avoid consuming extra
|
||||
PSS.
|
||||
|
||||
When set to non-zero, `mallopt(M_DECAY_TIME, 1)`, an allocator can delay the
|
||||
purge and release action. The amount of delay is up to the allocator
|
||||
implementation, but it should be a reasonable amount of time. The jemalloc
|
||||
allocator was implemented to have a one second delay.
|
||||
|
||||
The drawback to this option is that most allocators do not have a separate
|
||||
thread to handle the purge, so the decay is only handled when an
|
||||
allocation operation occurs. For server processes, this can mean that
|
||||
PSS is slightly higher when the server is waiting for the next connection
|
||||
and no other allocation calls are made. The `M_PURGE` option is used to
|
||||
force a purge in this case.
|
||||
|
||||
For all applications on Android, the call `mallopt(M_DECAY_TIME, 1)` is
|
||||
made by default. The idea is that it allows application frees to run a
|
||||
bit faster, while only increasing PSS a bit.
|
||||
|
||||
#### M\_PURGE
|
||||
When called, `mallopt(M_PURGE, 0)`, an allocator should purge and release
|
||||
any unused memory immediately. The argument for this call is ignored. If
|
||||
possible, this call should clear thread cached memory if it exists. The
|
||||
idea is that this can be called to purge memory that has not been
|
||||
purged when `M_DECAY_TIME` is set to one. This is useful if you have a
|
||||
server application that does a lot of native allocations and the
|
||||
application wants to purge that memory before waiting for the next connection.
|
||||
|
||||
## Correctness Tests
|
||||
These are the tests that should be run to verify an allocator is
|
||||
working properly according to Android.
|
||||
|
||||
### Bionic Unit Tests
|
||||
The bionic unit tests contain a small number of allocator tests. These
|
||||
tests are primarily verifying Android extensions and non-standard behavior
|
||||
of allocation routines such as what happens when a non-power of two alignment
|
||||
is passed to memalign.
|
||||
|
||||
To run all of the compliance tests:
|
||||
|
||||
adb shell /data/nativetest64/bionic-unit-tests/bionic-unit-tests --gtest_filter="malloc*"
|
||||
adb shell /data/nativetest/bionic-unit-tests/bionic-unit-tests --gtest_filter="malloc*"
|
||||
|
||||
The allocation tests are not meant to be complete, so it is expected
|
||||
that a native allocator will have its own set of tests that can be run.
|
||||
|
||||
### CTS Entropy Test
|
||||
In addition to the bionic tests, there is also a CTS test that is designed
|
||||
to verify that the addresses returned by malloc are sufficiently randomized
|
||||
to help defeat potential security bugs.
|
||||
|
||||
Run this test thusly:
|
||||
|
||||
atest AslrMallocTest
|
||||
|
||||
If there are multiple devices connected to the system, use `-s <SERIAL>`
|
||||
to specify a device.
|
||||
|
||||
## Performance
|
||||
There are multiple different ways to evaluate the performance of a native
|
||||
allocator on Android. One is allocation speed in various different scenarios,
|
||||
anoher is total PSS taken by the allocator.
|
||||
|
||||
The last is virtual address space consumed in 32 bit applications. There is
|
||||
a limited amount of address space available in 32 bit apps, and there have
|
||||
been allocator bugs that cause memory failures when too much virtual
|
||||
address space is consumed. For 64 bit executables, this can be ignored.
|
||||
|
||||
### Bionic Benchmarks
|
||||
These are the microbenchmarks that are part of the bionic benchmarks suite of
|
||||
benchmarks. These benchmarks can be built using this command:
|
||||
|
||||
mmma -j bionic/benchmarks
|
||||
|
||||
These benchmarks are only used to verify the speed of the allocator and
|
||||
ignore anything related to PSS and virtual address space consumed.
|
||||
|
||||
#### Allocate/Free Benchmarks
|
||||
These are the benchmarks to verify the allocation speed of a loop doing a
|
||||
single allocation, touching every page in the allocation to make it resident
|
||||
and then freeing the allocation.
|
||||
|
||||
To run the benchmarks with `mallopt(M_DECAY_TIME, 0)`, use these commands:
|
||||
|
||||
adb shell /data/benchmarktest64/bionic-benchmarks/bionic-benchmarks --benchmark_filter=stdlib_malloc_free_default
|
||||
adb shell /data/benchmarktest/bionic-benchmarks/bionic-benchmarks --benchmark_filter=malloc_free_default
|
||||
|
||||
To run the benchmarks with `mallopt(M_DECAY_TIME, 1)`, use these commands:
|
||||
|
||||
adb shell /data/benchmarktest64/bionic-benchmarks/bionic-benchmarks --benchmark_filter=stdlib_malloc_free_decay1
|
||||
adb shell /data/benchmarktest/bionic-benchmarks/bionic-benchmarks --benchmark_filter=malloc_free_decay1
|
||||
|
||||
The last value in the output is the size of the allocation in bytes. It is
|
||||
useful to look at these kinds of benchmarks to make sure that there are
|
||||
no outliers, but these numbers should not be used to make a final decision.
|
||||
If these numbers are slightly worse than the current allocator, the
|
||||
single thread numbers from trace data is a better representative of
|
||||
real world situations.
|
||||
|
||||
#### Multiple Allocations Retained Benchmarks
|
||||
These are the benchmarks that examine how the allocator handles multiple
|
||||
allocations of the same size at the same time.
|
||||
|
||||
The first set of these benchmarks does a set number of 8192 byte allocations
|
||||
in one loop, and then frees all of the allocations at the end of the loop.
|
||||
Only the time it takes to do the allocations is recorded, the frees are not
|
||||
counted. The value of 8192 was chosen since the jemalloc native allocator
|
||||
had issues with this size. It is possible other sizes might show different
|
||||
results, but, as mentioned before, these microbenchmark numbers should
|
||||
not be used as absolutes for determining if an allocator is worth using.
|
||||
|
||||
This benchmark is designed to verify that there is no performance issue
|
||||
related to having multiple allocations alive at the same time.
|
||||
|
||||
To run the benchmarks with `mallopt(M_DECAY_TIME, 0)`, use these commands:
|
||||
|
||||
adb shell /data/benchmarktest64/bionic-benchmarks/bionic-benchmarks --benchmark_filter=stdlib_malloc_multiple_8192_allocs_default
|
||||
adb shell /data/benchmarktest/bionic-benchmarks/bionic-benchmarks --benchmark_filter=stdlib_malloc_multiple_8192_allocs_default
|
||||
|
||||
To run the benchmarks with `mallopt(M_DECAY_TIME, 1)`, use these commands:
|
||||
|
||||
adb shell /data/benchmarktest64/bionic-benchmarks/bionic-benchmarks --benchmark_filter=stdlib_malloc_multiple_8192_allocs_decay1
|
||||
adb shell /data/benchmarktest/bionic-benchmarks/bionic-benchmarks --benchmark_filter=stdlib_malloc_multiple_8192_allocs_decay1
|
||||
|
||||
For these benchmarks, the last parameter is the total number of allocations to
|
||||
do in each loop.
|
||||
|
||||
The other variation of this benchmark is to always do forty allocations in
|
||||
each loop, but vary the size of the forty allocations. As with the other
|
||||
benchmark, only the time it takes to do the allocations is tracked, the
|
||||
frees are not counted. Forty allocations is an arbitrary number that could
|
||||
be modified in the future. It was chosen because a version of the native
|
||||
allocator, jemalloc, showed a problem at forty allocations.
|
||||
|
||||
To run the benchmarks with `mallopt(M_DECAY_TIME, 0)`, use these commands:
|
||||
|
||||
adb shell /data/benchmarktest64/bionic-benchmarks/bionic-benchmarks --benchmark_filter=stdlib_malloc_forty_default
|
||||
adb shell /data/benchmarktest/bionic-benchmarks/bionic-benchmarks --benchmark_filter=stdlib_malloc_forty_default
|
||||
|
||||
To run the benchmarks with `mallopt(M_DECAY_TIME, 1)`, use these command:
|
||||
|
||||
adb shell /data/benchmarktest64/bionic-benchmarks/bionic-benchmarks --benchmark_filter=stdlib_malloc_forty_decay1
|
||||
adb shell /data/benchmarktest/bionic-benchmarks/bionic-benchmarks --benchmark_filter=stdlib_malloc_forty_decay1
|
||||
|
||||
For these benchmarks, the last parameter in the output is the size of the
|
||||
allocation in bytes.
|
||||
|
||||
As with the other microbenchmarks, an allocator with numbers in the same
|
||||
proximity of the current values is usually sufficient to consider making
|
||||
a switch. The trace benchmarks are more important than these benchmarks
|
||||
since they simulate real world allocation profiles.
|
||||
|
||||
#### SQL Allocation Trace Benchmark
|
||||
This benchmark is a trace of the allocations performed when running
|
||||
the SQLite BenchMark app.
|
||||
|
||||
This benchmark is designed to verify that the allocator will be performant
|
||||
in a real world allocation scenario. SQL operations were chosen as a
|
||||
benchmark because these operations tend to do lots of malloc/realloc/free
|
||||
calls, and they tend to be on the critical path of applications.
|
||||
|
||||
To run the benchmarks with `mallopt(M_DECAY_TIME, 0)`, use these commands:
|
||||
|
||||
adb shell /data/benchmarktest64/bionic-benchmarks/bionic-benchmarks --benchmark_filter=malloc_sql_trace_default
|
||||
adb shell /data/benchmarktest/bionic-benchmarks/bionic-benchmarks --benchmark_filter=malloc_sql_trace_default
|
||||
|
||||
To run the benchmarks with `mallopt(M_DECAY_TIME, 1)`, use these commands:
|
||||
|
||||
adb shell /data/benchmarktest64/bionic-benchmarks/bionic-benchmarks --benchmark_filter=malloc_sql_trace_decay1
|
||||
adb shell /data/benchmarktest/bionic-benchmarks/bionic-benchmarks --benchmark_filter=malloc_sql_trace_decay1
|
||||
|
||||
These numbers should be as performant as the current allocator.
|
||||
|
||||
### Memory Trace Benchmarks
|
||||
These benchmarks measure all three axes of a native allocator, PSS, virtual
|
||||
address space consumed, speed of allocation. They are designed to
|
||||
run on a trace of the allocations from a real world application or system
|
||||
process.
|
||||
|
||||
To build this benchmark:
|
||||
|
||||
mmma -j system/extras/memory_replay
|
||||
|
||||
This will build two executables:
|
||||
|
||||
/system/bin/memory_replay32
|
||||
/system/bin/memory_replay64
|
||||
|
||||
And these two benchmark executables:
|
||||
|
||||
/data/benchmarktest64/trace_benchmark/trace_benchmark
|
||||
/data/benchmarktest/trace_benchmark/trace_benchmark
|
||||
|
||||
#### Memory Replay Benchmarks
|
||||
These benchmarks display PSS, virtual memory consumed (VA space), and do a
|
||||
bit of performance testing on actual traces taken from running applications.
|
||||
|
||||
The trace data includes what thread does each operation, so the replay
|
||||
mechanism will simulate this by creating threads and replaying the operations
|
||||
on a thread as if it was rerunning the real trace. The only issue is that
|
||||
this is a worst case scenario for allocations happening at the same time
|
||||
in all threads since it collapses all of the allocation operations to occur
|
||||
one after another. This will cause a lot of threads allocating at the same
|
||||
time. The trace data does not include timestamps,
|
||||
so it is not possible to create a completely accurate replay.
|
||||
|
||||
To generate these traces, see the [Malloc Debug documentation](https://android.googlesource.com/platform/bionic/+/master/libc/malloc_debug/README.md),
|
||||
the option [record\_allocs](https://android.googlesource.com/platform/bionic/+/master/libc/malloc_debug/README.md#record_allocs_total_entries).
|
||||
|
||||
To run these benchmarks, first copy the trace files to the target and
|
||||
unzip them using these commands:
|
||||
|
||||
adb shell push system/extras/dumps /data/local/tmp
|
||||
adb shell 'cd /data/local/tmp/dumps && for name in *.zip; do unzip $name; done'
|
||||
|
||||
Since all of the traces come from applications, the `memory_replay` program
|
||||
will always call `mallopt(M_DECAY_TIME, 1)' before running the trace.
|
||||
|
||||
Run the benchmark thusly:
|
||||
|
||||
adb shell memory_replay64 /data/local/tmp/dumps/XXX.txt
|
||||
adb shell memory_replay32 /data/local/tmp/dumps/XXX.txt
|
||||
|
||||
Where XXX.txt is the name of a trace file.
|
||||
|
||||
Every 100000 allocation operations, a dump of the PSS and VA space will be
|
||||
performed. At the end, a final PSS and VA space number will be printed.
|
||||
For the most part, the intermediate data can be ignored, but it is always
|
||||
a good idea to look over the data to verify that no strange spikes are
|
||||
occurring.
|
||||
|
||||
The performance number is a measure of the time it takes to perform all of
|
||||
the allocation calls (malloc/memalign/posix_memalign/realloc/free/etc).
|
||||
For any call that allocates a pointer, the time for the call and the time
|
||||
it takes to make the pointer completely resident in memory is included.
|
||||
|
||||
The performance numbers for these runs tend to have a wide variability so
|
||||
they should not be used as absolute value for comparison against the
|
||||
current allocator. But, they should be in the same range as the current
|
||||
values.
|
||||
|
||||
When evaluating an allocator, one of the most important traces is the
|
||||
camera.txt trace. The camera application does very large allocations,
|
||||
and some allocators might leave large virtual address maps around
|
||||
rather than delete them. When that happens, it can lead to allocation
|
||||
failures and would cause the camera app to abort/crash. It is
|
||||
important to verify that when running this trace using the 32 bit replay
|
||||
executable, the virtual address space consumed is not much larger than the
|
||||
current allocator. A small increase (on the order of a few MBs) would be okay.
|
||||
|
||||
NOTE: When a native allocator calls mmap, it is expected that the allocator
|
||||
will name the map using the call:
|
||||
|
||||
prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, <PTR>, <SIZE>, "libc_malloc");
|
||||
|
||||
If the native allocator creates a different name, then it necessary to
|
||||
modify the file:
|
||||
|
||||
system/extras/memory_replay/NativeInfo.cpp
|
||||
|
||||
The `GetNativeInfo` function needs to be modified to include the name
|
||||
of the maps that this allocator includes.
|
||||
|
||||
In addition, in order for the frameworks code to keep track of the memory
|
||||
of a process, any named maps must be added to the file:
|
||||
|
||||
frameworks/base/core/jni/android_os_Debug.cpp
|
||||
|
||||
Modify the `load_maps` function and add a check of the new expected name.
|
||||
|
||||
#### Performance Trace Benchmarks
|
||||
This is a benchmark that treats the trace data as if all allocations
|
||||
occurred in a single thread. This is the scenario that could
|
||||
happen if all of the allocations are spaced out in time so no thread
|
||||
every does an allocation at the same time as another thread.
|
||||
|
||||
Run these benchmarks thusly:
|
||||
|
||||
adb shell /data/benchmarktest64/trace_benchmark/trace_benchmark
|
||||
adb shell /data/benchmarktest/trace_benchmark/trace_benchmark
|
||||
|
||||
When run without any arguments, the benchmark will run over all of the
|
||||
traces and display data. It takes many minutes to complete these runs in
|
||||
order to get as accurate a number as possible.
|
Loading…
Reference in New Issue