Measuring Performance

本指南涵盖:

This guide covers:

  • how we measure memory usage

  • how we measure startup time

  • which additional flags will Quarkus apply to native-image by default

  • Coordinated omission Problem in Tools

我们的所有测试针对给定批次在同一硬件上运行。这不用说,但说得更好。

All of our tests are run on the same hardware for a given batch. It goes without saying, but it’s better when you say it.

How do we measure memory usage

在测量 Quarkus 应用程序占用空间时,我们测量 Resident Set Size (RSS),而不是 JVM 堆大小,后者只是整个问题的一小部分。JVM 不仅为堆分配本机内存(-Xms-Xmx),还为 jvm 运行应用程序所需结构分配内存。根据 JVM 实现情况,为应用程序分配的总内存将包括但不限于:

When measuring the footprint of a Quarkus application, we measure Resident Set Size (RSS) and not the JVM heap size which is only a small part of the overall problem. The JVM not only allocates native memory for heap (-Xms, -Xmx) but also structures required by the jvm to run your application. Depending on the JVM implementation, the total memory allocated for an application will include, but not limited to:

  • Heap space

  • Class metadata

  • Thread stacks

  • Compiled code

  • Garbage collection

Native Memory Tracking

要查看 JVM 使用的本机内存,可以在热点中启用 Native Memory Tracking (NMT) 功能;

In order to view the native memory used by the JVM, you can enable the Native Memory Tracking (NMT) feature in hotspot;

在命令行中启用 NMT;

Enable NMT on the command line;

-XX:NativeMemoryTracking=[off | summary | detail] 1

1 NOTE: this feature will add a 5-10% performance overhead

然后可以使用 jcmd 导出一个正在运行你的应用程序的 Hotspot JVM 的本机内存使用情况报告;

It is then possible to use jcmd to dump a report of the native memory usage of the Hotspot JVM running your application;

jcmd <pid> VM.native_memory [summary | detail | baseline | summary.diff | detail.diff | shutdown] [scale= KB | MB | GB]

Cloud Native Memory Limits

为了查看云原生应用程序的影响,测量整个内存非常重要。对于容器环境而言尤其如此,它将根据其全部 RSS 内存使用情况杀死一个进程。

It is important to measure the whole memory to see the impact of a Cloud Native application. It is particularly true of container environments which will kill a process based on its full RSS memory usage.

同样,不要陷入仅测量私有内存的陷阱,这是该进程使用但与其他进程不可共享的东西。虽然私有内存可能在部署许多不同应用程序(从而大量共享内存)的环境中很有用,但在 Kubernetes/OpenShift 等环境中会产生极大的误导。

Likewise, don’t fall into the trap of only measuring private memory which is what the process uses that is not shareable with other processes. While private memory might be useful in an environment deploying many different applications (and thus sharing memory a lot), it is very misleading in environments like Kubernetes/OpenShift.

Measuring Memory Correctly on Docker

要正确地测量内存 DO NOT use docker stat or anything derived from it (例如 ctop)。此方法仅测量正在使用驻留页面的一部分,而 Linux 内核、cgroup 和云编排提供程序将在其记账中利用完整的驻留集(确定一个进程是否已超出限制并且应该被杀死)。

In order to measure memory correctly DO NOT use docker stat or anything derived from it (e.g. ctop). This approach only measures a subset of the in-use resident pages, while the Linux Kernel, cgroups and cloud orchestration providers will utilize the full resident set in their accounting (determining whether a process is over the limits and should be killed).

要准确测量,应执行测量 Linux 上 RSS 的类似步骤集。docker top 命令允许在容器主机上针对容器实例中的进程运行一个 ps 命令。通过将此与格式化输出参数结合使用,可以返回 rss 值:

To measure accurately, a similar set of steps for measuring RSS on Linux should be performed. The docker top command allows running a ps command on the container host machine against the processes in the container instance. By utilizing this in combination with formatting output parameters, the rss value can be returned:

docker top <CONTAINER ID> -o pid,rss,args

例如:

For example:

 $ docker top $(docker ps -q --filter ancestor=quarkus/myapp) -o pid,rss,args

PID                 RSS                 COMMAND
2531                27m                 ./application -Dquarkus.http.host=0.0.0.0

或者,可以直接跳转到一个特权 shell(在主机上为 root),并直接执行一个 ps 命令:

Alternatively, one can jump directly into a privileged shell (root on the host), and execute a ps command directly:

 $ docker run -it --rm --privileged --pid=host justincormack/nsenter1 /bin/ps -e -o pid,rss,args | grep application
 2531  27m ./application -Dquarkus.http.host=0.0.0.0

如果你碰巧在 Linux 上运行,你可以直接执行 ps 命令,因为你的 shell 与容器主机相同:

If you happen to be running on Linux, you can execute the ps command directly, since your shell is the same as the container host:

ps -e -o pid,rss,args | grep application

Platform Specific Memory Reporting

为了不产生运行启用 NVM 的性能开销,我们使用特定于每个平台的工具测量一个 JVM 应用程序的总 RSS。

In order to not incur the performance overhead of running with NVM enabled, we measure the total RSS of an JVM application using tools specific to each platform.

Linux

The linux pmap and ps tools provide a report on the native memory map for a process

 $ ps -o pid,rss,command -p <pid>

   PID   RSS COMMAND
 11229 12628 ./target/getting-started-1.0.0-SNAPSHOT-runner
 $ pmap -x <pid>

 13150:   /data/quarkus-application -Xmx100m -Xmn70m
 Address           Kbytes     RSS   Dirty Mode  Mapping
 0000000000400000   55652   30592       0 r-x-- quarkus-application
 0000000003c58000       4       4       4 r-x-- quarkus-application
 0000000003c59000    5192    4628     748 rwx-- quarkus-application
 00000000054c0000     912     156     156 rwx--   [ anon ]
 ...
 00007fcd13400000    1024    1024    1024 rwx--   [ anon ]
 ...
 00007fcd13952000       8       4       0 r-x-- libfreebl3.so
 ...
 ---------------- ------- ------- -------
 total kB         9726508  256092  220900

列出了分配给该进程的每个内存区域;

Each Memory region that has been allocated for the process is listed;

  • Address: Start address of virtual address space

  • Kbytes: Size (kilobytes) of virtual address space reserved for region

  • RSS: Resident set size (kilobytes). This is the measure of how much memory space is actually being used

  • Dirty: dirty pages (both shared and private) in kilobytes

  • Mode: Access mode for memory region

  • Mapping: Includes application regions and Shared Object (.so) mappings for process

总 RSS(千字节)行报告了该进程正在使用的总本机内存。

The Total RSS (kB) line reports the total native memory the process is using.

macOS

On macOS, you can use ps x -o pid,rss,command -p <PID> which list the RSS for a given process in KB (1024 bytes).

$ ps x -o pid,rss,command -p 57160

  PID    RSS COMMAND
57160 288548 /Applications/IntelliJ IDEA CE.app/Contents/jdk/Contents/Home/jre/bin/java

这意味着 IntelliJ IDEA 消耗了 281,8 MB 的驻留内存。

Which means IntelliJ IDEA consumes 281,8 MB of resident memory.

How do we measure startup time

一些框架使用激进的延迟初始化技术。衡量从启动时间到第一个请求非常重要,这样才能最准确地反映框架启动所需的时间。否则,你会错过框架“actually”初始化所需的时间。

Some frameworks use aggressive lazy initialization techniques. It is important to measure the startup time to first request to most accurately reflect how long a framework needs to start. Otherwise, you will miss the time the framework actually takes to initialize.

以下是我们如何在测试中衡量启动时间。

Here is how we measure startup time in our tests.

我们创建一个示例应用程序,记录应用程序生命周期中某些点的 timestamp。

We create a sample application that logs timestamps for certain points in the application lifecycle.

@Path("/")
public class GreetingEndpoint {

    private static final String template = "Hello, %s!";

    @GET
    @Path("/greeting")
    @Produces(MediaType.APPLICATION_JSON)
    public Greeting greeting(@QueryParam("name") String name) {
        System.out.println(new SimpleDateFormat("HH:mm:ss.SSS").format(new java.util.Date(System.currentTimeMillis())));
        String suffix = name != null ? name : "World";
        return new Greeting(String.format(template, suffix));
    }

    void onStart(@Observes StartupEvent startup) {
        System.out.println(new SimpleDateFormat("HH:mm:ss.SSS").format(new Date()));
    }
}

我们开始在 shell 中循环,向我们正在测试的示例应用程序的 rest 端点发送请求。

We start looping in a shell, sending requests to the rest endpoint of the sample application we are testing.

$ while [[ "$(curl -s -o /dev/null -w ''%{http_code}'' localhost:8080/api/greeting)" != "200" ]]; do sleep .00001; done

在单独的终端中,我们启动正在测试的定时应用程序,打印该应用程序启动的时间

In a separate terminal, we start the timing application that we are testing, printing the time the application starts

$ date +"%T.%3N" &&  ./target/quarkus-timing-runner

10:57:32.508
10:57:32.512
2019-04-05 10:57:32,512 INFO  [io.quarkus] (main) Quarkus 0.11.0 started in 0.002s. Listening on: http://127.0.0.1:8080
2019-04-05 10:57:32,512 INFO  [io.quarkus] (main) Installed features: [cdi, rest, rest-jackson]
10:57:32.537

最后 timestamp 和第一个 timestamp 之间的差值即为应用程序提供第一个请求的总启动时间。

The difference between the final timestamp and the first timestamp is the total startup time for the application to serve the first request.

Additional flags applied by Quarkus

当 Quarkus 调用 GraalVM `native-image`时,它默认会应用一些额外的标记。

When Quarkus invokes GraalVM native-image it will apply some additional flags by default.

如果你要比较其他构建的性能属性,你可能想要了解以下这些:

You might want to know about the following ones in case you’re comparing performance properties with other builds.

Disable fallback images

回退本地镜像是 GraalVM 的一项功能,可“回退”到在普通 JVM 中运行你的应用程序,原因在于编译成本地代码可能会失败。

Fallback native images are a feature of GraalVM to "fall back" to run your application in the normal JVM, should the compilation to native code fail for some reason.

Quarkus 通过设置“-H:FallbackThreshold=0”禁用此功能:这样可以确保出现编译故障,而不是冒有无法在本地模式下真正运行应用程序的风险。

Quarkus disables this feature by setting -H:FallbackThreshold=0: this will ensure you get a compilation failure rather risking to not notice that the application is unable to really run in native mode.

如果你只想在 Java 模式下运行,这完全有可能:只需跳过原生镜像构建,并将其作为 jar 运行即可。

If you instead want to just run in Java mode, that’s totally possible: just skip the native-image build and run it as a jar.

Disable Isolates

隔离是 GraalVM 的一项出色功能,但 Quarkus 目前并未使用它。

Isolates are a neat feature of GraalVM, but Quarkus isn’t using them at this stage.

通过“-H:-SpawnIsolates”禁用。

Disable via -H:-SpawnIsolates.

Disable auto-registration of all Service Loader implementations

Quarkus 扩展可以自动选择它们需要的正确服务,而 GraalVM 的原生镜像默认为包含它在类路径中可以找到的所有服务。

Quarkus extensions can automatically pick the right services they need, while GraalVM’s native-image defaults to include all services it’s able to find on the classpath.

我们更喜欢显式列出服务,因为它会生成更好的优化二进制文件。通过设置“-H:-UseServiceLoaderFeature”也禁用它。

We prefer listing services explicitly as it produces better optimised binaries. Disable it as well by setting -H:-UseServiceLoaderFeature.

Others …​

本部分提供高级指导,但不能假定全面,因为某些标记由扩展、正在构建的平台、配置详细信息、你的代码以及这些组合动态控制。

This section is provided as high level guidance, but can’t presume to be comprehensive as some flags are controlled dynamically by the extensions, the platform you’re building on, configuration details, your code and possibly a combination of these.

一般来说,此处列出的那些最有可能影响性能指标,但在正确的情况下,也可能会观察到其他标记产生的不小影响。

Generally speaking the ones listed here are those most likely to affect performance metrics, but in the right circumstances one could observe non-negligible impact from the other flags too.

如果你要详细调查一些差异,确保检查 Quarkus 确切调用了什么:当构建插件生成本地镜像时,完整的命令行已记录。

If you’re to investigate some differences in detail make sure to check what Quarkus is invoking exactly: when the build plugin is producing a native image, the full command lines are logged.

Coordinated Omission Problem in Tools

在衡量像 Quarkus 这样的框架的性能时,用户体验的延迟尤其有趣,为此也有许多不同的工具。不幸的是,许多工具无法正确衡量延迟,反而会失败并造成坐标遗漏问题。这意味着在系统负载下提交新请求时,这些工具无法适应延迟,并汇总这些数字,从而使延迟和吞吐量数字非常具有误导性。

When measuring performance of a framework like Quarkus the latency experience by users are especially interesting and for that there are many different tools. Unfortunately, many fail to measure the latency correctly and instead fall short and create the Coordinate Omission problem. Meaning tools fails to acoomodate for delays to submit new requests when system is under load and aggregate these numbers making the latency and throughput numbers very misleading.

这个问题有一个很好的演练,见 this video 中 wrk2 作者 Gil Tene 对这个问题的解释,以及 Quarkus Insights #22 中 Quarkus 性能团队的 John O’Hara 展示了它如何出现。

A good walkthrough of the issue is this video where Gil Tene the author of wrk2 explains the issue and Quarkus Insights #22 have John O’Hara from Quarkus performance team show how it can show up.

虽然该视频和相关论文和文章都可以追溯到 2015 年,但时至今日,您仍会发现一些工具不足以解决协调的 oission 问题。

Although that video and related papers and articles date all back to 2015 then even today you will find tools that fall short with the coordinated oission problem

在撰写本文时,已知会产生这个问题且不应被用于测量延迟/吞吐量的工具:

Tools that at current time of writing is known to excert the problem and should NOT be used for measuring latency/throughput (it may be used for other things):

  • JMeter

  • wrk

已知不受影响的工具有:

Tools that are known to not be affected are:

请注意,这些工具不会优于您对它们所测量内容的理解,因此,即使使用 wrk2hyperfoil 也要确保您了解这些数字的含义。

Mind you, the tools are not better than your own understanding of what they measure thus even when using wrk2 or hyperfoil make sure you understand what the numbers mean.