add slow request sampler #4978

Dogacel · 2023-06-21T12:09:41Z

Motivation:

Add

Modifications:

Implement and and or operations for samplers.
Implement a percentile sampler that samples if data is on x-percentile.
Update the logging decorator to support sampling based on percentile slowness
- Give a lower bound so that even if the request is on x-percentile, don't sample it because it is still too fast.
- Give an upper bound so that even if the request is not on x-percentile, always log it because this latency is not acceptable.

Result:

TODO

Some TODOs:

Consider having a separate sampler for each endpoint.
Performance and memory tests.
Take snapshots every once in a while.

trustin

So far so good, @Dogacel! I left some comments which might be helpful for leaving the draft status. 🙇

core/src/main/java/com/linecorp/armeria/common/util/AndSampler.java

core/src/main/java/com/linecorp/armeria/common/util/OrSampler.java

trustin · 2023-07-05T08:26:11Z

core/src/main/java/com/linecorp/armeria/common/util/Sampler.java

+     * Returns a sampler that applies logical or operator to both samplers decisions.
+     */
+    default Sampler<T> or(Sampler<T> other) {
+        return new OrSampler<>(this, other);


Suggested change

return new OrSampler<>(this, other);

return new OrSampler<>(this, requireNonNull(other, "other"));

Can we also consider the situation where more than two samplers are chained? e.g. a.or(b).or(c) by accepting an array of samplers in OrSampler.<init>()?

If we accept an array of samplers, a new syntax can be used a.or(b, c, d). Is this syntax something we want?

Because .or returns another sampler, we can safely call another .or as you described.

a.or(b).or(c) looks better than a.or(b, c) in my opinion, what do you think?

Ah, I was actually talking about the constructor of OrSampler(). or() should accept only one parameter. OrSampler.or() could be optimized so that OrSampler doesn't wrap another OrSampler.

trustin · 2023-07-05T08:26:29Z

core/src/main/java/com/linecorp/armeria/common/util/Sampler.java

+     * Returns a sampler that applies logical and operator to both samplers decisions.
+     */
+    default Sampler<T> and(Sampler<T> other) {
+        return new AndSampler<>(this, other);


Suggested change

return new AndSampler<>(this, other);

return new AndSampler<>(this, requireNonNull(other));

Can we also consider the situation where more than two samplers are chained? e.g. a.and(b).and(c) by accepting an array of samplers in AndSampler.<init>()?

Should we add a static version of and() and or()? (not sure if this is the best idea though. let me know what you think.)

I did not like how the static function looks syntactically. Because there are no extension functions in java we need to call them such as Sampler.and(a, b) instead of a.and(b) right?

Right. I'm fine with not adding a static method. I just wish And/OrSampler doesn't wrap another And/OrSampler, which is suboptimal.

Hmm.. I am not sure if I am 100% following. Why wrapping another sampler is not optimal?

If we are talking about "short circuiting" the evaluation, that's something we don't want to do.

core/src/main/java/com/linecorp/armeria/common/util/Sampler.java

trustin · 2023-07-05T08:32:24Z

core/src/main/java/com/linecorp/armeria/common/util/TimeWindowPercentileSampler.java

+                                           .percentilePrecision(2)
+                                           .minimumExpectedValue(1.0)
+                                           .maximumExpectedValue(Double.POSITIVE_INFINITY)


Should we also make these three properties configurable? They are important for trading off between memory and accuracy.

Should we expose those variables to the user? Or do we want to make them system properties?

Because we have so many things here, I tried to simplify as much as possible to keep LoggingDecorator simple. So, do system properties or flags make more sense?

Ah I have one idea, let's re-use MoreMeters.

.merge(MoreMeters.distributionStatisticConfig())

I will only give percentiles and expiry policy. Rest should be shared configuration by default.

Sounds good to me!

trustin · 2023-07-05T08:32:44Z

core/src/main/java/com/linecorp/armeria/common/util/TimeWindowPercentileSampler.java

+                                           .minimumExpectedValue(1.0)
+                                           .maximumExpectedValue(Double.POSITIVE_INFINITY)
+                                           .expiry(Duration.ofMillis(windowLengthMillis))
+                                           .bufferLength(3)


This also should be configurable.

I believe the following helps with configuration

.merge(MoreMeters.distributionStatisticConfig());

So basically we are using the same buffer length we use for distribution statistics, which makes sense on my side. Any thoughts?

trustin · 2023-07-05T08:33:26Z

core/src/main/java/com/linecorp/armeria/common/util/TimeWindowPercentileSampler.java

+                                           .expiry(Duration.ofMillis(windowLengthMillis))
+                                           .bufferLength(3)
+                                           .build();
+        this.histogram = new TimeWindowPercentileHistogram(Clock.SYSTEM, distributionStatisticConfig, true);


We should make the Clock injectable with @VisisbleForTesting to test it properly.

Added a test-only constructor that takes a clock 👍

core/src/main/java/com/linecorp/armeria/common/util/TimeWindowPercentileSampler.java

ikhoon · 2023-07-12T13:35:19Z

core/src/main/java/com/linecorp/armeria/server/logging/LoggingService.java

+            final boolean isSlow = slowRequestSampler.isSampled(requestLog.totalDurationNanos());
+            final boolean successOrFailure;
            if (ctx.config().successFunction().isSuccess(ctx, requestLog)) {
-                return successSampler.isSampled(ctx);
+                successOrFailure = successSampler.isSampled(ctx);
+            } else {
+                successOrFailure = failureSampler.isSampled(ctx);
            }
-            return failureSampler.isSampled(ctx);


If isSlow is true, can we skip the additional samplings?

Samplers are stateful, if we short cut, it would mean sampler won't record the value. I.e. counting sampler won't count actual values.

There are tests that verify this there is similar behavior here?

armeria/core/src/test/java/com/linecorp/armeria/common/util/SamplerTest.java

Lines 139 to 158 in 01847b8

void andOrNotShortCircuited() {

final SampleOnce first = new SampleOnce();

final SampleOnce second = new SampleOnce();

assertThat(first.and(second).isSampled(0)).isTrue();

assertThat(first.isSampled(0)).isFalse();

assertThat(second.isSampled(0)).isFalse();

first.reset();

second.reset();

assertThat(first.or(second).isSampled(0)).isTrue();

assertThat(first.isSampled(0)).isFalse();

assertThat(second.isSampled(0)).isFalse();

first.reset();

second.reset();

assertThat(second.and(second).isSampled(0)).isFalse();

}

core/src/main/java/com/linecorp/armeria/common/util/TimeWindowPercentileSampler.java

core/src/main/java/com/linecorp/armeria/common/util/Sampler.java

Dogacel · 2023-10-07T13:54:27Z

core/src/main/java/com/linecorp/armeria/server/annotation/decorator/LoggingDecorator.java

+    /**
+     * Sample the requests if they are slower than the {@code slowRequestSamplingPercentile} percent
+     * of the requests.
+     */
+    float slowRequestSamplingPercentile() default -1.0f;
+
+    /**
+     * Slow request percentiles are calculated over the last {@code slowRequestSamplingWindowMilliseconds}.
+     */
+    long slowRequestSamplingWindowMilliseconds() default 10 * 60 * 1000;
+
+    /**
+     * Don't sample the requests if they are faster than the {@code slowRequestSamplingLowerBoundMilliseconds}.
+     * Should be used with {@link #slowRequestSamplingUpperBoundMilliseconds()}.
+     */
+    long slowRequestSamplingLowerBoundMilliseconds() default 0L;
+
+    /**
+     * Always sample the requests if they are slower than the {@code slowRequestSamplingUpperBoundMilliseconds}.
+     */
+    long slowRequestSamplingUpperBoundMilliseconds() default Long.MAX_VALUE;
+


@trustin @ikhoon @minwoox

I would like to continue working on this if you find any value on this. I find value on adding this so there is a good observability tool coming out-of-the box with Armeria, similar to printing failures with this decorator.

For example in my company, we implemented a method which has an hardcoded upper-bound only to capture slow requests right now to work-around this.

If we would like to discuss the interface we should provide to achieve this, I am open to it. I know this might not be any priority for Armeria but I would be happy to hear from you when you are available 🙂

I personally think the existing values look sensible which are disabled by default and set values to enable the feature.

codecov · 2023-10-07T14:28:18Z

Codecov Report

Attention: Patch coverage is 85.71429% with 15 lines in your changes missing coverage. Please review.

Project coverage is 74.04%. Comparing base (b8eb810) to head (16fd779).
Report is 635 commits behind head on main.

Files with missing lines	Patch %	Lines
...meria/common/util/TimeWindowPercentileSampler.java	79.48%	6 Missing and 2 partials ⚠️
.../armeria/server/logging/LoggingServiceBuilder.java	80.00%	1 Missing and 2 partials ⚠️
...ion/decorator/LoggingDecoratorFactoryFunction.java	88.23%	0 Missing and 2 partials ⚠️
...a/com/linecorp/armeria/common/util/AndSampler.java	87.50%	1 Missing ⚠️
...va/com/linecorp/armeria/common/util/OrSampler.java	87.50%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #4978      +/-   ##
============================================
+ Coverage     73.95%   74.04%   +0.08%     
- Complexity    20115    21142    +1027     
============================================
  Files          1730     1839     +109     
  Lines         74161    78166    +4005     
  Branches       9465     9983     +518     
============================================
+ Hits          54847    57878    +3031     
- Misses        14837    15589     +752     
- Partials       4477     4699     +222

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

github-actions · 2023-10-07T15:12:59Z

🔍 Build Scan® (commit: `16fd779`)

Job name	Status	Build Scan®
build-windows-latest-jdk-19	✅	https://ge.armeria.dev/s/am2vq4osk6w2e
build-self-hosted-unsafe-jdk-8	✅	https://ge.armeria.dev/s/f2sdb4zsacjzs
build-self-hosted-unsafe-jdk-19-snapshot-blockhound	✅	https://ge.armeria.dev/s/5qluyncrasbz2
build-self-hosted-unsafe-jdk-17-min-java-17-coverage	✅	https://ge.armeria.dev/s/ukmoytd7rva5o
build-self-hosted-unsafe-jdk-17-min-java-11	✅	https://ge.armeria.dev/s/pv7wwrjnhb6ya
build-self-hosted-unsafe-jdk-17-leak	❌ (failure)	https://ge.armeria.dev/s/pycyjosfvbf5q
build-self-hosted-unsafe-jdk-11	✅	https://ge.armeria.dev/s/645dfs4ckagy6
build-macos-12-jdk-19	❌ (failure)	https://ge.armeria.dev/s/wevqaeq3tvtiy

ikhoon · 2023-10-17T09:14:07Z

core/src/main/java/com/linecorp/armeria/common/util/TimeWindowPercentileSampler.java

+    private final long windowLengthMillis;
+    private final TimeWindowPercentileHistogram histogram;
+    @VisibleForTesting
+    static long SNAPSHOT_UPDATE_MILLIS = 1000L;


How about adding a secondary constructor to set this value rather than updating a static constant?

@VisibleForTesting TimeWindowPercentileSampler(float percentile, long windowLengthMillis, Clock clock, long snapshotUpdateMillis) { ... }

ikhoon · 2023-10-17T09:16:15Z

core/src/main/java/com/linecorp/armeria/common/util/TimeWindowPercentileSampler.java

+
+    @Override
+    public String toString() {
+        return "TimeWindowPercentileSampler with " + windowLengthMillis + " ms window and " + percentile +


Code style) Should we instead use MoreObjects.toStringHelper()?

Looks sensible, good to know 👍

ikhoon · 2023-10-17T09:22:04Z

core/src/main/java/com/linecorp/armeria/common/util/TimeWindowPercentileSampler.java

+            }
+        }
+
+        final Double percentileValue = histogramSnapshot.percentileValues()[0].value();


Suggested change

final Double percentileValue = histogramSnapshot.percentileValues()[0].value();

final double percentileValue = histogramSnapshot.percentileValues()[0].value();

Oops, Kotlin habits 😆

Hmm turns out it doesn't work this way. Unit tests just started failing. I believe .value() just returns an object thus I need it this way.

Or this needs to change

return t >= percentileValue.longValue();

I don't know why it would it fail anyway.

I see, it just rounds down with .longValue() which works better in my case because I don't really care about the decimal points. This causes maximum value to be never sampled because it is just an approximation 🤔

ikhoon · 2023-10-17T09:31:09Z

core/src/main/java/com/linecorp/armeria/common/util/TimeWindowPercentileSampler.java

+        histogram.recordLong(t);
+
+        if (lastSnapshotMillis + SNAPSHOT_UPDATE_MILLIS <= clock.wallTime()) {
+            if (isTakingSnapshot.compareAndSet(false, true)) {


Should we implement a double-checking pattern for the update? histogramSnapshot can be set in succession regardless of SNAPSHOT_UPDATE_MILLIS if:

Two threads stay between L70~L71.

Thread A finishes isTakingSnapshot.set(false)

Thread B starts isTakingSnapshot.compareAndSet(false, true)

I see, I am just duplicating the if condition in L70 to L72 to double check. Does it sound good?

core/src/main/java/com/linecorp/armeria/common/util/TimeWindowPercentileSampler.java

ikhoon · 2023-10-17T10:04:25Z

core/src/main/java/com/linecorp/armeria/server/annotation/decorator/LoggingDecorator.java

+     * Don't sample the requests if they are faster than the {@code slowRequestSamplingLowerBoundMilliseconds}.
+     * Should be used with {@link #slowRequestSamplingUpperBoundMilliseconds()}.
+     */
+    long slowRequestSamplingLowerBoundMilliseconds() default 0L;


Suggested change

long slowRequestSamplingLowerBoundMilliseconds() default 0L;

long slowRequestSamplingLowerBoundMillis() default 0L;

ikhoon · 2023-10-17T10:05:08Z

core/src/main/java/com/linecorp/armeria/server/annotation/decorator/LoggingDecorator.java

+    /**
+     * Always sample the requests if they are slower than the {@code slowRequestSamplingUpperBoundMilliseconds}.
+     */
+    long slowRequestSamplingUpperBoundMilliseconds() default Long.MAX_VALUE;


Suggested change

long slowRequestSamplingUpperBoundMilliseconds() default Long.MAX_VALUE;

long slowRequestSamplingUpperBoundMillis() default Long.MAX_VALUE;

ikhoon · 2023-10-17T10:08:05Z

core/src/main/java/com/linecorp/armeria/server/annotation/decorator/LoggingDecorator.java

+    /**
+     * Sample the requests if they are slower than the {@code slowRequestSamplingPercentile} percent
+     * of the requests.
+     */
+    float slowRequestSamplingPercentile() default -1.0f;
+
+    /**
+     * Slow request percentiles are calculated over the last {@code slowRequestSamplingWindowMilliseconds}.
+     */
+    long slowRequestSamplingWindowMilliseconds() default 10 * 60 * 1000;
+
+    /**
+     * Don't sample the requests if they are faster than the {@code slowRequestSamplingLowerBoundMilliseconds}.
+     * Should be used with {@link #slowRequestSamplingUpperBoundMilliseconds()}.
+     */
+    long slowRequestSamplingLowerBoundMilliseconds() default 0L;
+
+    /**
+     * Always sample the requests if they are slower than the {@code slowRequestSamplingUpperBoundMilliseconds}.
+     */
+    long slowRequestSamplingUpperBoundMilliseconds() default Long.MAX_VALUE;
+


I personally think the existing values look sensible which are disabled by default and set values to enable the feature.

ikhoon · 2023-10-17T10:09:58Z

...n/java/com/linecorp/armeria/server/annotation/decorator/LoggingDecoratorFactoryFunction.java

+                                .successSamplingRate(successSamplingRate)
+                                .failureSamplingRate(failureSamplingRate);
+
+        if (parameter.slowRequestSamplingPercentile() >= 0.0f) {


Should users simply use the hard limit slowRequestSamplingUpperBoundMillis without percentile?

Sure, I think it makes sense 👍

ikhoon · 2023-10-17T10:14:27Z

core/src/main/java/com/linecorp/armeria/server/logging/LoggingServiceBuilder.java

+                                                               long windowMilliseconds,
+                                                               long slowRequestSamplingLowerBoundMilliseconds,
+                                                               long slowRequestSamplingUpperBoundMilliseconds) {
+        final Sampler<Long> percentileMatches;


Should we early return if some values are disabled and raise an exception if some of them are illegal?

if ((slowRequestPercentile <= 0.0 || windowMilliseconds <= 0) && slowRequestSamplingLowerBoundMilliseconds == 0 && slowRequestSamplingUpperBoundMilliseconds == Long.MAX_VALUE) { return; }

Sure, I don't think I need to check the lower bound. Lower bound is just for ignoring so I guess users are free to pass there whatever they want regardless.

Dogacel · 2023-10-20T13:17:07Z

I have realized my test coverage was missing in some methods, with the latest commit it should be resolved.

Still there are uncovered lines such as toString(), is it fine to skip them?

@ikhoon @minwoox ?

Dogacel · 2024-01-20T23:12:45Z

Hi team, would like to continue on this. It's been a rough couple weeks, LMK if you still see value in this feature, thanks ❤️

trustin · 2024-04-09T08:13:02Z

@Dogacel Sorry that we were not as responsive as we wished. Yes, we're definitely interested in this PR. I left a couple small comments.

Dogacel added 3 commits June 21, 2023 15:05

add slow request sampler

1ffe771

document and lint properly

229901f

try 2

0a0bc05

trustin reviewed Jul 5, 2023

View reviewed changes

ikhoon reviewed Jul 12, 2023

View reviewed changes

Dogacel added 4 commits July 14, 2023 14:44

Merge branch 'main' into dogac/add-slow-request-sampler

5a23443

comment resolution

01847b8

Merge branch 'main' into dogac/add-slow-request-sampler

a85b8f8

use distribution config

7222a1c

Dogacel marked this pull request as ready for review July 22, 2023 11:35

Dogacel requested review from jrhee17 and minwoox as code owners July 22, 2023 11:35

Dogacel commented Oct 7, 2023

View reviewed changes

minwoox added this to the 1.26.0 milestone Oct 11, 2023

minwoox added the new feature label Oct 11, 2023

ikhoon reviewed Oct 17, 2023

View reviewed changes

ikhoon modified the milestones: 1.26.0, 1.27.0 Oct 17, 2023

Dogacel added 2 commits October 20, 2023 14:38

Merge branch 'main' into dogac/add-slow-request-sampler

b17cef3

comment resolution and much more tests

a26c7be

Dogacel force-pushed the dogac/add-slow-request-sampler branch from 7499911 to a26c7be Compare October 20, 2023 13:17

ikhoon modified the milestones: 1.27.0, 1.28.0 Jan 16, 2024

Merge branch 'main' into dogac/add-slow-request-sampler

28ba89d

jrhee17 modified the milestones: 1.28.0, 1.29.0 Apr 8, 2024

Merge branch 'main' into dogac/add-slow-request-sampler

16fd779

minwoox modified the milestones: 1.29.0, 1.30.0 May 21, 2024

github-actions bot added the Stale label Jul 13, 2024

ikhoon modified the milestones: 1.30.0, 1.31.0 Aug 1, 2024

github-actions bot removed the Stale label Aug 17, 2024

github-actions bot added the Stale label Sep 30, 2024

github-actions bot removed the Stale label Oct 13, 2024

jrhee17 modified the milestones: 1.31.0, 1.32.0 Nov 5, 2024

github-actions bot added the Stale label Dec 29, 2024

minwoox modified the milestones: 1.32.0, 1.33.0 Feb 12, 2025

github-actions bot removed the Stale label Feb 17, 2025

ikhoon modified the milestones: 1.33.0, 1.34.0 Aug 1, 2025

jrhee17 modified the milestones: 1.34.0, 1.35.0 Nov 24, 2025

	return new OrSampler<>(this, other);
	return new OrSampler<>(this, requireNonNull(other, "other"));

	return new AndSampler<>(this, other);
	return new AndSampler<>(this, requireNonNull(other));

	void andOrNotShortCircuited() {
	final SampleOnce first = new SampleOnce();
	final SampleOnce second = new SampleOnce();

	assertThat(first.and(second).isSampled(0)).isTrue();
	assertThat(first.isSampled(0)).isFalse();
	assertThat(second.isSampled(0)).isFalse();

	first.reset();
	second.reset();

	assertThat(first.or(second).isSampled(0)).isTrue();
	assertThat(first.isSampled(0)).isFalse();
	assertThat(second.isSampled(0)).isFalse();

	first.reset();
	second.reset();

	assertThat(second.and(second).isSampled(0)).isFalse();
	}

	final Double percentileValue = histogramSnapshot.percentileValues()[0].value();
	final double percentileValue = histogramSnapshot.percentileValues()[0].value();

	long slowRequestSamplingLowerBoundMilliseconds() default 0L;
	long slowRequestSamplingLowerBoundMillis() default 0L;

	long slowRequestSamplingUpperBoundMilliseconds() default Long.MAX_VALUE;
	long slowRequestSamplingUpperBoundMillis() default Long.MAX_VALUE;

add slow request sampler #4978

Are you sure you want to change the base?

add slow request sampler #4978

Uh oh!

Conversation

Dogacel commented Jun 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

trustin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Oct 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions bot commented Oct 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔍 Build Scan® (commit: 16fd779)

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dogacel commented Jun 21, 2023 •

edited

Loading

codecov bot commented Oct 7, 2023 •

edited

Loading

github-actions bot commented Oct 7, 2023 •

edited

Loading

🔍 Build Scan® (commit: `16fd779`)