Skip to content

Conversation

@Dogacel
Copy link
Contributor

@Dogacel Dogacel commented Jun 21, 2023

Motivation:

Add

Modifications:

  • Implement and and or operations for samplers.
  • Implement a percentile sampler that samples if data is on x-percentile.
  • Update the logging decorator to support sampling based on percentile slowness
    • Give a lower bound so that even if the request is on x-percentile, don't sample it because it is still too fast.
    • Give an upper bound so that even if the request is not on x-percentile, always log it because this latency is not acceptable.

Result:

  • TODO

Some TODOs:

  • Consider having a separate sampler for each endpoint.
  • Performance and memory tests.
  • Take snapshots every once in a while.

Copy link
Member

@trustin trustin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far so good, @Dogacel! I left some comments which might be helpful for leaving the draft status. 🙇

* Returns a sampler that applies logical or operator to both samplers decisions.
*/
default Sampler<T> or(Sampler<T> other) {
return new OrSampler<>(this, other);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return new OrSampler<>(this, other);
return new OrSampler<>(this, requireNonNull(other, "other"));

Can we also consider the situation where more than two samplers are chained? e.g. a.or(b).or(c) by accepting an array of samplers in OrSampler.<init>()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we accept an array of samplers, a new syntax can be used a.or(b, c, d). Is this syntax something we want?

Because .or returns another sampler, we can safely call another .or as you described.

a.or(b).or(c) looks better than a.or(b, c) in my opinion, what do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I was actually talking about the constructor of OrSampler(). or() should accept only one parameter. OrSampler.or() could be optimized so that OrSampler doesn't wrap another OrSampler.

* Returns a sampler that applies logical and operator to both samplers decisions.
*/
default Sampler<T> and(Sampler<T> other) {
return new AndSampler<>(this, other);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return new AndSampler<>(this, other);
return new AndSampler<>(this, requireNonNull(other));

Can we also consider the situation where more than two samplers are chained? e.g. a.and(b).and(c) by accepting an array of samplers in AndSampler.<init>()?

Should we add a static version of and() and or()? (not sure if this is the best idea though. let me know what you think.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not like how the static function looks syntactically. Because there are no extension functions in java we need to call them such as Sampler.and(a, b) instead of a.and(b) right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I'm fine with not adding a static method. I just wish And/OrSampler doesn't wrap another And/OrSampler, which is suboptimal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm.. I am not sure if I am 100% following. Why wrapping another sampler is not optimal?

If we are talking about "short circuiting" the evaluation, that's something we don't want to do.

Comment on lines 39 to 41
.percentilePrecision(2)
.minimumExpectedValue(1.0)
.maximumExpectedValue(Double.POSITIVE_INFINITY)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also make these three properties configurable? They are important for trading off between memory and accuracy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we expose those variables to the user? Or do we want to make them system properties?

Because we have so many things here, I tried to simplify as much as possible to keep LoggingDecorator simple. So, do system properties or flags make more sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I have one idea, let's re-use MoreMeters.

.merge(MoreMeters.distributionStatisticConfig())

I will only give percentiles and expiry policy. Rest should be shared configuration by default.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me!

.minimumExpectedValue(1.0)
.maximumExpectedValue(Double.POSITIVE_INFINITY)
.expiry(Duration.ofMillis(windowLengthMillis))
.bufferLength(3)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also should be configurable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the following helps with configuration

.merge(MoreMeters.distributionStatisticConfig());

So basically we are using the same buffer length we use for distribution statistics, which makes sense on my side. Any thoughts?

.expiry(Duration.ofMillis(windowLengthMillis))
.bufferLength(3)
.build();
this.histogram = new TimeWindowPercentileHistogram(Clock.SYSTEM, distributionStatisticConfig, true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make the Clock injectable with @VisisbleForTesting to test it properly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a test-only constructor that takes a clock 👍

Comment on lines +67 to -74
final boolean isSlow = slowRequestSampler.isSampled(requestLog.totalDurationNanos());
final boolean successOrFailure;
if (ctx.config().successFunction().isSuccess(ctx, requestLog)) {
return successSampler.isSampled(ctx);
successOrFailure = successSampler.isSampled(ctx);
} else {
successOrFailure = failureSampler.isSampled(ctx);
}
return failureSampler.isSampled(ctx);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If isSlow is true, can we skip the additional samplings?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Samplers are stateful, if we short cut, it would mean sampler won't record the value. I.e. counting sampler won't count actual values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are tests that verify this there is similar behavior here?

void andOrNotShortCircuited() {
final SampleOnce first = new SampleOnce();
final SampleOnce second = new SampleOnce();
assertThat(first.and(second).isSampled(0)).isTrue();
assertThat(first.isSampled(0)).isFalse();
assertThat(second.isSampled(0)).isFalse();
first.reset();
second.reset();
assertThat(first.or(second).isSampled(0)).isTrue();
assertThat(first.isSampled(0)).isFalse();
assertThat(second.isSampled(0)).isFalse();
first.reset();
second.reset();
assertThat(second.and(second).isSampled(0)).isFalse();
}

@Dogacel Dogacel marked this pull request as ready for review July 22, 2023 11:35
@Dogacel Dogacel requested review from jrhee17 and minwoox as code owners July 22, 2023 11:35
Comment on lines 86 to 107
/**
* Sample the requests if they are slower than the {@code slowRequestSamplingPercentile} percent
* of the requests.
*/
float slowRequestSamplingPercentile() default -1.0f;

/**
* Slow request percentiles are calculated over the last {@code slowRequestSamplingWindowMilliseconds}.
*/
long slowRequestSamplingWindowMilliseconds() default 10 * 60 * 1000;

/**
* Don't sample the requests if they are faster than the {@code slowRequestSamplingLowerBoundMilliseconds}.
* Should be used with {@link #slowRequestSamplingUpperBoundMilliseconds()}.
*/
long slowRequestSamplingLowerBoundMilliseconds() default 0L;

/**
* Always sample the requests if they are slower than the {@code slowRequestSamplingUpperBoundMilliseconds}.
*/
long slowRequestSamplingUpperBoundMilliseconds() default Long.MAX_VALUE;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trustin @ikhoon @minwoox

I would like to continue working on this if you find any value on this. I find value on adding this so there is a good observability tool coming out-of-the box with Armeria, similar to printing failures with this decorator.

For example in my company, we implemented a method which has an hardcoded upper-bound only to capture slow requests right now to work-around this.

If we would like to discuss the interface we should provide to achieve this, I am open to it. I know this might not be any priority for Armeria but I would be happy to hear from you when you are available 🙂

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally think the existing values look sensible which are disabled by default and set values to enable the feature.

@codecov
Copy link

codecov bot commented Oct 7, 2023

Codecov Report

Attention: Patch coverage is 85.71429% with 15 lines in your changes missing coverage. Please review.

Project coverage is 74.04%. Comparing base (b8eb810) to head (16fd779).
Report is 635 commits behind head on main.

Files with missing lines Patch % Lines
...meria/common/util/TimeWindowPercentileSampler.java 79.48% 6 Missing and 2 partials ⚠️
.../armeria/server/logging/LoggingServiceBuilder.java 80.00% 1 Missing and 2 partials ⚠️
...ion/decorator/LoggingDecoratorFactoryFunction.java 88.23% 0 Missing and 2 partials ⚠️
...a/com/linecorp/armeria/common/util/AndSampler.java 87.50% 1 Missing ⚠️
...va/com/linecorp/armeria/common/util/OrSampler.java 87.50% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #4978      +/-   ##
============================================
+ Coverage     73.95%   74.04%   +0.08%     
- Complexity    20115    21142    +1027     
============================================
  Files          1730     1839     +109     
  Lines         74161    78166    +4005     
  Branches       9465     9983     +518     
============================================
+ Hits          54847    57878    +3031     
- Misses        14837    15589     +752     
- Partials       4477     4699     +222     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@minwoox minwoox added this to the 1.26.0 milestone Oct 11, 2023
private final long windowLengthMillis;
private final TimeWindowPercentileHistogram histogram;
@VisibleForTesting
static long SNAPSHOT_UPDATE_MILLIS = 1000L;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding a secondary constructor to set this value rather than updating a static constant?

    @VisibleForTesting
    TimeWindowPercentileSampler(float percentile, long windowLengthMillis, 
                                Clock clock, long snapshotUpdateMillis) {
        ...
    }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


@Override
public String toString() {
return "TimeWindowPercentileSampler with " + windowLengthMillis + " ms window and " + percentile +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code style) Should we instead use MoreObjects.toStringHelper()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks sensible, good to know 👍

}
}

final Double percentileValue = histogramSnapshot.percentileValues()[0].value();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
final Double percentileValue = histogramSnapshot.percentileValues()[0].value();
final double percentileValue = histogramSnapshot.percentileValues()[0].value();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, Kotlin habits 😆

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm turns out it doesn't work this way. Unit tests just started failing. I believe .value() just returns an object thus I need it this way.

Or this needs to change

        return t >= percentileValue.longValue();

I don't know why it would it fail anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, it just rounds down with .longValue() which works better in my case because I don't really care about the decimal points. This causes maximum value to be never sampled because it is just an approximation 🤔

histogram.recordLong(t);

if (lastSnapshotMillis + SNAPSHOT_UPDATE_MILLIS <= clock.wallTime()) {
if (isTakingSnapshot.compareAndSet(false, true)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we implement a double-checking pattern for the update? histogramSnapshot can be set in succession regardless of SNAPSHOT_UPDATE_MILLIS if:

  • Two threads stay between L70~L71.
  • Thread A finishes isTakingSnapshot.set(false)
  • Thread B starts isTakingSnapshot.compareAndSet(false, true)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I am just duplicating the if condition in L70 to L72 to double check. Does it sound good?

* Don't sample the requests if they are faster than the {@code slowRequestSamplingLowerBoundMilliseconds}.
* Should be used with {@link #slowRequestSamplingUpperBoundMilliseconds()}.
*/
long slowRequestSamplingLowerBoundMilliseconds() default 0L;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
long slowRequestSamplingLowerBoundMilliseconds() default 0L;
long slowRequestSamplingLowerBoundMillis() default 0L;

/**
* Always sample the requests if they are slower than the {@code slowRequestSamplingUpperBoundMilliseconds}.
*/
long slowRequestSamplingUpperBoundMilliseconds() default Long.MAX_VALUE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
long slowRequestSamplingUpperBoundMilliseconds() default Long.MAX_VALUE;
long slowRequestSamplingUpperBoundMillis() default Long.MAX_VALUE;

Comment on lines 86 to 107
/**
* Sample the requests if they are slower than the {@code slowRequestSamplingPercentile} percent
* of the requests.
*/
float slowRequestSamplingPercentile() default -1.0f;

/**
* Slow request percentiles are calculated over the last {@code slowRequestSamplingWindowMilliseconds}.
*/
long slowRequestSamplingWindowMilliseconds() default 10 * 60 * 1000;

/**
* Don't sample the requests if they are faster than the {@code slowRequestSamplingLowerBoundMilliseconds}.
* Should be used with {@link #slowRequestSamplingUpperBoundMilliseconds()}.
*/
long slowRequestSamplingLowerBoundMilliseconds() default 0L;

/**
* Always sample the requests if they are slower than the {@code slowRequestSamplingUpperBoundMilliseconds}.
*/
long slowRequestSamplingUpperBoundMilliseconds() default Long.MAX_VALUE;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally think the existing values look sensible which are disabled by default and set values to enable the feature.

.successSamplingRate(successSamplingRate)
.failureSamplingRate(failureSamplingRate);

if (parameter.slowRequestSamplingPercentile() >= 0.0f) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should users simply use the hard limit slowRequestSamplingUpperBoundMillis without percentile?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I think it makes sense 👍

long windowMilliseconds,
long slowRequestSamplingLowerBoundMilliseconds,
long slowRequestSamplingUpperBoundMilliseconds) {
final Sampler<Long> percentileMatches;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we early return if some values are disabled and raise an exception if some of them are illegal?

if ((slowRequestPercentile <= 0.0 || windowMilliseconds <= 0) &&
    slowRequestSamplingLowerBoundMilliseconds == 0 &&
    slowRequestSamplingUpperBoundMilliseconds == Long.MAX_VALUE) {
   return;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I don't think I need to check the lower bound. Lower bound is just for ignoring so I guess users are free to pass there whatever they want regardless.

@ikhoon ikhoon modified the milestones: 1.26.0, 1.27.0 Oct 17, 2023
@Dogacel
Copy link
Contributor Author

Dogacel commented Oct 20, 2023

I have realized my test coverage was missing in some methods, with the latest commit it should be resolved.

Still there are uncovered lines such as toString(), is it fine to skip them?

@ikhoon @minwoox ?

@Dogacel Dogacel force-pushed the dogac/add-slow-request-sampler branch from 7499911 to a26c7be Compare October 20, 2023 13:17
@ikhoon ikhoon modified the milestones: 1.27.0, 1.28.0 Jan 16, 2024
@Dogacel
Copy link
Contributor Author

Dogacel commented Jan 20, 2024

Hi team, would like to continue on this. It's been a rough couple weeks, LMK if you still see value in this feature, thanks ❤️

@jrhee17 jrhee17 modified the milestones: 1.28.0, 1.29.0 Apr 8, 2024
@trustin
Copy link
Member

trustin commented Apr 9, 2024

@Dogacel Sorry that we were not as responsive as we wished. Yes, we're definitely interested in this PR. I left a couple small comments.

@minwoox minwoox modified the milestones: 1.29.0, 1.30.0 May 21, 2024
@github-actions github-actions bot added the Stale label Jul 13, 2024
@ikhoon ikhoon modified the milestones: 1.30.0, 1.31.0 Aug 1, 2024
@github-actions github-actions bot removed the Stale label Aug 17, 2024
@github-actions github-actions bot added the Stale label Sep 30, 2024
@github-actions github-actions bot removed the Stale label Oct 13, 2024
@jrhee17 jrhee17 modified the milestones: 1.31.0, 1.32.0 Nov 5, 2024
@github-actions github-actions bot added the Stale label Dec 29, 2024
@minwoox minwoox modified the milestones: 1.32.0, 1.33.0 Feb 12, 2025
@github-actions github-actions bot removed the Stale label Feb 17, 2025
@ikhoon ikhoon modified the milestones: 1.33.0, 1.34.0 Aug 1, 2025
@jrhee17 jrhee17 modified the milestones: 1.34.0, 1.35.0 Nov 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants