-
Notifications
You must be signed in to change notification settings - Fork 969
add slow request sampler #4978
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
add slow request sampler #4978
Conversation
trustin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far so good, @Dogacel! I left some comments which might be helpful for leaving the draft status. 🙇
core/src/main/java/com/linecorp/armeria/common/util/AndSampler.java
Outdated
Show resolved
Hide resolved
core/src/main/java/com/linecorp/armeria/common/util/OrSampler.java
Outdated
Show resolved
Hide resolved
| * Returns a sampler that applies logical or operator to both samplers decisions. | ||
| */ | ||
| default Sampler<T> or(Sampler<T> other) { | ||
| return new OrSampler<>(this, other); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| return new OrSampler<>(this, other); | |
| return new OrSampler<>(this, requireNonNull(other, "other")); |
Can we also consider the situation where more than two samplers are chained? e.g. a.or(b).or(c) by accepting an array of samplers in OrSampler.<init>()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we accept an array of samplers, a new syntax can be used a.or(b, c, d). Is this syntax something we want?
Because .or returns another sampler, we can safely call another .or as you described.
a.or(b).or(c) looks better than a.or(b, c) in my opinion, what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I was actually talking about the constructor of OrSampler(). or() should accept only one parameter. OrSampler.or() could be optimized so that OrSampler doesn't wrap another OrSampler.
| * Returns a sampler that applies logical and operator to both samplers decisions. | ||
| */ | ||
| default Sampler<T> and(Sampler<T> other) { | ||
| return new AndSampler<>(this, other); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| return new AndSampler<>(this, other); | |
| return new AndSampler<>(this, requireNonNull(other)); |
Can we also consider the situation where more than two samplers are chained? e.g. a.and(b).and(c) by accepting an array of samplers in AndSampler.<init>()?
Should we add a static version of and() and or()? (not sure if this is the best idea though. let me know what you think.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not like how the static function looks syntactically. Because there are no extension functions in java we need to call them such as Sampler.and(a, b) instead of a.and(b) right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. I'm fine with not adding a static method. I just wish And/OrSampler doesn't wrap another And/OrSampler, which is suboptimal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm.. I am not sure if I am 100% following. Why wrapping another sampler is not optimal?
If we are talking about "short circuiting" the evaluation, that's something we don't want to do.
| .percentilePrecision(2) | ||
| .minimumExpectedValue(1.0) | ||
| .maximumExpectedValue(Double.POSITIVE_INFINITY) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also make these three properties configurable? They are important for trading off between memory and accuracy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we expose those variables to the user? Or do we want to make them system properties?
Because we have so many things here, I tried to simplify as much as possible to keep LoggingDecorator simple. So, do system properties or flags make more sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I have one idea, let's re-use MoreMeters.
.merge(MoreMeters.distributionStatisticConfig())I will only give percentiles and expiry policy. Rest should be shared configuration by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me!
| .minimumExpectedValue(1.0) | ||
| .maximumExpectedValue(Double.POSITIVE_INFINITY) | ||
| .expiry(Duration.ofMillis(windowLengthMillis)) | ||
| .bufferLength(3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also should be configurable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the following helps with configuration
.merge(MoreMeters.distributionStatisticConfig());
So basically we are using the same buffer length we use for distribution statistics, which makes sense on my side. Any thoughts?
| .expiry(Duration.ofMillis(windowLengthMillis)) | ||
| .bufferLength(3) | ||
| .build(); | ||
| this.histogram = new TimeWindowPercentileHistogram(Clock.SYSTEM, distributionStatisticConfig, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should make the Clock injectable with @VisisbleForTesting to test it properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a test-only constructor that takes a clock 👍
core/src/main/java/com/linecorp/armeria/common/util/TimeWindowPercentileSampler.java
Outdated
Show resolved
Hide resolved
core/src/main/java/com/linecorp/armeria/common/util/TimeWindowPercentileSampler.java
Outdated
Show resolved
Hide resolved
| final boolean isSlow = slowRequestSampler.isSampled(requestLog.totalDurationNanos()); | ||
| final boolean successOrFailure; | ||
| if (ctx.config().successFunction().isSuccess(ctx, requestLog)) { | ||
| return successSampler.isSampled(ctx); | ||
| successOrFailure = successSampler.isSampled(ctx); | ||
| } else { | ||
| successOrFailure = failureSampler.isSampled(ctx); | ||
| } | ||
| return failureSampler.isSampled(ctx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If isSlow is true, can we skip the additional samplings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Samplers are stateful, if we short cut, it would mean sampler won't record the value. I.e. counting sampler won't count actual values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are tests that verify this there is similar behavior here?
armeria/core/src/test/java/com/linecorp/armeria/common/util/SamplerTest.java
Lines 139 to 158 in 01847b8
| void andOrNotShortCircuited() { | |
| final SampleOnce first = new SampleOnce(); | |
| final SampleOnce second = new SampleOnce(); | |
| assertThat(first.and(second).isSampled(0)).isTrue(); | |
| assertThat(first.isSampled(0)).isFalse(); | |
| assertThat(second.isSampled(0)).isFalse(); | |
| first.reset(); | |
| second.reset(); | |
| assertThat(first.or(second).isSampled(0)).isTrue(); | |
| assertThat(first.isSampled(0)).isFalse(); | |
| assertThat(second.isSampled(0)).isFalse(); | |
| first.reset(); | |
| second.reset(); | |
| assertThat(second.and(second).isSampled(0)).isFalse(); | |
| } |
core/src/main/java/com/linecorp/armeria/common/util/TimeWindowPercentileSampler.java
Outdated
Show resolved
Hide resolved
| /** | ||
| * Sample the requests if they are slower than the {@code slowRequestSamplingPercentile} percent | ||
| * of the requests. | ||
| */ | ||
| float slowRequestSamplingPercentile() default -1.0f; | ||
|
|
||
| /** | ||
| * Slow request percentiles are calculated over the last {@code slowRequestSamplingWindowMilliseconds}. | ||
| */ | ||
| long slowRequestSamplingWindowMilliseconds() default 10 * 60 * 1000; | ||
|
|
||
| /** | ||
| * Don't sample the requests if they are faster than the {@code slowRequestSamplingLowerBoundMilliseconds}. | ||
| * Should be used with {@link #slowRequestSamplingUpperBoundMilliseconds()}. | ||
| */ | ||
| long slowRequestSamplingLowerBoundMilliseconds() default 0L; | ||
|
|
||
| /** | ||
| * Always sample the requests if they are slower than the {@code slowRequestSamplingUpperBoundMilliseconds}. | ||
| */ | ||
| long slowRequestSamplingUpperBoundMilliseconds() default Long.MAX_VALUE; | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to continue working on this if you find any value on this. I find value on adding this so there is a good observability tool coming out-of-the box with Armeria, similar to printing failures with this decorator.
For example in my company, we implemented a method which has an hardcoded upper-bound only to capture slow requests right now to work-around this.
If we would like to discuss the interface we should provide to achieve this, I am open to it. I know this might not be any priority for Armeria but I would be happy to hear from you when you are available 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally think the existing values look sensible which are disabled by default and set values to enable the feature.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4978 +/- ##
============================================
+ Coverage 73.95% 74.04% +0.08%
- Complexity 20115 21142 +1027
============================================
Files 1730 1839 +109
Lines 74161 78166 +4005
Branches 9465 9983 +518
============================================
+ Hits 54847 57878 +3031
- Misses 14837 15589 +752
- Partials 4477 4699 +222 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| private final long windowLengthMillis; | ||
| private final TimeWindowPercentileHistogram histogram; | ||
| @VisibleForTesting | ||
| static long SNAPSHOT_UPDATE_MILLIS = 1000L; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about adding a secondary constructor to set this value rather than updating a static constant?
@VisibleForTesting
TimeWindowPercentileSampler(float percentile, long windowLengthMillis,
Clock clock, long snapshotUpdateMillis) {
...
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
|
||
| @Override | ||
| public String toString() { | ||
| return "TimeWindowPercentileSampler with " + windowLengthMillis + " ms window and " + percentile + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code style) Should we instead use MoreObjects.toStringHelper()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks sensible, good to know 👍
| } | ||
| } | ||
|
|
||
| final Double percentileValue = histogramSnapshot.percentileValues()[0].value(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| final Double percentileValue = histogramSnapshot.percentileValues()[0].value(); | |
| final double percentileValue = histogramSnapshot.percentileValues()[0].value(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, Kotlin habits 😆
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm turns out it doesn't work this way. Unit tests just started failing. I believe .value() just returns an object thus I need it this way.
Or this needs to change
return t >= percentileValue.longValue();
I don't know why it would it fail anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, it just rounds down with .longValue() which works better in my case because I don't really care about the decimal points. This causes maximum value to be never sampled because it is just an approximation 🤔
| histogram.recordLong(t); | ||
|
|
||
| if (lastSnapshotMillis + SNAPSHOT_UPDATE_MILLIS <= clock.wallTime()) { | ||
| if (isTakingSnapshot.compareAndSet(false, true)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we implement a double-checking pattern for the update? histogramSnapshot can be set in succession regardless of SNAPSHOT_UPDATE_MILLIS if:
- Two threads stay between L70~L71.
- Thread A finishes
isTakingSnapshot.set(false) - Thread B starts
isTakingSnapshot.compareAndSet(false, true)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, I am just duplicating the if condition in L70 to L72 to double check. Does it sound good?
core/src/main/java/com/linecorp/armeria/common/util/TimeWindowPercentileSampler.java
Outdated
Show resolved
Hide resolved
| * Don't sample the requests if they are faster than the {@code slowRequestSamplingLowerBoundMilliseconds}. | ||
| * Should be used with {@link #slowRequestSamplingUpperBoundMilliseconds()}. | ||
| */ | ||
| long slowRequestSamplingLowerBoundMilliseconds() default 0L; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| long slowRequestSamplingLowerBoundMilliseconds() default 0L; | |
| long slowRequestSamplingLowerBoundMillis() default 0L; |
| /** | ||
| * Always sample the requests if they are slower than the {@code slowRequestSamplingUpperBoundMilliseconds}. | ||
| */ | ||
| long slowRequestSamplingUpperBoundMilliseconds() default Long.MAX_VALUE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| long slowRequestSamplingUpperBoundMilliseconds() default Long.MAX_VALUE; | |
| long slowRequestSamplingUpperBoundMillis() default Long.MAX_VALUE; |
| /** | ||
| * Sample the requests if they are slower than the {@code slowRequestSamplingPercentile} percent | ||
| * of the requests. | ||
| */ | ||
| float slowRequestSamplingPercentile() default -1.0f; | ||
|
|
||
| /** | ||
| * Slow request percentiles are calculated over the last {@code slowRequestSamplingWindowMilliseconds}. | ||
| */ | ||
| long slowRequestSamplingWindowMilliseconds() default 10 * 60 * 1000; | ||
|
|
||
| /** | ||
| * Don't sample the requests if they are faster than the {@code slowRequestSamplingLowerBoundMilliseconds}. | ||
| * Should be used with {@link #slowRequestSamplingUpperBoundMilliseconds()}. | ||
| */ | ||
| long slowRequestSamplingLowerBoundMilliseconds() default 0L; | ||
|
|
||
| /** | ||
| * Always sample the requests if they are slower than the {@code slowRequestSamplingUpperBoundMilliseconds}. | ||
| */ | ||
| long slowRequestSamplingUpperBoundMilliseconds() default Long.MAX_VALUE; | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally think the existing values look sensible which are disabled by default and set values to enable the feature.
| .successSamplingRate(successSamplingRate) | ||
| .failureSamplingRate(failureSamplingRate); | ||
|
|
||
| if (parameter.slowRequestSamplingPercentile() >= 0.0f) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should users simply use the hard limit slowRequestSamplingUpperBoundMillis without percentile?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I think it makes sense 👍
| long windowMilliseconds, | ||
| long slowRequestSamplingLowerBoundMilliseconds, | ||
| long slowRequestSamplingUpperBoundMilliseconds) { | ||
| final Sampler<Long> percentileMatches; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we early return if some values are disabled and raise an exception if some of them are illegal?
if ((slowRequestPercentile <= 0.0 || windowMilliseconds <= 0) &&
slowRequestSamplingLowerBoundMilliseconds == 0 &&
slowRequestSamplingUpperBoundMilliseconds == Long.MAX_VALUE) {
return;
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I don't think I need to check the lower bound. Lower bound is just for ignoring so I guess users are free to pass there whatever they want regardless.
7499911 to
a26c7be
Compare
|
Hi team, would like to continue on this. It's been a rough couple weeks, LMK if you still see value in this feature, thanks ❤️ |
|
@Dogacel Sorry that we were not as responsive as we wished. Yes, we're definitely interested in this PR. I left a couple small comments. |
Motivation:
Add
Modifications:
andandoroperations for samplers.Result:
Some TODOs: