Checkable: update next_check ts in ExecuteCheck only if it's needed

Since the scheduler accounts for already running checks, we only need to
update the `next_check` timestamp in `Checkable::ExecuteCheck()` only
where it actually makes sense to do so, and as for local checks this
doesn't make sense at all. There only two cases where we need to update
the next check beforehand:

1) The execute command event is sent to a connected remote endpoint, so
   we need to set the next check to a time in the future until we actually
   receive the check result back from the remote endpoint. However, it must
   not be too far in the future to avoid that the check is not re-run for
   too long in case the remote endpoint never responds.
2) The check is a remote check, but either the endpoint is currently syncing
   replay logs or not connected at all, and we are within the magical 5min
   cold startup window. In these cases, the check is effectively skipped, and
   there will be no check result for it coming in, we manually update the next
   check normally as if the check was executed.

In the other cases, either the check is executed locally, which means the
`m_RunningCheck` flag already prevents the scheduler from re-running the check,
or this is a remote check and the endpoint is not connected, but we are outside
the cold startup window, in which case we also don't do anything as we've already
called `Checkable::ProcessCheckResult()` with an appropriate error state, which
in turn will call `Checkable::UpdateNextCheck()`.
This commit is contained in:
Yonas Habteab 2025-09-24 09:37:48 +02:00
parent 5fe6b68b89
commit f049f5258a

View File

@ -576,12 +576,6 @@ void Checkable::ExecuteCheck(const WaitGroup::Ptr& producer)
SetLastCheckStarted(Utility::GetTime());
/* This calls SetNextCheck() which updates the CheckerComponent's idle/pending
* queues and ensures that checks are not fired multiple times. ProcessCheckResult()
* is called too late. See #6421.
*/
UpdateNextCheck();
bool reachable = IsReachable();
{
@ -643,11 +637,16 @@ void Checkable::ExecuteCheck(const WaitGroup::Ptr& producer)
if (listener)
listener->SyncSendMessage(endpoint, message);
/* Re-schedule the check so we don't run it again until after we've received
* a check result from the remote instance. The check will be re-scheduled
* using the proper check interval once we've received a check result.
/*
* Let the checker use a dummy next check time until we actually receive the check result from the
* remote endpoint. This should be sufficiently far in the future to avoid excessive CPU load by
* constantly re-running the check, but not too far in the future to avoid that the check is not
* re-run for too long in case the remote endpoint never responds. We add a small grace period
* to the check command timeout to account for network latency and processing time on the remote
* endpoint. So, we only need to silently update this without notifying any listeners, and once
* this function returns, the checker is going access it via GetNextCheck() again.
*/
SetNextCheck(Utility::GetTime() + checkTimeout + 30);
SetNextCheck(Utility::GetTime() + checkTimeout + 30, true);
/*
* Let the user know that there was a problem with the check if
@ -670,6 +669,13 @@ void Checkable::ExecuteCheck(const WaitGroup::Ptr& producer)
cr->SetOutput(output);
ProcessCheckResult(cr, producer);
} else {
/**
* The endpoint is currently either syncing its state or not connected yet and we are within
* the magical 5min cold startup window. In both cases, we just don't do anything and wait for
* the next check interval to re-try the check again. So, this check is effectively skipped.
*/
UpdateNextCheck();
}
/**