A timeout is to give up, and assume if you haven’t responded in X time, the thing monitored is DOWN.
We want a warning while it is up, of things getting slow. Not to make things as down (dead in the water).
I’ll give you a few examples:
A. For us, we have a backup link that uses the mobile phone network. Unfortunately due to local conditions, service is not great. We have a booster for the signal.
The indicator the booster has become problematic or that a remote tower has been selected as the best, means our link stays up, our monitor continues to work, but the latency has gone from ~250ms to between 2000 and 7000ms.
Because the service is still fully operational, this is considered OK, it is indeed “up”, as a response will be received - but it is an early warning, and we can resolve this before an actual outage occurs. They are two distinct states - slow is not the same as down.
B. If you’re running a website and it is in the midst of being hit hard, but still functional, it may be slower to respond. By having an early warning, we can respond to this and get on top of it before it is down.
Again, this is a distinct state - and if we catch it early then that gives us time to respond. Again, it is distinct from being down - it’s a great warning indicator to have. And we don’t want to consider the site down just because the latency / response time has gotten slower. Because it isn’t down, and down may be an emergency, where as slow may be a - ok, no panic but let’s get on top of this before it creates a problem. We may need to add more capacity, for example.
C. Some applications, like database replication are sensitive to latency. If we start to see higher latency to a remote site, we may wish to take early action to either resolve it or take the nodes out of the cluster, and then work to resolve the issue (or work with our providers to do that).
Again, here, the early warning is extremely useful - and the site is not down, so we don’t want to call it down (which a very short time out could do). We want to know when we’re exceeding a threshold so we are warned about it - but not panicked by it. Allows us to take preventative action, before a problem further manifests.
Hope that helps clarify Tom.