High Availability with s3gw + Longhorn
- High Availability with s3gw + Longhorn
- Active/Active
- Active/Warm Standby
- Active/Standby
- Investigation's Rationale
- Failure cases
- Measuring s3gw failures on Kubernetes
- Notes on testing s3gw within K8s
- Tested Scenarios - radosgw-restart
- Tested Scenarios - S3-workload during s3gw Pod outage
We want to investigate what High Availability - HA - means for the s3gw when used with Longhorn.
If we identify the meaning of HA as the ability to have N independent pipelines so that the failure of some of those does not affect the user's operations, this is something that could be not easy to achieve with the s3gw.
HA, anyway, is not necessarily tied to multiple independent pipelines. While High Availability (as a component of dependable computing) can be achieved through redundancy, it can also be achieved by significantly increased reliability and quality of the stack. e.g., improving the start time improves HA by lowering the Recovery Time Objective (RTO) metric after detecting and recovering from a fault. Furthermore, we must say: achieving HA in a way that never affects the user's operations is also not realistic, since not all faults can be masked, or the complexity required to attempt it would actually be detrimental to overall system reliability/availability (aka not cost-effective). Paradoxically, failing fast instead of hanging on an attempted fault recovery can also increase observed availability.
Our key goal, for the s3gw, is to maximize user-visible, reliable service availability, and to be tolerant with regard to certain faults (such as pod crashes, single node outages ... - the faults we consider or explicitly exclude are discussed later in the document).
With pipeline, we mean all the chain from the ingress to the persistent volume PV
(provided by Longhorn):
HA can be difficult with a project like the s3gw because of one process: the radosgw
,
owning an exclusive access to a resource: the Kubernetes PV
where the S3
buckets and objects are persisted.
Actually, this is also an advantage, due to its lower complexity.
Active/active syncing of all operations is non-trivial,
and an active/standby pattern is much simpler to implement with lower overhead
in the absence of faults.
In theory, 3 HA models are possible with the s3gw:
- Active/Active
- Active/Warm Standby
- Active/Standby
Active/Active
The Active-Active model must implement true independent pipelines where all the pieces are replicated.
The immediate consequence of this statement is that every pipeline being part of
the same logical s3gw service must bind to a different PV
; one per radosgw
process.
All the PV
s for the same logical s3gw, must therefore, sync their data.
The need to synchronize and coordinate all operations across all nodes (to guarantee Atomicity/Independence/Durability, which even S3's eventual consistency model needs in some cases) is not free - even then there's a need to ensure the data has been replicated in reality, and since a fault can only be detected via a timeout of some form, there's still a blip in availability (just, hopefully, not a returned error message).
Since the synchronization mechanism is often complex, there's an on-going price to be paid for achieving this, plus it is harder to get right (lowering availability through lowered quality).
The ingress might still mask it and automatically retry. That's another thing to consider: most S3 protocol libraries are built on the assumption of an unreliable network, and so a single non-repeatable failure might not show to the user, but just be (silently or not) retried and on it goes without anyone knowing beyond a blip in high latency, unless it happens too often. This is a property of the S3 protocol that makes it a bit easier for us to achieve, we hope.
Active/Warm Standby
This model assumes N pipelines to be "allocated" on the cluster at the same time,
but only one, the active pipeline, owns the exclusive ownership
over the shared PV
.
Should the active pipeline suffer a failure, the next elected active pipeline,
chosen to replace the old one, should pay a time window necessary to "transfer"
the ownership over the shared PV
.
The Active plus Warm Standby operation here has no meaningful advantages over the Active/Standby option; loading the s3gw pod is the cheapest part of the whole process, compared to fault detection (usually a timeout), mounting (journal recovery) of the file system, the process running the SQLite, recovery on start, etc.
We'd pay for this with complexity (and resource consumption while in standby) hat likely would only give us very marginal benefits at best.
Active/Standby
In this scenario, in the event of failure, the system should pay the time needed to fully load a new pipeline.
Supposing that the new pipeline has scheduled to load on a node where the s3gw image is not cached, the system should pay the time needed to download the image from the registry before starting it.
In general on Kubernetes, it can not be assumed that an image is pre-pulled, since nodes may come and go dynamically.
Pre-pulled images is something we do want to ensure for eligible nodes. Otherwise, our restart is unpredictably long.
It's always possible that a fault occurs exactly at the time where we are pre-loading the image, but that's just bad luck.
Investigation's Rationale
The 3 models described above have different performances and different implementation efforts. While the Active/Active model is expected to require a significant development effort due to its inherent complex nature, for our use case, the Active/Standby model built on top of Longhorn actually makes the most sense and brings the "best" HA characteristics relative to implementing a more fully active/distributed solution.
In particular, the Active/Standby model, is expected to work with nothing but the primitives normally available on any Kubernetes cluster.
Given that the backend s3gw architecture is composed by:
- one ingress and one
ClusterIP
service - one stateless
radosgw
process associated with a statefulPV
it is supposed that, when the radosgw
process fails, a simple reschedule
of its POD could be enough to fulfill this HA model.
All of this has obviously timeouts and delays, we suspect we'll have to adjust them
for our requirements.
A clarification is needed here: the PV
state is guaranteed by Longhorn.
Longhorn ensures that a PV
(along with its content) is always available on
any cluster's node, so that, a POD can mount it regardless of its allocation on the
cluster.
Obviously, we can't achieve higher availability or reliability than the underlying Longhorn volumes.
Failure cases
The PV
state is kept coherent by Longhorn, so errors at this level are assumed
NOT possible; application level corruptions to the PV
's state ARE possible.
s3gw won't corrupt the PV's state or the file system on it,
but it might corrupt its own application data.
Any corruption in the file system is outside what s3gw can reasonably protect
against; at best, it can crash in a way that doesn't corrupt the data further,
but that's all undefined behavior and "restore from backup" time.
What are the failure cases that can happen for s3gw? Making these cases explicit could be useful for a theoretical reasoning on what scenarios we can actually think to solve with an HA model. If a case is clearly outside what an HA model can handle, we must expect that the Kubernetes back off mechanism to be the only mitigation when a restart loop should occur.
Let's examine the following scenarios:
radosgw
's POD failure andradosgw
's POD rescheduling- Non-graceful node failure
radosgw
's failure due to a bugPV
PV Data corruption at application level due to radosgw's anomalous exit
We are supposing all these bugs or conditions to be fatal for the s3gw's process so that they trigger an anomalous exit.
radosgw's POD failure and radosgw's POD rescheduling
This case occurs when the radosgw
process stops due to a failure of its POD.
This case applies also when Kubernetes decides to reschedule the POD to another node, eg: when the node goes low on resources. This would often be called a "switch-over" in an HA cluster - e.g., an administratively orchestrated transition of the service to a new node (say, for maintenance/upgrade/etc reasons). This has the advantage of being schedulable, so it can happen at times of low load if these exist. In combination with proper interaction with the ingress - pausing requests there instead of failing them - we should be able to mask these cleanly.
Bonus: this is also what we need to seamlessly restart on upgrade/update of the pod itself transparently.
This can be thought as an infrastructure issue independent to the s3gw. In this case, the Active/Standby model fully restores the service by rescheduling a new POD somewhere in the cluster.
Non-graceful node failure
This case occurs when the radosgw
process stops due to a cluster's node failure.
This also means we weren't shut down cleanly.
So the recovery needs to be optimized for the stack, and as soon as we
can, we need to hook into the ingress and tell it to pause until we're done,
and then resume.
Kubernetes detects node failures but a recovery in this situation may take a very long time due to timeouts and grace periods; see Longhorn's documentation: what to expect when a kubernetes node fails ( suggests "up to 7 minutes", which is often unacceptable in response to node failures).
In Kubernetes 1.28, k8s has gained GA support for the concept of non-graceful node shutdowns and orchestrating recovery actions once the taint/flag is manually set. Currently, this relies on manual intervention - the administrator needs to ensure the node(s) is (are) really down to avoid risk of split-brain scenarios.
Regarding this topic, we have opened a specific issue: Improving recovery times for non-graceful node failures within the Longhorn project.
radosgw's failure due to a bug
Bug not related to a certain input pattern
The crash is caused by a bug not directly related to any input type.
Examples:
- Memory leaks
- Memory corruptions (stack or heap corruptions)
- Periodic operations or routines not related to an input (GC, measures, telemetry, etc)
For a bug like this, the Active/Standby model could guarantee the user's operations until the next occurrence of the same malfunctioning.
A definitive solution would be available only when a patch for the issue has released.
Bug related to a certain input pattern
The crash is caused by a bug directly related to a certain input type.
Examples:
- Putting Buckets with some name's pattern
- Putting Objects that have a certain size
- Performing an admin operation over a suspended user
For a bug like this, the Active/Standby model could guarantee the user's operations under the condition that the crash-triggering input is recognized by the user and thus its submission blocked.
PV Data corruption at application level due to radosgw's anomalous exit
This case occurs when the state on the PV
corrupts due to a radosgw
's
anomalous exit.
In this unfortunate scenario, the Active/Standby can hardly help. A restart could evenly fix the problem or trigger an endless restarting loop. Logical data corruption is a Robustness/Reliability problem; the best we can aim for is to detect it (and abort with prejudice and finally so, so as to not make the corruption worse). The fix for this could contemplate an human intervention. A definitive solution would be available only when a patch for the issue is available.
Measuring s3gw failures on Kubernetes
After reviewing the cases, we can say that what can be actually solved with
an HA model with s3gw is when the failure is not dependent to applicative bugs.
We can handle temporary issues involving the infrastructure that is hosting
the radosgw
process.
We are interested in measuring:
-
The (kill - re) start loop timing outside of k8s/LH. So we have a baseline and we can measure of what Kubernetes adds, and how slow the s3gw is when exiting:
- Cleanly
- Crashing with no ops in flight
- Crashing with a load on-going
-
Then, fault detection times for k8s - how long until it notices that the process has crashed (that should be quick due to the generated signal), but what about the process hanging? (e.g., crashes are separate from timeouts)
-
Node failures, again, we need to understand which factors affect k8s detecting that and reacting to them, and what latencies are introduced by k8s/LH.
Actively asking k8s to restart is different (see case one, switch- vs fail-over). That should be smooth, but is not actually a failure scenario. Probably worth handling in a separate section.
Hence, The idea to collect measures regarding a series of restarts
artificially triggered on the radosgw
's POD.
Obtaining such measures would allow to compute some arbitrary statistics
for the time required by Kubernetes to restart the radosgw
's POD.
Notes on testing s3gw within K8s
As previously said, we want to compute some statistics regarding the Kubernetes
performances when restarting the radosgw
's Pod.
We developed the s3gw Probe
, a program with the purpose of collecting
certain events coming from the radosgw
process.
The tool acts as a client/server service inside the Kubernetes cluster.
- It acts as client vs the
radosgw
process requesting it to die. - It acts as client vs the k8s's
control plane
requesting a certain deployment to scale up and down. - It acts as server of
radosgw
process collecting itsdeath
andstart
events. - It acts as server of the user's client accepting configurations of restart
scenarios to be triggered against the
radosgw
process. - It acts as server of the user's client returning statistics over the collected data.
The radosgw
code has been patched to accept a REST call from the probe
where the user can specify the way the radosgw
will exit.
Currently, 4 modes are possible against the radosgw
:
exit0
exit1
segfault
regular
Regardless of the process's exit code, with Deployments, Kubernetes deals
with a restart loop using the same strategy.
A Pod handled by a Deployment goes into the CrashLoopBackoff
state.
The Pod is not managed on its own. It is managed through a ReplicaSet, which in
turn is managed through a Deployment. A Deployment is a Kubernetes workload
primitive whose Pods are assumed to run indefinitely.
About this behavior, there is actually an opened request to make the CrashLoopBackoff timing tuneable, at least for the cases when the process exits with zero.
Anyway, this behavior led us to think a different way we could use to achieve a
sufficient number of restarts samples for the radosgw
's Pod.
We ended up issuing a scale deployment 0
followed by a scale deployment 1
applied
to the radosgw
's deployment.
This trick allows to generate an arbitrary number of Pod restarts without falling
into the CrashLoopBackoff
state.
So, inside a Kubernetes environment, the probe tool can pilot the control plane
to scale down and up the s3gw's backend pod.
Currently, 2 modes are possible against the control plane
:
k8s_scale_deployment_0_1
k8s_scale_deployment_0_1_node_rr
Tested Scenarios - radosgw-restart
When we test a scenario we are interested in collecting radosgw
's restart
events; for each restart we measure the following metrics:
-
to_main
: this is evaluated as the duration elapsed between aradosgw
's death event and the measure at the very begin of themain
body in the newly restarted process. -
to_frontend_up
: this is evaluated as the duration elapsed between aradosgw
's death event and the measure just after the newly restarted process is able to accept aTCP/IP
connection from a client.
From these 2 metrics, we produce also a derived metric: frontend_up_main_delta
,
that is just the arithmetic difference between to_frontend_up
and to_main
.
For each scenario tested we collect a set of measures. For each scenario tested we produce a set of artifacts:
-
*_stats.json
- It is the
json
file containing all the measures done for a scenario. It also contains some key statistics.
- It is the
-
*_raw.svg
- It is the plot containing the all the charts for the measures:
to_main
to_frontend_up
frontend_up_main_delta
On the X axis there are the restart event's
ID
s. They follow the temporal order of the restart events. - It is the plot containing the all the charts for the measures:
-
*_percentiles_to_main.svg
- It is the plot containing the percentile graph for the
to_main
metric.
- It is the plot containing the percentile graph for the
-
*_percentiles_to_fup.svg
- It is the plot containing the percentile graph for the
to_frontend_up
metric.
- It is the plot containing the percentile graph for the
-
*_percentiles_fup_main_delta.svg
- It is the plot containing the percentile graph for the
frontend_up_main_delta
metric.
- It is the plot containing the percentile graph for the
The file name, normally, contains some information such as:
-
deathtype: the way the
radosgw
process is asked to die:exit0
- the process is asked to immediately exit withexit(0)
exit1
- the process is asked to immediately exit withexit(1)
segfault
- the process is asked to trigger asegmentation fault
regular
- the process is asked to exit with the ordered shutdown procedure
-
environment: the environment where the scenario is tested:
localhost/host-path-volume
k8s/k3d/k3s ... /host-path-volume
k8s/k3d/k3s ... /LH-volume
-
description: is a key description of the scenario
-
TS: this is just a timestamp of when the artifacts were produced
regular_localhost_zeroload_emptydb
- restart-type:
regular
- env:
localhost/host-path-volume
- load:
zero-empty-db
- #measures:
100
segfault_localhost_zeroload_emptydb
- restart-type:
segfault
- env:
localhost/host-path-volume
- load:
zero-empty-db
- #measures:
100
regular_localhost_load_fio_64_write
- restart-type:
regular
- env:
localhost/host-path-volume
- load:
fio
- #measures:
100
fio
configuration:
[global]
ioengine=http
http_verbose=0
https=off
http_mode=s3
http_s3_key=test
http_s3_keyid=test
http_host=localhost:7480
[s3-write]
filename=/workload-1/obj1
numjobs=8
rw=write
size=64m
bs=1m
regular_localhost_zeroload_400_800Kdb
400K objects - measures done with the WAL file zeroed
- restart-type:
regular
- env:
localhost/host-path-volume
- load:
zero-400K-db
- #measures:
100
800K objects - measures done with the WAL file still to be processed (size 32G)
- restart-type:
regular
- env:
localhost/host-path-volume
- load:
zero-800K-db
- #measures:
100
regular-localhost-incremental-fill-5k
- restart-type:
regular
- env:
localhost/host-path-volume
- load:
5K-incremental-800K-db
- #measures:
100
Between every restart there is an interposed PUT-Object
sequence, each of 5K objects;
the sqlite db initially contained 800K objects.
scale_deployment_0_1-k3s3nodes_zeroload_emptydb
- restart-type:
scale_deployment_0_1
- env:
virtual-machine/k3s-3-nodes/LH-volume
- load:
zero-empty-db
- #measures:
300
The test has been conducted in 3 blocks, each of 100 restarts. Each restart in a block is constrained to occur on a specific node. The schema is the following:
- taint all nodes but
node-1
- trigger 100 pod restarts
- taint all nodes but
node-2
- trigger 100 pod restarts
- taint all nodes but
node-3
- trigger 100 pod restarts
Tested Scenarios - S3-workload during s3gw Pod outage
These scenarios are focused in collecting data from an S3 client performing
a workload during an s3gw outage.
For each S3 operation we collect both its Round Trip Time - RTT
- and its
result
(success/failure).
Then, we correlate an s3gw's outage with collected results and RTTs.
For each scenario tested we produce a specific artifact:
*_S3WL_RTT_raw.svg
-
It is the plot containing the
RTT S3Workload
chart:- X-Axis: Relative time (starting from 0) when an S3 operation occurred.
- Y-Axis: The
RTT
's duration in milliseconds. - Each vertical bar is colorized in:
Green
when the corresponding S3 operation was successful, inRed
when the operation failed. - On the X-Axis, in
Yellow
, are drawn all the s3gw's outages occurred in the test; the segment represents the begin and the end of an outage. - On the X-Axis, in
Cyan
, are drawn the durations before the first successful S3 operation after an outage.
-
PutObj-100ms-ClusterIp
- restart-type:
regular
- env:
k3d/host-path-volume
- client-S3-workload:
PutObject/100ms
- S3-endpoint:
s3gw-ClusterIP-service
- #restarts:
10
- #S3-operations:
394
PutObj-100ms-Ingress
- restart-type:
regular
- env:
k3d/host-path-volume
- client-S3-workload:
PutObject/100ms
- S3-endpoint:
s3gw-Ingress
- #restarts:
10
- #S3-operations:
504