How many nines? Understanding availability

[:es]Es muy probable que hayas experimentado un sistema caído, ya sea una aplicación en la que has trabajado o algún servicio que consumes. Le ha pasado a Amazon, Netflix, Microsoft, Salesforce, etc. ¿Cuánto tuviste que esperar? o ¿cuánto tus usuarios?

Si al estar construyendo una aplicación le preguntas a tu jefe (o cliente) qué porcentaje de tiempo la aplicación debe estar disponible, lo más seguro es que recibas respuestas como “siempre” o “100% (o más)”.

Aunque no queremos que pasen cosas malas, pasarán: bugs, ataques, fallas de corriente, desastres naturales, etc., son escenarios que pueden afectar un sistema. Esperar que no pases es ingenuo; es mejor pensar en y planear para los fallos, ya que son inevitables.

Disponibilidad (availability) es la capacidad de una aplicación de estar disponible después y a pesar de que un problema ocurre. Repito, no estamos diciendo que no habrá problemas, sino que tan efectivamente podrá recuperarse. Esto significa que tenemos que a) identificar los puntos de falla potenciales y b) crear una estrategia para prevenir que el error se convierta en una falla afectando al usuario (un árbol que cae en el bosque cuando no hay nadie, no hace ruido).

Si un árbol cae en el bosque…

Regresando a la pregunta acerca del nivel que la aplicación necesita estar disponible, es importante enteder como se traduce a tiempos:

Disponibilidad	Tiempo sin funcionar
99%	3d 15h 36m
99.9%	8h 0m 45s
99.99%	52m 36s
99.999%	5m 15s

Esto significa que si tu meta es 99.99% necesitas poder recuperarte en menos de una hora si tienes sólo un incidente durante el año, menos de 30 min si tienes dos, y así sucesivamente. Es importate notar que los valores de downtime no incluyen bajas planeadas, como cuando se actualiza la aplicación, se parcha el sistema operativo, migración de base de datos, etc.

Contrato de nivel de servicio

Teniendo una meta de disponibilidad (%) lleva a preguntar, ¿qué pasa si no se alcanza? Aqui es cuando los contratos de nivel de servicio (Service Level Agreements, en inglés) entran en juego. Son contratos o acuerdos entre dos partes que declaran el tiempo de disponibilidad prometido y cual será la penalización aplicada si no se cumple. Por ejemplo Microsoft declara en SLA for Azure App Services que podrías ser elegible por un crédito del 10% si no mantienen un nivel de disponibilidad de al menos 99.95% y de 25% si no mantienen al menos 99%.

Estrategias

Lo más “fácil” del problema es manejar las excepciones de manera apropiada (a algunos nos tocó el típico OnError Resume Next), pero hay muchas otras amenazas que pueden afectar la disponibilidad de tu aplicación:

Ataques de DDoS tumbando tu servidor.
SQL injection, eliminando todo el contenido de tu base de datos.
Inundaciones, huracanes y otros desastres naturales provocando fallas de energía masivas.

Incluso cuando son eventos externos o de seguridad producen el mismo resultado: que la aplicación no esté disponible. A los usuarios no les interesa el porqué, pero a tí sí; necesitas pensar en maneras para evitar que esto pase. Hay muchas estrategias posibles a utilizar, pero pueden ser organizadas principalmente en:

Prevención: Asegurar que se está pensando en los errores potenciales antes de que aparezcan; esto significa manejar las excepciones de manera apropiada, identificar los puntos de fallo únicos, usar árboles de fallo para identificar potenciales cadenas de errores, eliminar elementos que puedan causar un problema, etc.
Detección: Asegurar que se está al tanto del error en el momento adecuado; herramientas de monitoreo como Nagios son un buen ejemplo.
Corrección: Una vez que el problema sucede, necesita resolverse: restaurar un respaldo, moverse a diferente servidor, encender una planta de emergencia, etc.

(Si estas trabajando con Microsoft Azure, Security Center te ayuda a identificar amenazas en las tres categorías)

Backups always succeed, it’s Restores that fail. Test them. http://t.co/nzoti3LSur

— Scott Hanselman (@shanselman) July 3, 2014

//platform.twitter.com/widgets.js

Como puede verse, esto significa que se tiene que pensar en más cosas que sólamente tu código. Por ejemplo, ¿qué pasa si tu aplicación maneja las excepciones correctamente pero hay una falla de hardware en tu servidor o data center? o si alguien que tiene las credenciales de la base de datos de producción ejecuta delete from table sin filtro. Parecen casos extremos, pero los he visto suceder en el pasado.

Costos

Como he mencionado en otros posts, arquitectura es balance, y esto no es la excepción. Apuntar a una meta de disponibilidad implica aplicar estrategias que potencialmente resultan en un costo. Por ejemplo, alta disponibilidad significa generalmente cinco nueves (99.999%) y con ello implica un poco más de cinco minutos de downtime total durante un año. Esto implica establecer varias estrategias que pueden incrementar el costo del proyecto. También está el hecho de ceder; no puedes buscar alta disponibilidad sin impactar otros atributos de calidad como modificabilidad o mantenibilidad.

No cualquier sistema require ese nivel de disponibilidad, yo creo que muchos pueden sobrevivir con dos o tres nueves. También está el tema del ambiente; no es lo mismo que un sistema de nómina falle en día de pago que en cualquier otro día.

Así que la vez que estés discutiendo con un cliente acerca de disponibilidad, asegúrate de preguntar “¿Cuántos nueves?”.[:en]Most likely you have already experienced a system downtime, either on an application you have worked on or on some service that you consume. It has happened to Amazon, Netflix, Microsoft, Salesforce, etc. How much did you have to wait? How much did your users?

If you’re building an application and you ask your boss (or your client) what’s the percentage of time the application should be working correctly, most likely you’ll get an answer like “always” or “100% (or more)”.

Even though we don’t wan’t bad things to happen, they will surely do; bugs, attacks, power outages, natural disasters, etc., all are scenarios that might affect a system. Expecting them not to happen is naive; it’s better to think on and plan for failure, since it is inevitable.

Availability is the capability of an application to be available after some problem occurs. Again, we are not saying there will be no problems, but how effectively will be able to recover from them instead. This means that we need to a) identify the potential fail points and b) create an strategy to be able to prevent the error becoming a failure affecting the user (this means a tree falling in the forest when no-one is around makes no sound).

If a tree falls in a forest…

Going back to the question regarding the percent of time the application needs to work correctly, it is important to understand how it translates into downtime.

Availability	Downtime per year
99%	3d 15h 36m
99.9%	8h 0m 45s
99.99%	52m 36s
99.999%	5m 15s

This means that if your target is 99.99% you need to be able to recover on little less than an hour if you have one incident, less than 30 min if you have two, and so on. It is important to notice this downtime values don’t include planned outages, like when updating the application, patching the OS, migrating the database, etc.

Service Level Agreements

Having a target availability value (%) leads to ask, what happens if it is not reached? This is where SLA come into play when working with third-parties. They are an agreement between two parties declaring what is the promised uptime and what will be the penalty or credit applied if not complied. For example Microsoft declares in their SLA for Azure App Services that you might be elegible for a credit of 10% if they not maintain a availability of at least 99.95% and of 25% if it is below 99%.

Strategies

The “easier” problem to think of is handling application exceptions properly (OnError Resume Next anyone?), but there are many other threats that can affect your application to be available

DDoS attacks taking your application server down
SQL injection, wiping out your database.
Floods, hurricanes and other natural disasters provoking a massive power outage.

Even though these are security or external concerns, they produce the same result: making your application unavailable. Users don’t care about why, but you should; you need to think on ways for avoiding this to happen. There are many possible strategies to use, but they can mainly be categorized as:

Prevention: Ensure you are thinking of potential errors before they appear; this means handling exceptions properly, identifying single point of failures, using fault trees to identify potential error chains, remove elements that might cause a problem, etc.
Detection: Ensure you are aware of the error at the proper moment; monitoring tools like Nagios are a good example.
Correction: Once the problem has arisen, what’s needed for solving it: restore a backup, move to a different server, turn on a power plant, etc.

(If you are working with Microsoft Azure, you have Security Center to help you identifying threats on the three different categories).

Backups always succeed, it’s Restores that fail. Test them. http://t.co/nzoti3LSur

— Scott Hanselman (@shanselman) July 3, 2014

//platform.twitter.com/widgets.js

As you can see, this means that you may need to think on more stuff than just your code. For example, what happens if your application handles all the exceptions properly but there is a hardware failure on your server or data center? or someone has the database crendentials and runs delete from table with no filter clause on your production server. Even though they seem extreme cases, I’ve seen them happen before.

Costs

As I’ve mentioned on other posts, architecture is about balance, and this is no exception. Aiming for an availability target implies applying strategies that result in a cost. For example high-availability tipically means five nines (99.999%) and by that a little more than five minutes of total downtime during a year. This will imply implementing several strategies that will increase the cost of the project. There is also the trade-offs, you cannot aim for higher availability without impacting other quality attributes, like modifiability or maintainability .

Not every system require that level of uptime, I think many of them can go with two or three nines. There is also the environment consideration; it is not the same for a payroll system to fail on payday than in any other moment.

So next time you are discussing with a client about availability, be sure to ask him “how many nines?”.[:]

2 responses

Denisse

2017-03-27

Buen artículo. Y agregaría la importancia de un assessment de posibilidades ante amenazas para que el estudio sea realmente efectivo. Seguridad un sistema para su disponibilidad es uno de los principios a proteger en el área, pero el invertir en seguridad para amenazas de poca probabilidad puede llegar a ser un hoyo más en el saco de papas.
Saludos!

1. Pollirrata
  
  2017-03-28
  
  Muy cierto, seguridad y disponibilidad van completamente de la mano.
  Gracias!