Availability Management is the practice of identifying levels of IT Service availability for use in Service Level Reviews with Customers.
All areas of a service must be measurable and defined within the Service Level Agreement (SLA).
To measure service availability the following areas are usually included in the SLA:
Availability is usually calculated based on a model involving the Availability Ratio and techniques such as Fault Tree Analysis, and includes the following elements:
All areas of a service must be measurable and defined within the Service Level Agreement (SLA).
To measure service availability the following areas are usually included in the SLA:
- Agreement statistics – such as what is included within the agreed service.
- Availability – agreed service times, response times, etc.
- Help Desk Calls – number of incidents raised, response times, resolution times.
- Contingency – agreed contingency details, location of documentation, contingency site, 3rd party involvement, etc.
- Capacity – performance timings for online transactions, report production, numbers of users, etc.
- Costing Details – charges for the service, and any penalties should service levels not be met.
Availability is usually calculated based on a model involving the Availability Ratio and techniques such as Fault Tree Analysis, and includes the following elements:
- Serviceability – where a service is provided by a 3rd party organisation, this is the expected availability of a component.
- Reliability – the time for which a component can be expected to perform under specific conditions without failure.
- Recoverability – the time it should take to restore a component back to its operational state after a failure.
- Maintainability – the ease with which a component can be maintained, which can be both remedial or preventative.
- Resilience – the ability to withstand failure.
- Security – the ability of components to withstand breaches of security.
Some availability measurements, that may be included in SLA:
- Mean-Time-Between-Failure (MTBF): elapsed time between a service gets up and down. It represents relaibility.
- Mean-Time-To-Repair (MTTR): elapsed time to repair a configuration item or IT service.
- Mean-Time-Between-System-Incidents (MTBSI): elapes time between detection of two consecutive incidents.
- Mean-Time-To-Restore-Service (MTRS): elapes time from the detection of an incident until it gets up.It represents maintainability.
MTBSI = MTBF + MTRS
Availability = uptime/ (uptime+downtime) =MTBF / (MTBF + MTTR)