Revisiting Security Fundamentals Part 3: Time to Examine Availability
As I discussed in parts 1 and 2 of this series, cybersecurity is complex. Security engineers rely on the application of fundamental principles to keep their jobs manageable. In the first installment of this series, I focused on confidentiality, and in the second installment, I discussed integrity. In this third and final part of the series, I’ll review availability. The application of these three principles in concert is essential to ensuring excellent user experiences on broadband.
Availability, like most things in cybersecurity, is complicated. Availability of broadband service, in a security context, ensures timely and reliable access to and use of information by authorized users. Achieving this, of course, can be challenging. In my opinion, the topic is under represented amongst security professionals and we have to rely on additional expertise to achieve our availability goals. The supporting engineering discipline for ensuring availability is reliability engineering. Many tomes are available to provide detailed insight on how to engineer systems to achieve desired reliability and availability.
How are the two ideas of reliability and availability different? Reliability focuses on how a system will function under specific conditions for a period of time. In contrast, availability focuses on how likely a system will function at a specified moment or interval of time. There are important additional terms to understand – quality, resiliency, and redundancy. These are addressed in the following paragraphs. Readers wanting more detail may consider reviewing some of the papers on reliability engineering at ScienceDirect.
Quality: We need to assure that our architectures and components are meeting requirements. We do that through quality assurance and reliability practices. Software and hardware vendors actually design, analyze, and test their solutions (both in development and then as part of shipping and integration testing) to assure they actually are meeting their reliability and availability requirements. When results aren’t sufficient, vendors apply process improvements (possibly including re-engineering) to bring their design, manufacturing, and delivery processes into line to hitting the reliability and availability requirements.
Resiliency: Again, however, this isn’t enough. We need to make sure our services are resilient – that is, even when something fails that our systems will recover (and something will fail – in fact, many things over time are going to fail, sometimes concurrently). There are a few key aspects we address when making our networks resilient. One is that when something fails, it does so loudly to the operator so they know something failed – either the failing element sends messages to the management system, or systems that it connects to or relies upon it tells the management system the element is failing or failed. Another is that the system can gracefully recover. It automatically restarts from the point it failed.
Redundancy: And, finally, we apply redundancy. That is to say we set up the architecture so that critical components are replicated (usually in parallel). This may happen within a network element (such as having two network controllers or two power supplies or two cooling units) with failover (and appropriate network management notifications) from one unit to another. Sometimes we’ll use clustering to both distribute load and achieve redundancy (sometimes referred to as M:N redundancy). Sometimes, we’ll have redundant network elements (often employed in data centers) or multiple routes of how network elements can connect through networks (using Ethernet, Internet, or even SONET). In cases where physical redundancy is not reasonable, we can introduce redundancy across other dimensions including time, frequency, channel, etc. How much redundancy a network element should imply depends on the math that balances reliability and availability to achiever your service requirements.
I’ve mentioned requirements several times. What requirements do I mean? A typical one, but not the only one and not even necessarily the most important one is the Mean Time Between Failure (MTBF). This statistic represents the statistical or expected average length of time between failures of a given element of concern, typically many thousands (even millions for some critical well understood components) of hours. There are variations. Seagate, for example, switched to Annualized Failure Rate (AFR) which is the “probable percent of failures per year, based on the [measured or observed failures] on the manufacturer’s total number of installed units of similar type (see Seagate link here). The key thing here, though, is to remember that MTBF and AFR are statistical predictions based on analysis and measured performance. It’s also important to estimate and measure availability – at the software, hardware, and service layers. If your measurements aren’t hitting the targets you set for a service, then something needs to improve.
A parting note, here. Lots of people talk about availability in terms of the percentage of time a service (or element) is up in a year. These are thrown around as “how many 9’s is your availability?” For example, “My service is available at 4x9s (99.99%)?” This is often a misused estimate because the user typically doesn’t know what is being measured, what it applies to (e.g., what’s included in the estimate), or even the basis for how the measurement is made. Never-the-less, it can be useful when backed by evidence, especially with statistical confidence.
Warnings about Availability
Finally, things ARE going to fail. Your statistics will, sometimes, be found wanting. Therefore, it’s critical to also consider how long it will take to RECOVER from a failure. In other words, what is your repair time? There are, of course, statistics for estimating this as well. A common one is Mean Time to Repair (MTTR). This seems like a simple term, but it isn’t. Really, MTTR is a statistic representing that measures how maintainable or repairable systems are. And measuring and estimating repair time is critical. Repair time can be the dominant contributor to unavailability.
So… why don’t we just make everything really reliable and all of our services highly available. Ultimately, this comes down to two things. One, you can’t predict all things sufficiently. This weakness is particularly important in security and is why availability is included as one of the three security fundamentals. You can’t predict well or easily how an adversary is going to attack your system and disrupt service. When the unpredictable happens, you figure out how to fix it and update your statistical models and analysis accordingly; and you update how you measure availability and reliability.
The second thing is simply economics. High availability is expensive. It can be really expensive. Years ago, I did a lot of work on engineering and architecting metropolitan, regional, and nationwide optical networks with Dr. Jason Rupe (one of my peers at CableLabs today). We found, through lots of research, that the general rule of thumb was for each “9” of availability, you could expect an increase of cost of service of around 2.5x on typical networks. Sounds extreme, doesn’t it? Typical availability of a private line or Ethernet circuit between two points (regional or national) is typically quoted at around 99%. That’s a lot of downtime (over 80 hours a year) – and it won’t all happen at a predictable time within a year. Getting that down to around 9 hours a year, or 99.9% availability, will cost 2.5x that, usually. Of course, architectures and technology does matter. This is just sharing my personal experience. What’s the primary driver of the cost? Redundancy. More equipment on additional paths of connectivity between that equipment.
There are lots of challenges in design of a highly available cost-effective access network. Redundancy is challenging. It’s implemented where economically reasonable, particularly at the CMTS’s, routers, switches, and other server elements at the hub, headend, or core network. It’s a bit harder to achieve in the HFC plant. So, designers and engineers tend to focus more on reliability of components and software and ensure that CMs, nodes, amplifiers, and all the other elements that make DOCSIS® work. Finally, we do measure our networks. A major tool for tracking and analyzing network failure causes and for maximizing the time between service failures is DOCSIS® Proactive Network Maintenance (PNM). PNM is used to identify problems in the physical RF plant including passive and active devices (taps, nodes, amplifiers), connectors, and the coax cable.
From a strictly security perspective, what can be done to improve the availability of services? Denial of service attacks are monitored and mitigated typically at the ingress points to networks (border routers) through scrubbing. Another major tool is ensuring authorized access through authentication and access controls.
It is important that we include consideration of availability in our security strategies. Security engineers often get too focused on threats caused by active adversaries. We must also include other considerations that disrupt the availability of experiences to subscribers. Availability is a fundamental security component and, like confidentiality and integrity, should be included in any security by design strategy.
What are the strategies and tools reliability and security engineers might apply?
- Model your system and assess availability diligently. Include traditional systems reliability engineering faults and conditions, but also include security faults and attacks.
- Execute good testing prior to customer exposure. Said another way, implement quality control and process improvement practices.
- When redundancy is impractical, reliability of elements becomes the critical availability design consideration.
- Measure and improve. PMN can significantly improve availability. But measure what matters.
- Partner with your suppliers to assure reliability, availability, and repairability throughout you network, systems, and supply chains.
- Leverage PNM fully. Solutions like DOCSIS® create a separation from a network problem and a service problem. PNM lets operators take advantage of that difference to fix network problems before they become service problems.
- Remember that repair time can be a major factor in the overall availability and customer experience.
Availability in Security
It’s important that we consider availability in our security strategies. Security engineers often get too focused on threats caused by active adversaries. We must also include other considerations that disrupt the availability of experiences to subscribers. Availability is a fundamental security component and—like confidentiality and integrity—should be included in any security-by-design strategy.
You can read more about 10G and security by clicking below.