Design for NFRs - introduction

This page is published under the terms of the licence summarized in the footnote.

Abstract

This paper assembles notes made during discussion in architect classes of terms in the syllabus for BCS professional certificates in enterprise and solution architecture

There is more about design for NFRs in Avancier’s training courses.

Contents

Introduction. 1

Design for performance. 2

Design for resilience (availability and reliability) 5

Design for recoverability. 6

Design for integrity. 7

Design for scalability. 7

Design for serviceability. 7

Design for security. 8

Design for change. 9

Other NFRs. 11

Introduction

"The beginning of wisdom for a computer programmer is to recognise the difference between getting a program to work and getting it right" M.A. Jackson (1975).

What makes an architecture good or right?

Generally, how well the architected system meets its long-term objectives and requirements.

Here, we mostly take the Functional Requirements for granted, and focus on the Non-Functional Requirements (NFRs).

Often, the non-functional qualities of a system turn out to be critical.

The cost of meeting a few NFRs often vastly exceeds the cost of meeting many functional requirements

Some NFRs are critical, can mean the difference between success and failure of a project at testing time or run time.

And if something goes wrong during operation, it may cost millions of dollars.

General advice on setting NFR measures:

· Set targets that are realistic, truly necessary and probably achievable.

· Define how they will be measured.

· Adjust targets with a level: a % of cases that must meet the target, and periods when the targets will be relaxed.

E.g. 99% of Telco top-up transactions must be faster than 3 seconds, 95% must be faster than 2 seconds.

Or, a service is to be 99.95% available during weekdays, but a lower figure is acceptable at weekends.

Who defines NFRs?

NFRs are usually negotiable; how much depends on context

When customers ask for more than they need, or can afford, then cost control requires solution architects take the lead.

Pre-empt glib requirements; strive to steer NFRs.

It is easy for customers to state requirements that:

· overstate their true requirements, and thereby impose excessive costs on solution development.

· are well-nigh impossible to measure, or to meet,

· are of questionable business benefit.

Turn NFRs as early as possible into solutions you can demonstrably deliver.

E.g. For maintainability you might define the use of specific “open” standards, and traceability of requirements to solution elements.

Consider also cost and risk.

It ain’t just the target measure users want; it is what you have to do/pay to meet it.

What is the cost/risk to the business? What is the cost/risk to the supplier?

What must the supplier do/pay when the measure isn’t met?

Options include: service credit, money, human resources (consultancy), technology resources, equipment upgrade / discount?

Design for performance

All the requirements below should be considered in designing a software architecture.

Not only the requirements of a whole system, but of each component, down to individual elements if necessary.

Requirements

Performance: subdivides into two measures, sometimes in harmony and sometimes in opposition:

· Throughput of volume: number of services executed in a time period.

· Response or cycle time (aka latency): time taken from request to response or completion.

Performance requirements have a major impact on how much time and money must be spent on a solution.

The zoom out - zoom in principle.

First zoom out: a 3 second response time may not matter in the context of work that people are doing asynchronously in the business.

The wider system may have different needs, or different ways to get around NFR failures.

Then zoom in: is a 3 second response time feasible in the light of the components and communication paths involved?

Throughput and response time are sometimes in harmony and sometimes in conflict.

“Many assume the default Oracle 11g parameters are right for them, not realizing the incompatible goals of optimizing for throughput vs. fast response time.

These metrics are quite separate.

The default optimizer_mode = all_rows (maximizing throughout by choosing SQL access plans that minimize server resources).

Optimizing for the fastest response time set optimizer_mode = first_rows.

These goals of response time and throughput are different and often at odds with each other”

IT Tips by Burleson Consulting, September 3, 2009

Techniques

The general advice is to look for bottlenecks and tackle them one by one.

These bottlenecks can occur at any scale in the solution (macro-scale or micro-scale).

There are some general good practices for coding efficiency, execution efficiency, memory efficiency, etc.

But what works in one context may be disastrous in another.

Scale up:

Usually means increasing the power of a node by adding processor or memory resources.

It can mean increasing network bandwidth.

It is generally good for both response time and throughput.

Design for response time

Time usually depends on networks and databases.

Both are much faster than they used to be, but slow compared with the speed of a computer.

Techniques include:

· Minimise distribution, network hops.

· Shorten the network hops.

· Minimise network traffic, message passing.

· Faster network medium (e.g. fibre).

· Eliminate unnecessary middleware, message queues.

· Minimise disk accesses (or rather head movements).

Caching: Holding data in a temporary storage area - usually frequently-accessed data.

Placing copies of persistent data in a location nearer to the user than the original data source.

Generally good for response or cycle time. Can raise concerns about data integrity and security.

Database optimisation: Techniques for reducing needless database access include

· Normalisation or de-normalisation of data

· Addition or removal of indexes

· Optimisation access paths

· Turn referential integrity off

· Move data processing from app server to data server.

Indexes: A list of pointers to selected data elements in set of data elements – usually, selected rows in a database table.

Useful in optimisation of batch input and output processes, which typically run overnight.

May be temporarily disabled during the day to optimise on-line update processes.

Access path analysis: Study of the route a process takes through a data store structure.

A very common source of performance problems is that an SQL programmer does not know the access path their procedure takes through a database.

So it is advisable to use access path analysis and/or employ highly skilled SQL resources for critical database access programs.

Agile development and/or KISS

More often than not, simpler designs are faster in operation.

You must however look ahead to what throughput is expected in future (say 3 to 5 years), and ensure the simple design can handle it.

More thoughts:

· Change transaction start and end times/boundaries.

· Optimise memory release and garbage collection.

· Sharding: partition data between data stores

· Separation of data stores for

o Persistent v. temporary or derivable data (no DR needed)

o Latest state v. transaction logs

o Active v. inactive data

· App profiling & tuning (e.g. Wiley Introscope)

· Specialised process accelerators

· Shorten Mainframe filepath

Design for throughput

Usually involves running processes in parallel.

Scale out (aka clustering): Increasing the number of parallel processors.

This might mean multi-threading of software components.

But more usually means adding more (physical or virtual) nodes to a cluster.

Some kind of load balancer must sit in front of the cluster and distribute service requests between them.

Generally good for throughput. Not always good for response time.

“There’s always a glass ceiling”

Scaling out from 1 box to 2 boxes may double the throughput (the average response time is ever-so-marginally longer).

Scaling out from 2 boxes to 200 is different.

Keep adding boxes, and eventually you reach a point where response time suffers, and throughput stops increasing

Finding the bottleneck may not be easy.

There are obvious bottlenecks, the CPU is working flat out.

And less obvious and perhaps obscure bottlenecks

· Load balancer capacity

· Number of threads

· Number of connections / socket table

· Middleware capacity

· Number of message queues

· Size of message queue

· Ill-fitting database structure (see below)

Consider also: batch transactions for off-line processing.

Design for resilience (availability and reliability)

Down time has a business impact

The “Churn” department of a telco calculates the effect of down time leaders on loss of customers.

Requirements

Availability: The amount or percentage of time that the services of a system are ready for use, excluding planned and allowed down time.

Possible measures include MTBF / (MTBF + MTBR), which usually refers to availability at the primary site, excluding disasters.

Reliability: The ability of a discrete component or service to run without failing. Possible measures include mean time between failures (MTBF).

Aside: The measures above are sometimes applied only to platform applications, ignoring faults and failures in business applications.

Recoverability: The ability of a system to be restored to live operations after a failure.

Possible measures include mean time to repair (MTBR). Usually refers to disaster recovery using resources at a remote site.

Techniques

First, you can consider using highly reliable hardware.

After that, the primary technique is to build redundancy into the system, to provide parallel components to do the work.

Redundancy means duplicating process, data, tin and wire.

E.g. to scale out, or add one to the number of servers in cluster that calculation or prototyping suggests is needed.

Other techniques include defensive design and provision of failover capability.

The availability requirements of the target have a major impact on how much communication and collaboration is necessary across elements of the solution for sharing function state, data and context.

This includes consideration for handling in-flight transactions, reference data and context data.

Fail over: Automatic switch over to a redundant or standby system, upon the failure or abnormal termination of the previously active system.

Failover happens suddenly and generally without warning.

Failover and failback should be automatic and transparent to application servers.

Defensive design: 1. Designing a client component so that it does not fall over if a server component does not work properly; asynchronous invocation may help.

2. Designing a server component so it does not depend on input data being valid, which means testing input data and preconditions before processing.

Aside: Defensive design is the opposite of “Design by Contract” as promoted by Bertrand Meyer.

Design for recoverability

The principal technique is to back up and provide some kind of switch-over or fail-over procedure.

Central applications are configured so as to failover from a central location to a backup disaster recovery site.

For high recoverability – you pay a lot.

Procedures must address also “fail back”, to return operations from a disaster recovery site to the normal production site.

Back up: A copy of data that may be used to restore the original after data loss.

Used in disaster recovery. Also used to restore individual files that have been deleted or corrupted.

Backups are typically the last line of defence, coarse-grained and can be inconvenient to use.

Backup site: A location where systems are or can be duplicated.

A cold site has no equipment.

A warm site has infrastructure but no up-to-date data or software.

A hot site has up-to-date software and more or less up to date copies of data.

Synchronous replication: replicates data storage in real time, needs high bandwidth and low latency, limited by time/distance.

Asynchronous replication: replicates data storage off-line, makes and keeps copies of data at a remote site; operations can be resumed at the remote location using a remote copy, not subject to time/distance limits.

Don’t forget the data: VMware Site Recovery Manager stages and executes the process of failing-over servers to a remote site.

But does your data storage system (SAN) failover as well?

Don’t forget remote applications: which enterprise applications are business critical?

Remote apps may be more important than central ones.

Don’t forget business recovery: you need qualified people at business locations to carry on the business

More things to think about

· Monitoring/warning

· Testing of back ups,

· Back up electricity supply

· Back up clerical operations

· Transaction log

· Restore processes,

· Business continuity support contract

· ESCROW

· KSD Key Support Documents (log of past recovery actions)

Design for integrity

Integrity: A term with four possible meanings defined under “Data Integrity”.

The two general techniques for data store are

· reduce data replication

· ensure updates are made via ACID transactions.

More specific techniques are:

· normalise stored data

· switch on automated referential integrity checks,

· remove caches

· consolidate distributed databases.

Defensive design (always validate input data)

Introduce check sums

Remove duplicated data stores (caches)

Design reconciliation processes

Control user access

Transaction logs and audit trails

Design for scalability

Scalability: The ability to expand a system, to increase system capacity in operation, to grow with increased workloads.

· N+1 design.

· NoSQL databases

· Loose coupling

· Virtualisation

Design for serviceability

Serviceability: The ability to monitor and manage a system in operation.

· Document the configuration

· Use automated monitoring tools

· Manage technology life cycles – planned replacement cycle

· Optimal modular design

· Enable remote access to and control of client devices

In addition to the employment of server and network monitoring tools, a notable technique is to instrument applications so that they report on what they are doing, and how well they are doing it.

Design for security

Security: The ability to prevent unauthorised access to a system.

(This is Confidentiality only, since Integrity and Availability have already been covered.)

This has a major impact on how much separation and control must be implemented in each element across the target solution.

Design for human and organisational security: Definition of all the things that can be done outside of IT systems to secure business information, such as security guards, locks on doors, definition and roll out of policies and procedures.

Data security: 1: Confidentiality alone. 2: A combination of Confidentiality, Integrity and Availability.

Aside: Tom Peltier suggests rating the security level of a data item, data structure or data store as equal to the highest of the individual ratings (high, medium, low) awarded for Confidentiality, Integrity and Availability.

Security protection: Prevention of access to data designed to maintain the required data qualities of confidentiality, availability, and integrity.

Security feature: A feature of a system that enables its data and processes to be protected, such as: Encryption, Checksum, HTTPS

Security policy: A policy that defines which actors have (or do not have) Access rights to objects in a given domain - along with any other protections.

Information domain: A uniquely identified set of objects with a common security policy.

Access to any data within the domain is limited and constrained by the same rules.

Identity: One or more data items (or attributes) that uniquely label an entity or actor instance. E.g. passport number or user name.

Encryption: A process to encode data items (in a data store or data flow) so that they are meaningless to any actor who cannot decode them.

Checksum: A redundant data item added to a message, the result of adding up the bits or bytes in the message and applying a formula.

This enables the receiver to detect if the message has been changed.

It protects against accidental data corruption, but does not guarantee data flow integrity, since it relies on the formula being known only to sender and receiver.

Digital signature: A cryptographic scheme that simulates the security properties of a handwritten signature.

More secure than a check sum, it is said to guarantee the data flow integrity of a message, since the signature is corrupted if the message content is changed.

Design for applications security: Techniques for preventing unauthorised use of an application.

Identification: A process via which an entity or actor supplies their identity to an authority. It is usually followed by authentication.

Authentication: A process to confirm or deny that an actor is trusted - is the entity to which an identity was given. E.g. A password check.

Produces one of four results: true positive, true negative, false negative - which leads to wrongly-denied access, or false positive - which leads to unauthorised access.

It is usually followed by authorisation.

Three-factor authentication: Authentication by checking what users remember (e.g. password, mother’s maiden name), carry (e.g. credit card or key) and are (using biometric data).

Authorisation: A process giving access to a trusted actor, based on that actor’s known access rights.

It is usually followed by Access.

Access: A process to look inside a system to find data or processes of interest, with a view to retrieving them or using them.

Design for infrastructure security: Techniques for protecting client and server computers from malicious access.

Client-side security: Features that protect client-end computers from malicious access.

Server-side security: Features that protect server computers and databases from malicious clients.

Firewall: Software at the boundary of a network that is used to detect, filter out and report messages that are unauthorised and/or not from a trusted source.

De-Militarised Zone (DMZ): An area of a network, usually between the public internet and the enterprise network.

It uses firewalls to filters out messages that fail security checks. It contains servers that respond to internet protocols like HTTP and FTP.

HTTPS: Normal HTTP Interoperation over an encrypted Secure Sockets Layer (SSL) or Transport Layer Security (TLS) connection.

This ensures reasonable protection of data content from those who intercept the data flow in transit.

Aside: If an HTTPS: URL does not specify a TCP port, the connection uses port 443.

Web site security: Usually, a process whereby a web browser checks the public key certificate of a web server at the other end of an HTTPS connection.

The aims are to check the web server is authentic (who it claims to be) and that messages to/from with the web server cannot be read by eavesdroppers.

Design for change

Requirements

Flexibility The ability to reconfigure a system with new interfaces, new rules.

Maintainability: The ability to analyse, then correct or enhance a system.

Portability: The ability to move a system from one platform to another, or convert it to run on another platform.

Extensibility: The ability to add new services or functions.

These requirements have a major impact on how much business logic can abstracted into parameters modifiable at run-time rather than “hard coded”.

For example, for an embedded system inside an engine, there are typically no requirements for new interfaces, rules or data.

This allows the architecture to be optimised for the prime characteristics of performance and resilience.

But what if the manufacturer has a range of such devices (say for different engine types) and expects to develop them over time?

Their Software Development Kit may contain flexibility capabilities which a pre-compiler removes when it optimises into the target (production) code.

A very different example is the enterprise solution for a multi-national telecoms company

It needs different interfaces in different countries, different tariff requirements etc.

It also has the expectation of rapid change, frequently different across different product lines

It is often trialling innovations which may be trialled and discarded if not viable.

Techniques

Design for portability

Conform to open standards (CDMs and IDL/APIs)

Otherwise design to enable vendor independence

Agile development and KISS

Whatever the requirement, many solution options can be made work, but look first at the simplest.

More often than not, simpler designs are easier to maintain.

The simplest design may not be configurable, but it is easier to refactor when new requirements arrive.

The agile philosophy is that large projects - to build complex software systems - tend to fail.

Rather than set out to build a complex system; it is faster, cheaper and better to build a simple solution, and learn from that.

While developing the simple solution, much is learnt about what kind of flexibility is required in a more complex solution.

Centralisation v decentralisation

The maintainability requirements of the target have a major impact on how “centralised” the configurable aspects of the solution must be.

In this context, centralisation can be either real or virtual, where a virtual implementation might provide some management tool for viewing and creating configuration profiles centrally but then automatically communicating them to and from the distributed instances of the elements of the configuration throughout the architecture.

An example of a truly decentralised maintenance architecture is a typical LAN/WAN design.

Each individual firewall, router, proxy and controller has its own configuration profile which is viewed (and altered) independently of each of the others.

In this case, there is no automated means to correlate any of the profiles so any discrepancies between profiles must be identified and corrected manually.

An example of a virtual centralised maintenance architecture is the control plane (for managing network routing and accounting control) in a Next Generation Network or Software-Defined Network where the maintenance of a network is centrally viewed and managed but is implemented in key controllers across the network infrastructure.

Socio-economic factors

Consider skills availability in the industry/region, longevity of the tools/techniques/technologies and the absolute costs

All have a major impact on the tools, techniques and technologies used in constructing the target solution.

Other NFRs

Usability: The ability of actors to use a system with minimal effort.

Measures include PLUME.

· Productivity: tasks completed in a given time.

· Learnability: how much training to reach a proficiency level.

· User Satisfaction: scores given by users.

· Memorability: how long to forget how to use the system.

· Error Rates.

Interoperability: The ability for systems to exchange data using shared protocols and networks.

May embrace “integratability” - the ability of interoperable systems to understand each other, which requires either common data types or translation between data types.

Footnote: Creative Commons Attribution-No Derivative Works Licence 2.0 07/05/2015 15:28

Attribution: You may copy, distribute and display this copyrighted work only if you clearly credit “Avancier Limited: http://avancier.co.uk” before the start and include this footnote at the end.

No Derivative Works: You may copy, distribute, display only complete and verbatim copies of this page, not derivative works based upon it.

For more information about the licence, see http://creativecommons.org