Design for NFRs - introduction
This page is published under the terms of the licence summarized in the footnote.
This paper assembles notes made during discussion in architect classes of terms in the syllabus for BCS professional certificates in enterprise and solution architecture
There is more about design for NFRs in Avancier’s training courses.
Contents
Design
for resilience (availability and reliability)
"The beginning of wisdom for a computer programmer is to recognise the difference between getting a program to work and getting it right" M.A. Jackson (1975).
What makes an architecture good or right?
Generally, how well the architected system meets its long-term objectives and requirements.
Here, we mostly take the Functional Requirements for granted, and focus on the Non-Functional Requirements (NFRs).
Often, the non-functional qualities of a system turn out to be critical.
The cost of meeting a few NFRs often vastly exceeds the cost of meeting many functional requirements
Some NFRs are critical, can mean the difference between success and failure of a project at testing time or run time.
And if something goes wrong during operation, it may cost millions of dollars.
General advice on setting NFR measures:
· Set targets that are realistic, truly necessary and probably achievable.
· Define how they will be measured.
· Adjust targets with a level: a % of cases that must meet the target, and periods when the targets will be relaxed.
E.g. 99% of Telco top-up transactions must be faster than 3 seconds, 95% must be faster than 2 seconds.
Or, a service is to be 99.95% available during weekdays, but a lower figure is acceptable at weekends.
Who defines NFRs?
NFRs are usually negotiable; how much depends on context
When customers ask for more than they need, or can afford, then cost control requires solution architects take the lead.
Pre-empt glib requirements; strive to steer NFRs.
It is easy for customers to state requirements that:
· overstate their true requirements, and thereby impose excessive costs on solution development.
· are well-nigh impossible to measure, or to meet,
· are of questionable business benefit.
Turn NFRs as early as possible into solutions you can demonstrably deliver.
E.g. For maintainability you might define the use of specific “open” standards, and traceability of requirements to solution elements.
Consider also cost and risk.
It ain’t just the target measure users want; it is what you have to do/pay to meet it.
What is the cost/risk to the business? What is the cost/risk to the supplier?
What must the supplier do/pay when the measure isn’t met?
Options include: service credit, money, human resources (consultancy), technology resources, equipment upgrade / discount?
All the requirements below should be considered in designing a software architecture.
Not only the requirements of a whole system, but of each component, down to individual elements if necessary.
Performance: subdivides into two measures,
sometimes in harmony and sometimes in opposition:
·
Throughput of volume: number of services executed in a time
period.
·
Response or cycle time (aka
latency): time
taken from request to response or completion.
Performance requirements have a major impact on how much time and money must be spent on a solution.
The zoom out - zoom in principle.
First zoom out: a 3 second response time may not matter in the context of work that people are doing asynchronously in the business.
The wider system may have different needs, or different ways to get around NFR failures.
Then zoom in: is a 3 second response time feasible in the light of the components and communication paths involved?
Throughput and response time are sometimes in harmony and sometimes in conflict.
“Many assume the default Oracle 11g parameters are right for them, not realizing the incompatible goals of optimizing for throughput vs. fast response time.
These metrics are quite separate.
The default optimizer_mode = all_rows (maximizing throughout by choosing SQL access plans that minimize server resources).
Optimizing for the fastest response time set optimizer_mode = first_rows.
These goals of response time and throughput are different and often at odds with each other”
IT Tips by Burleson Consulting, September 3, 2009
The
general advice is to look for bottlenecks and tackle them one by one.
These bottlenecks can occur at any scale in the solution (macro-scale or micro-scale).
There are some general good practices for coding efficiency,
execution efficiency, memory efficiency, etc.
But what works in one context may be disastrous in another.
Scale up:
Usually
means increasing the power of a node by adding processor or memory resources.
It can mean
increasing network bandwidth.
It is generally
good for both response time and throughput.
Time usually depends on networks and databases.
Both are much faster than they used to be, but slow compared with the speed of a computer.
Techniques
include:
·
Minimise distribution, network hops.
· Shorten the network hops.
· Minimise network traffic, message passing.
· Faster network medium (e.g. fibre).
·
Eliminate unnecessary middleware, message queues.
· Minimise disk accesses (or rather head movements).
Caching: Holding data in a temporary
storage area - usually frequently-accessed data.
Placing copies of persistent data in a location nearer to the user
than the original data source.
Generally good for response or cycle time. Can raise concerns about data integrity
and security.
Database optimisation: Techniques for reducing needless
database access include
· Normalisation or de-normalisation of data
·
Addition or removal of indexes
· Optimisation access paths
· Turn referential integrity off
·
Move data processing from app server to data
server.
Indexes: A list of pointers to selected data
elements in set of data elements – usually, selected rows in a database table.
Useful in optimisation of batch input and output processes, which
typically run overnight.
May be temporarily disabled during the day to optimise on-line
update processes.
Access path analysis: Study of the route a process
takes through a data store structure.
A very
common source of performance problems is that an SQL programmer does not know
the access path their procedure takes through a database.
So it is
advisable to use access path analysis and/or employ highly skilled SQL
resources for critical database access programs.
Agile development
and/or KISS
More often than not, simpler designs are faster in
operation.
You must however look ahead to what throughput is expected
in future (say 3 to 5 years), and ensure the simple design can handle it.
More thoughts:
· Change transaction start and end times/boundaries.
· Optimise memory release and garbage collection.
· Sharding: partition data between data stores
· Separation of data stores for
o Persistent v. temporary or derivable data (no DR needed)
o Latest state v. transaction logs
o Active v. inactive data
· App profiling & tuning (e.g. Wiley Introscope)
· Specialised process accelerators
· Shorten Mainframe filepath
Usually
involves running processes in parallel.
Scale out (aka clustering): Increasing the number of parallel
processors.
This might
mean multi-threading of software components.
But more
usually means adding more (physical or virtual) nodes to a cluster.
Some kind
of load balancer must sit in front of the cluster and distribute service
requests between them.
Generally good for throughput. Not always good for response
time.
“There’s always a glass ceiling”
Scaling out from 1 box to 2 boxes may double the throughput (the average response time is ever-so-marginally longer).
Scaling out from 2 boxes to 200 is different.
Keep adding boxes, and eventually you reach a point where response time suffers, and throughput stops increasing
Finding the bottleneck may not be easy.
There are obvious bottlenecks, the CPU is working flat out.
And less obvious and perhaps obscure bottlenecks
· Load balancer capacity
· Number of threads
· Number of connections / socket table
· Middleware capacity
· Number of message queues
· Size of message queue
· Ill-fitting database structure (see below)
Consider also: batch transactions for off-line processing.
Down time has a business impact
The “Churn” department of a telco calculates the effect of down time leaders on loss of customers.
Availability: The amount or percentage of time
that the services of a system are ready for use, excluding planned and allowed
down time.
Possible
measures include MTBF / (MTBF + MTBR), which usually refers to availability at
the primary site, excluding disasters.
Reliability: The ability of a discrete component
or service to run without failing. Possible measures include mean time between
failures (MTBF).
Aside: The measures above are sometimes applied only to
platform applications, ignoring faults and failures in business applications.
Recoverability: The ability of a system to be
restored to live operations after a failure.
Possible
measures include mean time to repair (MTBR). Usually refers to disaster
recovery using resources at a remote site.
First,
you can consider using highly reliable hardware.
After that, the primary technique is to build redundancy into the system, to provide parallel components to do the work.
Redundancy means duplicating process, data, tin and wire.
E.g. to scale out, or add one to the number of servers in cluster
that calculation or prototyping suggests is needed.
Other
techniques include defensive design and provision of failover capability.
The availability requirements of the target have a major impact on how much communication and collaboration is necessary across elements of the solution for sharing function state, data and context.
This includes consideration for handling in-flight transactions, reference data and context data.
Fail over: Automatic switch over to a
redundant or standby system, upon the failure or abnormal termination of the
previously active system.
Failover
happens suddenly and generally without warning.
Failover and failback should be automatic and transparent to application servers.
Defensive design: 1. Designing
a client component so that it does not fall over if a server component does not
work properly; asynchronous invocation may help.
2.
Designing a server component so it does not depend on input data being valid, which means testing input data and preconditions before processing.
Aside: Defensive design is the opposite of “Design by
Contract” as promoted by Bertrand Meyer.
The
principal technique is to back up and provide some kind of switch-over or
fail-over procedure.
Central applications are configured so as to failover from a central location to a backup disaster recovery site.
For high recoverability – you pay a lot.
Procedures
must address also “fail back”, to return operations from a disaster recovery
site to the normal production site.
Back up: A copy of data that may be used
to restore the original after data loss.
Used in
disaster recovery. Also used to restore individual files that
have been deleted or corrupted.
Backups
are typically the last line of defence, coarse-grained and can be inconvenient
to use.
Backup site: A location where systems are or
can be duplicated.
A cold
site has no equipment.
A warm site
has infrastructure but no up-to-date data or software.
A hot
site has up-to-date software and more or less up to date copies of data.
Synchronous replication: replicates data storage in real time, needs high bandwidth and low latency, limited by time/distance.
Asynchronous replication: replicates data storage off-line, makes and keeps copies of data at a remote site; operations can be resumed at the remote location using a remote copy, not subject to time/distance limits.
Don’t forget the data: VMware Site Recovery Manager stages and executes the process of failing-over servers to a remote site.
But does your data storage system (SAN) failover as well?
Don’t forget remote applications: which enterprise applications are business critical?
Remote apps may be more important than central ones.
Don’t forget business recovery: you need qualified people at business locations to carry on the business
More things to think about
· Monitoring/warning
· Testing of back ups,
· Back up electricity supply
· Back up clerical operations
· Transaction log
· Restore processes,
· Business continuity support contract
· ESCROW
· KSD Key Support Documents (log of past recovery actions)
Integrity: A term with four possible
meanings defined under “Data Integrity”.
The two
general techniques for data store are
·
reduce data replication
·
ensure updates are made via ACID transactions.
More
specific techniques are:
·
normalise stored data
·
switch on automated referential integrity checks,
·
remove caches
·
consolidate distributed databases.
Defensive design (always validate input data)
Introduce check sums
Remove duplicated data stores (caches)
Design reconciliation processes
Control user access
Transaction logs and audit trails
Scalability: The ability to expand a system, to
increase system capacity in operation, to grow with increased workloads.
· N+1 design.
· NoSQL databases
· Loose coupling
·
Virtualisation
Serviceability: The ability to monitor and manage
a system in operation.
· Document the configuration
· Use automated monitoring tools
· Manage technology life cycles – planned replacement cycle
· Optimal modular design
· Enable remote access to and control of client devices
In
addition to the employment of server and network monitoring tools, a notable
technique is to instrument applications so that they report on what they are
doing, and how well they are doing it.
Security: The ability
to prevent unauthorised access to a system.
(This is Confidentiality only, since Integrity and
Availability have already been covered.)
This has a major impact on how much separation and control must be implemented in each element across the target solution.
Design for human and organisational security: Definition of all the things that can be done outside of
IT systems to secure business information, such as security guards, locks on
doors, definition and roll out of policies and procedures.
Data security: 1:
Confidentiality alone. 2: A combination of Confidentiality, Integrity and
Availability.
Aside:
Tom Peltier suggests rating the security level of a
data item, data structure or data store as equal to the highest of the
individual ratings (high, medium, low) awarded for Confidentiality, Integrity
and Availability.
Security protection: Prevention of access to data designed to maintain the
required data qualities of confidentiality, availability, and integrity.
Security feature: A feature of a system that enables its data and processes
to be protected, such as: Encryption, Checksum, HTTPS
Security policy: A policy that defines which actors have (or do not have)
Access rights to objects in a given domain - along with any other protections.
Information domain: A uniquely identified set of objects with a common security
policy.
Access
to any data within the domain is limited and constrained by the same rules.
Identity: One
or more data items (or attributes) that uniquely label an entity or actor
instance. E.g. passport number or user name.
Encryption: A process to encode data items (in a data store or data
flow) so that they are meaningless to any actor who cannot decode them.
Checksum: A
redundant data item added to a message, the result of adding up the bits or
bytes in the message and applying a formula.
This
enables the receiver to detect if the message has been changed.
It
protects against accidental data corruption, but does not guarantee data flow
integrity, since it relies on the formula being known only to sender and
receiver.
Digital signature: A cryptographic scheme that simulates the security
properties of a handwritten signature.
More
secure than a check sum, it is said to guarantee the data flow integrity of a
message, since the signature is corrupted if the message content is changed.
Design for applications security: Techniques for preventing
unauthorised use of an application.
Identification: A process via which an entity or actor supplies their
identity to an authority. It is usually followed by authentication.
Authentication: A process to confirm or deny that an actor is trusted - is
the entity to which an identity was given. E.g. A password
check.
Produces
one of four results: true positive, true negative, false negative - which leads
to wrongly-denied access, or false positive - which
leads to unauthorised access.
It is
usually followed by authorisation.
Three-factor authentication: Authentication by checking what users remember (e.g.
password, mother’s maiden name), carry (e.g. credit card or key) and are (using
biometric data).
Authorisation: A process giving access to a trusted actor, based on that
actor’s known access rights.
It is
usually followed by Access.
Access: A
process to look inside a system to find data or processes of interest, with a
view to retrieving them or using them.
Design for infrastructure security: Techniques for protecting client and server computers from
malicious access.
Client-side security: Features that protect client-end computers from malicious
access.
Server-side security: Features that protect server computers and databases from
malicious clients.
Firewall: Software
at the boundary of a network that is used to detect, filter out and report
messages that are unauthorised and/or not from a trusted source.
De-Militarised Zone (DMZ): An area of a network, usually between the public internet
and the enterprise network.
It uses
firewalls to filters out messages that fail security checks. It contains
servers that respond to internet protocols like HTTP and FTP.
HTTPS: Normal
HTTP Interoperation over an encrypted Secure Sockets Layer (SSL) or Transport
Layer Security (TLS) connection.
This
ensures reasonable protection of data content from those who intercept the data
flow in transit.
Aside: If an HTTPS: URL does not specify a TCP port, the
connection uses port 443.
Web site security: Usually, a process whereby a web browser checks the public
key certificate of a web server at the other end of an HTTPS connection.
The
aims are to check the web server is authentic (who it claims to be) and that
messages to/from with the web server cannot be read by eavesdroppers.
Flexibility The ability to reconfigure a
system with new interfaces, new rules.
Maintainability: The ability to analyse, then
correct or enhance a system.
Portability: The ability to move a system from
one platform to another, or convert it to run on another platform.
Extensibility: The ability to add new services
or functions.
These requirements have a major impact on how much business
logic can abstracted into parameters modifiable at run-time rather than “hard
coded”.
For example, for an embedded system inside an engine, there
are typically no requirements for new interfaces, rules or data.
This allows the architecture to be optimised for the prime
characteristics of performance and resilience.
But what if the manufacturer has a range of such devices
(say for different engine types) and expects to develop them over time?
Their Software Development Kit may contain flexibility
capabilities which a pre-compiler removes when it optimises into the target
(production) code.
A very different example is the enterprise solution for a
multi-national telecoms company
It needs different interfaces in different countries,
different tariff requirements etc.
It also has the expectation of rapid change, frequently
different across different product lines
It is often trialling innovations which may be trialled and
discarded if not viable.
Design for portability
Conform to open standards (CDMs and IDL/APIs)
Otherwise design to enable vendor independence
Agile development and
KISS
Whatever the requirement, many solution options can be made
work, but look first at the simplest.
More often than not, simpler designs are easier to maintain.
The simplest design may not be configurable, but it is
easier to refactor when new requirements arrive.
The agile philosophy is that large projects - to build
complex software systems - tend to fail.
Rather than set out to build a complex system; it is faster, cheaper and better to build a simple solution, and learn from that.
While developing the simple solution, much is learnt about
what kind of flexibility is required in a more complex solution.
Centralisation v
decentralisation
The maintainability requirements of the target have a major
impact on how “centralised” the configurable aspects of the solution must be.
In this context, centralisation can be either real or
virtual, where a virtual implementation might provide some management tool for
viewing and creating configuration profiles centrally but then automatically
communicating them to and from the distributed instances of the elements of the
configuration throughout the architecture.
An example of a truly decentralised
maintenance architecture is a typical LAN/WAN design.
Each individual firewall, router, proxy and controller has
its own configuration profile which is viewed (and altered) independently of
each of the others.
In this case, there is no automated means to correlate any
of the profiles so any discrepancies between profiles must be identified and
corrected manually.
An example of a virtual centralised
maintenance architecture is the control plane (for managing network routing and
accounting control) in a Next Generation Network or Software-Defined Network
where the maintenance of a network is centrally viewed and managed but is
implemented in key controllers across the network infrastructure.
Socio-economic
factors
Consider skills availability in the industry/region,
longevity of the tools/techniques/technologies and the absolute costs
All have a major impact on the tools, techniques and
technologies used in constructing the target solution.
Usability: The ability of actors to use a
system with minimal effort.
Measures
include PLUME.
·
Productivity: tasks completed in a given time.
·
Learnability: how much training to reach a proficiency level.
·
User Satisfaction: scores given by users.
·
Memorability: how long to forget how to use the system.
·
Error Rates.
Interoperability: The ability for systems to exchange
data using shared protocols and networks.
May embrace “integratability” - the
ability of interoperable systems to understand each other, which requires
either common data types or translation between data types.
Footnote: Creative Commons Attribution-No Derivative Works Licence
2.0 07/05/2015 15:28
Attribution: You may copy, distribute and display this copyrighted work
only if you clearly credit “Avancier Limited: http://avancier.co.uk” before the start and
include this footnote at the end.
No Derivative Works: You may copy, distribute, display only complete and verbatim
copies of this page, not derivative works based upon it.
For more information about the
licence, see http://creativecommons.org