Data integrity or transaction throughout?
This page is published under the terms of the licence summarized in the footnote.
This paper is a spin off from the one on modular design and integration, covering OOD, DO, CBD, SOA, WS, EAI, REST, WOA and EDA
It contains a discussion of reasons why coding a data access layer on a data server might not be such a bad thing after all.
(Reasoned comments and challenges are welcome; you can contact me via the home page.)
Conventional enterprise database applications are usually characterised by the need for data quality.
Typically, business customers, suppliers and employees enter structured business data.
The transaction throughput might be very high, but not staggeringly huge, since it is usually limited by the nature of the business done.
A government department, hospital, school, or clothing retailer can only do so much business.
Data quality usually matters to business people, and their
business rules are like to be data-intensive rather than CPU-intensive.
There are not only basic database integrity rules (foreign keys and mandatory relationships).
But also data type checks and constraint rules such as price <10,000 dollars.
And simple derivation rules, such as price = quantity * unit price.
And more complex rules that require reading through the rows in a table – say reading through a customer’s order history.
Some other kinds of system have to handle a staggeringly huge volume of messages or “big” unstructured data
E.g. a global search engine, some Telco systems, some stock trading systems.
E.g. a global message board or social media web site.
Such systems likely require a highly “scaled out” solution design, perhaps with discrete databases in different locations
Some or all data may be replicated between locations, and inter-location data consistency may be relaxed.
Such systems usually have owners who can throw a staggering amount of money at them.
All business apps support business roles and processes; and all software and database structures are logical.
So the term business logic doesn’t really say much. Let me speak instead of process-centric and data-centric business rules
Some business rules govern the logical sequence of a business process, at some level of granularity.
· The end-to-end flow of a long-term value stream, from order to delivery.
· The shorter flow of a use case in the form of a user interface dialogue.
· The sequence in which the update of “master” data is distributed to places where it is copied.
There are two design patterns for a implementing a sequence: fork control (orchestration) and chain (choreography).
The fork pattern features a module that is dedicated to controlling the whole sequence from start to end.
Every step in a process can have preconditions (true before the step) and post conditions (true after the step).
Many preconditions and post conditions are well described as data-centric business rules..
These may be divided into validation or constraint rules, and derivation rules.
A constraint rule tests the value of a data item is valid.
· Price is < 1,000 dollars.
· The customer number represents a customer who has not been banned.
· A product has no outstanding orders (so can be deleted or archived).
A derivation rule calculates the value of a data item
· Price = order quantity * unit price.
· Price = (order quantity * unit price) less a percentage discount that is calculated from reading the customer’s order history.
Put yourself in my shoes: the three leading enterprise/IT architects of a medium-sized retail business ask you for advice.
They are running a lean mean IT operation; their mission-critical enterprise applications are based on relational databases.
Much of their “important code” is in the form of stored procedures that apply their business rules to their business data.
They are worried. They tell you:
“People tell us that all business rules belong on a middle tier app server.
We should update to newer technologies – recode all our business rules in J2EE code on an app server.
Also, we are being pressed to look at the latest database management systems.”
What do you recommend?
Do you advise them to migrate to new technologies that will likely incur risks and put their operation and maintenance costs up?
Or advise them to stick with the applications and design strategy they have?
If your business rules are many or complex, then shifting the rules from one layer/tier to another doesn’t reduce their number or complexity
But it might help you with concerns such as data integrity, performance, usability, maintenance etc.
Most software designers will favour encapsulating the
database structure in some kind of data access layer.
But where and how should that data access layer be coded? Relevant questions include:
· What is an app server used for?
· What throughput can a data server handle?
· What is a data server used for?
· What language to code business rules in?
· What rationale can we bring to bear on this?
But first a brief recap of modular design history.
For 40 years, the IT industry has continually revisited modular design and integration, its concepts and principles.
In the 1980s, the trend was to physically separate end-user client devices from database servers.
So, modular design now included dividing an enterprise application system between
· modules handling user interface data structures (on client devices) and
· modules handling business data structures (on a mainframe or database server).
In the 1990s, OO programming languages were adopted in enterprise application development.
The first OO software design guru (Bertrand Meyer) considered an object/module as an “abstract data type”.
Meyer took it for granted there would be higher-level procedures that invoke and reuse these objects/modules.
So, the wider system contains both process-centric rules and data-centric rules.
App server vendors and OO evangelists combined to promote deployment of OO software to a middle app server tier.
An app server sits on the server side, between the user interface devices and the database server.
It was a mantra of OO design that all business rules belong in objects on a middle tier app server.
In the 2000s, SOA and WOA reversed many of the early OO design presumptions. (See earlier paper for details.)
Today, architects, from software architects to enterprise architects, use a mix of principles, including pre-OO and post-OO ones.
Remember: every network hop prompts questions about latency, availability and security.
"Every extra tier is a performance hit, extra documentation, extra skills and knowledge to acquire, and extra complexity over time." (A correspondent commenting on this paper)
Suppose your application is a Google that handles a huge volume of enquiries and doesn’t worry too much about data integrity.
Or it processes a wide and fast flowing river of messages (voice, SMS, indicative stock prices).
Or it features CPU-intensive processes (a Fourier transform algorithm, a spinning web page banner).
You will probably look to put the server-side code on what might be called a web server or app server.
Big response time gains can be made by caching as much data as possible on web servers.
A web server cache can greatly reduce the amount of computing resources needed lower down the stack.
Huge message volumes can be handled by scaling out servers (provided most messages don’t need access to the database).
For some purposes, a web server farm is enough; for others, an app server tier is needed.
For very high volume message processing systems, there seems to be a design pattern in which, at a conceptual,level
· the app server tier caches data and acts as a database serving end user information needs.
· the data server acts more as a data warehouse, holding historical data for analytical processing.
In a discussion group, Eric Gufford explained the design of a stock trading system handling indicative and pricing data.
An indicative price is a bid and offer price provided by a market maker for evaluation or information purposes only, not to be interpreted as a firm bid.
The design was described as follows:
The L1 app server cache is in-memory.
Basically (a gross oversimplification) each managed Entity has its own collection of in-flight versions, and there is another collection of all managed entities.
Each time the app server cluster is restarted, the data is read out of the DB and stateful objects are stored in the cache.
All deltas came in to the system through the app
server, so the local objects were modified in line and persisted after the
There are 4 instances in the horizontal cluster, 2 servers in the vertical, total of 8 nodes.
L2 worked cross JVM and cross server thus
providing both HA and DR. It handles all this for you, 'for free' once you wire
it into the JPA provider
L2 caches are cross-JVM as well as CEOs server and a development discipline in its own right.
There are white papers on products like Gemfire and Coherence that explain it thoroughly.
Suffice to say they are also collections of objects where each App server instance communicates over a local socket for the golden copy.
The DB schemas for indicative data were kind of a data warehouse, while the pricing data was more of a transactional DB.
The schemas were modelled on the problem domain, using business terminology, though not quite in 3NF, as we denormalized a little for performance.
The schemas hold data-centric business rules: foreign keys constraints, cascading deletes; before/after triggers; transactional tables partitioned on day basis, constraints for simple things like 0/1 on small lint columns (small lint) and the like.
Were some business rules were replicated in the app server and data server tier?
The JPA annotations did replicate the cascades (cascade-all) as well as some length constraints and join relationships (many to one, one to one, join table, etc) and enumeration were used for various constrained value lists.
Who knows if the above design is right?
Remember: "The beginning of wisdom for a computer programmer is to recognise the difference between getting a program to work and getting it right" M.A. Jackson (1975).
Trouble is: the wisdom needed has increased ten or hundred fold since 1975.
Eric says: There is one chink in the armour for this architecture - there is no way for the app server to know when the data is changing underneath it.
So you true to get all delta processing to be pushed through the app server directly instead of direct DB manipulation.
Though this can be mitigated to an extent by reducing the TTL to, say, 5 minutes instead of, say, 15 mins.
Some of what Eric wrote above is cryptic to me. I assume the system works as well as is wanted.
But it does seem a very complex design. And I have the impression that much complexity arises from two design decisions
1) Coding rules on the app server tier as well as the data server tier.
2) Scaling out app servers that contain stateful objects.
So the question has to be: Why were those decisions made? Was it throughput alone? I gather the throughput is in the range 1,000 to 10,000 tps.
How do we know the data server (with suitable caching) could not handle this?
It is conceivable that a data server could handle 10,000 tps; see below for an example, but it depends on the kind of transaction.
Do process-centric business rules belong an app server?
Remember there are two basic design patterns for a sequence: fork control (orchestration) and chain (choreography).
If you choose the fork pattern, then you have to place the control module somewhere.
If that controller supervises a process spanning several data servers, then you may place it on a higher tier.
That higher tier could be an app server, or possibly a web server.
Do data-centric business rules belong on an app server?
In the late 1990s, OO programmers presumed such business rules belong on an app server rather than a data server.
Some still recommend that; especially those who make a living from app servers, or from writing software that runs on app servers.
Others recommend putting all data-centric business rules in the place where the data is stored: the database.
The former have tended to dominate the argument; so, I’d like to explore the case for the latter.
It is impossible to say how many transactions per second you can get out of unspecified transactions on unspecified servers.
Throughput depends on transaction size, enquiry/update mix, security, number of CPUs, your budget for servers etc.
Even when you have specified the transactions and the servers, you’ll find it hard to find anybody expert enough to predict a number.
And database vendors seem wary of publishing any useful-looking performance
So we have to
rely on what people find when they test different solution designs
Read some experts talking http://blogs.law.harvard.edu/philg/2011/01/10/how-many-updates-per-second-can-a-standard-rdbms-process.
One expert reports achieving a throughput of 16,000 transactions per second on a RDBMS for simple but non-trivial transactions.
I benchmarked PostgreSQL 8.3 and Oracle 10g.
Both on a pair of HP blade servers, each with 2x 2.4GHz quad-core Opterons and 8GB, and one 320GB FusionIO PCIe SSD module.
The list price of the Solid State Drives, $10,000+ each, was much higher than the rest of the systems put together.
One node was primary, the second backup, using Skytools walmgr for PostgreSQL and DataGuard for Oracle.
The query mix was 50-50 read-write, and 50-50 using SSD or SSD+HDD tablespaces.
The queries were simple but still representative of our production workload.
We had stringent SLAs in terms of latency, 30ms max response time at the 95th percentile.
PostgreSQL handled roughly 16,000 transactions per second, with 0.9ms latency at 95th percentile, and Oracle was comparable.
Most NoSQL databases are immature for operational deployment, and have serious functional compromises.
They can be appropriate for specific operations, but with fast SSDs, I don’t think they make sense for any but the largest Web 2.0 shops.
Are you thinking $10,000 is a lot of money for a Solid State Drive?
Calculate the cost of 12 programmers working for 12 months (not to mention the associate managers, testers etc.).
Primarily, a data server hosts a database management system, and the code that reads and writes database records.
For retrieving data, the SQL query language is much the same in most enterprise database servers.
However, the same discussion thread includes this report "relocating application logic and the web
server directly to the database server is not a bad idea."
"If transactions are not generated by stored procedures but come from clients,
be ready to spend huge amounts of time for round-trips, context switches and compilation of incoming queries.
Inter-process communication can easily cost more than all other things altogether.
We had a live specimen of a web server + application server with additional network activity + database.
It dispatched enormous SMS traffic from web applications to mobile networks.
It consumed almost 100% CPU of four big Sun boxes.
The same functionality was moved into a single executable without significant changes in business logic.
That executable was running on a single box consuming about 10% of CPU.
The rest of hardware was reused for rotating banners etc.
As the service become more popular, that banners occupied almost all CPU time of boxes.
But the CPU consumption of the main executable remained basically the same — about 10% of CPU of one box.
So relocating application logic and the web server directly to the database server is not a bad idea."
What intrigues me about the quote above isn’t the time and cost of inter-process (probably inter-server) communication.
It is the impression that only a small part of the total code base is devoted to business logic – which matches what my intuition suggests.
There is debate to be had about whether what follows applies to the raw database structure, or to an abstraction from it
Some database specialists like to hide the physical data structure from application developers.
They define a more logical data structure in terms of “views” that application developers can read and write.
Sorry, but I can’t get my head around how to write this paper to cope with all possible design options.
Suppose you do want to code your rule-imposing data access layer on a data server, what language to use?
And how can you store a module in close association with the data item that it constrains or derives?
“A stored procedure (stored in a database’s data dictionary)
- can consolidate and centralize logic that was originally implemented in applications
- is a module which applications can invoke via a call statement
- can contain complex processing that requires several SQL statements
- can call another stored procedure, in a nested structure
- may be coded in various programming languages, for example SQL, Java, C, or C++.
There is a standard specification for Java stored procedures.” After Wikipedia
Oracle (reference lost) say as follows
Java programmers can execute SQL statements directly from Java using JDBC.
However, the syntax for manipulating SQL statements in JDBC is much less intuitive and harder to maintain than writing native SQL.
By combining PL/SQL with Java and making careful choices, you can achieve a higher quality database application yielding better performance.
Oracle PL/SQL, an extension of the SQL language, fully supports SQL data types, alleviating the need to convert between Java types and SQL types.
It designed and optimized to handle interactions with underlying Oracle database tables.
Using PL/SQL, you can execute a single block that contains multiple SQL statements.
By doing this, you can drastically reduce network traffic between database and application, increase overall application development productivity and better manage applications.
Three more references FYI:
· AskTom on "Business Logic - PL/SQL Vs Java - Reg": http://asktom.oracle.com/pls/apex/f?p=100:11:0::::P11_QUESTION_ID:883929178230
· Tony Andrews on business logic in the database: DEAD LINK? http://tonyandrews.blogspot.com/2005/04/business-logic-apis.html
· You may find a database-centric approach to J2EE application development here: http://www.oracle.com/technology/pub/articles/odtug_award.pdf
You want rules so close to the data that nobody who writes higher level code (workflows, ETL, whatever) can bypass the rule when accessing the data.
The closer the rule to the stored data, the more likely that rule will always be applied.
The less opportunity there is for somebody to write code that forgets a constraint rule, or bypasses a derivation rule and so corrupts the data.
Single point of definition (SPOD) is a recognised design principle.
The rule should be coded, definitively, in one module in one place. The nearer to the data, the less likely it will duplicated by accident.
Databases perform very well. You've bought expensive, sophisticated software capable of managing your data in a fast, reliable, scalable way. Why not use it?
For usability and/or fast response time, you may deliberately duplicate the rule as close to the user as possible.
So the rule may appear in front-end user interface code, on a web server or even on a client device.
Still, to ensure a data-centric rule is applied via every input/output channel, you’ll always need the definitive form of the rule to be placed at the back end.
Separating the code for one process between network nodes adds complexity to the overall solution - not to mention response time, costs and risks associated with a longer network path.
The more code is distributed, the more code is needed to shuffle data from one place to another, rather than do anything purposeful.
Separating the code for one process between coding technologies can introduce a structure clash.
If your “rich domain” OO class model and database schema have different structures, then you’ll need to add a data abstraction layer.
Martin Fowler (“Patterns of enterprise application architecture”) noted that much time and cost is absorbed by creating and maintaining object-relational mapping layers.
He also wrote that for most business transaction processing systems, his simple "Transaction Script Pattern" is fine (so a rich domain model is not needed).
If you want to find code that applies a rule to a data item, the most obvious place to look for the rule is in code closely associated with the data item.
You’ll always need database specialists. The technologies used for higher layers of software are more volatile.
To implement a transaction using procedures in the database is “a powerful tool for encapsulating rules and hiding detail” (a correspondent writes).
What if your enterprise’s standard design pattern includes an app server?
Then use it sparingly. Beware that storing state data on the app server makes it harder to scale out the design.
If you do scale out a stateful app server, then you will have to spend time and money addressing the consequences.
Different server technologies and coding languages can affect performance and maintainability.
Even bigger effects can result from the difference between good and bad software design.
I embrace the philosophy of James Rumbaugh.
“It is important not to be dogmatic. Every approach has its advantages and limitations. You must use a mixture of approaches in the real world and not mistake any one of them for truth.”
However, I feel duty-bound to report the following experiences from my time as a reviewer of multi-tier application development projects.
The three most successful development projects were 2-tier solutions; they came in on time and budget, were liked by users and easily maintained.
They were developed separately by unrelated teams in the water industry, the health service, and the rail industry.
The worst failure I observed tried to use a 4GL to code mission-critical high-performance batch processing.
Other notable failures included two projects that were hampered by object-oriented envangelism.
Both projects employed 20-30 Java programmers, yet only one database specialist.
In one - a hotel booking system – programmer time was consumed by the data abstraction layer, debating which OO design patterns to use, distributed programming issues, and complexities at the level of network ports and sockets
In the other – a banking system - programmers spent their days devising a user interface and debating the “right” structure for their rich domain model.
They paid little attention to the business database, treated it as though it were generic infrastructure, never got as far as defining a data abstraction layer.
The closest we can get to a purely logical, technology-independent, data structure is an entity-attribute-relationship (EAR) model.
If you can derive an EAR model from analysis of data in system inputs and outputs, then you have a sound basis for a database structure and/or an OO class structure.
If you have a good reason to make either of these two structures different from the EAR model, then do it.
But if you don't have a good reason, then keep the two structures as similar as possible.
Certainly, don't create a structure clash just because you have a fanciful notion that you know how to model the “real world” independent of I/O processing requirements.
Footnote: Creative Commons Attribution-No Derivative Works Licence 2.0 27/05/2014 20:45
Attribution: You may copy, distribute and display this copyrighted work only if you clearly credit “Avancier Limited: http://avancier.co.uk” before the start and include this footnote at the end.
No Derivative Works: You may copy, distribute, display only complete and verbatim copies of this page, not derivative works based upon it.
For more information about the licence, see http://creativecommons.org