This booklet is published under the terms of the licence summarized in footnote 1.
This short booklet serves as a general introduction to two more substantial volumes of papers on entity modeling and event modeling, which contain many more specific analysis patterns and analysis questions. If you feel the lack of diagrams in this book; you will be more than compensated by the large number of diagrams in later volumes.
But the software industry is a broad church. Church members who want to get married need to be aware that the various branches of the modeling religion hold to different articles of faith.
Modeling languages have waxed and waned. Modeling itself waxes and wanes.
A big challenge is to reconcile the Agilists view that code and tests are all that matters with the more traditional view that models are essential.
The data model is perhaps the most universal and powerful artifact in enterprise application development. Walk around any office where an enterprise application is being is being developed, and if you see only one diagram people’s desks, it is likely to be a data model or database structure.
It is fashionable to speak of analysts building a conceptual model or domain model, but I believe this is unwise because it encourages people to build fanciful models. Let me be absolutely clear: if there is to be a database, then you should start defining a plain old-fashioned logical data model as early as possible.
Moreover, I prize the remark by Alistair Cockburn in the reference below about data modelers coming up with "conceptual models" as well if not better than OOP modelers using CRC cards.
There are good reasons why this is so. Data modelers construct better entity models because they draw on a body of knowledge and experience about data structures that is not available to people trained only in OOPLs and object-oriented design.
Some software engineering gurus claim the OO principle of encapsulating data behind process means we should design processes before data. This is rather to miss the point of the enterprise database, which is to serve as wide a variety of processing requirements as possible.
It is wise, in the right circumstances, to apply Agile development principles, to program iteratively and refactor data structures as we go along. But are the circumstances always right? Most enterprise applications require a substantial database.
· How Agile should we be in our efforts to define the system’s database structure?
· How far is it advisable to start coding programs (perhaps realising just one use case) before the database structure is reasonably stable?
· Can the structure of a database that contains persistent data be refactored as readily as the structure of processing code?
In exploring these questions, I will propose three principles as being good general advice. I will also suggest that using one modeling notation (the UML class diagram) for different techniques, for different purposes, may be hindering teachers and students rather than helping them.
Given your project requires substantial database development, then the system to be built will feature two structures: a database schema and processing schema. The database schema is based on (is even) a data model. The processing schema may take the form of a class model in an OOPL implementation, or a modular structure in a procedural language implementation.
· Agilist: You are well advised to have your code schema and your data schema based on the same design model.
I’d put it differently: one depends on the other; the processing schema depends on the data schema. Several reasons have been proposed for the precedence of data structure over process structure.
· Data structures tend to be relatively stable compared with processing structures. New requirements usually need new processes, but often need no change to the structure of the persistent data store.
· The data items that must be stored for long term use tend to be relatively indisputable, and there are only a few plausible structures in which these data items can be grouped, whereas there are very many plausible ways to structure the processing.
· The structure of a database that contains persistent data cannot be refactored as readily as the structure of processing code, because you have to migrate all the persistent data (both live data and test data) from the old schema to the new one.
· You are well advised to design the processing schema (or at least part of it) around an earlier-defined data schema. But you are not so well advised to design the data schema around an early-defined processing schema; because it is simply harder to get it right that way around.
This is not idle theory. I worked on a project in a New York bank where the 20 Java developers tried to complete their code schema before the 2 database administrators had produced a stable Oracle data schema. The ship sank because the Java developers did not consider spending a half penny on data analysis and data modeling.
One way or another, developers have to reconcile data used in different I/O channels through use of common data definitions. They have to recognise and deal with the structure in which persistent data is stored. OOpers have to map their OOP classes to data model entity types. Procedural developers have to include data manipulation statements that read and write data model entity types.
But the same is not true the other way around. Systems analysts and/or database designers do not have to look at the process structure to define their data structure. They certainly should consider the most important business services and their access paths. But they can and should analyse and specify these business services and access paths at a level higher than programming. The natural access path for a business service through a data structure is not determined by program design or language choices.
· Agilist: There are other ways to approach modeling. If my team consists mostly of OO developers, typical of modern projects, then they are more likely to be more familiar with OO models (such as those of the UML) than they are with data models.
Yes. The modern developer’s education is often narrowly object-oriented. Many OO developers are not taught data modeling. I want us to lift those OO developers out of their mono-paradigmism rather than accept it. I realise we have to march alongside the army of developers whose ideas about analysis and design come only from OOP training courses, but the industry needs people like us to counterbalance their training.
Also, we should teach people to distinguish modeling technique from modeling notation. It seems to me that using UML class diagrams as the notation for teaching data modeling encourages developer students to view data modeling as limited form of a program structuring technique rather than a distinct data structuring technique.
· Agilist: Regarding technique, the UML User Guide devotes 6 pages out of 482 to persistence issues and the advice boils down to applying a few stereotypes.
Indeed. And the Unified Process does not mention data modeling. That surely encourages OO developers to think data modeling is easy, and perhaps that they know all there is to know about it. They won’t seek education and training in data analysis and data modeling techniques. Serious education in data modeling needs a manual of its own and two or three days training.
We are well advised to “model with a purpose”. So what is the purpose of a data model?
· to capture required business terms and facts and other business rules (sometimes calls constraints and derivations).
· to define what data must persist for the required business services to be completable. The business services (aka units of work) are what clients, in the broadest sense of that term, require of the back end of the system to be built.
Both these purposes are unaffected by our choice of programming language (COBOL, SQL or Java) or program design paradigm (procedural or OO). Both goals can be met without reference to a structure of OOP classes, or to a structure of procedural program modules.
The following are not the primary purposes of drawing a data model:
· the specification, design or modularisation of processes,
· the specification or design of OO classes, operations or interfaces,
· the specification of interface inheritance (enabling polymorphism of operations),
· the specification of implementation inheritance (enabling reuse of operations),
· the specification of interfaces.
All of those things belong rightly elsewhere, in structural and behavioral modeling techniques familiar to program designers and OO class modelers. Of course the data model feeds into all those things - but it is not those things - it is first and foremost a data structure.
· Agilist: Regarding notation, I choose the documentation format that best suits my audience. I can achieve the same ends as a data model with an UML class diagram.
Yes you can draw your data model using a UML tool. You will have to put up with some limitations of the notation and the tool as regards data modeling. More importantly, I suspect that using UML class diagrams for data modeling may lead your students to confuse data design techniques with program design techniques.
I believe that good data analysis and data modeling practices, techniques, knowledge and skills are being lost because data models are now presented as a minor branch of UML class diagrams and the OO paradigm. I suspect that some of the refactoring that goes on is no more than correcting mistakes that somebody trained data modeling would not have made 10 or 15 years ago.
I am not talking here about highly abstract data models or domain models. I am talking about efforts to define a concrete data structure, one that can stand as the basis of the system to be constructed, one that customers will accept as sufficient for live operational use.
Even such a concrete and purposeful data structure is developed through stages of analysis and design. Using a UML class diagram at both stages may hinder students’ understanding of the distinction between analysis and design.
· Agilist: I can distinguish analysis and design. I draw a domain model in the form of UML class diagram. Just because this analysis-time diagram indicates that "Employee" is a domain concept that doesn't mean my design-time model will have an Employee class (or an Employee table for that matter). I may choose to apply the Party pattern, or do something else in the design model.
Some, David Hay for example, would have started the analysis with Party then refined this by specialisation to Employee, so would not agree with your division between analysis and design. However, regardless of whether generalisation or specialization comes first, I guess we can agree that analysis precedes design.
I believe the OO paradigm has exaggerated a long-standing confusion of design with analysis to such an extent that almost nobody has a reasonable grasp of the distinction, so here is one.
The analysis principle
Analysts find things out, define activities in the business domain, distil a mess of functional requirements into discrete elements, and define essential business rules.
The design principle
Designers define data and process structures to the bottom level of detail and precision - to be implemented using specific technologies - to meet not only functional but also non-functional requirements.
By this definition, data normalisation is an analysis technique, so a normalised data model is a product of analysis. And data denormalisation is a design technique.
I propose that whereas a data model is first and foremost a product of analysis, a processing schema is more a product of design, since it is built using the results of the earlier analysis, and it is designed to be implemented using specific technologies and to meet non-functional requirements.
Yes, my analysis-design distinction is soft at the edges. I do believe however in the proposition that data schema naturally precedes processing schema.
· Agilist: I can choose to do analysis (I might call it domain modeling or logical modeling) with a data model, with a class model, with CRC cards, or with other types of artifacts. I don't have to limit myself to just one kind of model. This is why 'Multiple Models' is such a critical principle. A good modeler is flexible enough to choose the right artifact for the situation. Not so good modelers have a very small number of tools in their intellectual toolkits.
Yes. Systems analysts should use several techniques and produce several kinds of document. But given their project involves developing a database, they really should always draw a data model. Calling it a “domain model” seems unhelpful sophistry to me, and tends to encourage airy-fairy modeling of the kind I sought to exclude above. Let our systems analysts be clear; their data model is a data structure, it should prefigure the database schema.
Who (which person/role) documents a model and what tools they use is certainly open to debate. If your developers can define the data model using the database tool, that may be good enough. Though not all developers make good systems analysts. And I guess Oracle wouldn't have developed Oracle Designer if they thought their DBMS was good enough as a CASE tool.
However, my aim here is to distinguish the purposes and patterns of data models from the purposes and patterns of several other products analysts and design might document. Let me turn to the data structure refactoring question.
At the “structured” extreme, you can set out to define a data model that will fully meet requirements before you start coding programs. This delays your first working code. It means accepting the risk that you may get the data model wrong, and may have to substantially reengineer the system when users revise their requirements through experience of the working system.
When to do this? If you have a clear idea of what business services clients require, then it should save you time and money to get your data model close to complete before coding.
· Agilist: Just don't carve it in stone and just don't waste time implementing sections of it that you don't need right now. Just in time (JIT) is a critical concept here.
Yes. And some database refactoring is clearly necessary. I am less happy about encouraging people to start coding before a decent attempt at analysis - that was the very problem that data models and the like were introduced to prevent.
At the “Agile” extreme, you can define a small part of the data model, enough to support one use case perhaps, and start coding programs to this data structure. This means you will certainly have to accept the cost of incrementally refactoring the data structure and the code, as the data structure is refined and extended.
When to do this? If you don't know what the requirements are, or if they are volatile, then you will probably find the best way to determine the required business services are by getting down to coding (prototyping) some use cases and proceeding via iterative development.
So, the first thing to do is estimate the degree to which the required business services are pinned down by the given requirements. The less the degree, the sooner that turning a first-cut data model into a database and prototyping the system will pay off.
But that's not the end of the story. See my other books.
Can we state some general principles? We can't say you must start the data model before the use cases, or complete it up front before coding. We can't say analysts should focus exclusively on the data model, or stop after the data model. We don't suggest designers should build an OOP class model without looking at the data model.
However, I urge you to consider these three principles:
· Agilist: Better to say something like start the "domain model early" then let people pick the artifact that best suits their situation.
No. I mean, precisely, the data model (whatever the notation) rather than a domain model. As I understand it, the OO concept of a domain model includes object-oriented design features that are irrelevant to the two purposes of a data model. Thinking about such things as aggregate entities, inheritance and interfaces (not to mention a grandiose pretension to modeling 'real-world' objects) gets in the way of straightforward data modeling.
· Agilist: Better to say something like "understand/explore the requirements as best you can"
No. Requirements is too loose a term for what I mean here. I mean business rules capture. The relationship between business rules and requirement is a subtle one, another issue for another time.
· Agilist: But if you're using OO technology that will be on your mind when you're designing.
That is unfortunate. You should be telling your developers to get that OO technology out of their heads when drawing a data model. It does not help. It can only hinder. That is my main point. And if using UML is encouraging you to think about OO technology while you are data modeling, then that is a convincing argument for using a distinct data model notation.
You can use UML class diagram notation for data models. But I am here taking a stand for the data model as a distinct product - with its own purposes and patterns, and perhaps even some shades of semantics different from an OO structural model.
A data model is not an OOP design - it is a data structure, albeit one which can and should be annotated with business rule definitions .
· Agilist: Is data structure enough? Perhaps "responsibility structure" is what is really needed? At least on OO development efforts.
It is enough for the purposes I set out with. The data model starts in analysis rather than design because it distills requirements and captures essential business rules. It is needed on the majority of enterprise application projects. It is not affected by the choice of an OOP.
What you need to do for OOP (as opposed to what you do for procedural programming) is, by definition, a matter of program design rather than systems analysis.
· Agilist: It's good to distinguish between requirements, analysis, and design but difficult to do in practice.
Yes. However I am sure you agree it is a good idea to start the data model (though you may call it a domain model) during requirements analysis. The following are not the primary purposes of drawing a data model:
· the specification, design or modularisation of processes,
· the specification or design of OO classes, operations or interfaces,
· the specification of interface inheritance (enabling polymorphism of operations),
· the specification of implementation inheritance (enabling reuse of operations),
· the specification of interfaces.
Agilist: I will however model triggers and stored processes on a physical data model for a relational database (see footnote).
Great! Now abstract a little from those triggers and processes. What are you talking about? You are talking about discrete events and the business services they trigger on the persistent data. The same events and business services are required regardless of your programming paradigm. So, you can happily and productively specify event parameters, and the pre and post conditions of business services without knowing anything about object-oriented design and programming - and I wish you were teaching this!
Brief notes on a tool for assuring data model quality. Looking recently at <http://citeseer.nj.nec.com/465756.html> I was reminded of the following.
A triangular relationship in a data model is not wrong, is not itself a measure of low quality. But a triangular relationship that has not yet been analysed is a direct indicator of low quality. So how to make sure the quality assurance question is asked? And how to measure that?
Precondition: your ERWin data model contains three triangular relationships, including this one
YOU: click a button to run the tool against your data model
TOOL: lists all patterns in your data model, reports e.g. there are 3 triangular relationships
YOU: select the first triangular relationship
TOOL: shows the three entity subset of the data model and asks something like "Do you always find the same set of persons via both
· the direct organisation-person relationship and
· the indirect organisation-department-person relationships"?
YOU: click "Yes"
TOOL: asks "Then would you like to remove the long direct relationship thus"?
YOU: click "Yes"
TOOL: transforms the schema accordingly.
Data model quality (including schema transformations) was one of my consulting specialities from 1980 to 1995. The other related speciality was behavior analysis of database entity types using finite state machines to express constraints and state changes.
The fruits of my consultancy research included a catalogue of about 20 data model patterns. The triangle is a simple and well known pattern. The richer patterns involve 4, 5 or 6 entity types. I catalogued the patterns and associated schema transformations in a book. I never took to the book to a publisher, partly because I wanted to sell the tool below and I wanted the tool to get a head start in the market before somebody built a rival (they never did I think).
Two of us (me as analyst, other as developer) spent a year or so building a sweet data model quality tool to
· 1 - read a schema (in the form of ERWin data model)
· 2 - catalogue all the recognisable patterns* in the schema (there are always many)
· 3 - show each pattern, along with questions the analyst needs to answer
· 4 - read the answer and propose a transformed schema to the analyst
My colleague was a superb CASE tool developer. He was able to both develop the pattern recognition algorithms and embed them in a sweet little package. He made a demo accessible via his web site. However, having both worked for niche CASE tool vendors in the past, neither of us wanted to risk giving up our salaried jobs to market the tool, so we never made a sale.
Our data model quality tool still exists. If you are interested, I'll ask my ex colleague if he wants to refresh his web site and make a demo version available to you. Email me.
To be honest, I had forgotten about the tool
until I noticed the Petia Assenova reference at the URL above. I was an
assessor for Petia's thesis and visited her at
Petia's schema transformation patterns were basic - we had many richer ones embedded in the tool. However, her work made more explicit to me the fact that all schema transformations are reversible - an idea I took on board in our tool development.
A couple of years back I looked at Scott Ambler's database schema transformations, but they are at a lower level of granularity, some within table stuff, and relatively unexciting to an analyst. I couldn't get Scott interested in the broader (usually earlier) systems analysis questions that lead a schema to be transformed in more dramatic ways.
P.S I have similar catalogue of patterns revealed by entity life history in finite state machine diagrams, along with the questions and behavioral schema transformations the patterns suggest, but I have no expectation of finding people interested in those!
Aside: We define a distributed enterprise application as one where we cannot assume the platform provides a roll-back service, usually because discrete data stores can only be co-ordinated via long-distance messaging. It is possible here to regard the processing of a discrete event as a backtracking problem. So preconditions turn into “quit” conditions. You “posit” the event will succeed, “quit” if any precondition is found to be true, and introduce a whole lot of processing to handle the “side-effects” of switching to the “admit” case.
Martin Fowler has commented on ‘completers’ who feel obliged to complete a specification or model, whether before or after they start coding. Agile modelers never feel obliged to complete a model; they do just enough. This principle applies to all kinds of model that the analyst may specify, with the possible exception of a data model.
Completers ought to dwell on
· Some Gartner report found that only 16% of the code in a typical application is covered by the functional specification. Moreover, the report suggested that is probably about as much as you can realistically expect.
· Some analysts don't realise how far short their specifications fall of the completeness, detail and precision required for coding. Designers and developers fill in the gaps through a mix of guesswork, conversation and undocumented analysis.
· Some Agilists argue that much initial requirements specification is better frozen or thrown away than maintained with the code. How many projects maintain the data model after the database is designed? Is the enterprise really going to maintain all those business process models and business rule specifications once the system has been coded?
· When a project hands over to application maintenance (even before then), some kind of change/defect log tends to take over from the initial requirements catalogue. Change requests are quite different from initial requirements because they can and do refer to concrete features of the live system.
I confess to a former life as a completer. I still don’t like to let go of my models. But I realize that accurate completers are rare, and completion of models can be a trap. It is worth recalling some of the reasons and ways in which methodologists have proposed we complete models before coding.
People often mistakenly compare software engineering with mechanical engineering. It may be true that the plan for Boeing 747 has to be, and is, completed down to the last detail, down the dimensions and materials of the tiniest nut and bolt. See ref. 1 for discussion of the differences between aircraft engineering and software engineering.
These differences combine to mean that once the code has been developed, it is always costly and generally impractical to maintain parallel specifications in excruciating detail.
Top-down decomposition was promoted in the 1970s. The idea was and is to divide a system into a few high-level functions (whatever 'function' means), and by successive refinement, keep dividing until you arrive at program modules, classes or operations that can be readily coded.
Top-down decomposition or successive refinement ran into numerous practical and theoretical problems.
1. It confuses business analysis and programming.
2. It confuses a process decomposition hierarchy and a process invocation hierarchy.
3. The first refinement is usually wrong. High-level abstractions are mostly wrong when first conceived, unless they be vacuous generalisations that say nothing specific about the business at hand.
4. The final refinement is always wrong; it needs refactoring. An invocation structure of a software system is always a network rather than a hierarchy.
5. Refinement cannot be done systematically. Successive abstraction seduces people into thinking that successive refinement can be as systematic. Yes, you can create a higher-level abstraction from a lower-level specification in a mechanical way that leaves a perfect mapping between the two. No, you cannot generate a lower-level specification from a higher-level one, unless it be to plug in predefined infrastructure you already knew would be needed and created a socket for.
6. The degree of abstraction or refinement is implausible. Successive refinement, or at least successive abstraction, looks plausible when you consider only data definitions. The expansion ratio from number of entities in a conceptual model to number of tables in database is small enough to manage. But successive abstraction tends to fall down when you consider process and other facets of enterprise architecture. The expansion ratio from one sentence in a business objective to the lines of software code that implement the objective may be one to ten thousand. So to working back from lines of code to the business objective requires truly drastic abstraction.
7. The kind of abstraction or refinement is obscure. By what rules does a higher level specification abstract from a lower level one? Is it by composition? By generalisation? By omission of detail? You can't have it all ways and hope to maintain mappings from lower level elements to higher level elements, let alone generate the former automatically from the latter.
The idea may have worked in classroom examples; I never saw it work in practice. To quote the caption of the cartoon popular in those days: "a miracle occurs" somewhere between high-level business functions and low-level programming modules.
· Data Modeler: I think a function hierarchy is a very good way to quickly grasp what is going on in a business. It is still one of my most important tools. Note here that I mean a business function hierarchy, describing the nature of a business. By definitions, functions do not include any references to mechanisms.
Sure, but one elementary product of business analysis (a business process step or use case) may require 100 software components. You don't take that function hierarchy down to program code do you? Given the level of abstraction from enterprise objective to coded statement can be one to ten thousand, the continual hierarchical refinement idea just seems implausible.
A top-level “context diagram” is good. It shows, for a system, the external entities and the major input/output data flows. People may draw such a diagram using UML use case diagram notation, or a DFD at level zero.
A single bottom level DFD can be useful (much as any system flowchart can be useful). Though there are question marks over the notation. How to represent tiny I/O data flows between a keyboard and a screen? How to stop people drawing flow charts or reading an input data flow as triggering the target process? How to... Let me stop there.
A set of several bottom level DFDs can be useful. Though how to focus one diagram within a set is a challenging question. Do we focus on the processing of a data store? on the process related to an external entity? on the consequent process of an input data flow? on the precedent process for an output data flow? Or should we show each of those views in different diagrams?
The big difficulty is how to draw and maintain a hierarchy of DFDs, with the top, bottom and intermediate levels. What is a data flow on a higher level diagram? is it one of the data flows at a level below? is it a composite of several? is it a selection between several? How to decide which bottom-level details do not appear at the next level up? small ones? rare ones? It was in practice simply too difficult to maintain consistency between diagrams, to ripple the effects of structure changes up and down, and sideways between related diagrams.
I think nobody ever worked out how to maintain a hierarchy of “wiring” as you call it. Because the theory was unsound in the first place. It sounds like you drew DFDs as sketches that amplify what you drew in your function hierarchy. You could also use the DFD notation to draw honest-to-goodness system flow charts. Those approaches are OK. Beyond that, rather you than me.
I describe my life as a completer and teacher of models in <The Agile Architect>. Consulting on real projects eventually forced me to give up expecting other people to complete such models.
· Data Modeler: Are you talking about entity life histories, here?
· Data Modeler: These strike me as a really good idea, but I haven't found a CASE tool that supports them, and they are undoable without it.
Yes. The superb CASE tool I used (with automatic transformation of finite state machine diagrams into skeleton interaction diagrams) is no longer available.
· Data Modeler: I have a presentation that I give, and making a consistent example is excruciating.
Yes, though getting a model is right is only about as excruciating as getting code right.
The fact is that experienced developers find it easier to complete Java code than to complete a UML model of the same code. I suspect many Model-Driven Architecture enthusiasts do not realise the modeling effort that generating coding from models implies. See “Model-Driven Analysis”.
With rare exceptions, enterprise data models have failed. They are either too big to be manageable or too generic to be useful. The enterprise has to focus instead on the narrower set of data and rules shared by discrete applications. Most people are now focusing down on the priority data integration issues and/or enterprise-wide business rules.
Data Modeler: Well, I'm still trying to build enterprise data models, and clients are still hiring me. Although my examples aren't yet completely enterprise wide; they tend to address a particular division.
With the exception of data models, most models are never completed. I don’t expect the future will bring encouragement to anybody who wants to draw and maintain a full model, be it a complete top-down process decomposition hierarchy, a hierarchy of Data Flow Diagrams, a complete set of entity life history diagrams, UML diagrams corresponding to implemented code, or an enterprise-wide data model.
Reverse engineering models from code doesn’t really count, since the first thing you want to do is erase some of the completed detail.
It is difficult to relate thousands of detailed design elements and coded components to models and specifications. And it is costly in to do it way that is maintainable and auditable (as CMMI requires).
You can however:
· trace from coded components to system test cases
· trace from system test cases to elements of whatever abstract model is maintained.
Plausible mappings include: requirement to/from use case, business process step to/from use case, use case to/from business service, use case and/or business service to/from system test case, system test case to/from coded component.
I wouldn't even try to map from detailed design elements (say 2,000 Java classes) directly to requirements. If the design elements are tightly bound with coded component by a CASE tool, then the code to test case mapping does the job. If not, the task is too scary to complete.
Ref. 1: “Software is not Hardware” in the Library at http://avancier.co.uk
Ref. 2: “Business Rules - an Introduction” in the Library at http://avancier.co.uk
Ref. 3: “Abstraction in all its variety” in the Library at http://avancier.co.uk
Footnote 1: Creative Commons Attribution-No Derivative Works Licence 2.0
Attribution: You may copy, distribute and display this copyrighted work only if you clearly credit “Avancier Limited: http://avancier.co.uk” before the start and include this footnote at the end.
No Derivative Works: You may copy, distribute, display only complete and verbatim copies of this page, not derivative works based upon it. For more information about the licence, see http://creativecommons.org