Avancier Methods

Information/Data Architecture (with TOGAF & ArchiMate artefacts)

One of more than 200 papers at http://avancier.website. Copyright Graham Berrisford. Last updated 21/05/2017 22:44

Click here for illustrations of the diagrams mentioned below.

Contents

Mainstream EA.. 1

Data stores – data at rest 2

Data flows – data in motion. 5

Footnotes. 8

Mainstream EA

EA emerged in the 1980s to address the need for enterprise-wide analysis and re-architecting of data and processes.

Today, mainstream EA remains about business activities that involve the creation and use of information.

It is about the design and planning of changes to those business activities, and/or to the capture and provision of that information.

Moreover, enterprise architecture (rather than solution architecture) is about doing this at a strategic and cross-organisational level.

To digitise business operations that create and use information, architects must attend to modern standards and technologies.

What follows is the information/data architecture approach in Avancier Methods.

It maps information/data architecture to business architecture and applications architecture.

It groups TOGAF artefacts into views, and includes suggested mappings to ArchiMate.

Bear in mind that ArchiMate is not designed for professional business process modellers or data modellers.

Do "information" and "data” mean the same thing? If not, what is the distinction?

Various distinctions are explained and analysed in this paper “Data and Information”.

Maintaining any of the terminology distinctions (in discussing or writing) is very difficult.

This paper uses “data” as a catch all; you can replace some or all by “information” as you see fit.

Modern data architects speak of being concerned with:

· Data at rest - data stores – their locations, contents and synchronisation

· Data in motion - data flows – their sources, destinations and contents

· Data qualities – data types, standards, confidentiality, integrity and availability.

Data stores – data at rest

Businesses have to store information for future use, in persistent data stores.

TOGAF uses the term data component rather than data store.

There are two interpretations of “data component”:

· A passive data structure, contained in some kind of data store.

· An active data server (an application component) that provides read/write access to a data structure.

Either way, the data component contains a data structure that can be described in terms of inter-related entities.

Data store view

Enterprise architecture is concerned with actors and activities that create and use stored data/information.

Information System View	TOGAF artefacts	ArchiMate viewpoints
Data store view	Conceptual Data diagram (Business Data Model)	Information Structure – conceptual level
	Data Entity/Business Function matrix
	Logical Data diagram
	Application/Data matrix
	Data Entity/Data Component catalog
	Data Dissemination diagram
	Data Security diagram
	Data Lifecycle diagram
	Data Security diagram
	Data Lifecycle diagram
Migration view	Data Migration diagram

Define the core entities and events that the business must remember in order to complete its business processes and provide business services..

The word “core” usually implies data that is central to the conduct of business processes: e.g. Customer, Order, and Product Type..

This data is often duplicated in different data stores.

A Conceptual Data diagram (Business Data Model) may be draw to show relationships between the core entities.

Some find it more practical to simply list the core entities in a business data entity catalogue

Few if any attributes are specified in this kind of model, which is primarily used to identify data duplication.

Map data entities (in the diagram above) to business functions that create and use them.

You can cluster activities in a Data Entity/Business Function matrix, for example by data created.

The North West corner method sorts the rows and columns of a matrix by clustering them on a shared cell entry, such as “create”.

Define the entities and events that an application must remember in order to provide services to other applications and/or business users.

A Logical Data diagram details information to be stored, usually in one database and/or to enable one application.

A logical data model includes not only entities and relationships, but also each entity type’s primary key and other attributes.

Foreign keys may identify the relationship between different entities.

This model is usually normalised so as to minimise duplication of information.

It defines terms and concepts used in a particular business domain.

Map data entities to application that create and use them.

An Application/Data matrix can reveal overlaps between data maintained by different applications

Map data entities (in the diagram above) to data components that hold them.

Data Entity/Data Component catalog

Map data entities to applications that maintain them, or data components that hold them.

A Data Dissemination diagram shows there is duplication, look to define a data mastering policy (master and copy) for the baseline or target application portfolio.

Data Security diagram

Data Lifecycle diagram

Data Migration diagram

A physical data model specifies the schema to be used in a particular database. .

It may be denormalized to speed up storage or retrieval.

It may refer to features available in the chosen database management system.

Physical data store forms

Data architects are concerned with the forms of matter and energy in which information is stored.

In theory, data architects can design non-digital stores; in practice most focus on digital ones.

Digital data store forms are changing; magnetic disks are currently being replaced by flash storage.

But then, flash is optimised for the asymmetric use cases of mobile devices, where data is written few times and read many times.

So, if you want to find out what may replace flash memory try this:

http://www.computerweekly.com/feature/Whats-wrong-with-flash-storage-And-what-will-come-after

Architects have to research physical data storage forms as the need arises.

Data store schema standards

Data architects define the locations and contents of data stores.

TOGAF’s “physical data component” is a vendor/technology specific realisation of a logical data component.

It could be database, data warehouse, document store, web information server or transaction log.

It has a technology-specific data schema, designed to suit its purpose, for example:

· Transactional database

· Data warehouse

· Document store

· Big data store

This table maps a purely logical data model to some data store schema varieties.

Logical data component	Physical data component
Logical data model	CODASYL database schema	Relational database schema	XML schema (footnote 2)	OData-compliant web information server.
Entities	Records	Tables	Complex types	Entities
Attributes	Fields	Columns	Contained elements	Properties
Relationships	Address pointers	Foreign keys	Contained elements	Navigation properties

Logical data model

TOGAF’s “logical data component” is a logical definition of the data in a data store.

It can be documented as a logical data model; that is, an entity-attribute-relationship model (which can include gen-spec relationships).

Modelling an information/data model in this way can be done independent of computing altogether.

A logical information/data model is a purely logical declaration of business terms and concepts without consideration of any database schema.

You might use a UML tool to draw an information/data model, but to call it a class diagram/model is misleading.

UML class diagrams are for modelling objects that have behaviour.

OData - the data access protocol for a web-based information server.

A modern way to realise logical data components as physical data components is in XML schema, accessed using the OData protocol.

This provides a generic way to organize and describe the data structure of any remote data store as a logical data model.

· An Entity Type (Customer, Employee, etc.) is a data structure type consisting of named and typed Properties and with a key.

· An Entity is an instance of an Entity Type.

· An Entity Key (CustomerId, OrderId etc.) is formed from a subset of Properties of the Entity Type.

· An Association defines a relationship between instances of Entity Types (for example, Employee WorksFor Department).

· An Association can be 1-to-1 or 1-to-many, uni-directional or bi-directional.

· A Navigation Property is property of an Entity Type bound to a specific association, which can be used to refer to associations of an entity.

Microsoft and SAP now expose their data using the OData protocol.

Any client (even a human) can retrieve a logical entity-attribute-relationship model from a web data store using HTTP, then proceed invoke operations on it using HTTP.

The physical data structure of a remote data server is its own business.

All that matters to a client is that data server returns a logical data model in reply to a request saying "get meta data".

The client can then proceed to invoke create, read, update and delete operations on entities in that data model.

Centralised and distributed data storage

Data architects are much concerned with the distribution or duplication of data in different data stores.

The choice between hierarchy and anarchy is central to much discussion of sociology and politics.

It is closely related to the choice between centralisation and distribution, which appears also in business, software and data architecture.

“It is not hard to speculate about, if not realize, very large, very complex systems implementations, extending in scope and complexity to encompass an entire enterprise.” John Zachman, 1987

This might be interpreted to imply consolidation of an enterprise’s business data into one large database.

(And it appears SAP pursued this strategy for many years.)

A current fashion is to "distribute data management" as Martin Fowler puts it.

So-called “micro services” (better-called “micro apps”) are based on small data stores.

The idea is to integrate small information systems rather than consolidate them around one data store.

This has advantages and disadvantages, but whether data storage is centralised or distributed the vision of EA remains the same.

That vision is to integrate business activities through sharing of the data they create and uses.

Data flows – data in motion

Businesses have to move information from one place to another, between business actors and data stores.

Data architects are concerned with the capture and transport of information in data structures.

Data flow view

Enterprise architecture is concerned with actors and activities that send and receive data/information.

The view relates applications to data flows (which can include messages, files and reports) and to data components.

Information System View	TOGAF artefacts	ArchiMate viewpoints
Data flow view	Application Interaction matrix	Application Cooperation
	Application Communication diagram	Application Cooperation
	Interface catalog

Application Interaction matrix

Application Communication diagram

See “Applications Architecture”.

Interface catalog

Catalog the data flows that pass between applications, and between human roles and applications.

Human actors do convey much critical business information informally - in ad hoc speech, gestures and drawings.

But architects cannot model ad hoc information; they can only name messages that appear in regular communications.

Data architects can name messages created and used in regular business processes (e.g. enquiry, response, order, invoice, payment).

A Data Flow Catalogue (Interface Catalogue in TOGAF)
Functional attributes	Flow name	Enquiry	Response	Order
	Trigger		Enquiry
	Source	Customer	Sales	Customer
	Destination	Sales	Customer	Sales
	Information	Unstructured	Unstructured	Order details (tbd)
Non-functional attributes	Frequency	1,000/day	1,000/day	30/day
	Volume			500K
	Confidentiality	High	High	High
	Integrity	Medium	Medium	High
	Availability	24/7	09.00-18.00	24/7
Transport mechanisms	Technology	Web	Telephone	Web
Transport mechanisms	Protocol	HTTP		HTTPS

Data flow definers may name data groups and items in those data structures (e.g. From and To addresses in an email header).

They can name so-called “unstructured” data items to hold ad hoc information (e.g. the message in the body of an email).

Like many such illustrations, this table shows what could be documented rather than what most actually document.

But understanding what is possible in theory is a precursor to deciding what to do in practice.

Physical data flow forms

Data architects are concerned with the forms of matter and energy in which business actors convey information.

At the bottom-most level, physical forms include wires, microwaves and sound waves (human speech).

In theory, data architects can design non-digital data flows.

In practice, data architects mostly focus on business systems in which business information is to be digitised.

Data flow format standards

Data architects are concerned to ensure senders and receivers can create and read data structures.

There are many standard data flow formats, covering:

· Digital audio data, image data, and video data:

· Documentation and scripts

· Geospatial data; vector and raster data

· Qualitative data, textual

· Quantitative tabular data, with or without metadata

Data architects have to research standard data formats as the need arises. See footnote 1 for more detail.

Semantic interoperability

The information found in the structure of data flow or data store is a matter of perspective.

So “semantic interoperability” is a major concern of enterprise data architecture.

Data architects work to ensure the creators and users of a data structure share the same understanding of its contents.

Business data can be structured according to many domain-specific languages – bespoke or standard.

So, data architects have to research standard “canonical data models” as the need arises.

Where are input and output data/information flows in TOGAF?

Business input and output flows are identified at the start of the Business Architecture phase B.

These flows can convey materials and/or information.

The B-to-C and B-to-B information content is conveyed through Human-Computer Interfaces and APIs, and in non-digital forms like paper.

The flows are documented in Business Service contracts in the Architecture Requirements Specification.

And may appear also in a Business Service/Function Catalogue and/or Process/Event/Control/Product Catalogue.

Application input and output flows are identified at the start of IS Architecture phase C

These A-to-B and A-to-A flows can convey information only.

The information content is conveyed through Human-Computer Interfaces and APIs.

The flows are documented in IS Service contracts in the Architecture Requirements Specification

And may appear also in an Interface Catalogue and/or Application Use Case Descriptions.

Where are input and output data/information flows in ArchiMate?

Architects are taught to define systems from out to in, starting with the input/output boundary.

They define the external view of a system - hiding details of internal behaviours and structures.

Then, they divide the system into layers and/or subsystems and define each in the same way.

Architects define each system and subsystem (building block or component) by defining its interface(s).

An interface is a collection of services that a system or subsystem makes available to clients.

An interface encapsulates the internal actors/components and processes that implement or realise services.

The ArchiMate modelling language classifies these ideas as shown in the table below.

ArchiMate	Behaviour elements	Active structure elements
External view	Services	Interfaces
Internal view	Processes	Actors/Components

Services are discrete behaviours that clients can request of a system.

Service contracts encapsulate (hide) the necessary internal process flows and actors/components

Services consume and produce input/output flows that contain data and/or materials.

So, input and output data flows can be named (and detailed if need be) in service contracts.

Footnotes

Footnote 1: Data flow format standards

The table below shows a selection of formats from the list at https://library.uoregon.edu/datamanagement/fileformats.html.

It is drawn from UK Data Archive documentation; some of the data formats may be receding into history.

Popular modern formats include JSON for data flows, and OData for the description of web-accessible data stores.

Digital image data

TIFF version 6 uncompressed (.tif)

JPEG (.jpeg, .jpg)

PDF (.pdf)

Digital video data:

MPEG-4 High Profile (.mp4)

JPEG 2000 (.mj2)

Digital audio data

Free Lossless Audio Codec (FLAC) (.flac)

Waveform Audio Format (WAV) (.wav)

MPEG-1 Audio Layer 3 (.mp3)

Qualitative data, textual

eXtensible Mark-up Language (XML) text according to a Document Type Definition (DTD) or schema (.xml)

Rich Text Format (.rtf)

plain text data, ASCII (.txt)

Hypertext Mark-up Language (HTML) (.html)

widely-used proprietary formats, e.g. MS Word (.doc/.docx)

Documentation and scripts

Open Document Text (.odt)

Rich Text Format (.rtf)

HTML (.htm, .html)

plain text (.txt)

widely-used proprietary formats, e.g. MS Word (.doc/.docx) or MS Excel (.xls/ .xlsx)

XML marked-up text (.xml) to a DTD or schema, e.g. XHMTL 1.0

PDF (.pdf)

Quantitative tabular data with extensive metadata

SPSS portable format (.por)

delimited text and command ('setup') file (SPSS, Stata, SAS, etc.) containing metadata information

structured text or mark-up file containing metadata information, e.g. DDI XML file

MS Access (.mdb/.accdb)

Geospatial data; vector and raster data

ESRI Shapefile (essential -- .shp,.shx, .dbf;

optional -- .prj, .sbx, .sbn)

geo-referenced TIFF (.tif, .tfw)

CAD data (.dwg)

tabular GIS attribute data

Quantitative tabular data with minimal metadata:

comma-separated values (CSV) file (.csv)

tab-delimited file (.tab) including delimited text of given character set with SQL data definition statements where appropriate

delimited text of given character set -- only characters not present in the data should be used as delimiters (.txt)

widely-used formats, e.g. MS Excel (.xls/.xlsx), MS Access (.mdb/.accdb), dBase (.dbf) and OpenDocument Spreadsheet (.ods)

Footnote 2: Logical data model to physical XML schema mappings

The table below is edited from a table on the IBM web site.

http://www.ibm.com/support/knowledgecenter/SS9UM9_8.1.0/com.ibm.datatools.transform.ldm.xsd.doc/topics/rldm2xsd_map.html

Logical data model	Physical XML schema
	Schema
	SchemaLocation (XSD file name)
	TargetNamespace (unless set in the Properties page)
Atomic Domain	Simple Type
Atomic Domain - Name	Name
Domain Constraint	Facet (FractionDigits, TotalDigits, MaxLength, MinLength, Length MaxExclusive, MinExclusive, MaxInclusive, MinInclusive, Enumeration, Pattern)
Entity	Complex Type and Element
Entity - Name	Name
Entity - Documentation	Documentation
Entity - Supertype of Generalization	BaseType of Complex Type
Entity - Primary Key	Key of Element
Generalization	See Entity
Generalization Set	See Entity (with all applicable properties of the generalization set).
Attribute	Contained Element with Simple Type
Attribute - Name	Name
Attribute - Documentation	Documentation
Attribute - Data Type, Length/Precision, Scale	Type
Attribute - Primary Key	Key field of containing Element
Attribute - Entity	Owning Complex Type
Relationship	Contained Element with Complex Type
RelationshipEnd	Contained Element with Complex Type
RelationshipEnd - VerbPhrase	Name
RelationshipEnd - Cardinality	MinOccurs / MaxOccurs