ERP Historic Data strategy
This paper has been edited from a LinkedIn discussion started by Javier Mauricio Murcia Fernández.
The principal contributor was Nic Harvard - and you can email him if you want more technical advice.
Regular business operations require on-line transactional processing (OLTP) of data gathered over a few days, weeks or months.
You can reasonably hope to generate some useful reports from the transactional database.
Strategic decision makers often require on-line analytic processing (OLAP) of data gathered over years.
A common business problem is that, as data volumes grow, the OLTP system cannot meet NFRs.
Query responses are slow. Data analysis and reports for managers hog the system, making the system unavailable.
The root cause is that current technologies don't scale to allow OLTP and OLAP on the one data store at the data volumes often needed for OLAP.
All ERP systems are different, but they may be divided into three tiers.
· Tier 1 ERP systems - SAP and Oracle.
· Tier 2 ERP systems - Infor, Sage, IFS, Epicor, IQMS,QAD, SysPro, MSDynamics combination, etc. A handful still around after a decade of consolidation.
· Tier 3 ERP systems - hundreds of them, mostly used by a single department or SME company.
Most ERP systems assume you will make as much use as you can of their transactional database.
Perhaps no ERP system comes (out of the box) with a clear
archiving strategy and inbuilt tools
for archiving and OLAP.
Tier 1 and 2 vendors do sell add-on tools – at considerable extra licence costs.
The purchaser must then pay for the tools to be implemented, so caveat emptor!
ERP systems are designed to centralise all a business needs, not to be integrated with home-grown or bespoke business applications.
In the ideal, they do not need care and feeding from IT people (sic).
Many (most?) ERP systems store data redundantly at many points.
They may link transactional data elements to each other, or to reference data, in ways you might not expect.
You work for an SME that does not have lots of money, or superb internal IT support or a fantastic relationship with its ERP vendor.
ERP system has been in operation a long time and accumulated a huge amount of
So, when users query a huge dataset, the response time slows, or the database becomes completely unavailable.
The problem ought not to exist in a mature, well-managed environment where the ERP is seen as a strategic asset.
Nevertheless the situation has arisen, and the ERP system does not help out of the box.
Perhaps it does come with basic ETL tools to help moving data to a different database
But it does not
provide a clear strategy to deal with the historic data that is needed for
analysis or marketing.
The symptoms have revealed you have a problem that requires strategic thinking.
You need to present options your business managers can understand. So, what to suggest?
You may consider the general solution patterns:
· separate de-normalised OLAP data from normalised OLTP data
· archive non-active data to maintain the performance of transaction processing.
But it would take a lot of time and money to jump to this from the current solution.
And perhaps it makes you uncomfortable because your instinct is not to expand your system estate such that you increase
· the number of systems, skill sets and inter-system integrations than need to be maintained and managed
· the number of places the same data is has to be defined, stored and updated (whether separately or by periodic integration).
You might consider extracting portions of the ERP’s stored data into a separate data stores for reporting and/or archiving.
You might look at extracting the data for one sales area, or one period of history.
But this turns out to be very difficult because your
transactional data is inextricably linked to configuration data, master data
and reference data.
Your transaction data is stored to meet referential integrity rules.
The rules get broken when you select only of part of the data in an extract, or remove part of the data from the original store.
So, what else to suggest?
A reasonably easy tactical solution is
1) Enable queries on a replica store (log-shipped copies, backups, or similar).
2) Put a time-out limit on queries on the live data, such that after (say) 20 seconds
3) The user gets a message "this query requires so much processing it will slow down the application for other users. Please open this <other application> to query a historic copy of the data. Or else call <this number> for advice."
That works to solve the immediate problem. I have done this
Once, it was good, and 3 months later the underlying issue was solved, so all remained well.
Once, it was bad; at first the symptoms went away, so directors thought the problem was solved.
Two years later: "Oh dear, everything has fallen over! What to do now?”
The easiest low risk strategy is to copy all data over to a Data Warehouse with
Do this anyway, so you can stop people running queries against live data; which is simply a bad thing.
Now the historic data is in a safe place, indexed to optimise OLAP, and in very low-cost storage, the problem is partially solved.
Next, rid yourself of the transactional data going back 2 decades (or however long it is)
The natural point to do this is at the next re-implementation or migration of the whole ERP system.
Since this point will likely come around in 3-7 years, you
should not have to wait too long.
At that point, you must have a very long, careful chat with your data migration experts, and proceed carefully.
Depending on your ERP system and vendor, you might do the following.
Google <yourERPapp> + Information Lifecycle Management (ILM) + archiving, to find what your vendor wants you to buy.
Chat with people in your ERP space who have done this before - user groups and customers
Tell your vendor you won’t buy the latest all-singing all-dancing toolkit they acquired last summer in an acquisition-orgy
Ask: can you have the one they used 5 years ago please? (They will say "no" of course)
Tell your vendor that's fine, as you were planning to archive the data when you migrate to <newApp> in a year’s time.
Wait for the vendor to come to your party.
A variety of potential solutions might be considered, most obviously, stripping data out of the OLTP database
What if reducing the data volume doesn’t help much?
Try to fix things with bespoke code
ERP systems store data in particular, and perhaps obscure, ways
So any technique you use may backfire, and you may not find the problem during testing.
If you are lucky, you’ll get further early warning signs, before (say 3 month later) the system freezes.
You will have to tell the CxOs why you recommended something that didn’t work.
And why the company cannot produce its financial reports and the share price will go near zero. (And how much your insurance covers.)
So, if you use the ERP for accounting or anything mission critical, beware.
Best stick to the ERP vendor’s script: you committed to lock-in when you bought the software!
On many ERP systems, you can all the queries through a SQL cost analyzer.
Perhaps there are only one or two routines that create all the problems, perhaps reused by several services.
Cay Hasselman reports: I came across a SAP/ Oracle database system with most queries at a cost of 1 to 3, with a few at 4 or 5.
But one (to produce a report for the specific business needs of that organisation) cost 12,527.
Later, programmers found it so useful they called it from various ABAP routines.
It turned out include over hundred nested inner joins.
We than reduced this in steps with an overall SQL query cost to 6 and suddenly the system had not a single problem.
Cay Hasselman reports: This is a technique you can try on ERP without problems
Usually when I am called in to fix the problem, indexing was the problem as the time to run all indexes was 25 plus hours a day.
Cay uses tools *within* the product - other systems may or may not have the equivalent of an ST05 transaction/utility
He checks indexing at the DB level - this should work on most ERP systems, as long as you really know what you are doing! (Careful of transparent tables Cay?:)
This will resolve the part of the problem which revolves around queries.
If you are very
lucky, there are only one or two newish queries, and the rest of your systems is affected by a data volume increase.
And you have solved the problem for the predictable future.
The separation of concerns between OLTP and OLAP systems
usually trumps the desire for one system.
The trouble is that ERP systems try do it all (which I consider a fundamental failing of ERP system design, and is being addressed by opening up the core design to more SOA-style interoperation).
They provide a number of operational query-type reports.
They also highly recommend the use and implementation of a separate Data Warehouse for OLAP.
Tier 1 ERP systems
SMEs can reasonably hope to run queries on the transactional database.
Within the SAP estate (and without SAP Business Intelligence
tools) all functional areas contain a fairly useful reporting suite - Logistics
info systems, HRIS, etc.
Few consultants and almost no users know about these tools.
There are a wealth of "normal" reports one can run
And you can use these to write SQL-like joins and report on
pretty darn any data you want in the transactional system.
After a few years, as users gain expertise in using the system, they get frustrated with barriers.
They may discover the reporting
tools, then start to use them, with sometimes poor results performance-wise.
Large companies buy the Business Intelligence Data Warehouse add-on (at large cost).
This may or may not spawn sundry data-marts and infospokes, etc.
This may or may not be part of a rational Information Lifecycle Management archiving strategy.
I've known companies who grow data stores at transactional data rates of 10-100 TB per month.
You would think that this would raise concerns, but storage is so cheap, and performance good enough, that this hardly causes a blink.
Tier 2 and 3 ERP systems.
These are not so good at progressing OLTP tasks over huge datasets, or running hybrid OLTP/OLAP tasks simultaneously.
They are usually used by small enterprises
Probably too small to be able to afford (or need) a dedicated Business Intelligence application.
But likely to have one or two clever-clogs
IT people who can get at the data in the ERP database.
If the IT people are half-good, they design from day zero so that queries run off a replica data store (option 3).
If they are less good, they take the easy route to run
queries against live data, either in-system, or using 3rd party tools.
It's read-only for queries, so it's safe, right? Trouble is, sooner or later, the performance issues resurface.
Remember, a system stuttering due to queries is a symptom of
an underlying problem that can be resolved; the system is groaning under the weight of too much
Now we are back to option 2 and the difficulty in cleanly breaking up an old, highly interlinked ERP dataset.
Oddly, testing tools can be better for this than archiving tools.
The guys needing to extract usable subsets of data have come up with quicker/better/cheaper solutions than the ones who simply want to archive.
NB: Your mileage will differ from app to app - not all ERP's are equal in this regard.
And there is so much business and transactional logic above/outside the data server that data store-only approaches are fraught with danger.
The flow of data from transaction data entry to data warehouse and management information reports can include not only OLTP and DW data stores, but also ODS and DM data stores.
However, the pattern below – implying the transient or
persistent instantiation of data in several physical data storage schema - is
Today, you might find it easier to use in-memory stores, data virtualisation, data replication, or other techniques.
OLTP: on-line transaction process data stores (mostly normalised)
Capture operational transactional data.
Regular business operations require OLTP of data gathered over a few days, weeks or months.
You can reasonably hope to generate some useful reports from the transactional database.
ODS: operational data stores (mostly normalised)
Populated with data close to real time with data from one or more OLTP systems sources.
Isolated from the transaction processing, and optimized with indexes to improve query performance.
Sometimes needed to provide operational information to users where data must be obtained/joined from different sources in "near real time".
Can be used also as a staging database in
an ETL process that moves data to a DW.
DW: Data Warehouse (denormalised)
Populated with data from potentially many OLTP and/or ODS data stores
Strategic decision makers often require on-line analytic processing (OLAP) of data gathered over years, which may be stored separately in a data warehouse.
DM: Data Mart data stores (denormalised)
Populated with a subset of the DW data, and optimised to answer one or more specific business analysis questions.