Data: a concise word for a tremendous amount of information. Once, data was manageable; today, with data scattered everywhere – even with systems you don’t control – it often veers towards the realm of chaos. As a data professional, your mission is to conduct that chaos.
All data is big data
Big data describes data that is big in size (has high Volume), is rapidly growing and changing (Velocity), and comes from many different sources (Variety.) For most larger organisations it is reasonable to suggest that all data is now big data.
According to statista.com, “the total amount of data created, captured, copied, and consumed globally is forecast to increase rapidly, reaching 64.2 zettabytes in 2020. Over the next five years up to 2025, global data creation is projected to grow to more than 180 zettabytes. In 2020, the amount of data created and replicated reached a new high. The growth was higher than previously expected caused by the increased demand due to the COVID-19 pandemic, as more people worked and learned from home and used home entertainment options more often. Storage capacity also grew. Only a small percentage of this newly created data is kept though, as just two percent of the data produced and consumed in 2020 was saved and retained into 2021. In line with the strong growth of the data volume, the installed base of storage capacity is forecast to increase, growing at a compound annual growth rate of 19.2 percent over the forecast period from 2020 to 2025. In 2020, the installed base of storage capacity reached 6.7 zettabytes.”
Modern data landscapes typically store data in a combination of on-premise and cloud data stores – from traditional mainframe and relational database management systems, to data lakehouses and other semi-structured modern storage options. Arguably, these days, all data is big data.
Making sense of it all
It should be clear that strategies for managing and getting value from data have to change to manage this new complexity.
Data management has historically been seen as an IT or a compliance problem. While the end result of many data management efforts is to feed advanced analytics functions, in fact, research shows that investments in data integrity have positive impacts across a range of business metrics. As the volume and complexity of data increase, we need to formalise data practices that help us to identify which data is most important, ensure its quality and make it accessible to authorised users.
The compliance driven data governance initiatives of the mid to late 2000s have left a bad taste in many business people’s mouths. In many cases, programs were perceived as blocking business attempts to be more agile or innovative.
Data governance needs to shift to enable business – by making it easier to find and access the data that knowledge workers need to do their jobs. Ultimately, data governance should be formalising decision-making about data – starting with an agreement about what is important and where data management investments are necessary.
Data governance structures should be driven by the data strategy, which, in turn, should be focused on meeting business goals through addressing gaps in the data architecture and capabilities. As data volumes increase and become less structured, so does the ability to understand how data is created, stored, and consumed; who is responsible for it; who should have access to it and for what purpose. Furthermore, looking at how trustworthy the data is, becomes increasingly essential if data is to be appropriately prioritised for maximum value.
Back in 2014, we identified that without context the data lake would become unmanageable. In 2018, we talked about how chaotic data lakes were making it impossible to deliver on the goals of data democratisation – making data available to the people that need it. Technologies such as enterprise data catalogues and metadata tools – which link business and technical data context to provide data intelligence – and data quality tools, which improve the content of data, must be selected with the business knowledge worker in mind.
Technologies must be selected based on their ability to cater for a variety of business cases and stakeholders – from decision makers looking for trusted insights, to analytics teams and data scientists looking to build the next data product. Operational teams and IT must be able to run agile DataOps programs to deliver trusted, reliable data at scale, and compliance and privacy teams must be able to prove compliance to external and internal auditors.
A business-first, top-down approach should be driven through integrated data governance processes and structures that ensure collaboration and manage priorities across the organisation.
By Gary Allemann, MD at Master Data Management