Update 28 Sept 2021 - Azure Purview now in general availability status.
Update 1 May 2022 - Azure Purview is now Microsoft Purview, and Purview Studio is now Microsoft Purview Governance Portal. A subtle but important change as Purview, even though it is part of the Azure ecosystem, it of course covers a data estate well beyond Azure (incl. AWS, GCP, Snowflake, Teradata, on-premises, etc.). This article was altered to reflect this change.
Update 7 May 2022 - Also see this related article re deeper integration between Microsoft Purview and Azure Synapse.
Data enablement - can Microsoft Purview help
For several years now, I have been evangelising about the need for data enablement in organisations. In a nutshell, this means getting data to consumers (be it final consumers of reports, or consumers who go on to create data solutions such as data engineers, data scientists and business focussed data modellers) faster, and accepting that data transformations, data modelling and new data solutions (models, reports, dashboards, etc.) will occur as part of a highly agile, rapid development process and by multiple data workers, some of whom are external to ICT.
Technology has now reached that part in the maturity curve, where this enablement will accelerate and become the norm, replacing old school, linear data workloads performed by central BI/ ICT teams only. Core to these technologies is Data Lakes for storage at scale (the ‘lake’ of your ‘lake house’), Synapse or Databricks for on-demand transformations and the virtual data warehouse layer (the ‘house’ of your ‘lake house’), and Power BI for business models and visualisations (the ‘front door’ to the whole thing) and of course resources to move data into, and around the ecosystem.
BUT all this enablement now demands a massive rethink of governance - both in terms of methodology as well as technology. Long and laborious theory heavy data governance methodologies simply won’t keep up with the rapid internal growth of the internal data market and the many workers across the organisation who take part in data related activities. An alternative, much more pragmatic methodology is required and must be supported by technology that posses
two crucial things: (1) Artificial Intelligence to simplify and accelerate the process of data cataloguing and classification, and (2) crowd sourcing so that users across the business can quickly add to the collective knowledge of the data assets of the business. And it is in the technology space where Azure had a massive blind spot.
Introducing Microsoft Purview.
The word Purview simply means the ‘range of vision’ and when it comes to data, then the greater the range of this vision and the clearer the objects you see, the better. Will Purview live up to this definition of its more generic namesake and will it cover the blind spot I previously mentioned?
Currently in public preview, and due for general availability in the not-too-distant future is the first of multiple planned modules for Purview, I.e., the Data Catalog module. This first module supports AI based cataloguing and classification of data across your data estate, curation, and data insights. Users will in addition be able to use and maintain business glossaries, expressions to classify data based on patterns beyond the out-of-the-box classifications (let's call these bring your own (BYO) expressions to cover additional patterns), provide visibility over ownership and custodianship, show lineage, etc.
This will have immediate benefit to anyone seeking pragmatic data governance as it will immediately provide a heap of knowledge about the data in your data estate via 100+ out of the box scanning rules, something that would have required resource intensive and error prone human activity, plus it enables a data worker to augment/ override the AI scanning in a crowd sourcing ecosystem, or by allowing data workers to BYO scanning rules.
Recently road-testing Purview showed there are still some kinks to iron out, but the product is still in public preview so that is to be expected. We dumped a whole load of data into Azure Data Lake and set Purview scanning loose over it to do its AI and built-in classifier magic. The results looked pretty good and goes a long way to fulfil that requirement I mentioned before for pragmatic and accelerated data governance – some highlights are:
Microsoft Purview - classifications and sensitivity:
o Classifications using Purview’s own 100+ prebuilt classifications, or BYO ones, are used to mark and identify data of a specific type that is found within your data estate. We could easily see where all the credit card and Australian Tax File numbers were located across the data lake.
o Sensitivity labels define how sensitive certain data is and they are applied when one or more classifications and conditions are found together. We could clearly find the data with a sensitivity of ‘secret’ – an immediate application of this could be to support the IRAP data classifications as defined by the Australian Signals Directorate and PCI in the financial sector.
When the scan completes all data meeting classification rules will be discoverable, whereas Purview’s sensitivity labelling needs a couple of hours to reflect the new assets and auto label sensitivity after which time it too will be discoverable. It is also possible to view insight reports for classifications and sensitivity labels.
Microsoft Purview - business glossary
We could easily employ a glossary to overlay business friendly terms over the metadata so that we converted the physical vocabulary with a standard vocabulary the business can understand. Remember data is the business’ asset more so that that of ICT, so a business vocabulary is important.
Purview has, as I previously mentioned, more than 100 system classifiers it uses during scans to automatically detect the system and it can also use your own BYO classifications and apply them to data assets and schemas. But it was easy to override these with the business glossary and anything a human override was never replaced by subsequent automated scans. We for example overrode CC# with Credit Card Number.
It is also in the glossary where data stewards and owners are set, two core elements of effective data governance.
Terms can also be subject to workflow and approvals so that it does not become a free for all.
Microsoft Purview - show lineage at various levels
We could see a bird’s-eye view of the data estate, including very impressive lineage at the asset, column and process, levels, as well as the full data map:
o At the data asset/ entity level, i.e., where the entity/ asset came from.
o At the column level. i.e., where the attribute came from.
o At the process level, i.e., how data was moved/ orchestrated.
Microsoft Purview - gain insights
It is also easy to see insights across many of these concepts – across the data assets, glossary, scans, classifications, sensitivity, and file extensions.
The images below show only some insight examples for the glossary, for classification and for sensitivity:
Conclusion - is Microsoft Purview a worthy data governance tool
Does Purview hit the mark? As I said before, there were, as at the date of authoring this, some kinks Microsoft needed to sort out and my correspondence with their product team suggests they are working on this. So, looking at it from a pure data cataloguing perspective, it ticks many boxes and at a very compelling price point.
But data governance is broader than just cataloguing, and even though Purview crosses the boundary into some aspects that would not normally sit within a data catalog (which is a good thing), other areas still require attention, notably master data, data quality and data security. BUT we all know this is just the first module, so watch this space. I am very hopeful.
Disclaimer
The views expressed on this post are mine and do not necessarily reflect the views of any organisation I am associated with.
Comments