The vast troves of data generated today help businesses make better decisions. However, the larger thought within the data community is that with the explosion of the number and variety of data sources, we need newer paradigms for integrating and deriving insights from them. We’re still at the beginning of a true data revolution and, as we said earlier, there is scope to build a unified conversation around the best practices, updates, tactics, and new tools enabling it.
In the first episode of Dataverse, the conversation is centred around the hottest new subject for chief data officers (CDOs) — Data Mesh. Manjot and Ravi spoke to Shuveb Hussein, co-founder at Zipstack, about what data mesh is, why it matters, and how it will benefit the larger data community.
If you run a search for “data mesh,” the results will, nine out of ten times, include words like “federated,” “decentralised,” “distributed,” and so on. While these certainly explain what a data mesh does, Shuveb sets the context by putting out something more basic. The fact that a data mesh is not a straightforward, tangible entity like a software product. It is, according to Shuveb, a “socio-technical construct. Just as DevOps is a culture. The implementation has to start first with the culture. It’s not like, please install these four things, and you have a data mesh”.
So, what constitutes a data mesh? Broadly, there are four pillars Shuveb outlines:
- Decentralised domain-oriented permissions
Organisations have historically carried out data ops through a centralised control of data systems. That has been changing, but we have a long way to go. To make data meaningful and actionable, it is important to figure out the right way to use it. It follows naturally then that domain experts who understand the data should have access to the right tools to make sense of it. The first pillar of a data mesh, therefore, involves ensuring that various tools are available to domain experts who can “figure out how data can be covered, and how data can be used, how dark data can be reduced, how quickly they can get new data sources online for others to use.”
- Apply product thinking to data
Like you would for building a product, apply the same thinking to working with data. For a data mesh, data should be easily discoverable, and it should be easy to get data products built on the base of service level agreements (SLAs), standard interfaces, and contracts.
- Common DataOps platform
The data mesh should have a common platform where people can declaratively do things. Instead of having to build a new product for high level operations and tasks such as blue-green deployments, or auto scale to save costs, it makes sense to have a common platform that houses these.
- Federated computational governance
Data governance, says Shuveb, is often misunderstood to be about access control. It is actually about how well any organisation is leveraging the data they have. Federated computational governance is about how you maximise that leverage when applied to a data machine. Essentially, working on a central set of principles, local data/domain teams decide how to apply these to build products best suited for their requirements.
How did we get here?
Ravi mentioned that most organisations want to become data-driven. However, the ecosystem is still evolving and it is still very challenging to access data. This lack of accessibility is one need that drove some of the conversations leading to data mesh. Another reason is the silos in operational and analytical data. Apart from that, if you take a look at data services, they have a way to communicate through APIs (application programming interface). But usually, in the data world, warehouses and lakes have been a single access point. There are no good standards on versioning access, or a way for all the data stored in various silos to be interoperable. It follows then that data does not get governed in the right way either. These challenges gave rise to the concept of data mesh.
So you want to get started with the data mesh. What should you do first?
Shuveb had practical advice on how to begin (start small, test, and scale) with implementing a data mesh principally within an organisation, and how it should look for someone on the consumption side. A consumer of the data should be able to see, for example, who created the data, who is using it, how popular it is, how many queries are being run every day, and as much information about that data as possible. At the end of the day, the reason why data mesh is becoming popular is that it gets your data to actually work for you, eliminating any dark data, and doing all this efficiently. Ravi has seen the opportunities and challenges of a data mesh at scale in his previous role at GoJek.
Won’t the cloud platforms provide their own solution that works?
Cloud platforms have tried to approach this and have not succeeded so far. As data mesh began gaining popularity as a concept, everyone expected cloud platforms to implement it. As an example, Confluent has started publishing lots of blogs about how they’re approaching data mesh. Google Cloud is trying in a slightly different way with Dataflow. They’re trying to figure out how they can tie all these systems together and offer a data mesh. But the important thing is, no matter what tool you choose, no cloud platform can provide the last mile experience yet.
We believe exciting things are happening in the space, and this is an ongoing conversation that will truly change the way we access and leverage data.
If you are a data practitioner who would like to be a part of Dataverse, please sign up here or contact me at firstname.lastname@example.org.