When setting up a modern data stack, data warehouse modeling is often the very first step. It is important to create an architecture that supports the data models that you wish to build. I often see people going straight to writing complex transformations before thinking about how they want to organize the databases, schemas, and tables within their warehouse. To succeed, it is key to design your data warehouse with your models in mind before starting the modeling process.
Data warehouse modeling is the process of designing and organizing your data models within your data warehouse platform. The design and organization process consists of setting up the appropriate databases and schemas so that the data can be transformed and then stored in a way that makes sense to the end user.
When data warehouse modeling, it is important to keep in mind the specific access controls and separation of environments. You need to consider which users have access to certain databases, schemas, tables, and views. You also need to remember to keep your development and production environments separate but similar in how they function.
First, data engineers will ensure data is being properly recorded by front and backend processes on the website and with different systems used. Ensuring this raw data is being captured is necessary to collect raw data to ingest into your warehouse. Rather than working directly with the data itself, which tends to be more of an analytics role, data engineers ensure the right processes are in place for analytics to have the data they need.
Analytics engineers, like me, focus on writing dbt data models that help to transform the raw data. However, they can also be in charge of setting up the data warehouse depending on the organization and the size of the data team. Because the analytics engineer is working so closely with the transformed data, it often makes sense for them to decide on a structure for the models they are building. Because dbt itself is so intertwined with data warehouse modeling, it can be easiest to have the same person work on these two things.
Now, what does data warehouse modeling entail? There are three main types of data models, made popular by the use of dbt. Each of these models serves a different purpose and is set up differently within the warehouse. We will discuss what each of these is for and where they should sit in your data warehouse.
Base models (or staging models as dbt now calls them) are views that sit directly on top of your raw data. They include basic casting and column renaming to help keep your standards consistent across different data sources. It is in these models that you decide the type of timestamps to use, how to name date fields, whether to use camel or snake case for naming your columns and how to define primary keys. Base models are instead referenced rather than the raw data tables by data analysts in their reports and dashboards.
Intermediate models are important when using a tool like dbt that helps make your data transformations modular. The purpose of intermediate models is to speed up the time it takes for your data models to run and make it easier for analytics engineers to debug more complex models.
Core models are your data models that produce a fully transformed dataset that can be used by data analysts and business stakeholders. They are the final product in your transformation process! Core models reference base models and intermediate models to produce a final dataset. They can be simple dimension tables that join related base models or complex data models with lots of underground logic.
Lastly is your production database. This database is the most important because it is the one that everyone else on your team will be using to access your data models. It is the place they turn to for high-quality, trusted data. Your production database should mimic the same structure as your development database. This means it should have two schemas: one for intermediate models and another for core models. By having the same structure, you can ensure your models will behave as expected after testing in development.
When performing data warehouse modeling in the cloud as compared to on-prem, you have many more features to use to your advantage. With tools like dbt and cloud data platforms such as AWS Redshift, Databricks, Snowflake, or Google Big Query, you can build your data warehouse with incremental models which allow you to only run your transformation code on new data, saving you on compute costs and run time.
Cloud data warehouses also allow you to take advantage of features like views, rather than needing to create a table that would take up space and be more expensive to maintain. Because views only sit on top of other tables, you only pay for the query you run on them rather than the storage space. This helps to reduce costs and keeps your warehouse data clean through the standards you put in place with base models.
Lastly, cloud data platforms provide elastic compute allowing you to dynamically scale your resources up and down. Snowflake allows you to change your warehouse size depending on how fast you need your data models to run. If you find it too expensive, you can always size them down. This ability adds a lot of flexibility and ensures your infrastructure can grow as the size of the business grows.
When data warehouse modeling, you need to build your architecture with base, intermediate, and core models in mind. Base models are necessary to protect your raw data and create consistent naming standards across different data sources. Intermediate models act as the middleman between base and core models and allow you to build modular data models. Core models are the final transformation product utilized by the data analyst.
Madison Schott is an analytics engineer for Winc, a wine subscription company that makes all its own wines, where she rebuilt its entire modern data stack. She blogs about analytics engineering, data modeling, and data best practices on Medium. She also has her own weekly newsletter on Substack.
Data modeling shapes raw data into the story of a business, as well as establishes a repeatable process that can help create consistency in a data warehouse: how schemas and tables are structured, models are named, and relationships are constructed. At the end of the day, a solid data modeling process will produce a data warehouse that is navigable and intuitive, with data models that represent the needs of the business.
This page is going to cover the four most common types of data modeling techniques we see used by modern analytics teams (relational, dimensional, entity-relationship, and data vault models), what they are at a high level, and how to unpack which one is most appropriate for your organization.
Dimensional data modeling is a type of relational model that puts entities into two buckets of facts and dimensions (aka as the bread and butter of analytics work ?). Dimensional modeling is one of the most predominant types of data modeling used in modern data stacks, as it offers a unique combination of both flexibility and constraint.
Another type of relational model, entity-relationship (ER) data models have entities at the heart of their modeling. ER modeling is a high-level data modeling technique that is based on how entities, their relationships, and attributes connect together:
Many data sources you ingest into your data warehouse via an ETL tool will have ERDs (entity relationship diagrams) that your team can review to better understand how the raw data connects together. Slightly different from an ER model itself, ERDs are often used to represent ER models and their cardinality (ex. one-to-one, one-to-many) in a graphical format. These ERDs will often look a little like the relational model shown earlier, demonstrating how tables connect together. Using these diagrams with a data modeling technique of your choice, such as dimensional modeling, helps data teams efficiently wade through raw data and create business entities of meaning.
Data vault architecture was invented to easily track data changes by having an insert-only mindset; whenever a row changes in a classical data model, a new row is either added or the existing row is modified. In a data vault world, data updates are represented only by new rows.
A considerable amount of data vault modeling can feel very repetitive or prescriptive given the consistent structure of hubs, links, and satellites. With dbt, you can use the dbtvault package to speed up the development time of fundamental data vault models to focus on writing the SQL that really matters to your business.
Ready to learn more about how dbt can support your data modeling efforts? Take a look at some of the resources below to see how modern data teams are transforming (?) the way they tackle data modeling with dbt:
Data modeling is the process of creating a visual representation of either a whole information system or parts of it to communicate connections between data points and structures. The goal is to illustrate the types of data used and stored within the system, the relationships among these data types, the ways the data can be grouped and organized and its formats and attributes.
Data modeling employs standardized schemas and formal techniques. This provides a common, consistent, and predictable way of defining and managing data resources across an organization, or even beyond.
Ideally, data models are living documents that evolve along with changing business needs. They play an important role in supporting business processes and planning IT architecture and strategy. Data models can be shared with vendors, partners, and/or industry peers.
Like any design process, database and information system design begins at a high level of abstraction and becomes increasingly more concrete and specific. Data models can generally be divided into three categories, which vary according to their degree of abstraction. The process will start with a conceptual model, progress to a logical model and conclude with a physical model. Each type of data model is discussed in more detail in subsequent sections:
dd2b598166