Sample & Test Data for Data Warehousing

Sample & Test Data for Data Warehousing

For those of use who like to learn and experiment (and particularly if you want to teach/provide examples), sample data is a must. Professionals have access to their real world data (which we can’t share in whole or in part) to learn with, but even then dummy data/sample data can be useful as we can do what we like with it, whenever we like.

In this post, I’m linking to and summarising some sample data sets which are shared with licenses allowing people to use and reuse them with a lot of (if not complete) freedom. Because they are open sourced (or with permissive licences), many have been ported to multiple database servers. If one hasn’t been ported tot he database that you desire, then AI (or my db2seed tool) can help you convert them.

Databases By Creator

I’ll start with some common sample databases, generally produced by the training department of each DBMS (or in the case of dbt, the tool developers).

DBT Labs

DBT labs has a number of sample projects, which are listed under their Are there any example dbt projects? page. One key advantage of these projects is they support any database supported by dbt as the data is provided as dbt seeds (csv files), and loaded by dbt.

Jaffle Shop

This comes in two flavours:

The jaffle-shop project has a sister project to generate more data jaffle-shop-generator: https://github.com/dbt-labs/jaffle-shop-generator

Oracle (MySQL)

Employees Sample Database

licensed under the Creative Commons Attribution-Share Alike 3.0 Unported License.

Sakila Sample Database

licensed under the New BSD license, and has been ported to other databases.

Microsoft

It came as a surprise to me that Microsoft shares their sample databases under an MIT licence (Microsoft has been good at making samples available, but the MIT licence enables the databases and data within to be ported for other databases).

Northwind and pubs sample databases

Two sample databases packaged together, which have been ported to other database engines including PostgreSQL.

AdventureWorks

Also available for PostgreSQL.

Other Sample Data Locations

kaggle.com

Kaggle is a data science competition platform and online community for data scientists and machine learning practitioners under Google LLC. Less appropriate for a lot of Analytics Engineering purposes, but a good mention here.

AI/LLM Generated Data

My original simple shop model (used for very basic testing) was created by a request to ChatGPT, stating that I wanted a dataset consisting of people, products, shopping carts, shopping cart items. I’ve since ported the CSVs to SQL for:

Python Generated Data

I took my idea for the “AI/LLM Generated Data” and my Simple Shop Model, and asked ChatGPT:

Create python code which uses the faker library and the sqlalchemy library to generate sample data for a database which has tables for customers, products, shopping carts, shopping cart items, support agents, and support requests

The sample application (mostly created with ChatGPT, and hand edited for some tweaks) can be viewed here: https://github.com/d-roman-halliday/simple_shop_data_generator

Other Datasets

If none of the above fit your needs, then these links might be worth investigationg:

Loading Data

In most cases, sample data is produced in a format for loading into the respective database. Many of the samples above for database specific sample datasets have been ported for other databases.

If a dataset hasn’t been migrated, then an easy way to do it (raw data, not including procedures and other database logic) could be to use my python project db2seed. Which is a simple script that:

  1. Reads database connection details from a YAML config file.
  2. Connects to the database using sqlalchemy.
  3. Lists tables in the database.
  4. Enables selection of individual tables.
  5. Exports selected tables to CSV.
  6. Generates a dbt seed configuration YAML file to define data types.

Summary

Hopefully the links in this post are useful to Analytics Engineers looking for sample data for their own projects/learning. As well as approaches for working with the data.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.