We had the pleasure of having the SQL guru, trainer and consultant Markus Winand join us at this spring’s DevTalks, where he took the stage to talk about the evolution of SQL. Earlier that day, he gave us Smartlies an internal training session a deep dive into PostgreSQL internals and the life of the query planner. Markus is the author of SQL Performance Explained, an excellent book we constantly order new copies of for our engineers to read.
PostgreSQL is our go-to database at Smartly.io, and it powers many of our most mission-critical applications. Our biggest database is the reporting database, which has about 4 terabytes of data, and two of the biggest tables have grown to around 300 and 550 million rows.
The reporting database is accessed through our reporting backend Distillery which we recently rolled out to serve the entire Smartly.io application. Distillery accepts almost arbitrary queries using a custom query language we've developed in-house, and it’s able to turn those queries safely into SQL queries that are then run in the metrics database.
Distillery powers, for example, our re-engineered Pivot Table reporting view as well as all the performance metrics seen in various parts of our application. This means we're heavily dependent on the performance of the PostgreSQL database for running the Smartly platform.
As fans of Markus’ work, we were humbled and excited for the chance to brush up on our own SQL skills with him. We give kudos to Markus for sharing his expertise with us, and use this opportunity to share what we learnt from him with our blog readers.
1. Understanding the internals of PostgreSQL
SQL performance starts with understanding how the database engine works internally. SQL is a domain-specific programming language with the purpose of describing how data is transformed. It is declarative by nature; using SQL we only need to specify what data is needed, not how to get it.
This is probably one of the biggest triumphs of SQL and has made it possible to create general purpose optimizers—or planners—for figuring out how to process the queries most efficiently, and fetch the data without the need to manually write a plan for each use case individually.
The planners don't free us from the responsibility of understanding what happens under the hood altogether. The planner is good at making choices between good and bad, but not so good at choosing between two mediocre plans. That's why it is so important to know what the database is doing and understand how it makes those choices.
2. Enter the planner
If we zoom in a bit, the first thing that the database engine puts the SQL query into is the Planner. First, the planner makes sure that the SQL is syntactically sound and that it doesn’t violate the language constraints. It also checks things like schema level constraints which means checking existence of columns referenced in the query, for example. This is almost like what the parser and compiler do in a compiled programming language. For example, database syntax errors get thrown at this point of the query execution.
Next, the planner performs the most interesting operation: it creates the execution plan. The declarative SQL doesn’t contain instructions that can tell the computer processor how to access the data we want. The execution plan, on the other hand, describes what are the imperative steps to fulfill the query. As there’s plenty of different ways to fulfill the query, the planner is also acting as an optimizer to the compilation process by choosing the steps that are likely to be optimal for the query.
The execution plan can contain all kinds of things: scan an index looking for a certain value, do a full table scan, join these two datasets using a hash join, sort the result set, and so on. The plan can be thought of as a branched pipeline where the dataset collected from the previous step (or the results of two branches) is fed to the next operation until the whole plan is finished.
3. Ready, set, Execute!
The execution plan then goes to the Executor, which runs the query. It executes the steps in the execution plan one by one and returns the final dataset to the client. If there are no errors, the client will see a nice table of results, as we're talking about SQL after all.
The execution might run into problems too: there might be a runtime error, the database can run out of memory, or there might be a runtime constraint error among other. Everything unexpected that cannot be anticipated based on static analysis of the SQL query alone triggers an error in the execution stage.
4. Can you please EXPLAIN?
An easy way to see what plans the Planner has made for us is to run an EXPLAIN query. Any query prepended with the EXPLAIN keyword will only go as far as the query planner and instead of passing the execution plan to the executor, the database engine returns a textual representation of that plan to the client.
It is worth noting that EXPLAIN only returns the plan based on what the planner thinks might be the best way to run the query. We'll get to the heuristics of the planner in a bit, but the important thing here is that the query has not been run. If we were to run the query for real, the cost of the plan and the intermediate result set sizes might be close to what the planner thinks they are, or then it can be something else completely. This discrepancy is often what is causing bad SQL performance if your query should be otherwise ok and there are sensible indexes and the data is structured in a sane way.
There's a second form of EXPLAIN we can use to get to the bottom of these kinds of situations: EXPLAIN ANALYZE. Instead of relying on the planner alone, EXPLAIN ANALYZE causes the database to execute the query, but instead of returning the result, return the execution plan with actual measurements of how many rows were involved in each step and how long each step took. Do note that if you're running destructive queries like UPDATE or DELETE, the EXPLAIN ANALYZE will actually perform the operation, so you might want to consider wrapping it inside a transaction and rolling back afterwards.
5. The Plan, analyzed
So, how does the planner know what kind of query plan would be good for each situation? Is a full table scan slower than an index scan combined with row access? It depends. For sequential table access, the cost is proportional to the table size, but for index-only access, the cost goes up logarithmically. PostgreSQL stores all information to disk in blocks of 8KB, so loading a table that fits into a couple of blocks is probably faster than looking up the index and then loading the relevant pages.
The PostgreSQL planner selects the best strategy by estimating the total cost of the individual steps of the query plan and summing those together. The planner tries this with different query plans, and the plan with the lowest cost wins.
How can the planner then know the actual cost of the operations? First of all, it, of course, knows the structure of your database and the relevant indices. However, that’s not enough: how many rows are going to be affected by a certain operation? The relevant statistics are mainly stored in pg_class and pg_statistic collections. The pg_class collection is maintained mostly automatically, and for the pg_statistics, PostgreSQL runs an ANALYZE task in the background automatically, as part of the VACUUM process that collects statistics about the database tables.
The pg_class collection contains information of things like how many pages (data blocks on disk) the table has or how many tuples (rows) the table has. The pg_statistics contains more interesting things about columns: information about the most common values and their respective frequency, the number of distinct values and the fraction of null values.
Using all of this information—database structure, index information, and table and column statistics—allows the planner to make educated guesses of how many rows will be affected in each step of the query and what is going to be the total cost of the query.
6. When the numbers fall short
However, there are certain cases where the planner can go wrong even when it’s armed with all of the information available. There are cases when correlation exists between the data that is not apparent to the planner. For example, consider a case where we store product attributes for a shop inventory as rows in the database; there is a column for the type of the attribute (for example color, size, or material) and the corresponding value (for example red, M, or wool). When querying for products that are white, we might write a query like this:
SELECT product_id FROM product_attributes WHERE attribute = ‘color’ AND value = ‘white’;
PostgreSQL planner knows the distributions of the color and white values in their respective columns, but it doesn’t know that the values white only appear on rows where the attribute is color. In general, adding more conditions to a where clause makes the planner think the result set will be smaller, so the planner will greatly underestimate the final amount of rows selected. In this case, adding the attribute condition to the where clause will not make the result set any smaller.
Another case where the query planner can be thrown off, is data that has a strong correlation between columns is when doing GROUP BYs. In general, adding more groupings is assumed to increase the number of results. In our case, adding a grouping by the attribute doesn’t increase the result set, so the planner’s estimate will be a lot bigger than it is in reality.
In PostgreSQL 10, it’s possible to tell the planner about these kinds of correlations. We can create multi-column statistics that will let PostgreSQL planner map these correlations.
CREATE STATISTICS ... (dependencies,ndistinct) ON column1, column2;
The dependencies flag creates statistics that help in the first case, and the ndistinct flag creates statistics that help with the second case.
A great learning experience
Thanks to the session with Markus, we got some excellent insights into the internals of PostgreSQL. Looking at SQL queries from the perspective of the planner and the database engine internals really makes a difference for understanding why the queries are fast or slow.
We also learned a lot about the enhanced statistics in the PostgreSQL 10 release that hopefully will help us solve some of the problems we’re facing with the query planner making assumptions that are just plain silly in our more gnarly database queries.
The Smartly.io culture is a lot about maximizing learning and sharing knowledge. Our Engineering team is adamant to constantly develop their knowhow about the technologies they use, and Markus’ visit was a perfect opportunity to bring in new knowledge from outside the company. We’d like to thank Markus once again, and we’re excitedly waiting for the next opportunity to learn from the best.
Would you like to work in a team that’s relentlessly learning new things? Learn more about our Engineering team and who we’re looking for at smartly.io/developer.