CYBERTEC PostgreSQL Logo

Citus: Sharding your first table

09.2023 / Category: / Tags: |

Citus is a capable sharding solution for PostgreSQL. It solves a ton of scalability issues: these can be addressed using a sharding approach. We at CYBERTEC have used Citus for some time and can wholeheartedly recommend it (check out our services to find out more).

Since the need for PostgreSQL sharding is constantly growing, we thought we’d share some of our knowledge. Here’s how to shard a table from scratch.

Setting up CitusDB

Most people run Citus directly by firing up a couple of containers which already contain a working version of the solution. This is a good way to get started; it helps to shave off some of the overhead associated with installation and configuration.

Running a Citus Docker container

Here’s how to run a Citus Docker container:

This method will provide you with a working solution in minutes.

Configuring Citus manually

However, if you want to get your hands dirty and find out what’s going on behind the scenes, you can also configure PostgreSQL manually to run Citus. After installing the binaries, it is necessary to load the Citus library directly at startup. The way to do that is to add the citus extension to shared_preload_libraries in postgresql.conf:

I found it super useful to also add pg_stat_statements to the variable, in order to monitor performance as professionally as possible. Note that “citus” has to be the first extension in the list. Otherwise, the server won’t start.

Adding worker nodes

Once this is done, we can provide a set of database instances which we’ll run. In my case I prepared 5 empty instances with the following setup:

  • A coordinator: localhost, port 5432
  • 4 database worker nodes: localhost, port 6001 - 6004

These are empty instances which have been given the configuration discussed above. First we have to set the coordinator node, then we can add the worker shards.

The shards are really - well, just shards.

You are not supposed to run operations there. Citus is even kind enough to remind us of this fact when we try to do things we are not supposed to do:

This is relevant because it ensures consistency and avoids shards getting out of control. Centralising things through the coordinator is a good idea, and helps to keep your cluster healthy.

Creating a sharded table in PostgreSQL

After successfully configuring Citus it is time to create a table. The creation of the table itself is simple. All it takes is a simple CREATE TABLE command. However, this does not mean that the new relation is already shared: We have to tell Citus to actually shard the table and define the key which we want to use to split the data. In my case the sharding key is called shard_key and is part of the table.

Once the table has been created, it's time to add some data.

The beauty here is that we can simply send data to the table normally and Citus will take care of the rest for us automatically:

See here how fast things actually are.

5 million rows have been inserted on a simple Mac Mini machine (local desktop). Still, the entire process takes hardly more than 3 seconds. It seems parallelism indeed pays off.

What is noteworthy about this is the size of the tables:

Don’t trust d+ here. The table has no “size” on the PostgreSQL side as we are using Citus storage. To inspect the real size of things we have to take a look at the Citus meta data:

Now we can see the real size of the table. Why does it matter? Well, many monitoring tools are using the standard PostgreSQL functions to gather size information. Monitoring has to be made aware of the existence of sharded tables. If you are using pgwatch, you can easily add metrics to take that into account.

Querying a sharded table

Now that we have created a table and inserted some data, it's important to see how the query is actually handled. We can run a count ( * ) and see which kind of execution plan we'll get. It will enlighten us on the performance side of things:

PostgreSQL will distribute the query for us and send it to all shards. Note that PostgreSQL will not show the operations sent to each shard but just show things for one. However, those shards are created equal so we don’t care much.

If we look at a specific value in the shard_key column, we'll see that only one shard is addressed:

The “Node” inside the plan shows that data is only comes from the instance running on port 6001, which tells us that we don’t have to scan all shards. That's super beneficial.

Sharding does not substitute indexing

Often, sharding is envisioned as a fix for performance. The truth is, it is ONE step to good performance, but it does not substitute thinking and indexing.

Consider the following query:

Citus will turn on the sharding machinery, query things in parallel, and return a result by going through ALL the data. In reality, this is really bad and throwing more hardware at the problem is going to make things even worse.

As is so often the case, the solution to the problem is to use indexing:

As you can see, after indexing, performance improves by orders of magnitude:

This is orders of magnitudes faster than having no index, but of course also orders of magnitudes slower than working on a local, non-sharded database. In other words: There is no magic solution to performance - not even sharding.

Finally …

Check out our other blog posts about sharding, indexing and performance:

 


In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Facebook or LinkedIn.

Leave a Reply

Your email address will not be published. Required fields are marked *

CYBERTEC Logo white
Get the newest PostgreSQL Info & Tools


    This site is protected by reCAPTCHA and the Google Privacy Policy & Terms of Service apply.

    ©
    2024
    CYBERTEC PostgreSQL International GmbH
    phone-handsetmagnifiercrosscross-circle
    linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram