Scaling Postgres AI Workloads Even Further with Google AlloyDB Omni on Silk Platform

Background and Motivation

For the previous articles in this series, I tested Silk Platform’s capabilities in multiple contexts, both its raw performance in the cloud and suitability for modern enterprise AI use cases like vector search.

Here’s a list of all articles in this series:

How I Achieved 20 GiB/s I/O Throughput in a Single Cloud VM
Unlock AI Innovation By Boosting Existing Apps with Risk-Free Vector Search
Pushing the Limits of High-Performance AI Vector Search with Postgres on Silk Platform
Scaling Postgres AI Workloads Even Further with Google AlloyDB Omni on Silk Platform (this article)

I first tested Silk Platform’s ability to deliver raw I/O performance with Linux OS tools and achieved 20 GiB/s read I/O throughput, 10 GiB/s sustained write I/O throughput and 1.3M random (8kB) read IOPS – issued from just a single client VM in Google Cloud. As a next step, I set up some AI vector search examples that you could just “plug in” to your existing databases and applications, without any upfront re-engineering or data migration efforts.

Back in 2021, I tested Silk’s capabilities by stress-testing Oracle Database workloads in MS Azure cloud. Silk Platform is not tied to any single cloud provider, so this time I ran all the tests on Google Cloud Platform. I picked Postgres and pgvector vector search as the database workloads to focus on.

Testing with Google AlloyDB Omni

Back in 2023, Google released an enhanced Postgres-based database called AlloyDB Omni that runs anywhere – not only in the Google Cloud, but in other clouds and on your on-premises hardware too. AlloyDB Omni is based on Postgres source code and is fully Postgres-compatible, but also includes significant performance improvements and latest cloud integrations, like access to Google’s Vertex AI vector search or machine learning prediction APIs directly from your database.

A notable performance feature of AlloyDB Omni is its in-memory column store. It can greatly accelerate reporting & analytics tasks running within your existing OLTP database – as long as you can cache everything you need in memory. At the database level, you won’t have to “overindex” your transactional tables just to support a single end-of-day batch job.

This also helps to keep your architecture simpler, as you won’t need to add new databases and replication just for running concurrent reports without impacting online transaction processing. Nevertheless, even if you somehow allocate enough RAM to cache the entire database in memory – you will still need a whole lot of disk I/O for your checkpoints, WAL writes, tempfile write bursts for large sorts & hash joins, etc.

As I had already built everything for my Postgres experiments, it made sense to test out Google’s AlloyDB Omni too. After all, it’s fully compatible with Postgres, so I didn’t have to change my existing scripts or application code. I ran all my tests in the same client VM and Silk backend setup in GCP anyway. I was curious to see what kind of performance improvements I’d see when switching to AlloyDB Omni on the same underlying infrastructure, without changing anything else.

I recreated my 1 TB HammerDB TPCC test database in a freshly installed AlloyDB Omni instance (that runs in a Docker container). I then loaded the embedding vectors computed from my “cat customer” photo sets into the database and created pgvector HNSW indexes as before.

AlloyDB offers both a customized pgvector HNSW index type and a new Google-provided vector search approach (ScaNN). The ScaNN implementation is based on years of Google’s research and development experience and should be more efficient than the standard Postgres pgvector HNSW index, especially when indexing many millions of vectors. However, I mostly wanted to test how far I could push AlloyDB’s concurrent SQL execution and I/O capabilities in general, so I didn’t jump into comparing the quality and behavior of all these different index types.

I started from an otherwise idle database and first ran a few parallel read-only queries scanning through large tables, concurrently. Interestingly, AlloyDB Omni was able to drive the Linux + pagecache + block I/O subsystem even to 11-14 GiB/s read rate. The “run-it-anywhere” Omni version of AlloyDB was based on Postgres 15 during my tests. The multiblock I/O streaming option was added to vanilla Postgres in v17.0, but AlloyDB Omni was still able to drive more I/O for its scans, even when compared to the latest Postgres 17. So it looks like the AlloyDB Omni team have added an optimized I/O path already into an earlier version of their Postgres engine.

I looked into the I/O sizes of this workload and the parallel table scan I/Os were using I/Os up to 256 kB too (like my later Postgres 17 configuration), still going via the OS pagecache. AlloyDB is definitely doing something interesting with its I/O handling or the closely related buffer access strategy.

Next, I started the HammerDB OLTP benchmark and my vector search loops again, doing lots of concurrent small reads and checkpoint writes. I wanted to see if we’d hit any OS/kernel software-level contention, like in the previous vanilla Postgres tests. A parallel full table scan was still running at the same time too.

Looks like AlloyDB Omni was able to avoid at least some of the bottlenecks we witnessed before and was running a serious mix of small reads, large reads for parallel scans, reading up to 6-7 GiB/s of data and at the same time writing 3-4 GiB/s of data for WAL and checkpoints from the database buffer cache too. This is still well below the 20 GiB/s read and 10 GiB/s write rate I saw during raw block I/O tests earlier, so even this AlloyDB Omni I/O test (on a buffered filesystem) is not pushing Silk’s I/O capabilities to the max.

In a benchmark where enough data blocks are cached in memory, you may end up seeing more write I/Os than reads. The writes are for WAL persistence, checkpoints and sometimes tempfile I/O for large sorts and joins. Of course this would change if any large reports and table scans are kicked off in the same database.

Here’s an example from the Silk’s performance dashboard during such a benchmark. Out of the total 9-10 GiB/s bidirectional I/O volume, about half of the bytes moved were for reads and another half for writes (~4-5 GiB/s in each direction, sustained):

When looking at the IOPS breakdown above (green colors), about 66% of the I/O operations were for write I/Os. They were likely smaller writes to random datafile locations, issued by Postgres checkpoints and some WAL writes too. The rate of read I/O operations was about 2x lower than write IOPS, but since some of the reads were large I/Os done for full table scans, the total amount of “bytes moved” was still about 50/50 for both directions. Silk was able to sustain such I/O rates over hours of testing on this configuration.

For heavy OLTP workloads and large data loads, your I/O subsystem must be able to sustain such write activity. Given how NAND SSD media works, you can’t just randomly replace a 8kB chunk at its current offset when a database checkpoint wants to persist some modified block. This is different from how spinning hard drives work. In-place NAND media updates would require first erasing the whole 256kB-to-multi-MB “NAND block” and then rewriting the entire thing – that’s not practical for small writes. Enterprise SSD controllers try to work around this too, but it can cause various latency overheads if you’re hammering the SSDs with constant random writes all the time.

This is why the Silk Platform handles any incoming writes in its backend software layer. It first buffers, compresses and consolidates incoming write requests in memory – and then sequentially appends the resulting larger data chunks to SSDs in its backing store. Silk compute nodes keep track of where the latest version of any data-block resides. This allows Silk backends to sustain continuous, multi-gigabyte random write I/O streams, without any sudden SSD performance drops or garbage collection problems that you might get with naive use of NAND SSDs.

In the last example, I reduced the number of OLTP workers (from 192 to 96) and later started multiple concurrent analytic queries that were doing parallel full table scans with large reads. These drove up the aggregate read/write I/O throughputmetric to over 10 GiB/s. The read/write IOPS rates were still split at around 50/50 at the end of this test. This shows that now the individual writes were much smaller on average than reads. Nevertheless, the write I/Os kept going at about 2 GiB/s rate throughout the whole test (19% of 10.5 GiB/s is about 2 GiB/s of writes):

This is an interesting behavior to observe. We see a brief dip in the IOPS chart in the middle (and increased I/O latency below), right when I kicked off my additional parallel queries. Yet the amount of megabytes written didn’t drop at all. So, somehow the system was able to write a similar volume of data with fewer IO operations for a while! During the dip, looks like AlloyDB Omni plus the OS pagecache still managed to do around (average) 40k IOPS * 50 kB sized writes (~2 GB/s). And after that, the system resumed doing 100k IOPS * 20kB sized writes, while the large table scans kept going on.

I did not dig deeper, but this behavior is probably related to having multiple additional layers and moving parts in the Postgres I/O and data caching paths. As Postgres I/Os go through the OS pagecache, some blocks may now have ended up staying longer in the OS filesystem cache and got re-read back into Postgres buffer cache from there. Such reads from OS cache would not show up as physical I/Os in any of the block I/O monitoring tools (including the ones I have used here). My previous OS-level tests with fio and Oracle’s direct I/O did not show such metric fluctuations, even when under heavier load.

Summary

I ran these tests on the same VM and Silk backend as described in my previous articles. I did not have to change my testing scripts or application code, as AlloyDB Omni is based on Postgres and is fully compatible with it. Plus you get additional AI-integration APIs and extensions that can access various Google’s machine learning models and services directly from your database.

As you see from the results above, Google’s AlloyDB Omni engine was able to get more out of the underlying infrastructure, it showed better I/O driving capabilities with mixed workloads, compared to my previous vanilla Postgres 17 tests on the same setup. Google has clearly improved some critical codepaths of AlloyDB’s interaction with the OS & disk I/O and perhaps its internal buffer cache management too. This allowed AlloyDB to avoid some of the bottlenecks I had seen in my earlier tests and it got more work done, faster.

I was still not able to make my AlloyDB workloads drive enough I/O to the Silk datastore to get even close to the original raw I/O throughput numbers in my OS-only tests. This is not surprising, given how many additional layers of complexity and potential software bottlenecks are involved in such high concurrency database workloads.

Experience Silk for High-Performance AI Workloads

See the power of Silk in action! Try our self-guided demo to explore how Silk optimizes workloads, delivering unmatched speed and efficiency for your AI use cases.

Take the Demo Now