Congratulations! You’ve successfully deployed turned your software into a SaaS offering. You might be ready to log off and head on some much needed PTO but there’s a few things you should do to ensure that the offering is a success. We’ve collected a number of the most impactful practices that successful SaaS platform operators have shared. These also extend into many other application environments, so even if you aren’t a SaaS administrator, these practices can still be put into place.

Monitor the Golden Signals

Monitoring is an absolute necessity for understanding the state of the environment. This also includes the concept of

observability which has roots in control theory and mathematical systems. Understanding the state of the environment using key metrics and signals is paired with control systems to measure the effects of making change. It’s also critical for any troubleshooting and operations.

The “Golden Signals” for monitoring database application performance refer to a set of key metrics that provide valuable insights into the health, efficiency, and overall performance of a database system. These signals help identify potential bottlenecks, measure resource utilization, and ensure optimal database application performance.

  • Latency – Latency measures the time it takes for a database to respond to a query or operation. Monitoring latency helps assess the responsiveness and efficiency of the database. High latency can indicate performance issues such as slow query execution, resource contention, or network bottlenecks.
  • Throughput – Throughput represents the number of queries or operations a database can handle within a given time period. Monitoring throughput helps understand the database’s capacity to handle workload demands. Sudden drops or fluctuations in throughput could indicate resource limitations or performance bottlenecks.
  • Error Rate – The error rate measures the proportion of failed or erroneous database operations. Monitoring the error rate helps identify issues such as misconfigurations, data integrity problems, or constraint violations. A sudden increase in error rate can indicate issues affecting database availability or data consistency.
  • Resource Utilization – Monitoring resource utilization, including CPU, memory, disk, and network, provides insights into how the database utilizes available resources. High resource utilization can indicate bottlenecks or the need for resource scaling to ensure optimal performance.
  • Connection Pooling and Connection Usage – Tracking connection pooling metrics helps monitor the efficiency of database connections. Metrics such as connection pool size, connection acquisition time, and connection usage patterns can identify connection-related performance issues or potential connection leaks.
  • Query Execution Time – Monitoring query execution time helps identify slow-performing queries. Identifying and optimizing long- running or resource-intensive queries can significantly improve overall database performance.
  • Database Locking and Deadlocks – Monitoring locking and deadlock metrics provides insights into concurrent transactional operations. Detecting excessive locking or frequent deadlocks helps optimize transactional workflows, minimize contention, and ensure consistent database performance.

It’s important to note that the specific metrics and thresholds for these Golden Signals may vary depending on the database system, workload characteristics, and application requirements. Customizing the monitoring approach to align with the specific needs of your database environment is crucial for effective performance management and troubleshooting.

Automate often

Reduce risk, errors, and effort by automating as much of your infrastructure and application operations as possible. Scripting and automation are treated differently because scripts are the artifacts versus the automation which has inputs, triggers, and outputs.

  • Script deployments using common tools (e.g. Terraform, Ansible, Chef)
  • Design automation to be proactive with monitoring where possible
  • Ensure alerts and notifications but be careful about signal vs. noise

Automation that is used by the team is great. It’s even better once you build systems that will do some automatic recovery, and cyclic processes to cut down on unnecessary work by people and also reduces risk of configuration and operational errors.

Use managed services

Why reinvent the wheel? Managed services are available for many regularly used products and processes like centralized logging, file storage, authentication and authorization, to name a few.

The advantage to a managed service offering is the fact that it has the right level of abstraction for you with less complexity. There are some tradeoffs to functionality, customization, cost, and performance as you decide between DIY and managed services.

A good example would be using Microsoft Azure AD (soon to become Microsoft Entra ID) rather than designing your own authentication directory. Scaling and data consistency are built-in and the system is built for continuous global availability by design.

Use Scripting and Infrastructure-as-Code

For example, to automate the scaling of Azure SQL, you can utilize Azure Automation and PowerShell.

Below are code examples demonstrating how to trigger a scaling process:

Scaling Azure SQL Database Performance Level:

# Authenticate and connect to Azure

Connect-AzAccount

# Define variables

$resourceGroup = “YourResourceGroup”

$serverName = “YourAzureSQLServer”

$databaseName = “YourDatabase”

$performanceLevel = “S2” # Target performance level (e.g., S2, P1, P2)

# Scale Azure SQL Database

Set-AzSqlDatabase -ResourceGroupName $resourceGroup -ServerName $serverName -DatabaseName

$databaseName -Edition “Standard” -RequestedServiceObjectiveName $performanceLevel

In the above example, the code uses the Set-AzSqlDatabase cmdlet to scale the Azure SQL Database. You need to provide the appropriate values for the resource group, server name, database name, and the desired performance level (e.g., S2, P1, P2).

Scaling Azure SQL Database Storage:

# Authenticate and connect to Azure

Connect-AzAccount

# Define variables

$resourceGroup = “YourResourceGroup”

$serverName = “YourAzureSQLServer”

$databaseName = “YourDatabase”

$storageSizeInGB = 250 # Target storage size in GB

# Scale Azure SQL Database storage

Set-AzSqlDatabase -ResourceGroupName $resourceGroup -ServerName $serverName -DatabaseName

$databaseName -Edition “Standard” -MaxSizeBytes ($storageSizeInGB * 1024 * 1024 * 1024)

In this example, the code uses the Set-AzSqlDatabase cmdlet to scale the Azure SQL Database storage. Specify the resource group, server name, database name, and the desired storage size in GB using the MaxSizeBytes parameter.

These code examples leverage the Azure PowerShell module, so make sure you have it installed and authenticated before running the scripts. You can execute the scripts as standalone PowerShell scripts or incorporate them into an Azure Automation Runbook for scheduled or event-driven automation.

Remember to customize the variables and parameters according to your specific Azure SQL setup, resource group, server, database, and scaling requirements.

The value of scripting your operations as much as possible is that you can document the changes in code, share scripts across your team, and also extend into other automation systems and triggers.

The one risk to scripting your tasks is that it does require some coding skills and you will find that commands and parameters change which requires you to continually revisit your scripts.

Design for failure, always

Assume a loss of services is coming. Design your systems to be resilient and able to withstand some loss of resources. This will always be a part of your price-performance consideration as well but primarily targets application availability.

Be ready to handle transactions, logs, sessions, and many other application components as a distributed system. Any single point of failure is a risk to your application availability and performance.

Design loosely coupled systems

Your SaaS application will have multiple services, shared data, and likely other processes and systems interacting with it. You need to design for loosely coupled architectures which don’t always depend on synchronous access between components.

Serverless event-driven tools are a great design pattern for this. For example, developers may use Azure Functions as loosely coupled systems where each function performs a specific task.

The system as a whole can be composed and orchestrated dynamically based on event-driven triggers. The trade-off here is often complexity and the amount of effort to write and maintain these components. Loosely coupled systems are generally designed more generically rather than tuned specifically for a given use-case.

Another important aspect is the price performance ratio. Scaling loosely coupled systems is a constant set of trade-offs between scaling for performance while trying to optimize cost efficiency. You will need to design for these business rules when designing scaling thresholds and triggers.

Design for Eventual Consistency

This is likely one of the biggest design challenges. Distributed applications will have many front-end components, middleware, plus both structured and unstructured data.

As an example, Cosmos DB is a Microsoft Azure service for low-latency and high-throughput access to unstructured data. Cosmos DB employs an eventual consistency model to ensure scalability and high availability.

Updates made to the database are replicated across multiple regions to provide global access and fault tolerance. However, due to the distributed nature of the system and the replication process, data consistency is not guaranteed to be immediately synchronized across all regions. Cosmos DB tries to solve for this by offering different consistency levels, including “eventual consistency,” which allows for some degree of latency in propagating updates.

For example, suppose you have a Cosmos DB account with data replicated across three regions: East US, West US, and Europe. You make an update to a document in the East US region. With eventual consistency, there might be a short delay before the updated document is fully synchronized and available for read operations in the West US and Europe regions.

During this propagation delay, if a read operation is performed on the West US or Europe regions, it might return the older version of the document instead of the latest update. The updates will eventually reach all regions, ensuring eventual consistency across the entire distributed database over time.

Don’t forget that eventual consistency is a trade-off between availability, latency, and consistency. By allowing for some delay in propagating updates, Cosmos DB can provide high availability and low-latency access to data, which is crucial for global-scale applications.This will mean that in certain scenarios users might observe inconsistencies during the time the updates are fully replicated.

Azure Cosmos DB allows you to choose different consistency levels based on your application’s requirements. You can opt for stronger consistency levels like “session,” “bounded staleness,” or “strong consistency” if you prioritize data consistency over availability and latency, albeit with trade-offs in performance and cost.

Use Platform Engineering and SRE practices

Don’t make it an “Ops problem” or a “Dev problem”. Work as collaboratively as possible during design and development and continue that collaboration in production operations as well.

A shared responsibility model between teams will reduce the risk once the application goes to production. Let’s use an example of a team running an application on Google Cloud. This team uses several Site Reliability Engineering (SRE) processes to ensure the reliability and performance of their system.

  • The team implements proactive monitoring and alerting mechanisms. They leverage Google Cloud’s Stackdriver Monitoring to collect and analyze system metrics in real-time.
  • Using custom monitoring dashboards, they set up alerts based on predefined thresholds for metrics like CPU utilization, latency, and error rates.
  • When an alert is triggered, the team receives immediate notifications, enabling them to identify and resolve issues promptly.
  • They use Stackdriver Logging to capture detailed logs and perform log analysis for troubleshooting and root cause analysis.
  • Deployments are done with infrastructure as code (IaC) using tools like Terraform or Deployment Manager. They define their infrastructure components, such as virtual machines, load balancers, and networking configurations, in a declarative format.
  • The bonus is that their infrastructure is documented, reviewed, and can be easily recreated or modified, reducing the risk of configuration drift and human error.
  • They utilize Google Cloud’s Cloud Monitoring and Cloud Logging to set up incident management workflows. When an incident occurs, they leverage tools like Google Cloud Pub/Sub or Cloud Functions to automatically trigger incident response actions.
  • The team designs and tests disaster recovery procedures by creating backups, replicating data across regions, and conducting periodic drills to ensure business continuity in case of a major outage or failure.

Each and every opportunity to automate and create self-healing systems is a win for the operations team and for users of the application. This is also why SaaS providers will always lean to automation-first and foremost for consistency and reduced risk.

Futureproof where possible

Adopting a microservices architecture allows for better scalability, flexibility, and independent development of application components. By decomposing the application into smaller, loosely coupled services, each with its specific functionality, you can enhance modularity and enable seamless updates or additions to individual services.

You may choose tools like Azure Service Fabric for building and managing microservices-based applications, offering features like automatic scaling, rolling upgrades, and service healing. There are many service-based platforms that help to build and operate microservices including monitoring and scaling.

Adopt services that give you flexibility and portability whenever possible. This will be a trade-off against using simple integrations with proprietary systems (e.g. OpenFaaS vs AWS Lambda). Your decision should be based on business needs weighed against technical complexity and cost. Luckily, there are many platform partners with proprietary systems integration that can reduce the complexity and risk.

Your application will continue to evolve and innovate, so don’t feel as though you are ever “done”. This is why future-proofing components and practices gives you the most flexibility as the system gets used.

Still Building Out Your SaaS Offering?

Download our ebook to get started on building and optimizing your SaaS solution today!

Let's Get Started!