Designing computer systems is hard. Especially at scale. You’d think you could just use a standard process to work from a set of requirements to a viable design, but it’s not that easy. There have been many attempts to formalize it over the years, and some of those efforts have given us useful tools like Agile Development and Domain-Driven Design that many teams employ today. But there is always more than one way to solve a design problem, the available tools and technologies keep evolving, and as a result the best design today is unlikely to be the best design a year from now.

Successful large scale systems these days are built by teams of capable software and hardware engineers working together, proposing design ideas, analyzing them, and iteratively building and improving them over time. A successful system is never finished.

One approach that I find helps me is to use a checklist. It doesn’t prescribe a specific design, but it gives you a list of steps to consider, and a set of lenses through which to analyze your designs. Like everything else it needs to evolve over time. You should write it down, and then commit to reviewing and updating it as you learn more about the craft.

Here’s my checklist, which I update over time as I learn more about system design.

Functional requirements

Functional requirements are the things the system needs to do in order to be useful to its users.

Who are the users of the system?
What are the main workflows?
What are the “key problems” that need to be solved?

It’s easy to go down a rabbit hole at every step of a system’s design. You have to find a balance between documenting what’s important without getting lost in detail. Remember this is an iterative process.

Users vs. customers vs. stakeholders - I prefer to use the term “Users” to describe the users of the system, but this can include all kinds of different personas that you should document if it helps to clarify the functionality of the system.

Non-functional requirements

Non-functional requirements are the things the system needs to do that are implied and typically not made explicit by the users of the system.

Consider:

Scalability: What amount of load, measured in requests per second or transactions per second, will the system need to handle? Will the load be consistent or vary over time? This will impact whether the system needs to horizontally scale, and whether it must scale up and down to meet demand. Scaling down is usually necessary to avoid over-provisioning and to save costs, but depends on the shape of the traffic and budget. What will the ratio between read and write traffic be? Many web scale systems are read-heavy and can benefit from caching and pre-computing read responses.
Latency: How fast does the system need to respond to requests? How quickly does it need to process inputs and other external events? Latency requirements can impact technology choice like polling vs. persistent web socket connections. Requests may need to be acknowledged immediately and processed asynchronously.
Availability and reliability: What is the impact if the entire system, or parts of the system, go down? Some systems and subsystems can afford to go down depending on the business requirements, in which case using a less reliable design could save on cost and complexity. How many nines of reliability are needed?
Consistency: Do reads need to reflect the most recent writes at all times, or can reads return stale data? This will affect the types of data persistence and caching that can be used in the design.
Durability: Depending on the business requirements, certain types of data may not need highly durable storage and can tolerate data loss. For example, telemetry metrics and logs may allow data loss. Other types of customer data may require guaranteed durability, including protections against hardware failures and data center outages. Your system may need to perform continuous backups or replicate data to other data centers or geographical regions.
Maintainability: What is the expected lifespan of the system? Most systems will live on many times longer than it takes to initially implement them. How many people and teams will be involved in maintaining it? What metrics should you measure to determine if it meets business goals? How easy does it need to be to debug? Most systems will require a certain level of observability to make them debuggable and to measure their effectiveness.
Security and Privacy: By default you should design systems with security and privacy in mind. Data needs to be accessible only by those with appropriate permissions. Checks should be in place to ensure data does not leak between tenants. Data should be encryped at rest and in transit. Personally identifiable information (PII) should not appear in logs or internal reports. Users should be able to opt out of the system and have their data be deleted or anonymized. Do some parts of the system need to be in a VPC with limited access to the public internet?

Compile a list of the most important non-functional requirements, and any assumptions you make related to them. You’ll use these to create your high level architecture, and later to help analyze your design.

High level architecture

Start with a high level architecture. If the system is large enough you’ll need to break it up into sub-systems.

Consider:

Domain Driven Design: Understand the problem domain that the system operates in. This will give you the shared language, vocabulary, and mental model your team will use to describe and understand the system.
Service Oriented Architecture: Design the system as a set of interoperating services. Each service should correspond to a sub-domain and be concerned with a single entity or a set of related entities within that sub-domain.
Tiered architecture: Some systems lend themselves to a tiered architecture, where a given tier hides the complexity of the tier below it, and implements a useful abstraction for the tier above it.
Control Planes and Data Planes: If there is a distinction between control and data planes in your domain, it can make sense to design these into separate components or sub-components. Control planes often require higher availability and consistency guarantees, so it can be useful to think of them as separate components.

When a system is large enough, it can use different architectural patterns at different levels. An individual service within a service oriented oriented architecture could use a tiered architecture within its boundaries.

Being able to reason about a system at different levels of abstraction is a very useful property of a well designed system, but there are trade offs. Each level of abstraction adds complexity, and may affect performance. Again, this is an iterative process and you will have the opportunity to change the high level architecture later if you find there is a potential performance issue.

Analysis

Review your high level design to confirm that it meets your functional and non-functional requirements. Walk through the main workflows and think about the flow of requests, responses, asynchronous messages and data streams.

Consider:

Single Points of Failure: Will the system, or parts of it, be able to progress if other parts go down? Is there a critical component that can bring down the entire system?
Bottlenecks: Are there points in the design that will cause it to have limited throughput?
What if scenarios: Consider what will happen if a component or an external system is unreachable, or has degraded performance. Are there edge cases that may cause unexpectedly high load on the system.
Maintainability: Is the system observable and debuggable? How will I know if the system is meeting its business objectives?
Extensibility: Will the design support new functional requirements over time? Will the design be able to integrate with existing and new systems?
Scalability: Will the system be able to handle increased load over time, as the system gains more users? Will the system handle increased storage requirements over time?

Your analysis may reveal weaknesses in your high level design. If so, go back and fix those. Otherwise, you can continue by diving deeper into individual components. Your analysis will give you an idea of the key components so you can focus your deep dives.

Deep dives

Once you have a high level design, dive into each one and start thinking about how you will build and deploy it using physical infrastructure.

Your goal is to create a design document that your team can use as a blueprint to build the system.

Consider:

Load balancing and horizontal scaling: Can the component benefit from horizontal scaling and load balancing to improve scalability?
Data stores: If the component needs persistent state, what type of data store is most appropriate: Relational, Non-relational, Graph DBs, Time Series DBs, Object storage? What are the data access patterns? What consistency guarantees are needed by the application? Does the application need atomic transactions? Does data need to be backed up periodically, or replicated in real time?
Integration Patterns: How does this component integrate with other components: queues, message brokers, streaming, RPC?
Caching: Can the component use caching to reduce latency and increase scalability and availability? Does caching affect consistency?
Partitioning: Can the component use partitioning to improve scalability? In a multi-tenant system, could data be partitioned by tenant?
Regions: Does the system need to be available across the globe? Does state need to be replicated across regions?

Remember this is an iterative process. Continue to analyze and refine your design until you and your team understand what you need to build, then start building. With a high quality design it should be much easier to build and maintain, and you’ll be able to evolve the design over time.