Subscription IT: Delivering High Performance, Availability and Scalability

Written by:

Today’s subscription businesses require speed and flexibility to launch new products in the market, the ability to automate manual processes, and visibility to the key metrics required to make accurate business decisions. The SaaS infrastructure must be able to support reliable 7x24x365 operations and deliver these capabilities requires a reliable ‘enterprise-grade’ system with services that are built on a secured, high-performing and scalable infrastructure.

This Academy guide will cover the key capabilities required to support your mission-critical subscription business operations.

Note that cloud security is covered in it’s own guide.

System Performance

Key points:

  • Defining high performance needs
  • Understanding synchronous and asynchronous transaction capacity
  • Understanding synchronous transaction response times
  • Requirements for network latency

SaaS applications need to deliver high speed system performance.SaaS applications face the internet – this requires ongoing investment to continue improvements in linear scaling. Customers rely on their service providers to meet and maintain system performance requirements. Application vendors should provide detailed information about system performance and availability.

At the very heart of Zuora’s 9 Keys framework, is a commitment to deliver the performance, scalability and capacity to support the requirements of complex operations in a mission-critical billing system.

‘High performance’ is typically measured by:

  • Average transaction processing speed
  • Average number of transactions supported
  • Web page delivery times
  • Average query response times


Synchronous APIs = operation that issues API request and waits until it gets a response.

Asynchronous APIs = processes that run in the background e.g. monthly bill runs, data synching with SFDC, large scale data extracts

For example, Zuora uses both Synchronous and Asynchronous APIs to communicate all requests within the application and achieve maximum performance objectives.

Capacity and response times

I. Synchronous Transaction Capacity:

This is especially important for high volume businesses like B2C consumer services. There is a common requirement for a sustained peak of at least 100 transactions per second. Another benchmark is, whose peak rate on cyber Monday in 2012 was 306 orders  per second. Zuora exceeds both of these benchmarks.

II. Asynchronous Transaction Capacity:

Needs to scale out horizontally to handle whatever load it has to carry. As with any billing system, the bill runs are intensive and there’s a need for near linear scaling of bill run performance as you fan out horizontally across multiple threads. Zuora is architected to scale out bill runs horizontally.

III. Synchronous Transaction Response Times:

Response times vary based on the complexity of the operation that the API supports – e.g. some Zuora APIs perform a very targeted operation that can be completed very rapidly (in as little as 5 ms). Others perform an entire end-to-end business use case and can take longer to execute because they interact with 3rd party systems such as payment gateways. Most Zuora synchronous transactions execute in a range of 5 and 300 ms.

How is your service provider demonstrating a commitment to performance and scalability?

Zuora continues to improve and invest in our infrastructure to support larger, more scalable enterprise customer architectures. Over time, Zuora has delivered:

  • 2X to 10X improvement in bill run performance (range of customer results)
  • 60X improvement in Salesforce sync times
  • 2X improvement in order taking speed
  • 60x improvement in notifications (callouts) that can be used to communicate with your provisioning system
  • 5x speed improvement in PDF invoice generation

Network latency:

You can have the fastest application in the world but if you don’t have a fast network you won’t see the performance. It’s important to use web performance solutions to improve your response time and throughput. Savvy cloud application vendors know how to do this. Zuora uses Akamai’s web application accelerator to slash internet latency.

The New Enterprise IT

Learn how to blend legacy and cloud solutions for an agile IT strategy.

Learn now!

Scalable Architecture

Key points:

  • Scalability means linear transaction performance
  • There are two kinds of scalability: horizontal and vertical
  • Horizontal (scale out) = ability to add more customer accounts
  • Vertical (scale up) = ability to handle increase in complexity of operations

As SaaS applications add more customer accounts and add feature functionality, there’s a need to ensure that system performance keeps up pace with added scale to avoid performance bottlenecks.

There’s typically a performance tradeoff between horizontal (scale out) and vertical (scale up) architecture. Typically, you lose some vertical capability when you increase horizontal and vice versa. Good companies should always prefer horizontal since it aligns to business growth. Vertical scale can be added later once horizontal scale is achieved. Zuora is architected and built on a horizontal architecture.

Best practices in building scalable SaaS architecture include the following:

  • Refactoring/optimizing code (vertical scalability)
  • Production-like environment profiling/analysis
  • Optimally aligning s/w and h/w architectures
  • Investment in state of the art technology
  • Search & Destroy philosophy in practice by architecture team

Snapshot: Zuora’s own scalable architecture

Zuora’s core platform has been architected from the ground up to organically scale to massive volumes. The core system architecture and application are designed to scale horizontally at the application level, messaging infrastructure level as well as the database level. The diagram below represents the architecture of Zuora’s cloud infrastructure.

Zuora utilizes the following resources to ensure that the platform is scalable at all levels:

  • Juniper & Mellanox for network gear
  • SuperMicro high-capacity web and application servers
  • FIPS compliant SafeNet encryption appliances
  • High Performance SAN for storage layer

Load Balancer/Web Servers:

  • All web servers are horizontally scalable and are configured in a redundant manner for HA
  • Software load balancers route traffic to multiple servers

Application Servers:

  • Separate application servers for UI, synchronous and asynchronous transactions
  • All application servers are horizontally scalable and configured in a redundant manner for High Availability
  • UI and API processing is stateless, multi-threaded and configured and tuned with large Java heaps
  • Asynchronous transactions are re-entrant, highly multi-threaded and also configured with large Java heaps. They also segregate higher priority from lower priority processing

Messaging Infrastructure:

  • AMQ Broker used to implement publish/subscribe model
  • Broker is set up in active-passive configuration for quick failover/HA

Database Servers:

  • MySQL database
  • Sharding enables horizontal scalability at the database tier
  • Servers are setup in master/slave configuration for HA, with read-only slaves used to offload query workloads
  • AMQ database is used for persistent messaging in the event of failover recovery

Availability and Disaster Recovery

Key Points:

  • Understanding High Availability and Disaster Recovery
  • The importance of Business Continuity Planning to address Disaster Recovery
  • Defining your recovery objectives (RTO, RPO, and RLO)
  • Best practices around DR and Business Continuity  planning

Running mission critical applications like a SaaS-based Relationship Business Management system requires a provider focused on maintaining high service availability and ensuring your applications and data are recoverable in the case of a disaster.

High Availability refers to ensuring the highest level of availability of the service by providing redundancy at all the layers of the architecture so that if one infrastructure component (network, server, storage) fails, overall service remains available.

Disaster Recovery addresses service continuity in the case of a disaster that affects the physical datacenter, so that service is maintained through a standby site. Two independent environments, typically in separate and distinct facilities, each contain their own data (in the file system and database) and executables. Data and configuration information are replicated between the production and standby sites.

Defining your recovery objectives (source – MSDN):

RTO (Recovery Time Objective): The duration of acceptable application downtime, whether from unplanned outage or from scheduled maintenance/upgrades. The primary goal is to restore full service to the point that new transactions can take place.

RPO (Recovery Point Objective): The ability to accept potential data loss from an outage. It is the time gap or latency between the last committed data transaction before the failure and the most recent data recovered after the failure. The actual data loss can vary depending upon the workload on the system at the time of the failure, the type of failure, and the type of high availability solution used.

RLO (Recovery Level Objective): This objective defines the granularity with which you must be able to recover data — whether you must be able to recover the whole instance, database or a set of databases, or specific tables.

Best practices for High Availability and DR capabilities of a Relationship Business Management system:

  • Sub-second data replication across all datacenter facilities
  • Hourly backups for a revolving 24 hour window stored locally and offsite
  • Daily backups for a revolving 30 day window stored locally and offsite
  • Weekly backups for a revolving annual window stored locally and offsite
  • 12 months of data stored onsite and offsite in a PCI compliant manner
  • Backups are tested on a regular cadence, often daily, at a minimum weekly
  • Mock failover exercises are run as part of the business continuity plan

Zuora meets or exceeds these availability and DR requirements.

Datacenter Operations

Leading SaaS application providers should use the right data centers for mission-critical applications. A good provider may choose to use either predominantly a private cloud delivery model or a public cloud service for delivery of services, depending on requirements and mission criticality. It’s good to use a hybrid approach to balance for scale-out agility and cost to serve.

Zuora has two state of the art datacenter facilities to ensure the highest levels of security, performance, availability and DR failover. Zuora uses a private cloud for it’s production environments and a public cloud for development and test.

Both of the datacenter facilities are 100% synchronized. The second datacenter runs in warm standby and is able to take over full service capacity. Each datacenter is located in separate disaster zones (Las Vegas and San Jose) and are used by notable customers with high security and availability requirements (e.g. Wells Fargo and Paypal).

Network infrastructure and latency

  • Fully redundant tier 1 carriers
  • 1GB North/South Traffic, 40GB East/West traffic by Infiniband
  • Fully redundant Juniper switching and routing gear
  • Web and application firewalls to defend attacks
  • Access to a blended internet backbone comprised of 10+ carriers
  • DR site has 2 network carriers
  • Use of public cloud services for development and web services – delivers elastic capability for creating customer sandboxes

Learn from your peers. Network your heart out.

Join a Subscribed event near you.

View Events