Serverless computing has significantly simplified the process of deploying code to production. Scaling, capacity planning, or maintenance are no longer that much of a concern. Serverless allows for quick building of cloud-native services. With only a few commands, you can deploy code to numerous data centers across the world, providing low-latency access to your services for a vast number of customers. There’s one thing that stands in the way, however: existing cloud databases don’t fit serverless applications.
There are a vast number of cloud database offerings, ranging from hosted open-source databases (e.g., Compose, Amazon RDS, Google SQL) to proprietary NoSQL solutions from major cloud vendors (e.g., DynamoDB, DocumentDB, Datastore). All of those solutions were created for applications that, in most cases, run continuously for many days in a row, from a single geographic location and a fixed set of servers. That’s why we need a new type of database that shares the same principles as serverless when it comes to pricing model, global distribution, and zero maintenance demands.
In this post, I’m focusing solely on the database layer, but these concepts could apply to other backend systems like caches, queues, messaging systems.
Cloud providers are launching in new regions at an incredible pace, attracting customers from different parts of the world. In 2016, AWS launched 5 new regions. In 2017, Google Cloud launched in 8 regions. Microsoft Azure has 36 regions in total. So why is it that most of the software we build still runs from a single location?
The first reason is cost. Having the same set of hosts, load balancers, and other cloud resources in multiple regions is expensive and adds a significant amount of operational complexity. Function-as-a-service providers like AWS Lambda solved this problem by having a pricing model in which we don’t pay for idleness.
The second reason for single region applications is the difficulty of data replication and synchronization. It’s a challenging technical task to distribute a database across geographic regions, and requires knowledge of and expertise in distributed consensus algorithms and the CAP theorem. Some of the most popular open-source DB engines weren’t built with that in mind. Other solutions, even if capable of geographic distribution (like Cassandra), require significant operational overhead.
Another factor is consistency. Ideally, the database for serverless should guarantee strong consistency, which might be required in some, if not most, of the use cases. It’s worth mentioning that companies like Twitter or Google have built such databases for their internal use.
The landscape looks a bit better for databases provided by cloud vendors. Google’s Datastore supports multi-regional active-active replication, which is limited to nearby geographic areas on the same continent. Azure DocumentDB provides decent global distribution features with configurable levels of consistency. But like other cloud NoSQL solutions, it falls short in other categories that I will discuss shortly.
One of the main reasons why serverless is becoming successful is the pricing model. Developers don’t need to pay for unused resources, because the infrastructure automatically scales to meet current demand. And it’s dirt cheap.
One invocation of the AWS Lambda function costs $0.0000002
AWS Lambda charges for each function invocation. Databases, on the other hand, don’t provide the same level of granularity. Developers still need to pick an instance type for the database, pay for all unused resources, and care about capacity planning. In the best case, developers pay for read/write units (per hundred) and separately for storage (per GB, though). The pay-per-request pricing model is not adequate in the case of a database, as the storage always generates costs. The main requirement for a new type of database is that costs should scale precisely with usage. Ideally, this setup would charge per operation plus storage, with the highest possible level of granularity (e.g., total costs = ops * single op cost + records * record cost). With that pricing model, calculating operational costs is a no-brainer.
Operations like data replication and synchronization, adding new regions, removing regions, and scaling underlying resources cannot require manual intervention. It needs to be abstracted and cannot cause downtimes. Maintenance windows have to become a blast from the past.
Going even further, developers probably should not be even responsible for picking geographic regions manually. Databases could provide low-latency access from different parts of the world either by replicating data to all possible data centers by default, or by detecting where requests come from and automatically replicate data to the closest data center. This would be similar to how CDN works.
The relational model is useful. SQL is not.
There are tons of use cases for columnar, document-oriented, or graph databases. Still, the truth is that most consumer-facing applications use a relational model (with transactions and joins), because it prevents you from having inconsistent and duplicated data and is generally more straightforward for a developer to manage.
You may be used to querying relational databases with SQL, a language that has some flaws. It doesn’t fit the object-oriented programming model, it’s too low-level, and it doesn’t have first-class support for hierarchical data (other than the JOIN statement). There is room for improvement in this area. GraphQL looks like a viable option that a modern database system might use as a query language.
I need something now!
The recently launched FaunaDB looks to be the closest to the ideal database for serverless. It’s a cloud-based, globally distributed database with strong consistency guarantees. I think it’s worth checking out, mainly because of their pricing model and query language. I’m planning to write a review of FaunaDB from the FaaS perspective. Stay tuned!