As someone who builds, advises, and invests in startups, I often get the question: What database should I choose for my project?
Like anything in technology, there are partisans for every solution out there. Rather than saying “You should choose MySQL” or “Postgres is the way,” I’d instead like to lay out a process for choosing the right technology for you.
Video
If you’d prefer to watch rather than read, I have a Youtube video where I talk through all of this:
Relational vs Non-Relational
The first fork in the road is choosing between relational and non-relational databases.
A relational database, like MySQL, PostgresQL, or SQLite, excels in modeling relationships in your data.
Think of a blog post. A blog post has an author. It also has comments. Every comment also has an author (perhaps modeled as a different type of user). These are all relationships.
In a relational system, you would model each of these separately. You might have a table for your posts, each of which has an id that points to the authors table. You might have a separate table for comments, where each comment has an id for the post and an id for the comment author. And so on.
Each of these tables would have a schema, or set of rules defining what kind of data is allowed in each column, as well as constraints establishing what sorts of relationships must exist between your data. If you want to guarantee that every blog post has an author, and that the author must have a first name and last name of type string, that’s trivially easy to accomplish.
A non-relational - or document - database, like MongoDB or Firebase, does this differently. You might have one table for your blog posts, and each record, or document, in that table would have everything it needs to display the post. That means the author, the comments, the authors for the comments - all of it. You embed that directly in what amounts to a single JSON column, traditionally without any schema to enforce rules on what makes it into the document (you would instead handle the validation in your application layer).
If you wanted another view of that data (for example, a list of all authors), you would traditionally store and manage that data independently of your posts. In essence, you would be de-normalizing (or duplicating) your data and bringing the complexity of syncing that data into your application layer.
This adds up to a big drag on your velocity. The net effect of putting all your data in a schemaless data store is that the data becomes easy to write and easy to scale, but much harder to verify for integrity. This is a problem that will bite you much sooner than you think. And it has absolutely destroyed companies in the past.
Before the Mongo fanboys object, yes, I understand that you can split this up into multiple tables (or collections in document DB parlance) and attempt to model relationships. But at that point, you’re just creating your own relational DB that’s objectively worse than the existing options. Why?
Unless you have a very good reason to do otherwise, I recommend you choose a relational database. In the case where you have truly non-relational data with a flexible schema (think, event logging or similar), just throw it in a JSON column in your relational DB of choice until the day you might, possibly, need a more specialized system for it.
Choosing a Relational DB
So you've decided on a relational database. How do you choose the right one? It's not just about picking the most popular option. Consider these factors.
In-House Expertise
If you have an expert in-house, you should probably choose that option. Every piece of technology has its footguns, and if there’s a solution you or a team member already knows really well, it’s usually better to go with what you know.
Specialized Capabilities
While most of the mainstream relational options are pretty comparable, there are certain cases where one option (generally Postgres via extensions) supports something another doesn’t.
There is a school of thought that says “use the best tool for the job” but in the context of a fast-moving startup, every new piece of infrastructure is an operational burden. Being able to get more out of your database if you need to can be a very good thing.
Here are some fairly unique capabilities that you can get out of Postgres:
Vector support via pgvector - if you’re doing anything with AI/ML, you will probably need vector embeddings. If pgvector works for your needs, it can save you having to work with Snowflake or Redis or another system.
Geospatial support via PostGIS - PostGIS probably offers the richest support for geospatial use cases. Although MySQL provides fairly rich support out of the box as well.
Time series data via TimescaleDB - If you need to aggregate large amounts of data (imagine feeding metrics by day/week/month to a dashboard), Timescale will handle setting up and refreshing Postgres partitions for you.
If you need any or all of these capabilities, you should strongly consider PostgreSQL. If you don’t, you should continue working your way down the decision tree.
Operational Simplicity
Another consideration is how easy it is to operate the database. If you have an expert in-house, that person will likely know a lot about this, and that’s a big part of why I recommend going with what you know.
The other caveat here is that many people choose to use a cloud database provider like AWS Aurora or Planetscale or Neon. In those cases, you are outsourcing the operational complexity and it effectively doesn’t matter for you.
But if you are going to be operating the database yourself, SQLite is probably the simplest option out there.
Before you object, hear me out. It’s literally just a file, the most operationally simple primitive you can imagine. For a long time now, SQLite has supported concurrent writes for great scalability. And with tools like Litestream, you can stream all of your database changes to a disaster recovery store (like S3) so that a catastrophic problem on your server doesn’t put your data at risk. There are even SQLite-compatible options like Turso that are attempting to distribute it to multiple nodes for even more robustness and performance.
So SQLite may have been a toy database in the past, but I think it’s a very defensible option for a production database today.
But in the case where you want or need to have a traditional database server setup, MySQL tends to have a more elegant upgrade and maintenance process. This is in part due to Oracle’s focus on supporting enterprise customers, as well as the incredible ecosystem of tools to make things like online schema changes relatively straightforward.
Postgres also spins up a new process for each connection instead of a thread, which generally means you’ll have to spin up pgBouncer (another system) to handle connection pooling.
This isn’t to say that Postgres is a bad option in this regard, but the story is stronger for the other offerings.
Scalability
I know everyone loves to ask “but does it scale?” And it’s not a bad question, but I’d also retort “YAGNI.” The truth is that 99% of projects will never outgrow a single database server, and the ones that do will have the money to figure it out, whether it’s a migration to MySQL + Vitess, Postgres + Citus, or some sort of bespoke architecture.
But let’s talk about the scalability characteristics regardless.
MySQL is used by some of the largest software companies on the planet. Facebook, Github, Airbnb, Uber, and many, many others all use it at incredible scale.
This is in large part thanks to Vitess, an open source project that came out of Youtube. Vitess sits between your application and MySQL, intelligently transforming and routing queries. Those queries can go to one or more shards, which are essentially database nodes that each hold part of your data. Provided you have a reliable way to isolate your data (for example, by each company using your service), you can scale to a theoretically infinite number of shards.
From your application’s perspective, you can still interface with a single “database,” and the fact that there is an entire cluster of nodes powering that interface is elegantly abstracted away from you.
Postgres has a similar offering in its ecosystem called Citus, and while it is impressive, it doesn’t have the adoption or battle-tested history that Vitess does.
To illustrate the contrast, read through this writeup of Figma’s Postgres scalability challenge: https://www.figma.com/blog/how-figma-scaled-to-multiple-databases/
Scaling a database is never trivial, but I imagine the writeup for a MySQL based company would have felt more like “we sharded our data using Vitess and called it a day.”
I would be remiss if I didn’t talk about SQLite here as well. Due to the fact that it’s a file that lives in-memory, it has some very unique performance characteristics. You don’t have to reach out to a separate server to get your data; it is right there, waiting to be queried, with effectively zero latency. It’s so fast that you might not even need a Redis cache to make your reads faster.
Of course, SQLite has limitations as well. There isn’t a mature solution for multi-node replication at the time of writing, so you’re more or less limited to what you can do on a single machine. But that ceiling is surprisingly high.
As one example, pocketbase, an open source project based on SQLite, advertises support for up to 10,000 concurrent users at a time.
Expensify (using a custom wrapper around SQLite) is able to support up to 4M QPS.
So while it might be a little non-traditional, it could be a compelling option under the right circumstances.
Conclusion
Just to wrap things up, here’s how I’d summarize:
Use a relational database unless you have a very good reason not to
If you have an expert on one of these technologies, go with what you know.
At the early stages of your company, strongly consider Postgres if it offers something you would otherwise need from a separate system (like support for vector embeddings)
For maximum operational simplicity, consider SQLite + Litestream. Or MySQL if you want/need a separate DB server and plan to manage it yourself.
For maximum long-term scalability, consider MySQL + Vitess.
For blazing-fast performance (i.e. zero latency), consider SQLite + Litestream if you can live within its constraints.