This post was updated to (1) specify the SQL implementations and engines I am focusing on, and (2) include a note on nullability.
Type soundness (more often referred to as type safety) can be vaguely defined as a large (and in some languages growing) collection of programming language features that make code easier to write, and less prone to bugs. These features often follow some common themes, including that they usually happen at compile-time (this sometimes means your editor will check your code as you write it). In this post, I attempt to briefly describe some of those features, with comments about how SQL fares in regard to them.
There are many different SQL implementations and engines, so here I focus on the ones I am most familiar with; those used in the data discipline, like BigQuery, Spark SQL, Trino, PostgreSQL, among others, even though this post should still apply to most SQL implementations out there. MySQL and SQLite are notable exceptions (more on this in a later post). I also make an effort to keep this article digestible and not delve into quirky behavior of which SQL admittedly has plenty.
Static typing
Static typing means your variables may only have one single type forever, a type that must be known at compile time. Python and JavaScript are dynamically (not statically) typed because it is possible to assign any value to any variable at any time (exceptions apply). The code below runs in both Python and JavaScript:
x = 1
x = "2"
x = {"3": 4}
Many other popular languages including Java and C are statically typed. Static typing is preferred because reasonable assumptions about a variable’s behavior are harder to break if its type is known. Even in dynamically typed languages like Python and JavaScript, programmers usually try to keep a single type throughout the lifetime of each variable.
How does SQL fare in terms of static typing?
All implementations of SQL I have ever come across in the data discipline are statically typed1. Not only is there no such thing as a variable reassignment in SQL, but also fields have compile-time known types. You can validate this by trying to run SELECT IF(TRUE, 'abc', 123)
on your favorite SQL engine.
My static typing score for SQL: 10/10
Strong typing
This term is sometimes used interchangeably with type safety and type soundness. It is also sometimes used in place of static typing. Here I take it to mean that values do not unexpectedly behave like other types. Because this is a vague definition (much like any other definition of strong typing out there), it is more useful to list examples of features that arguably constitute weak typing.
In C, variables have specific types, but every value can be naturally cast to a numerical value (e.g. strings are just pointers, whose underlying representations are integers, and string characters are also integers). To some, this makes C weakly typed. This is debatable. A fairer assessment is that C only has numerical primitive types, and allows lots of casting.
In JavaScipt, you can subtract a number from a string. In that operation, the string gets interpreted as a number. This is known as type coercion. You might say this makes JavaScript weakly typed, or you could argue that it simply allows more cross-type operations than other languages.
How strongly typed is SQL?
Most people will tell you that SQL is strongly typed. This is a fair assessment, but you can actually find some type coercion in it. For example, when comparing an integer with a floating point number, the integer is coerced to a float. In most scenarios, there is no good reason to be able to perform comparisons or operations between different numerical types. While numerical type coercion does not render the language terribly unsafe, a more explicit approach to numerical type conversion is often favored in modern type-sound programming languages.
My strong typing score for SQL: 9/10
(Built-In) Generics
Generics are what allows a programming language to, for instance, enforce distinctions between lists (arrays) of integers and lists of strings. Python, which is often considered to be strongly typed, does not make those distinctions. In Python, a lists of integers behaves exactly the same way as a list of strings, or a list of both. The absence of generic types is common in dynamically typed languages.
Are there generic types in SQL?
The answer is yes. In modern implementations of SQL, arrays have a single element type. And arrays with different element types are treated as distinct types. You may not, for instance, write SELECT ARRAY[1, '2']
or SELECT IF(TRUE, ARRAY[1, 2], ARRAY['3', '4'])
.
My generic typing score for SQL: 10/10
Custom types
The ability to define your own types is an important feature of some modern type-sound programming languages. One of the reasons you may wish to do this is to make sure your variables’ functionalities match their semantics. For instance, you may define types X
and Y
that represent a 2D vector’s x and y coordinates respectively (or perhaps a point’s latitude and longitude), so that X
and Y
are essentially floats, but you are forbidden from adding them up together without performing some explicit casting.
Most general-use programming languages have some degree of custom typing. In Python, you can define your own classes and overload operators like +
, *
, or even ==
for instances of those classes. Other languages go one step further by allowing custom generic types, which essentially serve to define common functionality across user-defined families of concrete types.
Does SQL allow custom typing?
There is a "user-defined type" concept in some SQL engines. But its purpose is often to allow for different forms of underlying data, rather than to enforce safer code. An ideal implementation of custom typing in SQL would allow me to treat a transactions
table’s amount
column as being of some Amount
type which, among other restrictions, shall not be multiplied times itself (what is a square dollar anyway?). In this ideal implementation, such enforcement would be possible regardless of the underlying database representation of that column, which for all I care could be a native Float
or Decimal
.
My custom typing score for SQL: 1/10
Other safety features
Type-sound programming languages often incorporate various features, such as immutability or robust pattern matching, that enhance the overall safety of the code written. Although these features may not be considered a direct part of type safety, they still contribute to the overall enforcement of safer programming practices.
SQL is a fully declarative language, which means that rather than telling the engine exactly what to do (as in imperative programming), programmers tell it which outcome they desire from it. This is an impressive language design feat considering the diverse ecosystems of tools and engines it supports, and the intricacy of the computations SQL queries can prompt, sometimes involving complex distributed systems. It also gives it an important advantage in terms of code safety and accessibility (ease of use, even for non-programmers) over competing imperative frameworks like Spark or Pandas.
My only complaint about the safety of SQL’s declarative syntax, is that it is perhaps too flexible. For instance, the non-enforcement of some keywords like AS
means that the missing comma in the query SELECT id name, address FROM people
, makes it return something completely different to what a programmer might expect from it. But this is a nitpick.
On the other hand, many modern SQL implementations used by data practitioners (e.g. Trino, BigQuery) lack enumeration types (enums), which they could greatly benefit from, along with some exhaustive form of pattern matching (not to be confused with string pattern matching) for a much more powerful CASE/WHEN
clause. More on this in a future post.
My “other features” score for SQL: 7/10
A note on nullability
Most SQL implementations do not include a concept of nullability. Instead, the compiler assumes all fields are nullable for the purposes of parsing and executing the query, and non-nullability is only enforced at runtime in the database layer (or even manually in a test layer).
This should probably reduce the language’s type safety score. But instead, I have decided to regard it as a deliberate language design decision for which there are very good reasons. Real-world data is messy, and the logic that determines whether a field is effectively nullable or not goes beyond what most of us would call “the type system.” For example, you can remove a field’s nullability by inner-joining on it. I will dig deeper into this semantic concept of nullability in a later post.
A brief conclusion
Overall SQL is, in many regards, a lot more type-sound than many popular programming languages like Python, JavaScript, or C. Its weakest points in terms of type safety are, in my opinion, the absence of a simple yet flexible implementation of compile-time custom types, along with some universal enum implementation (with exhaustive pattern matching). It could also do away with some flexibility features that can lead to unwitting error, such as type coercion and optional keywords (e.g. AS
, or UNNEST
without CROSS JOIN
).
My average type soundness score for SQL: 7.4/10
Do you have additional thoughts about SQL’s type soundness? Reach out or write them in this post’s comments.
You could argue that BigQuery has JavaScript functions, but those functions still need a static signature.
Nice write up, can I go on a bit of a tangent, and talk about how Types can be used to prevent SQL Injection… specifically, we know we need to use parameterised queries, but mistakes still happen, so we can introduce the idea of a “literal-string” type (a string defined in the program, aka by the programmer), these can be concatenated together, but cannot include user data. Enforcing this type on the SQL string ensures that it’s safe from an Injection vulnerability. This can be done in Python and PHP today, but also in other languages via similar techniques: https://eiv.dev/