Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document main IR classes #11972

Merged
merged 19 commits into from
Jan 8, 2025
2 changes: 2 additions & 0 deletions docs/libraries/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,3 +15,5 @@ Documents in this section describe Enso's library ecosystem.
- [**Repositories:**](./repositories.md) Information on the structure of
repositories providing Enso libraries and Editions.
- [**Sharing Libraries:**](./sharing.md) Information on how to share libraries.
- [**Database IR:**](./database-ir.md) The backend-independent internal
representation used for database queries.
210 changes: 210 additions & 0 deletions docs/libraries/database-ir.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,210 @@
---
layout: developer-doc
title: Database IR
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we still planning on renaming to SQL AST?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, definitely on the list.

category: libraries
tags: [libraries, databases, integrations]
order: 4
---

# Overview

The database internal representation (IR) is used to describe full SQL queries
and statements in a backend-neutral way. The IR is compiled to SQL in
`Base_Generator`, with backend-specific variations supplied by the `Dialect`
modules.

End-users do not use IR types directly; they interact wih the `DB_Table` and
`DB_Column` types, which are analagous to the in-memory `Table` and `Column`
types. User-facing operations on these types do not immediately execute SQL in
the database backends; they only create IR. As a final step, the IR is compiled
into SQL and sent to the backend.

Informally, a "query" consists of a table expression and a set of column
expressions, roughly corresponding to:

```sql
select [column expression], [column expression]
from [table expression]
```

This terminology applies to both the user-facing and IR types, which represent
table and column expression in multiple ways.

# Main IR Types

Column expressions are represented by `SQL_Expression`. `SQL_Expression` values
only have meaning within the context of a table expression; they do not contain
their own table expressions.

Table expressions are represented by the mutually-recursive types `From_Spec`
and `Context`.

Top-level queries and DDL/DML commands are represented by the `Query` type.

## SQL_Expression

Represents a column expression. Can be a single column (`Column`), a derived
expression built from other expressions (`Operation`), a constant value
(`Constant`, `Literal`, `Text_Literal`), or a let-binding (`Let` and `Let_Ref`).

`SQL_Expression`s only have meaning in the context of a particular table
expression; for example, a `SQL_Expression.Column` value consists of the
name/alias of a table expression and the name of a column within it.

`Let` and `Let_Ref` variants are used to express let-style bindings using SQL
`with` syntax.

## From_Spec

Represents a table expression. Can be a database table (`Table`), a derived
table built from other tables (`Join`, `Union`), or a constant value (`Query`,
`Literal_Values`).
Comment on lines +63 to +64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd distinguish Query from Literal_Values - they differ quite a lot.

Suggested change
table built from other tables (`Join`, `Union`), or a constant value (`Query`,
`Literal_Values`).
table built from other tables (`Join`, `Union`), from a raw SQL query passed as text (`Query`) or constructed from constants
`Literal_Values`).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's important that the first paragraph of each section gives a summary with a small set of broad categories, not necessarily comprehensive, to help understanding. Otherwise, this might as well be constructor-level documentation and should be in the source, rather than here. Both Query and Literal_Value are constants in the sense that they have meaning without additional context, and are not built out of other table expressions, so I think they go together.

But I do think the distinction is important so I added another section describing both the literal values.


A `Query` value is a complete SQL query, either as a single `Text` or as an
`SQL_Statement` built safely from strings and values. A `Literal_Values`
consists of a table-shaped vector-of-vectors of values and is compiled into an
inline literal SQL table expression.

`Sub_Query` is used to nest a query as a subquery, replacing column expressions
with aliases to those same column expressions within the subquery. This is used
Comment on lines +71 to +72
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not exactly true. The 'alias' is just one 'use-case'.

What Sub_Query does is it nests a full sub query into the From_Spec, meaning that a whole Context plus a set of column expressions (SQL_Expression) are nested in it and then a new set of columns that reference this new context can reference the columns from within that nested subquery.

It allows more than aliases, as the subquery can contain more complicated expressions that from now on can be referenced just by their names.

Well after a thought perhaps that's what you mean by 'replacing column expressions with aliases', but it was not immediately clear to me so I was wondering if we could add some details, and perhaps an example here?

To show that e.g. when we have a query SELECT 1+2*T.A, T.B FROM T the subquery allows to refer to the 'complex' expression 1+2*T.A by a simple alias name: e.g. becoming SELECT SUB.EXPR1, SUB.B FROM (SELECT 1+2*T.A AS EXPR1, T.B AS B FROM T) AS SUB.


I don't know, perhaps I'm overcomplicating this explanation too much. I just wanted to more clearly show that Sub_Query allows 'baking' in some complex expressions and giving them 'simpler' names - in that regard it has some similarity to the Let construct as well although it has different use-cases because it also creates the sub-expression which (as you noted) makes any ORDER BY etc. from the outer query independent from the inner.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooops I completely missed that you actually described this very well in a section some lines below 🤦 Sorry.

The explanation below looks perfect. I'd then just add 'see section ... for more explanation of subqueries'.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

to keep query elements such as `where`, `order by`, and `group by` separate to
prevent unwanted interactions between them. This allows `join` and `union`
operations on complex queries, as well as more specific operations such as
`DB_Table.add_row_number`. This is explained more fully below in the
[`Subqueries` section](#subqueries).

## Context

Represents a table expression, along with `where`, `order by`, `group by` and
`limit` clauses.

A `DB_Column` contains its own reference to a `Context`, so it can be read
without relying on the `DB_Table` object that it came from. In fact, `DB_Column`
values are standalone and not directly tied to particular `DB_Table` instance.
Instead, they are connected to the `Context` objects they contain, and all
`DB_Columns` from a single table expression must share the same `Context`. This
corresponds to the idea that the columns expressions in a `SELECT` clause all
refer to the same table expression in the `FROM` clause.

And also we can 'merge' `DB_Column`s that have the same `Context` into a single
`DB_Table` e.g. via `DB_Table.set`, allowing to add more derived expressions to
existing tables. Compatibility between `Context`s is verified by the
`Helpers.check_integrity` method.

## Query

A query (`Select`), or other DML or DDL command (`Insert`, `Create_Table`,
`Drop_Table `, and others).

# Relationships Between The Main Types

This section covers the main ways in which both the IR and user-facing types are
combined and nested to describe typical queries; it is not comprehensive.

A `DB_Table` serves as a user-facing table expression, and contains column
expressions as `Internal_Column`s and a table expression as a `Context`.

A `DB_Column` serves as a user-facing column expression, and contains a column
expression as an `SQL_Expression` and a table expression as a `Context`.

An `Internal_Column` serves as a column expression, and contains a
`SQL_Expression`, but no table expression. An `Internal_Column` is always used
inside a `DB_Table`, and inherits its table expression from the `DB_Table`'s
`Context`.

A `From_Spec` serves as a table expression, and corresponds to the 'from' clause
of an SQL query. It can be a base value (table name, constant, etc), join,
union, or subquery:

- `From_Spec.Join`: contains `From_Spec` values from the individual tables, as
well as `SQL_Expressions` for join conditions
- `From_Spec.Union`: contains a vector of `Query` values for the individual
tables.
- `From_Spec.Sub_Query`: contains column expressions as `SQL_Expression`s, and a
table expression as a `Context`.

A `Context` serves as a table expression, and corresponds to the `from` clause
of an SQL query, as well as everything after the `from` clause, including
`where`, `order by`, `group by` and `limit` clauses.

# Subqueries

Subqueries are created using `Context.as_subquery`. They correspond to (and are
compiled into) subselects. This allows them to be referred to by an alias, and
also nests certian clauses (`where`, `order by`, `group by` and `limit`) in a
kind of 'scope' within the subselect so that they will not interfere with other
such clauses.

By itself, turning a query into a subquery does not change its value. But it
prepares it to be used in larger queries, such as ones formed with `join` and
`union`, as well as other more specific operations within the database library
(such as `DB_Table.add_row_number`).

In the IR, `Context.as_subquery` prepares a table expression for nesting, but
does not do the actual nesting within another query. To do the actual nesting,
you use the prepared subquery as a table expression within a larger query.

Creating a subquery consists of replacing complex column expressions with
aliases that refer to the original complex expressions within the nested query.
For example, a query such as

```sql
select [complex column expression 1],
[complex column expression 2]
from [complex table expression]
where [where clauses]
group by [group-by clauses]
order by [order-by clauses]
```

would be transformed into

```sql
select alias1, alias2
from (select [complex column expression 1] as alias1,
[complex column expression 2] as alias2
from [complex table expression]
where [where clauses]
group by [group-by clauses]
order by [order-by clauses]) as [table alias]
```

After this transformation, the top-level query has no `where`, `group by`, or
`order by` clauses. These can now be added:

```sql
select alias1, alias2
from (select [complex column expression 1] as alias1,
[complex column expression 2] as alias2
from [complex table expression]
where [where clauses]
group by [group-by clauses]
order by [order-by clauses]) as [table alias]
where [more where clauses]
group by [more group-by clauses]
order by [more order-by clauses])
```

Thanks to this nesting, there can be no unwanted interference between the
`where`, `group by`, or `order by` at different levels.

The added table alias allows join conditions to refer to the columns of the
individual tables being joined.

The `Context.as_subquery` method returns a `Sub_Query_Setup`, which contains a
table expression as a `From_Spec`, a set of simple column expressions as
`Internal_Column`s, and a helper function that can convert an original complex
`Internal_Column` into its simplified alias form.

# Context Extensions

TODO

# Additional Types

- SQL_Statement
- SQL_Fragment
- SQL_Builder
- SQL_Query

TODO
Loading