Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document main IR classes #11972

Merged
merged 19 commits into from
Jan 8, 2025
2 changes: 2 additions & 0 deletions docs/libraries/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,3 +15,5 @@ Documents in this section describe Enso's library ecosystem.
- [**Repositories:**](./repositories.md) Information on the structure of
repositories providing Enso libraries and Editions.
- [**Sharing Libraries:**](./sharing.md) Information on how to share libraries.
- [**Database IR:**](./database-ir.md) The backend-independent internal
representation used for database queries.
198 changes: 198 additions & 0 deletions docs/libraries/database-ir.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,198 @@
---
layout: developer-doc
title: Database IR
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we still planning on renaming to SQL AST?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, definitely on the list.

category: libraries
tags: [libraries, databases, integrations]
order: 4
---

# Overview

The database internal representation (IR) is used to describe full SQL queries
and statements in a backend-neutral way. The IR is compiled to SQL in
`Base_Generator`, with backend-specific variations supplied by the `Dialect`
modules.

End-users do not use IR types directly; they interact wih the `DB_Table` and
`DB_Column` types, which are analagous to the in-memory `Table` and `Column`
types. User-facing operations on these types do not immediately execute SQL in
the database backends; they only create IR. As a final step, the IR is compiled
into SQL and sent to the backend.

Informally, a "query" consists of a table expression and a set of column
expressions, roughly corresponding to:

```sql
select [column expression], [column expression]
from [table expression]
```

This terminology applies to both the user-facing and IR types, which represent
table and column expression in multiple ways.

# Main IR Types

Column expressions are represented by `SQL_Expression`. `SQL_Expression` values
only have meaning within the context of a table expression; they do not contain
their own table expressions.

Table expressions are represented by the mutually-recursive types `From_Spec`
and `Context`.

Top-level queries and DDL/DML commands are represented by the `Query` type.

## SQL_Expression

Represents a column expression. Can be a single column (`Column`), a derived
expression built from other expressions (`Operation`), or a constant value
(`Constant`, `Literal`, `Text_Literal`).

`SQL_Expression`s only have meaning in the context of a particular table
expression; for example, a `SQL_Expression.Column` value consists of the
name/alias of a table expression and the name of a column within it.

This also includes `Let` and `Let_Ref` variants which are used to express
let-style bindings using SQL `with` syntax.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the first paragraph it says that column expression is one of Column/Operation/Constant/Literal/Text_Literal, and only later we see it can also be the Let.

I'd rephrase to also include Let/Let_Ref in the first paragraph perhaps with a note "which are explained later".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


## From_Spec

Represents a table expression. Can be a database table (`Table`), a derived
table built from other tables (`Join`, `Union`), or a constant value (`Query`,
`Literal_Values`).
Comment on lines +63 to +64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd distinguish Query from Literal_Values - they differ quite a lot.

Suggested change
table built from other tables (`Join`, `Union`), or a constant value (`Query`,
`Literal_Values`).
table built from other tables (`Join`, `Union`), from a raw SQL query passed as text (`Query`) or constructed from constants
`Literal_Values`).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's important that the first paragraph of each section gives a summary with a small set of broad categories, not necessarily comprehensive, to help understanding. Otherwise, this might as well be constructor-level documentation and should be in the source, rather than here. Both Query and Literal_Value are constants in the sense that they have meaning without additional context, and are not built out of other table expressions, so I think they go together.

But I do think the distinction is important so I added another section describing both the literal values.


`Sub_Query` is used to nest a query as a subquery, replacing column expressions
with aliases to those same column expressions within the subquery. This is used
Comment on lines +71 to +72
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not exactly true. The 'alias' is just one 'use-case'.

What Sub_Query does is it nests a full sub query into the From_Spec, meaning that a whole Context plus a set of column expressions (SQL_Expression) are nested in it and then a new set of columns that reference this new context can reference the columns from within that nested subquery.

It allows more than aliases, as the subquery can contain more complicated expressions that from now on can be referenced just by their names.

Well after a thought perhaps that's what you mean by 'replacing column expressions with aliases', but it was not immediately clear to me so I was wondering if we could add some details, and perhaps an example here?

To show that e.g. when we have a query SELECT 1+2*T.A, T.B FROM T the subquery allows to refer to the 'complex' expression 1+2*T.A by a simple alias name: e.g. becoming SELECT SUB.EXPR1, SUB.B FROM (SELECT 1+2*T.A AS EXPR1, T.B AS B FROM T) AS SUB.


I don't know, perhaps I'm overcomplicating this explanation too much. I just wanted to more clearly show that Sub_Query allows 'baking' in some complex expressions and giving them 'simpler' names - in that regard it has some similarity to the Let construct as well although it has different use-cases because it also creates the sub-expression which (as you noted) makes any ORDER BY etc. from the outer query independent from the inner.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooops I completely missed that you actually described this very well in a section some lines below 🤦 Sorry.

The explanation below looks perfect. I'd then just add 'see section ... for more explanation of subqueries'.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

to keep query elements such as `where`, `order by`, and `group by` separate to
prevent unwanted interactions between them. This allows `join` and `union`
operations on complex queries, as well as more specific operations such as
`DB_Table.add_row_number`.

## Context

Represents a table expression, along with `where`, `order by`, `group by` and
`limit` clauses.

A `DB_Column` contains its own reference to a `Context`, so it can be read
without relying on the `DB_Table` object that it came from. In fact, `DB_Column`
values can be thought of as not being attached to a particular table. Instead,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence ("can be thought o as not being attached to a particular table") is a bit confusing to me.

The DB_Column is standalone and not necessarily directly tied to a DB_Table, but from SQL/DB standpoint it often is tied to some table (or more complex context).

Maybe,

Suggested change
values can be thought of as not being attached to a particular table. Instead,
values are standalone and not directly tied to `DB_Table` instance. Instead,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

they are connected to the `Context` objects they contain, and all `DB_Columns`
from a single table expression must share the same `Context`. This corresponds
to the idea that the columns expressions in a `SELECT` clause all refer to the
same table expression in the `FROM` clause.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this may be worth mentioning:

And also we can 'merge' DB Columns that have the same Context into a single DB_Table e.g. via DB_Table.set, allowing to add more derived expressions to existing tables. This is verified by the check_integrity method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


## Query

A query (`Select`), or other DML or DDL command (`Insert`, `Create_Table`,
`Drop_Table `, and others).

# Relationships Between The Main Types

This section covers the main ways in which both the IR and user-facing types are
combined and nested to describe typical queries; it is not comprehensive.

A `DB_Table` serves as a user-facing table expression, and contains column
expressions as `Internal_Column`s and a table expression as a `Context`.

A `DB_Column` serves as a user-facing column expression, and contains a column
expression as an `SQL_Expression` and a table expression as a `Context`.

An `Internal_Column` serves as a column expression, and contains a
`SQL_Expression`, but no table expression. An `Internal_Column` is always used
inside a `DB_Table`, and inherits its table expression from the `DB_Table`'s
`Context`.

A `Context` serves as a table expression, but really inherits this from the
`From_Spec` that it contains. It also contains `where`, `order by`, `group by`
and `limit` clauses.

A `From_Spec` serves as a table expression, and can be a base value (table name,
constant, etc), join, union, or subquery:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rephrase something about this - both of these 'serve as a table expression' it is not bringing that much information. I think we can try to describe them in a more distinct way.

I'd say that (but still needs a bit better phrasing): the Context is everything that is after the FROM clause in SQL - from where we are taking the data (the From_Spec) as well as other modifiers - WHERE, ORDER BY etc. The From_Spec is then just the 'shape' that the FROM part itself can take. It is not a table expression on its own IMHO - it may just refer to tables or their combinations, but a table expression is more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you remember from the London meetings if we wanted to rename these? Because I think there was some suggestions but I don't remember know what it was. We should probably find the sketches we made.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, mostly -- I made it clear that Context includes the from clause and everything after it. I think it's still useful to describe both of these as 'table expressions', since the two main categories here (table and column expressions) are important categories for understanding the whole IR. And both From_Spec and Context contain enough information to specify a result set (even though they are not used that way, directly).

We did have some ideas about renaming things; after we've agreed on this basic documentation I think that's the next step.


- `From_Spec.Join`: contains `From_Spec` values from the individual tables, as
well as `SQL_Expressions` for join conditions
- `From_Spec.Union`: contains a vector of `Query` values for the individual
tables.
- `From_Spec.Sub_Query`: contains column expressions as `SQL_Expression`s, and a
table expression as a `Context`.

# Subqueries

Subqueries are created using `Context.as_subquery`. They correspond to (and are
compiled into) subselects. This allows them to be referred to by an alias, and
also nests certian clauses (`where`, `order by`, `group by` and `limit`) in a
kind of 'scope' within the subselect so that they will not interfere with other
such clauses.

By itself, turning a query into a subquery does not change its value. But it
prepares it to be used in larger queries, such as ones formed with `join` and
`union`, as well as other more specific operations within the database library
(such as `DB_Table.add_row_number`).

In the IR, `Context.as_subquery` prepares a table expression for nesting, but
does not do the actual nesting within another query. To do the actual nesting,
you use the prepared subquery as a table expression within a larger query.

Creating a subquery consists of replacing complex column expressions with
aliases that refer to the original complex expressions within the nested query.
For example, a query such as

```sql
select [complex column expression 1],
[complex column expression 2]
from [complex table expression]
where [where clauses]
group by [group-by clauses]
order by [order-by clauses]
```

would be transformed into

```sql
select alias1, alias2
from (select [complex column expression 1] as alias1,
[complex column expression 2] as alias2
from [complex table expression]
where [where clauses]
group by [group-by clauses]
order by [order-by clauses]) as [table alias]
```

After this transformation, the top-level query has no `where`, `group by`, or
`order by` clauses. These can now be added:

```sql
select alias1, alias2
from (select [complex column expression 1] as alias1,
[complex column expression 2] as alias2
from [complex table expression]
where [where clauses]
group by [group-by clauses]
order by [order-by clauses]) as [table alias]
where [more where clauses]
group by [more group-by clauses]
order by [more order-by clauses])
```

Thanks to this nesting, there can be no unwanted interference between the
`where`, `group by`, or `order by` at different levels.

The added table alias allows join conditions to refer to the columns of the
individual tables being joined.

The `Context.as_subquery` method returns a `Sub_Query_Setup`, which contains a
table expression as a `From_Spec`, a set of simple column expressions as
`Internal_Column`s, and a helper function that can convert an original complex
`Internal_Column` into its simplified alias form.

# Context Extensions

TODO

# Additional Types

- SQL_Statement
- SQL_Fragment
- SQL_Builder
- SQL_Query

TODO
Loading