Skip to content

Commit

Permalink
More edits and clean up
Browse files Browse the repository at this point in the history
  • Loading branch information
nassibnassar committed Jun 11, 2020
1 parent 1b646a6 commit 1ecfad1
Show file tree
Hide file tree
Showing 4 changed files with 134 additions and 57 deletions.
166 changes: 120 additions & 46 deletions QUERIES.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,13 @@
Query Writing
=============

This document describes techniques for creating, understanding, and revising LDP queries.
This document describes techniques for creating, understanding, and
revising LDP-based queries.


SQL Style
---------

* spacing/indentation
* four spaces for indents

Expand All @@ -13,7 +16,8 @@ SQL Style
```

* when listing more than 3 elements, put each on a new line (including first)
* when listing more than 3 elements, put each on a new line
(including first)

```
SELECT
Expand All @@ -23,7 +27,8 @@ SQL Style
...
```

* alternate: keep first element on same line, then indent remaining elements to the location of the first element
* alternate: keep first element on same line, then indent remaining
elements to the location of the first element

```
SELECT sp.name AS service_point_name,
Expand All @@ -32,7 +37,8 @@ SQL Style
...
```

* if you have parallel expressions, you may wish to use spacing to align similar elements
* if you have parallel expressions, you may wish to use spacing to
align similar elements

```
loan_date BETWEEN (SELECT start_date FROM parameters) AND
Expand All @@ -43,90 +49,158 @@ SQL Style
* write keywords in all caps. Examples:
* `SELECT`
* `'2019-01-01' :: DATE`
* always use `AS` for aliasing (columns, subqueries, tables, etc.)
* always use `AS` for aliasing (columns, subqueries, tables,
etc.)
* blank lines
* no blank lines
* punctuation
* `,` at end of line
* `(` at end of line
* `)` at beginning of line, lined up with keyword from line with (
* `)` at beginning of line, lined up with keyword from line
with (
* type conversion
* always use `' :: '` followed by data type in upper case (e.g., `VARCHAR`, `DATE`)
* always use `' :: '` followed by data type in upper case
(e.g., `VARCHAR`, `DATE`)
* comments
* `/* ... */` for multi-line comments
* `--` for single line comments
* file name
* use underscores instead of dashes
* selecting fields
* Do not use `SELECT *`. List all fields explicitly.
* (for joins, can join on whole table, don't need a subquery to limit the right table in the join)
* (for joins, can join on whole table, don't need a subquery
to limit the right table in the join)


Structuring a Query
-------------------

1. header comment section
* last edited date? current as of?
* fields requested in output, in requested order
* any filters?
* aggregated or not? (how?)
* any other context necessary to understand query
* warning if query might result in more than 1 million rows (Excel)?
* warning if query might result in more than 1 million rows
(Excel)?
* have this header as a template in the documentation
2. parameters (using `WITH` statement)
* place parameters at beginning of file to make it easier for people to modify
* place parameters at beginning of file to make it easier for
people to modify
* always use name "parameters"
* avoid using parameter field names that duplicate LDP field names, if possible
* set default parameter values in a way that should guarantee the query will return some results, both for testing and for reassuring query users
* if filtering by a date range, use a default date range that is very large (10+ years), even if this query will typically be used for a single year
* if filtering by value in a particular field (e.g., a particular service point), consider using the most common value
3. additional `WITH` statements to label subqueries (see services\_usage query for example) - optional
* avoid using parameter field names that duplicate LDP field
names, if possible
* set default parameter values in a way that should guarantee
the query will return some results, both for testing and for
reassuring query users
* if filtering by a date range, use a default date
range that is very large (10+ years), even if this
query will typically be used for a single year
* if filtering by value in a particular field (e.g., a
particular service point), consider using the most
common value
3. additional `WITH` statements to label subqueries (see
services\_usage query for example) - optional
4. primary query


Details on Specific Strategies
------------------------------

* `WITH` statements
* can use `WITH` to create temporary tables at the beginning of the query that then get used later
* last `WITH` statement goes straight into primary `SELECT` statement for query, do not need a comma after last `WITH` statement
* while in `WITH` statements you can specify the column names before the `SELECT` statement, the code is more readable if you continue to alias the columns with `AS` instead the `SELECT` statement (see services\_usage query)
* [modern SQL article on WITH statements](https://modern-sql.com/feature/with)
* [using WITH statements to create Literate SQL](https://modern-sql.com/use-case/literate-sql)
* can use `WITH` to create temporary tables at the beginning
of the query that then get used later
* last `WITH` statement goes straight into primary `SELECT`
statement for query, do not need a comma after last `WITH`
statement
* while in `WITH` statements you can specify the column names
before the `SELECT` statement, the code is more readable if
you continue to alias the columns with `AS` instead the
`SELECT` statement (see services\_usage query)
* [modern SQL article on WITH
statements](https://modern-sql.com/feature/with)
* [using WITH statements to create Literate
SQL](https://modern-sql.com/use-case/literate-sql)
* Catching empty string & null values
* if you are just selecting a column that may have a null value, you don't need to do anything special
* if you are transforming the column in some way, like using it in a mathematical calculation or extracting some part of the value you need to test for a null value or empty string
* one way might be `COALESCE`, which allows you to specify a default value if the result is null
* if you are just selecting a column that may have a null
value, you don't need to do anything special
* if you are transforming the column in some way, like using
it in a mathematical calculation or extracting some part of
the value you need to test for a null value or empty string
* one way might be `COALESCE`, which allows you to specify a
default value if the result is null
* Picking which table to select from first
* when writing a query, it's important to think through which table you list first in the `SELECT` statement because of the joins that will build on it
* start with the table that best represents what you want on each line of the results table
* for example, if you ultimately want a list of loans, start with loans table
* when writing a query, it's important to think through which
table you list first in the `SELECT` statement because of
the joins that will build on it
* start with the table that best represents what you want on
each line of the results table
* for example, if you ultimately want a list of loans,
start with loans table
* `LEFT JOIN` vs. `INNER JOIN`
* in general, using `LEFT JOIN` makes sure you don't accidentally lose the items you're most interested
* for example, if you're interested in loans and also want to see the demographics of the users making the loan, you can use `LEFT JOIN` to keep all loans even if you don't know the user's demographics
* if you are filtering a table based on a field in a secondary table, you may instead want to use INNER JOIN to make sure to exclude records that don't have the required value
* in general, using `LEFT JOIN` makes sure you don't
accidentally lose the items you're most interested
* for example, if you're interested in loans and also
want to see the demographics of the users making the
loan, you can use `LEFT JOIN` to keep all loans even
if you don't know the user's demographics
* if you are filtering a table based on a field in a secondary
table, you may instead want to use INNER JOIN to make sure
to exclude records that don't have the required value
* `BETWEEN`
* note that using `BETWEEN` for dates is risky because it only includes records up to midnight of the end date (essentially, the end of the day before, but it will include items exactly at midnight of the end date)
* if you do use `BETWEEN`, try to educate people about its behavior in comments and set default values that make sense for the behavior (e.g., the first day of one year and the first day of the following year, instead of the last day of the year)
* if you do not want to risk including values from midnight of the end date, you can use `>= start_date` and `< end_date` instead of `BETWEEN`. This is like using `BETWEEN` except that you use `<` instead of `<=`. You still have to use an end date that will not be included in the date range (i.e., the day after the last day you want included).
* [stack overflow question on querying between date ranges](https://stackoverflow.com/questions/23335970/postgresql-query-between-date-ranges)
* note that using `BETWEEN` for dates is risky because it only
includes records up to midnight of the end date
(essentially, the end of the day before, but it will include
items exactly at midnight of the end date)
* if you do use `BETWEEN`, try to educate people about its
behavior in comments and set default values that make sense
for the behavior (e.g., the first day of one year and the
first day of the following year, instead of the last day of
the year)
* if you do not want to risk including values from midnight of
the end date, you can use `>= start_date` and `< end_date`
instead of `BETWEEN`. This is like using `BETWEEN` except
that you use `<` instead of `<=`. You still have to use an
end date that will not be included in the date range (i.e.,
the day after the last day you want included).
* [stack overflow question on querying between date
ranges](https://stackoverflow.com/questions/23335970/postgresql-query-between-date-ranges)
* DRY - Don't Repeat Yourself
* as with any programming, the more repetition you have in your query, the more likely you are to forget to update something or make a mistake the second time around
* try to find a way to reuse parts of your query creatively, either with parameters or `WITH` statements
* as with any programming, the more repetition you have in
your query, the more likely you are to forget to update
something or make a mistake the second time around
* try to find a way to reuse parts of your query creatively,
either with parameters or `WITH` statements


Accommodating Redshift
----------------------
* Why accommodate Redshift
* PostgreSQL is great and free, but requires local hosting
* Amazon Redshift requires Amazon hosting, but is often faster than locally-hosted PostgreSQL
* LDP is designed to run on either PostgreSQL or Redshift, so LDP queries also need to run on both
* Redshift SQL is based on (an older version of) PostgreSQL, but there are differences that mean that not everything that runs on PG can run on Redshift (and vice versa)

* General notes
* FOLIO reporting supports LDP-based querying on both
PostgreSQL or Redshift, so queries also need to run on both
* Redshift's dialect of SQL is largely based on an old version
of PostgreSQL, but there are differences that mean that not
everything that runs on PostgreSQL can run on Redshift (and
vice versa)
* JSON functions
* PostgreSQL has much better JSON support than Redshift. Redshift can pretty much only use `json_extract_path_text()`
* [Redshift JSON functions](https://docs.aws.amazon.com/redshift/latest/dg/json-functions.html)
* PostgreSQL has much better JSON support than Redshift.
Redshift can pretty much only use `json_extract_path_text()`
* [Redshift JSON
functions](https://docs.aws.amazon.com/redshift/latest/dg/json-functions.html)
* explicit casting
* for anything in a query that doesn't come from a database table, you will need to explicitly state (or "cast") the value to a data type
* example: anything in the parameters temporary table will need explicit casting
* example: if you are adding an ad hoc field with a static value into your table (see services-usage), the static value will need explicit casting
* for anything in a query that doesn't come from a database
table, you will need to explicitly state (or "cast") the
value to a data type
* example: anything in the parameters temporary table
will need explicit casting
* example: if you are adding an ad hoc field with a
static value into your table (see services-usage),
the static value will need explicit casting
* other issues
* had some trouble with the date_part functions and Redshift documentation was not helpful; took a fair amount of trial-and-error to figure out the right pattern (see services\_usage query)
* had some trouble with the date_part functions and Redshift
documentation was not helpful; took a fair amount of
trial-and-error to figure out the right pattern (see
services\_usage query)


12 changes: 6 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,11 +20,11 @@ Interest Group](https://wiki.folio.org/display/RPT/).

## How to use this repository

LDP queries are written in SQL and are designed to execute correctly
on either the PostgreSQL or Redshift installation of the LDP. To use
these queries, you will need to connect to an instance of the LDP
using a reporting tool that supports SQL scripts. Examples of
reporting tools that will execute SQL scripts include:
LDP-based queries are written in SQL and are designed to execute
correctly on either PostgreSQL or Redshift. To use these queries, you
will need to connect to an LDP database instance using a reporting
tool that supports SQL scripts. Examples of reporting tools that will
execute SQL scripts include:

* Microsoft Access
* DBeaver
Expand All @@ -37,7 +37,7 @@ If none of the queries provided match your needs, you can look for an
existing query to use as a starting point and edit the query to create
the desired output. The [LDP User
Guide](https://github.com/folio-org/ldp/blob/master/doc/User_Guide.md)
includes LDP-specific guidelines for query writing.
includes guidelines for query writing.


## Queries
Expand Down
4 changes: 2 additions & 2 deletions TESTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ dbname = <database_name>
For example:

```ini
databases = ldpqdev,rs-ldpqdev
databases = ldpqdev,redshift_ldpqdev

[ldpqdev]
dbtype = postgresql
Expand All @@ -64,7 +64,7 @@ user = ldp
password = YS4p4EkJGWJqbO9w
dbname = ldpqdev

[rs-ldpqdev]
[redshift_ldpqdev]
dbtype = redshift
host = ldpqdev.hfwgaxcbvs5t.us-east-2.redshift.amazonaws.com
port = 5439
Expand Down
9 changes: 6 additions & 3 deletions sql/README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,11 @@
# The FOLIO LDP Query Repository
# The FOLIO Query Repository

This repository stores shared queries designed to produce reports of
FOLIO data in local LDP instances. These queries have been developed
by the [FOLIO Reporting SIG](https://wiki.folio.org/display/RPT/).
FOLIO data in a local LDP instance. These queries have been developed
by the FOLIO reporting community. For more information about FOLIO
reporting, see the [FOLIO Reporting
SIG](https://wiki.folio.org/display/RPT/).


## How to find a query

Expand Down

0 comments on commit 1ecfad1

Please sign in to comment.