feat(pyclient): add support for GraphQL API to 'get' method (#4558)

* started creating parse function for GraphQL query from schema metadata * * moved 'parse_query' to Client class * added column types to new constants.py * started creating 'get_pkeys' for metadata.py * * created function 'parse_nested_pkeys' * finished '_parse_get_table_query' method * * implemented 'columns' filter in `get` * * added check for None columns * * split filter method results for CSV and GraphQL API * * fixed logging issues * * added dtype conversion to DataFrame output * * fixed column type float to decimal * * fixed column type float to decimal * added remaining types * * fixed server URL ending with '/' * * updated dev script for catalogue model * * implemented truncate * * fixed imports * * fixed truncate GraphQL url * * created `ReferenceException` * * refactored table names 'Collections', 'Cohorts' to 'Resources' * * small fixes * * fixed examples in dev.py * * updated README.md * * updated README.md * * removed redundant script * * fixed GraphQL query for column type FILE * * added parser for ontology columns, top-level only * * implemented parser for ontology columns nested in ref/ref_array/refback columns * * fixed as_df=False return empty list instead of None in case of empty table * * improved parsing for cases where ref columns reference ontology tables * * fixed referencing tables in other schemas * * replaced dtype 'bool' by 'boolean' for data type BOOL * * added data type LONG for conversion * * fixed issue with rounding of numeric values in string type columns * * moved get 'GraphQL' option to separate method 'get_graphql' * restored previous behaviour of get(as_df=False) * * fixed error in parsing datatime data type * * updated docs * * corrected docstrings * updated documentation * updated changelog
molgenis · Feb 10, 2025 · bc3e226 · bc3e226
1 parent dd8af51
commit bc3e226
Show file tree

Hide file tree

Showing 10 changed files with 540 additions and 118 deletions.
diff --git a/docs/molgenis/use_usingpyclient.md b/docs/molgenis/use_usingpyclient.md
@@ -4,18 +4,32 @@ The MOLGENIS EMX2 Python client allows the user to retrieve, create, update and
 
 ## Installation
 The releases of the package are hosted at [PyPI](https://pypi.org/project/molgenis-emx2-pyclient/).
-The recommended way to install the latest
+The recommended way to install the latest version is through _pip_:
 
 ```commandline
 pip install molgenis-emx2-pyclient
 ```
 
 ## Setting up the client
-The Python client can be integrated in scripts authorized by either a username/password combination or a temporary token.
-URLs of EMX2 servers on remote servers are required to start with `https://`.
+The Python client can be integrated in scripts authorized by either a temporary token or a username/password combination.
+URLs of EMX2 instances on remote servers must start with `https://`.
 It is possible to use the Pyclient on a server running on a local machine. The URL should then be passed as `http://localhost:PORT`.
 
-Signing in with a username/password combination requires using the client as context manager:
+The recommended method for authorization in the Pyclient is with tokens, which can be generated in the UI following the instructions [here](use_tokens.md).
+In the initialization of the Client object the token can then be passed as an argument.
+It is recommended that the token be stored as an environment variable such that it can be read in and used as follows:
+```python
+import os
+from molgenis_emx2_pyclient import Client
+
+token = os.environ.get("MOLGENIS_TOKEN")
+
+with Client(url='https://example.molgeniscloud.org', token=token) as client:
+    # Perform tasks
+    ...
+
+```
+Signing in with a username/password combination is done using the `signin` method:
 ```python
 from molgenis_emx2_pyclient import Client
 
@@ -31,29 +45,13 @@ with Client(url='https://example.molgeniscloud.org') as client:
     ...
 ```
 
-Before using the Pyclient with a token, this token should be generated in the UI, see [Tokens](use_tokens.md).
-Using the Pyclient with a token requires supplying the token in the initialization of the Client object:
-```python
-from molgenis_emx2_pyclient import Client
-
-token = '********************************'
-
-client = Client(url='https://example.molgeniscloud.org', token=token)
-
-# Perform other tasks
-...
-
-```
+If the client is only to be used for retrieving information from publicly viewable schemas no authorization is needed.
 
 Additionally, if the Pyclient is to be used on a particular schema, this schema can be supplied in the initialization of the client, alongside the server URL:
 ```python
 with Client('https://example.molgeniscloud.org', schema='My Schema') as client:
     ...
 ```
-or 
-```python
-client = Client('https://example.molgeniscloud.org', schema='My Schema', token=token)
-```
 
 ### Scripts and Jobs
 When using the client in a script that runs as part of a job via the [Task API](use_scripts_jobs.md), it is essential
@@ -124,52 +122,104 @@ Raises the `TokenSigninException` when the client is already signed in with a us
 ```python
 def get(self, 
         table: str, 
+        columns: list[str] = None,
         query_filter: str = None, 
         schema: str = None, 
         as_df: bool = False) -> list | pandas.DataFrame:
     ...
 ```
-Retrieves data from a table on a schema and returns the result either as a list of dictionaries or as a pandas DataFrame.
+Retrieves data from a table on a schema using the CSV API and returns the result either as a list of dictionaries or as a pandas DataFrame.
+Use the `columns` parameter to specify which columns to retrieve. By default all columns are returned
 Use the `query_filter` parameter to filter the results based on filters applied to the columns.
 This query requires a special syntax. 
-Columns can be filtered on equality `==`, inequality `!=`, greater `>` and smaller `<` than.
+Values in columns can be filtered on equality `==`, inequality `!=`, greater `>` and smaller `<` than.
 Values within an interval can also be filtered by using the operand `between`, followed by list of the upper bound and lower bound.
 The values of reference and ontology columns can also be filtered by joining the column id of the table with the column id of the reference/ontology table by a dot, as in the example `countries.name`, where `countries` is a column in the table `My table` and `name` is the column id of the referenced table specifying the names of countries. 
 It is possible to add filters on multiple columns by separating the filter statements with _' and '_.
 It is recommended to supply the filters that are compared as variables passed in an f-string.
 
 Throws the `NoSuchSchemaException` if the user does not have at least _viewer_ permissions or if the schema does not exist.
-Throws the `NoSuchColumnException` if the query filter contains a column id that is not present in the table.
+Throws the `NoSuchColumnException` if the `columns` argument or query filter contains a column that is not present in the table.
 
 
 | parameter      | type | description                                                                    | required | default |
 |----------------|------|--------------------------------------------------------------------------------|----------|---------|
 | `table`        | str  | the name of a table                                                            | True     | None    |
+| `columns`      | list | a list of column names to return                                               | False    | None    |
 | `schema`       | str  | the name of a schema                                                           | False    | None    |
 | `query_filter` | str  | a string to filter the results on                                              | False    | None    |
 | `as_df`        | bool | if true: returns data as pandas DataFrame <br/> else as a list of dictionaries | False    | False   |
 
 ##### examples
+
 ```python
-# Get all entries for the table 'Collections' on the schema 'MySchema'
-table_data = client.get(table='Collections', schema='MySchema')
+# Get all entries for the table 'Resources' on the schema 'MySchema'
+table_data = client.get(table='Resources', schema='MySchema', columns=['name', 'collectionEvents'])
 
 # Set the default schema to 'MySchema'
 client.set_schema('MySchema')
 # Get the same entries and return them as pandas DataFrame
-table_data = client.get(table='Collections', as_df=True)
+table_data = client.get(table='Resources', columns=['name', 'collection events'], as_df=True)
+
+# Get the entries where the value of a particular column 'number of participants' is greater than 10000
+table_data = client.get(table='Resources', query_filter='numberOfParticipants > 10000')
+
+# Get the entries where 'number of participants' is greater than 10000 and the resource type is a 'Population cohort'
+# Store the information in variables, first
+min_subpop = 10000
+cohort_type = 'Population cohort'
+table_data = client.get(table='Resources', query_filter=f'numberOfParticipants > {min_subpop}'
+                                                        f'and cohortType == {cohort_type}')
+```
+
+
+### get_graphql
+```python
+def get_graphql(self, 
+        table: str, 
+        columns: list[str] = None,
+        query_filter: str = None, 
+        schema: str = None) -> list:
+    ...
+```
+Retrieves data from a table on a schema using the GraphQL API and returns the result as a list of dictionaries.
+This method and its parameters behave similarly to `get` with option `as_df=False`. 
+The results are returned in a slightly different way, however.
+`get` retains the column _names_, whereas `get_graphql` returns column _id_s, which are in lower camel case.
+Furthermore, the `get` method will return the values in columns with a reference type, while the results of `get_graphql` will also contain the primary keys for those columns.  
+
+Throws the `NoSuchSchemaException` if the user does not have at least _viewer_ permissions or if the schema does not exist.
+Throws the `NoSuchColumnException` if the `columns` argument or query filter contains a column that is not present in the table.
+
+
+| parameter      | type | description                       | required | default |
+|----------------|------|-----------------------------------|----------|---------|
+| `table`        | str  | the name of a table               | True     | None    |
+| `columns`      | list | a list of column ids to filter on | False    | None    |
+| `schema`       | str  | the name of a schema              | False    | None    |
+| `query_filter` | str  | a string to filter the results on | False    | None    |
+
+##### examples
+
+```python
+# Get all entries for the table 'Resources' on the schema 'MySchema'
+table_data = client.get_graphql(table='Resources', schema='MySchema', columns=['name', 'collectionEvents'])
+
+# Set the default schema to 'MySchema'
+client.set_schema('MySchema')
 
 # Get the entries where the value of a particular column 'number of participants' is greater than 10000
-table_data = client.get(table='Collections', query_filter='numberOfParticipants > 10000')
+table_data = client.get_graphql(table='Resources', query_filter='numberOfParticipants > 10000')
 
-# Get the entries where 'number of participants' is greater than 10000 and the cohort type is a 'Population cohort'
+# Get the entries where 'number of participants' is greater than 10000 and the resource type is a 'Population cohort'
 # Store the information in variables, first
-min_subcohorts = 10000
+min_subpop = 10000
 cohort_type = 'Population cohort'
-table_data = client.get(table='Collections', query_filter=f'numberOfParticipants > {min_subcohorts}'
-                                                          f'and cohortType == {cohort_type}')
+table_data = client.get_graphql(table='Resources', query_filter=f'numberOfParticipants > {min_subpop}'
+                                                        f'and cohortType == {cohort_type}')
 ```
 
+
 ### get_schema_metadata
 ```python
 def get_schema_metadata(self, name: str = None) -> Schema:
@@ -212,11 +262,11 @@ Throws the `NoSuchSchemaException` if the user does not have at least _viewer_ p
 ##### examples
 ```python
 
-# Export the table 'Collections' on the schema 'MySchema' from the CSV API to a BytesIO object 
-collections_raw: BytesIO = await client.export(schema='MySchema', table='Collections')  
+# Export the table 'Resources' on the schema 'MySchema' from the CSV API to a BytesIO object 
+resources_raw: BytesIO = await client.export(schema='MySchema', table='Resources')  
 
-# Export 'Collections' from the Excel API to the file 'Collections-export.xlsx' 
-await client.export(schema='MySchema', table='Collections', filename='Collections-export.xlsx')
+# Export 'Resources' from the Excel API to the file 'Resources-export.xlsx' 
+await client.export(schema='MySchema', table='Resources', filename='Resources-export.xlsx')
 ```
 
 
@@ -243,12 +293,12 @@ Throws the `NoSuchSchemaException` if the schema is not found on the server.
 
 ##### examples
 ```python
-# Save an edited table with Collections data from a CSV file to the Collections table
-client.save_schema(table='Collections', file='Collections-edited.csv')
+# Save an edited table with Resources data from a CSV file to the Resources table
+client.save_schema(table='Resources', file='Resources-edited.csv')
 
-# Save an edited table with Collections data from memory to the Collections table
-collections: pandas.DataFrame = ...
-client.save_schema(table='Collections', data=collections)
+# Save an edited table with Resources data from memory to the Resources table
+resources: pandas.DataFrame = ...
+client.save_schema(table='Resources', data=resources)
 ```
 
 ### upload_file
@@ -269,8 +319,8 @@ Throws the `NoSuchSchemaException` if the schema is not found on the server.
 
 ##### examples
 ```python
-# Upload a file containing Collections data to a schema
-await client.upload_file(file_path='data/Collections.csv')
+# Upload a file containing Resources data to a schema
+await client.upload_file(file_path='data/Resources.csv')
 
 # Upload a file containing members information to a schema
 await client.upload_file(file_path='molgenis_members.csv', schema='MySchema')
@@ -306,18 +356,27 @@ Throws the `NoSuchSchemaException` if the schema is not found on the server.
 
 ##### examples
 ```python
-# Delete cohorts from a list of ids
-cohorts = [{'name': 'Cohort 1', 'name': 'Cohort 2'}]
-client.delete_records(schema='MySchema', table='Cohorts', data=cohorts)
+# Delete resources from a list of ids
+resources = [{'name': 'Resource 1', 'name': 'Resource 2'}]
+client.delete_records(schema='MySchema', table='Resources', data=resources)
 
-# Delete cohorts from pandas DataFrame
-cohorts_df = pandas.DataFrame(data=cohorts)
-client.delete_records(schema='MySchema', table='Cohorts', data=cohorts_df)
+# Delete resources from pandas DataFrame
+resources_df = pandas.DataFrame(data=resources)
+client.delete_records(schema='MySchema', table='Resources', data=resources_df)
 
-# Delete cohorts from entries in a CSV file
-client.delete_records(schema='MySchema', table='Cohorts', file='Cohorts-to-delete.csv')
+# Delete resources from entries in a CSV file
+client.delete_records(schema='MySchema', table='Resources', file='Resources-to-delete.csv')
 ```
 
+### truncate
+```python
+client.truncate(table='My table', schema='My Schema')
+```
+Truncates the table and removes all its contents.
+This will fail if entries in the table are referenced from other tables.
+
+Throws the `ReferenceException` if entries in the table are referenced in other tables.
+
 ### create_schema
 ```python
 async def create_schema(self, 

diff --git a/tools/pyclient/README.md b/tools/pyclient/README.md
@@ -13,6 +13,16 @@ pip install molgenis-emx2-pyclient
 Releases of the Molgenis EMX2 Pyclient follow the release number of the accompanying release of the Molgenis EMX2 software.
 Therefore, releases of the Pyclient are less frequent than those of EMX2 and the latest version of the Pyclient may differ from the latest version of Molgenis EMX2.
 
+#### 11.56.2
+- Added: feature 'truncate' to remove all entries from a table
+- Added: option to filter results of `get` method by columns
+- Added: method `get_graphql` implements the GraphQL API
+- Improved: added additional parsing for data returned from the CSV API to pandas DataFrame in `get` method 
+- Fixed: log level was set to `DEBUG` without possibility to change this. The user can now set the log level again at their preferred level
+
+#### 11.47.1
+Fixed: updated GraphQL queries to be in line with EMX2 database metadata
+
 #### 11.23.0
 Added: an optional `job` argument to the `Client` initialization, allowing the Pyclient to run asynchronous methods within a job in EMX2."