Skip to content

Commit

Permalink
#2091: Create a documentation folder for Enceladus 3 (#2101)
Browse files Browse the repository at this point in the history
#2091: Create a documentation folder for Enceladus 3
* new folder for v3.0.0 documentation
* version 3.0.0 in the list of versions
* new data files appropriate to the version
* typo fix
* link to Spark 3 documentation
* CODEOWNERS
  • Loading branch information
benedeki authored Aug 2, 2022
1 parent e03acf7 commit 781deb0
Show file tree
Hide file tree
Showing 25 changed files with 2,128 additions and 2 deletions.
2 changes: 1 addition & 1 deletion .github/CODEOWNERS
Original file line number Diff line number Diff line change
@@ -1 +1 @@
* @lokm01 @benedeki @DzMakatun @Zejnilovic @dk1844 @AdrianOlosutean @zakera786
* @lokm01 @benedeki @DzMakatun @Zejnilovic @dk1844 @lsulak @zakera786
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ $> bundle exec jekyll serve
# => Now browse to http://localhost:4000
```

### Run convinience scripts
### Run convenience scripts

#### Generate new docs
```ruby
Expand Down
81 changes: 81 additions & 0 deletions _data/configuration_3_0_0.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
---
- name: conformance.allowOriginalColumnsMutability
options:
- name: boolean
description: "Allows to modify/drop columns from the original input (default is <i>false</i>)"
- name: conformance.autoclean.standardized.hdfs.folder
options:
- name: boolean
description: 'Automatically delete standardized data folder after successful run of a Conformance job <sup><a href="#note1">*</a></sup>'
- name: control.info.validation
options:
- name: <i>strict</i>
description: Job will fail on failed _INFO file validation.
- name: <i>warning</i>
description: "(default) A warning message will be displayed on failed validation,
but the job will go on."
- name: <i>none</i>
description: No validation is done.
- name: enceladus.recordId.generation.strategy
options:
- name: <i>uuid</i>
description: "(default) <code>enceladus_record_id</code> column will be added and will contain
a UUID <code>String</code> for each row."
- name: <i>stableHashId</i>
description: "<code>enceladus_record_id</code> column will be added and populated with an
always-the-same <code>Int</code> hash (Murmur3-based, for testing)."
- name: <i>none</i>
description: no column will be added to the output.
- name: max.processing.partition.size
options:
- name: non-negative long integer
description: 'Maximal size (in bytes) for the processing partition, which would influence the written parquet file size
<b>NB! Experimental - sizes might still not fulfill the requested limits</b>'
- name: menas.rest.uri
options:
- name: string with URLs
description: 'Comma-separated list of URLs where Menas will be looked for. E.g.:
<code>http://example.com/menas1,http://domain.com:8080/menas2</code>'
- name: menas.rest.retryCount
options:
- name: non-negative integer
description: Each of the <code>menas.rest.uri</code> URLs can be tried multiple times for fault-tolerance
- name: menas.rest.availability.setup
options:
- name: <i>roundrobin</i>
description: "(default) Starts from a random URL from the <code>menas.rest.uri</code> list, if it fails the next
one is tried, if last is reached start from 0 until all are tried"
- name: <i>fallback</i>
description: "Always starts from the first URL, and only if it fails the second follows etc."
- name: min.processing.partition.size
options:
- name: non-negative long integer
description: 'Minimal size (in bytes) for the processing partition, which would influence the written parquet file size
<b>NB! Experimental - sizes might still not fulfill the requested limits</b>'
- name: standardization.defaultTimestampTimeZone.default
options:
- name: string with any valid time zone name
description: The time zone for normalization of timestamps that don't have their own time zone either in data
itself or in metadata. If left empty the system time zone will be used.
- name: standardization.defaultTimestampTimeZone.[rawFormat]
options:
- name: string with any valid time zone name
description: Same as above <code>standardization.defaultTimestampTimeZone.default</code>, but applies only for
the specific input raw format - then it takes precedence over
<code>standardization.defaultTimestampTimeZone.default</code>.
- name: standardization.defaultDateTimeZone.default
options:
- name: string with any valid time zone name
description: The time zone for normalization of dates that don't have their own time zone either in data itself
or in metadata in case they need it. Most probably this should be left undefined.
- name: standardization.defaultDateTimeZone.[rawFormat]
options:
- name: string with any valid time zone name
description: Same as above <code>standardization.defaultDateTimeZone.default</code>, but applies only for
the specific input raw format - then it takes precedence over
<code>standardization.defaultDateTimeZone.default</code>.
- name: timezone
options:
- name: string with any valid time zone name
description: The time zone the Spark application will operate in. Strongly recommended
to keep it to default <i>UTC</i>
46 changes: 46 additions & 0 deletions _data/menas-configuration_3_0_0.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
---
- name: javax.net.ssl.keyStore
options:
- name: string path to JKS file
description: 'KeyStore file containing records of private keys to connect to a secure schema registry.
E.g.: <code>/path/to/keystore.jks</code>'
- name: javax.net.ssl.keyStorePassword
options:
- name: string
description: 'Password for the file referenced in <code><a href="javax.net.ssl.keyStore">javax.net.ssl.keyStore</a></code>. E.g.:
<code>password1234</code>'
- name: javax.net.ssl.trustStore
options:
- name: string path to JKS file
description: 'TrustStore file containing records of trusted certificates to connect to a secure schema registry.
E.g.: <code>/path/to/truststore.jks</code> <sup><a href="#note2">*</a></sup>'
- name: javax.net.ssl.trustStorePassword
options:
- name: string
description: 'Password for the file referenced in <code><a href="javax.net.ssl.trustStore">javax.net.ssl.trustStore</a></code>. E.g.:
<code>password123</code>'
- name: menas.auth.admin.role
options:
- name: string
description: 'Specifies the admin role to operate property definition create and update operations.'
- name: menas.auth.roles.regex
options:
- name: string - regular expression
description: 'Regular expression specifying which user roles to include in JWT. E.g.:
<code>^menas_</code>. If the expression filters out the admin role (<code><a href="#menas.auth.admin.role">menas.auth.admin.role</a></code>), account won''t be recognized as admin.'
- name: menas.auth.ad.server
options:
- name: string - space-separated AD server domains
description: 'ActiveDirectory server domain(s) - multiple values are supported as fallback options.
DN (e.g. <code>dc=example,dc=com</code>) should not be included as this is supplied in <code>menas.auth.ldap.search.base</code>.
Example: <code>menas.auth.ad.server=ldaps://first.ldap.here ldaps://second.ldap.here ldaps://third.ldap.here</code> (notice no quotes)'
- name: menas.schemaRegistry.baseUrl
options:
- name: string with URL
description: 'Base Url to (secure) schema registry. E.g.:
<code>https://localhost:8081</code> <sup><a href="#note1">*</a></sup>'
- name: menas.schemaRegistry.warnUnsecured
options:
- name: boolean
description: 'If set, in case the <code>javax.net.ssl.*</code> settings are missing or incorrect, the application
will issue a warning. Default: <code>True</code>'
9 changes: 9 additions & 0 deletions _data/selected-plugins-configuration_3_0_0.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
---
- name: atum.hdfs.info.file.permissions
options:
- name: string with FS permissions
description: 'Desired FS permissions for Atum <code>_INFO</code> file. Default: <code>644</code>.'
- name: spline.hdfs.file.permissions
options:
- name: string with FS permissions
description: "Desired FS permissions for Spline's <code>_LINEAGE</code> file. Default: <code>644</code>."
1 change: 1 addition & 0 deletions _data/versions.yaml
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
- '1.0.0'
- '2.0.0'
- '3.0.0'
7 changes: 7 additions & 0 deletions _docs/3.0.0/build-process.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
---
layout: docs
title: Build Process
version: '3.0.0'
categories:
- '3.0.0'
---
33 changes: 33 additions & 0 deletions _docs/3.0.0/components.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
---
layout: docs
title: Components
version: '3.0.0'
categories:
- '3.0.0'
---

### Menas

Menas is a UI component of the Enceladus project. It is used to define datasets and schemas representing your data. Using dataset definition you define where the data is, where should it land if any conformance rules should be applied. Schema defines how does the data will look (column names, types) after standardization.

[More...]({{ site.baseurl }}/docs/{{ page.version }}/components/menas)

### SparkJobs

Enceladus consists of two spark jobs. One is Standardization, for alignation of data types and format, and the second one is Conformance, which then applies conformance rules onto the data.

#### Standardization

Standardization is used to transform almost any data format into a standardized, strongly typed parquet format, so the data can be used/view using unified tools.

#### Conformance

Conformance is used to apply conformance rules (mapping, negation, casting, etc.) onto the data. Conformance rules are additional tranformations of the data.

### Plugins

[More...]({{ site.baseurl }}/docs/{{ page.version }}/plugins)

### Built-in Plugins

[More...]({{ site.baseurl }}/docs/{{ page.version }}/plugins-built-in)
40 changes: 40 additions & 0 deletions _docs/3.0.0/components/menas.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
---
layout: docs
title: Components - Menas
version: '3.0.0'
categories:
- '3.0.0'
- components
---
## API

### Monitoring endpoints

All `/admin` endpoints except `/admin/health` require authentication (and will require strict permissions once [Authorization]({{ site.github.issues_url }}/30) is implemented)
* `GET /admin` - list of all monitoring endpoints
* `GET /admin/heapdump` - downloads a heapdump of the application
* `GET /admin/threaddump` - list of the threaddump of the application
* `GET /admin/loggers` - list of all the application loggers and their log levels
* `POST /admin/loggers/{logger}` - change the log level of a logger in runtime
* `GET /admin/health` - get a detailed status report of the application's health:
```json
{
"status": "UP",
"details": {
"HDFSConnection": {
"status": "UP"
},
"MongoDBConnection": {
"status": "UP"
},
"diskSpace": {
"status": "UP",
"details": {
"total": 1000240963584,
"free": 766613557248,
"threshold": 10485760
}
}
}
}
```
20 changes: 20 additions & 0 deletions _docs/3.0.0/deployment.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
---
layout: docs
title: Deployment
version: '3.0.0'
categories:
- '3.0.0'
---

## Menas

### Prerequisits to deploying Menas are

- Tomcat 8.5+ to deploy the war to
- `HADOOP_CONF_DIR` environment variable. This variable should point to a folder containing Hadoop configuration files (`core-site.xml`, `hdfs-site.xml` and `yarn-site.xml`). These are used to query the HDFS for folder locations.
- MongoDB 4.0+ used as a storage
- _OPTIONAL_ [Spline 0.3.X](https://absaoss.github.io/spline/0.3.html) for viewing of the lineage from Menas. Even without Spline in Menas, Standardization and Conformance will log lineage to Mongo.

### Deploying Menas

The easiest way to deploy Menas is to copy the `menas-VERSION.war` to `$TOMCAT_HOME/webapps`. This will create `<tomcat IP>/menas-VERSION` path on your local server.
66 changes: 66 additions & 0 deletions _docs/3.0.0/plugins-built-in.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
---
layout: docs
title: Built-in Plugins
version: '3.0.0'
categories:
- '3.0.0'
---
<!-- toc -->
- [What are built-in plugins](#what-are-built-in-plugins)
- [Existing built-in plugins](#existing-built-in-plugins)
- [KafkaInfoPlugin](#kafkainfoplugin)
- [KafkaErrorSenderPlugin](#kafkaerrorsenderplugin)
<!-- tocstop -->

## What are built-in plugins

Built-in plugins provide some additional but relatively elementary functionality. And also serve as an example how plugins
are written. Unlike externally created plugins they are automatically included in the `SparkJobs.jar` file and therefore
don't need to be included using the `--jars` option.

## Existing built-in plugins

The plugin class name is specified for Standardization and Conformance separately since some plugins need to run only
during execution of one of these jobs. Plugin class name keys have numeric suffixes (`.1` in this example). The numeric
suffix specifies the order at which plugins are invoked. It should always start with `1` and be incremented by 1 without
gaps.

### KafkaInfoPlugin

The purpose of this plugin is to send control measurements to a Kafka topic each time a checkpoint is reached or job
status is changed. This can help to monitor production issues and react to errors as quickly as possible.
Control measurements are sent in `Avro` format and the schema is automatically registered in a schema registry.

This plugin is a built-in one. In order to enable it, you need to provide the following configuration settings in
`application.conf`:

```
standardization.plugin.control.metrics.1=za.co.absa.enceladus.plugins.builtin.controlinfo.mq.kafka.KafkaInfoPlugin
conformance.plugin.control.metrics.1=za.co.absa.enceladus.plugins.builtin.controlinfo.mq.kafka.KafkaInfoPlugin
kafka.schema.registry.url="http://127.0.0.1:8081"
kafka.bootstrap.servers="127.0.0.1:9092"
kafka.info.metrics.client.id="controlInfo"
kafka.info.metrics.topic.name="control.info"
# Optional security settings
#kafka.security.protocol="SASL_SSL"
#kafka.sasl.mechanism="GSSAPI"
# Optional Schema Registry Security Parameters
#kafka.schema.registry.basic.auth.credentials.source=USER_INFO
#kafka.schema.registry.basic.auth.user.info=user:password
```

### KafkaErrorSenderPlugin

The purpose of this plugin is to send errors to a Kafka topic.

This plugin is a built-in one. In order to enable it, you need to provide the following configuration settings in
`application.conf`:

```
standardization.plugin.postprocessor.1=za.co.absa.enceladus.plugins.builtin.errorsender.mq.kafka.KafkaErrorSenderPlugin
conformance.plugin.postprocessor.1=za.co.absa.enceladus.plugins.builtin.errorsender.mq.kafka.KafkaErrorSenderPlugin
`kafka.schema.registry.url`=
`kafka.bootstrap.servers`=
`kafka.error.client.id`=
`kafka.error.topic.name`=
```
37 changes: 37 additions & 0 deletions _docs/3.0.0/plugins.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
---
layout: docs
title: Plugins
version: '3.0.0'
categories:
- '3.0.0'
---

**Standardization** and **Conformance** support plugins that allow executing additional actions at certain times of the computation.

A plugin can be externally developed. In this case, in order to use the plugin a plugin jar needs to be supplied to
`spark-submit` using the `--jars` option. You can also use built-in plugins by enabling them in `application.conf`
or passing configuration information directly to `spark-submit`.

The way it works is like this. A plugin factory (a class that implements `PluginFactory`) overrides the
apply method. Standardization and Conformance will invoke this method when job starts and provides a configuration that
includes all settings from `application.conf` plus settings passed to JVM via `spark-submit`. The factory then
instantiates a plugin and returns it to the caller. If the factory throws an exception the Spark application
(Standardization or Conformance) will be stopped. If the factory returns `null` an error will be logged by the application,
but it will continue to run.

There's one type of plugins supported for now:

## Control Metrics Plugins

_Control metrics plugins_ allow execution of additional actions any time a checkpoint is created
or job status changes. In order to write such a plugin to Enceladus you need to implement the `ControlMetricsPlugin` and
`ControlMetricsPluginFactory` interfaces.

Controls metrics plugins are invoked each time a job status changes (e.g. from `running` to `succeeded`) or when a checkpoint
is reached. A `Checkpoint` is an [Atum][atum] concept to ensure accuracy and completeness of data.
A checkpoint is created at the end of Standardization and Conformance, and after each conformance rule
configured to create control measurements. At this point the `onCheckpoint()` callback is called with an instance of control
measurements. It is up to the plugin to decide what to do at this point. All exceptions thrown from a plugin will be
logged, but the spark application will continue to run.

[atum]: https://github.com/AbsaOSS/atum
Loading

0 comments on commit 781deb0

Please sign in to comment.