Skip to content

Commit

Permalink
Aggregator Extractor (leomarquine#25)
Browse files Browse the repository at this point in the history
* toIterator method

* pipeline consume as a Generator

* new Accumulator extractor

* missingDataExxecption if strict

* fun with chaining

* better doc

* @ArthurHoaro get rid of superfluous if statement

* @ArthurHoaro fix type juggling/casting

* perf, remove unecessary md5 hash

* various cosmetics

* Accumulator tests

* optimization

* documentation

* svg schema update

* missing link in documentation

* svg better

* rename to Aggregator

* incomplete flag

* svg tuning

* fix fusion

Co-authored-by: Nicolas @ remote <[email protected]>
  • Loading branch information
leNEKO and Nicolas-Masson-Wizaplace authored May 25, 2020
1 parent 9c077a1 commit a2fa8c9
Show file tree
Hide file tree
Showing 23 changed files with 1,965 additions and 52 deletions.
4 changes: 3 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,14 @@

Extract, Transform and Load data using PHP.

![ETL](docs/img/etl.svg)

## Changelog
See the changelog [here](changelog.MD)

## Installation
In your application's folder, run:
```
```shell
composer require wizaplace/php-etl
```

Expand Down
93 changes: 93 additions & 0 deletions docs/Extractors/Aggregator.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
# Aggregator

Merge rows from a list of partial data iterators with a matching index.

```php
# user data from one CSV file
$userDataIterator = (new Etl())
->extract(
new Csv(),
'user_data.csv',
['columns' => ['id','email', 'name']]
)
->toIterator()
;

# extended info from another source
$extendedInfoIterator = (new Etl())
->extract(
new Table(),
'extended_info',
['columns' => 'courriel', 'twitter']
)
# let's rename 'courriel' to 'email'
->tranform(
new RenameColumns(),
[
'columns' => ['courriel' => 'email']
]
)
->toIterator()
;

# merge this two data sources
$mergedData = (new Etl())
->extract(
new Aggregator(),
[
$userDataIterator,
$extendedInfoIterator,
],
[
'index' => ['email'], # common matching index
'columns' => ['id','email','name','twitter']
]
)
->load(
new CsvLoader(),
'completeUserData.csv'
)
->run()
;
```

## Options

### Index (required)

An array of column names common in all data sources

| Type | Default value |
|-------|---------------|
| array | `null` |

```php
$options = ['index' => ['email']];
```

### Columns (required)

A `Row` is yield when all specified columns have been found for the matching index.

| Type | Default value |
|-------|---------------|
| array | `null` |

```php
$options = ['columns' => ['id', 'name', 'email']];
```

### Strict

When all Iterators input are fully consummed, if we have any remaining incomplete rows:

- if *true*: Throw an `IncompleteDataException`
- if *false*: yield the incomplete remaining `Row` flagged as `incomplete`

| Type | Default value |
|---------|---------------|
| boolean | `true` |

```php
$options = ['strict' => false];
```
8 changes: 4 additions & 4 deletions docs/Extractors/Collection.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,15 +9,15 @@ $etl->extract($collection, $iterable, $options);

> **Tip:** Using generators will decrease memory usage.

## Options

### Columns

Columns from the iterable item that will be extracted.

| Type | Default value |
|----- | ------------- |
| array | `null` |
| Type | Default value |
|-------|---------------|
| array | `null` |

```php
$options = ['columns' => ['id', 'name', 'email']];
Expand Down
32 changes: 19 additions & 13 deletions docs/Extractors/Csv.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,22 +7,24 @@ Extracts data from a character-separated values file.
$etl->extract($csv, 'path/to/file.csv', $options);
```


## Options

### Columns

Columns that will be extracted. If `null`, all columns will be extracted and the first line will be used as the columns names.

| Type | Default value |
|----- | ------------- |
| array | `null` |
| Type | Default value |
|-------|---------------|
| array | `null` |

To select which columns will be extracted, use an array with the columns list:

```php
$options = ['columns' => ['id', 'name', 'email']];
```

To rename the columns, use an associative array where the `key` is the name of the column in the file and the `value` is the name that will be used in the etl process:

```php
$options = ['columns' => [
'id' => 'id',
Expand All @@ -32,6 +34,7 @@ $options = ['columns' => [
```

If your file does not contains the columns names, you may set the name and the index of the columns to extract starting at 1:

```php
$options = ['columns' => [
'id' => 1,
Expand All @@ -41,35 +44,38 @@ $options = ['columns' => [
```

### Delimiter

Field delimiter (one character only).

| Type | Default value |
|----- | ------------- |
| string | , |
| Type | Default value |
|--------|---------------|
| string | , |

```php
$options = ['delimiter' => ';'];
```

### Enclosure

Field enclosure character (one character only).

| Type | Default value |
|----- | ------------- |
| string | |
| Type | Default value |
|--------|---------------|
| string | |

```php
$options = ['enclosure' => '"'];
```

### Throw error

If the extractor need to throw an exception if it
encounters any input issue during the data processing. Default value
is set to false to keep backward compatibility.

| Type | Default value |
|----- | ------------- |
| boolean | false |
| Type | Default value |
|---------|---------------|
| boolean | false |

```php
$options = ['throwError' => '"'];
Expand Down
9 changes: 5 additions & 4 deletions docs/Extractors/FixedWidth.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,17 +7,18 @@ Extracts data from a text file with fields delimited by a fixed number of charac
$etl->extract($fixedWidth, 'path/to/file.txt', $options);
```


## Options

### Columns (required)

Columns that will be extracted.

| Type | Default value |
|----- | ------------- |
| array | `null` |
| Type | Default value |
|-------|---------------|
| array | `null` |

Associative array where the `key` is the name of the column and the `value` is an array containing the start position and the length of the column;

```php
$options = ['columns' => [
'id' => [0, 5], // Start position is 0 and length is 5.
Expand Down
9 changes: 5 additions & 4 deletions docs/Extractors/Json.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,17 +7,18 @@ Extracts data from a JavaScript Object Notation file.
$etl->extract($json, 'path/to/file.json', $options);
```


## Options

### Columns

Columns that will be extracted. If `null`, the first level key/value pairs of the object in each iteration will be used.

| Type | Default value |
|----- | ------------- |
| array | `null` |
| Type | Default value |
|-------|---------------|
| array | `null` |

For more control over the columns, you may use JSON path:

```php
$options = ['columns' => [
'id' => '$..bindings[*].id.value',
Expand Down
5 changes: 4 additions & 1 deletion docs/Extractors/Query.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,10 @@ Extracts data from a database table using a custom SQL query.
$etl->extract($query, 'select * from users', $options);
```


## Options

### Connection

Name of the database connection to use.

| Type | Default value |
Expand All @@ -22,18 +22,21 @@ $options = ['connection' => 'app'];
```

### Bindings

Values to bind to the query statement.

| Type | Default value |
|----- | ------------- |
| array | `[]` |

Using prepared statement with named placeholders `select * from users where status = :status`:

```php
$options = ['bindings' => ['status' => 'active']];
```

Using prepared statement with question mark placeholders `select * from users where status = ?`:

```php
$options = ['bindings' => ['active']];
```
2 changes: 1 addition & 1 deletion docs/Extractors/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,9 @@ Extractors are the entry point of any process. To start a process, you must set
$etl->extract($type, $source, $options);
```


## Available extractors types

* [Aggregator](Aggregator.md)
* [Collection](Collection.md)
* [CSV](Csv.md)
* [Fixed Width](FixedWidth.md)
Expand Down
23 changes: 13 additions & 10 deletions docs/Extractors/Table.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,38 +7,41 @@ Extracts data from a database table.
$etl->extract($table, 'table_name', $options);
```


## Options

### Columns

Columns that will be extracted. If `null`, all columns of the table will be extracted.

| Type | Default value |
|----- | ------------- |
| array | `null` |
| Type | Default value |
|-------|---------------|
| array | `null` |

To select which columns will be extracted, use an array with the columns list:

```php
$options = ['columns' => ['id', 'name', 'email']];
```

### Connection

Name of the database connection to use.

| Type | Default value |
|----- | ------------- |
| string | default |
| Type | Default value |
|--------|---------------|
| string | default |

```php
$options = ['connection' => 'app'];
```

### Where

Array of conditions, where `key` equals `value`. If you need more flexibility in the the query creation, you may use the [Query extractor](Query.md).

| Type | Default value |
|----- | ------------- |
| array | `[]` |
| Type | Default value |
|-------|---------------|
| array | `[]` |

```php
$options = ['where' => ['status' => 'active']];
Expand Down
Loading

0 comments on commit a2fa8c9

Please sign in to comment.