Skip to content

Commit

Permalink
Merge pull request #3 from moka-guys/development
Browse files Browse the repository at this point in the history
v1.0.0 (#3)

Co-Authored-By: Graeme Smith <[email protected]>
  • Loading branch information
rebeccahaines1 and Graeme-Smith authored Jan 4, 2024
2 parents 3b13e7c + 914e0f2 commit 467a20d
Show file tree
Hide file tree
Showing 48 changed files with 3,211 additions and 19 deletions.
44 changes: 44 additions & 0 deletions .github/workflows/on-pull-request.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
name: Samplesheet Validator

on:
push:
branches:
- master
- 'feature/**'
- 'development'
jobs:
build:

runs-on: ubuntu-latest
strategy:
matrix:
python-version: ['3.10.6']

steps:
- name: Checkout head
uses: actions/checkout@v3
with:
fetch-depth: 2
run: git checkout HEAD^
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
python3 -m pip install --upgrade pip
pip3 install flake8==6.0.0 wheel==0.38.4 pytest==7.2.1
pip3 install -r requirements.txt
- name: Lint with flake8
run: |
# stop the build if there are:
# - syntax errors (E9)
# - common assertion and comparison gotchas (F63)
# - control flow gotchas (F7)
# - undefined names (F82)
flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
# exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
flake8 . --count --exit-zero --max-complexity=10 --max-line-length=120 --statistics
- name: Test with pytest
run: |
python3 -m pytest
9 changes: 9 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
venv/
__pycache__/
build/
.coverage
dist
*.egg-info
.pytest_cache
*.log
*.vscode
26 changes: 8 additions & 18 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -1,21 +1,11 @@
MIT License
Copyright 2023 Synnovis

Copyright (c) 2020 Graeme
Licensed under the Apache License, Version 2.0 (the "License"); you may not use these files except
in compliance with the License. You may obtain a copy of the License at

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
[http://www.apache.org/licenses/LICENSE-2.0](http://www.apache.org/licenses/LICENSE-2.0)

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
Unless required by applicable law or agreed to in writing, software distributed under the License
is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
or implied. See the License for the specific language governing permissions and limitations under
the License.
107 changes: 106 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1 +1,106 @@
# samplesheet_verifier
# Samplesheet Validator

Checks sample sheet naming and contents. Carries out a series of checks on the sample sheet and collects any errors
that it identifies (SamplesheetCheck.errors_list). It also identifies whether or not a run is a TSO run from the sample
sheet (SamplesheetCheck.tso).

## Protocol

Runs a series of checks on the sample sheet, collects any errors identified. Checks whether:
* Sample sheet exists
* Samplesheet name is valid (validates using the [seglh-naming](https://github.com/moka-guys/seglh-naming/) library)
* Sequencer ID is in the list of allowed sequencer IDs supplied to the script
* Samplesheet is not empty (>10 bytes)
* Samplesheet is for a development run, using the development pan number supplied to the script
* Samplesheet contains the minimum expected `[Data]` section headers: `Sample_ID, Sample_Name, index`
* `Sample_ID` and `Sample_Name` match for each sample in the data section of the samplesheet
* Sample name does not contain any illegal characters
* Sample name is valid (validates using the [seglh-naming](https://github.com/moka-guys/seglh-naming/) library)
* Pan numbers are in the list of allowed pan numbers supplied to the script
* Library prep name in the sample name is in the list of allowed library prep names supplied to the script
* Samplesheet contains any TSO samples

## Usage

### Python package

The repository provides a python package which can be installed with:

`python3 setup.py install`

NB: Use the --user flag or install into an virtualenv/pipenv if not installing globally.

```python

from samplesheet_validator.samplesheet_validator import SamplesheetCheck

sscheck_obj = SamplesheetCheck(
samplesheet_path, # str
sequencer_ids, # list
panels, # list
library_prep_names, # list
tso_panels, # list
dev_panno, # str
logdir, # str
)
sscheck_obj.ss_checks() # Carry out samplesheeet validation

print(sscheck_obj.errors_dict) # View the dictionary of error messages
```

### Command line

The environment must be set up as follows:
```bash
python3 -m venv venv
source venv/bin/activate
pip3 install -r requirements.txt
```

The script can then be used as follows:
```bash
usage: Used to validate a samplesheet using the seglh-naming conventions

Given an input samplesheet, will validate the samplesheet using seglh-naming conventions and output a logfile

options:
-h, --help show this help message and exit
-S SAMPLESHEET_PATH, --samplesheet_path SAMPLESHEET_PATH
Path to samplesheet requiring validation
-SI SEQUENCER_IDS, --sequencer_ids SEQUENCER_IDS
Comma separated string of allowed sequencer IDS
-P PANELS, --panels PANELS
Comma separated string of allowed panel numbers
-R LIBRARY_PREP_NAMES, --library_prep_names LIBRARY_PREP_NAMES
Comma separated string of allowed library prep names
-T TSO_PANELS, --tso_panels TSO_PANELS
Comma separated string of tso panels
-D DEV_PANNO, --dev_panno DEV_PANNO
Development pan number
-L LOGDIR, --logdir LOGDIR
Directory to save the output logfile to
```

### Testing

Test datasets are stored in [/test/data](../test/data). The script has a full test suite:
* [test_samplesheet_validator.py](../test/test_samplesheet_validator.py)

These tests should be run before pushing any code to ensure all tests in the GitHub Actions workflow pass. These can be run as follows:

```bash
python3 -m pytest
```
**N.B. Tests and test cases/files MUST be maintained and updated accordingly in conjunction with script development**
**N.B. This includes ensuring that the arguments passed to pytest in the [pytest.ini](pytest.ini) file are kept up to date**


## Logging

Logging is performed by [ss_logger](samplesheet_validator/ss_logger.py). The directory to save the log file to is supplied as an argument. The output log file is named by the script as follows:
- `$LOGFILE_DIR/$RUNFOLDER_NAME_$TIMESTAMP_samplesheet_validator.log`

The script also collects the error messages as it runs, which can be used by other scripts when this script is used as an import.


### Developed by the Synnovis Genome Informatics Team
2 changes: 2 additions & 0 deletions pytest.ini
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
[pytest]
addopts = -v --ignore=test/data/ --ignore=test/temp/ --cov=. --cov-report term-missing --sequencer_ids=NB551068,NB552085,M02353,M02631,A01229 --library_prep_names=ADX,NGS,TSO,SNP,DEV --tso_panels=Pan4969,Pan5085,Pan5112,Pan5114 --dev_panno=Pan5180 --panels=Pan5180,Pan4009,Pan2835,Pan4940,Pan4396,Pan5113,Pan5115,Pan4969,Pan5085,Pan5112,Pan5114,Pan5007,Pan5008,Pan5009,Pan5010,Pan5011,Pan5012,Pan5013,Pan5014,Pan5015,Pan5016,Pan4119,Pan4121,Pan4122,Pan4125,Pan4126,Pan4974,Pan4975,Pan4976,Pan4977,Pan4978,Pan4979,Pan4980,Pan4981,Pan4982,Pan4983,Pan4984,Pan4821,Pan4822,Pan4823,Pan4824,Pan4825,Pan4149,Pan4150,Pan4129,Pan4964,Pan4130,Pan5121,Pan5185,Pan5186,Pan5143,Pan5147,Pan4816,Pan4817,Pan5122,Pan5144,Pan5148,Pan4819,Pan4820,Pan4145,Pan4146,Pan4132,Pan4134,Pan4136,Pan4137,Pan4138,Pan4143,Pan4144,Pan4151,Pan4314,Pan4351,Pan4387,Pan4390,Pan4826,Pan4827,Pan4828,Pan4829,Pan4830,Pan4831,Pan4832,Pan4833,Pan4834,Pan4835,Pan4836 --logdir=.
7 changes: 7 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
git+https://github.com/moka-guys/[email protected]
setuptools==58.2.0
pytest==7.2.1
coverage==6.3.1
pytest==7.2.1
flake8==6.1.0
pytest-cov==4.1.0
Empty file.
108 changes: 108 additions & 0 deletions samplesheet_validator/__main__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
import os
import argparse
from .samplesheet_validator import SamplesheetCheck


def get_arguments():
"""
Uses argparse module to define and handle command line input arguments
and help menu
:return argparse.Namespace (object): Contains the parsed arguments
"""
parser = argparse.ArgumentParser(
description=(
"Given an input samplesheet, will validate the samplesheet using "
"seglh-naming conventions and output a logfile"
),
usage="Used to validate a samplesheet using the seglh-naming conventions",
)
parser.add_argument(
"-S",
"--samplesheet_path",
type=lambda x: is_valid_file(parser, x),
required=True,
help="Path to samplesheet requiring validation",
)
parser.add_argument(
"-SI",
"--sequencer_ids",
type=str,
required=True,
help="Comma separated string of allowed sequencer IDS",
)
parser.add_argument(
"-P",
"--panels",
type=str,
required=True,
help="Comma separated string of allowed panel numbers",
)
parser.add_argument(
"-R",
"--library_prep_names",
type=str,
required=True,
help="Comma separated string of allowed library prep names",
)
parser.add_argument(
"-T",
"--tso_panels",
type=str,
required=True,
help="Comma separated string of tso panels",
)
parser.add_argument(
"-D",
"--dev_panno",
type=str,
required=True,
help="Development pan number",
)
parser.add_argument(
"-L",
"--logdir",
type=lambda x: is_valid_dir(parser, x),
required=True,
help="Directory to save the output logfile to",
)
return parser.parse_args()


def is_valid_file(parser: argparse.ArgumentParser, file: str) -> str:
"""
Check file path is valid
:param parser (argparse.ArgumentParser): Holds necessary info to parse cmd
line into Python data types
:param file (str): Input argument
"""
if not os.path.exists(file):
parser.error(f"The file {file} does not exist!")
else:
return file


def is_valid_dir(parser: argparse.ArgumentParser, dir: str) -> str:
"""
Check directory path is valid
:param parser (argparse.ArgumentParser): Holds necessary info to parse cmd
line into Python data types
:param file (str): Input argument
"""
if not os.path.isdir(dir):
parser.error(f"The directory {dir} does not exist!")
else:
return dir


if __name__ == "__main__":
parsed_args = get_arguments()
sscheck_obj = SamplesheetCheck(
parsed_args.samplesheet_path,
parsed_args.sequencer_ids,
parsed_args.panels,
parsed_args.library_prep_names,
parsed_args.tso_panels,
parsed_args.dev_panno,
parsed_args.logdir,
)
sscheck_obj.ss_checks() # Carry out samplesheeet validation
40 changes: 40 additions & 0 deletions samplesheet_validator/config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
import datetime

TIMESTAMP = str(f"{datetime.datetime.now():%Y%m%d_%H%M%S}")

# Specifies the layout of log records in the final output
LOGGING_FORMATTER = "%(asctime)s - SAMPLESHEET_VALIDATOR - %(levelname)s - %(message)s"

LOG_MSGS = {
"ss_present": "Samplesheet with supplied name exists (%s)",
"ss_absent": "Samplesheet with supplied name does not exist (%s)",
"ssname_valid": "Samplesheet name is valid (%s)",
"ssname_invalid": "Samplesheet name is invalid (%s). Exception: %s",
"sequencer_id_valid": "Sequencer ID in samplesheet name is valid",
"sequencer_id_invalid": "Sequencer id not in allowed list (%s, %s)",
"ss_not_empty": "Samplesheet is (>10 bytes)",
"ss_empty": "Samplesheet empty (<10 bytes)",
"found_header_line": "Line in samplesheet identified as a header line",
"found_sample_line": "Line in samplesheet identified as containing a sample",
"error_extracting_headers": "An error was encountered when extracting headers from the samplesheet: %s",
"found_empty_line": "Line in samplesheet is an empty line",
"col_extraction_error": "Exception raised while attempting to extract %s from sample line %s: %s",
"headers_as_expected": "Expected headers present in samplesheet",
"headers_err": "Header(/s) missing from [Data] section: '%s'",
"samplenames_match": "All sample names and sample IDS match",
"nonmatching_samplenames": "The following Sample IDs do not match the corresponding Sample Name: (%s)",
"no_illegal_chars": "Sample name %s contains no illegal characters in column %s",
"illegal_chars": "Sample name contains invalid characters (%s: %s)",
"sample_name_valid": "Sample name valid: %s (%s)",
"sample_name_invalid": "Sample name invalid (%s). Exception: %s",
"valid_panno": "Pan no is valid: %s",
"invalid_panno": "Pan no is invalid: %s (%s: %s)",
"valid_library_prep_name": "Library prep name is valid: %s",
"library_prep_name_err": "Library prep name not in allowed list (%s, %s)",
"dev_run": "Samplesheet is from a development run: %s",
"not_dev_run": "Samplesheet is not from a development run: %s",
"tso_run": "Samplesheet is for a TSO run",
"not_tso_run": "Samplesheet is not for a TSO run",
"sschecks_not_passed": "Samplesheet did not pass checks: %s",
"sschecks_passed": "Samplesheet passed all checks %s",
}
Loading

0 comments on commit 467a20d

Please sign in to comment.