Merge pull request #3 from moka-guys/development

v1.0.0 (#3) Co-Authored-By: Graeme Smith <[email protected]>
moka-guys · Jan 4, 2024 · 467a20d · 467a20d
2 parents 3b13e7c + 914e0f2
commit 467a20d
Show file tree

Hide file tree

Showing 48 changed files with 3,211 additions and 19 deletions.
diff --git a/.github/workflows/on-pull-request.yaml b/.github/workflows/on-pull-request.yaml
@@ -0,0 +1,44 @@
+name: Samplesheet Validator
+
+on:
+  push:
+    branches:
+      - master
+      - 'feature/**'
+      - 'development'
+jobs:
+  build:
+
+    runs-on: ubuntu-latest
+    strategy:
+      matrix:
+        python-version: ['3.10.6']
+
+    steps:
+      - name: Checkout head
+        uses: actions/checkout@v3
+        with:
+          fetch-depth: 2
+          run: git checkout HEAD^
+      - name: Set up Python ${{ matrix.python-version }}
+        uses: actions/setup-python@v4
+        with:
+          python-version: ${{ matrix.python-version }}
+      - name: Install dependencies
+        run: |
+          python3 -m pip install --upgrade pip
+          pip3 install flake8==6.0.0 wheel==0.38.4 pytest==7.2.1
+          pip3 install -r requirements.txt
+      - name: Lint with flake8
+        run: |
+          # stop the build if there are:
+          # - syntax errors (E9)
+          # - common assertion and comparison gotchas (F63)
+          # - control flow gotchas (F7)
+          # - undefined names (F82)
+          flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
+          # exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
+          flake8 . --count --exit-zero --max-complexity=10 --max-line-length=120 --statistics
+      - name: Test with pytest
+        run: |
+          python3 -m pytest
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,9 @@
+venv/
+__pycache__/
+build/
+.coverage
+dist
+*.egg-info
+.pytest_cache
+*.log
+*.vscode
diff --git a/LICENSE b/LICENSE
@@ -1,21 +1,11 @@
-MIT License
+Copyright 2023 Synnovis
 
-Copyright (c) 2020 Graeme
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use these files except
+in compliance with the License. You may obtain a copy of the License at
 
-Permission is hereby granted, free of charge, to any person obtaining a copy
-of this software and associated documentation files (the "Software"), to deal
-in the Software without restriction, including without limitation the rights
-to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
-copies of the Software, and to permit persons to whom the Software is
-furnished to do so, subject to the following conditions:
+[http://www.apache.org/licenses/LICENSE-2.0](http://www.apache.org/licenses/LICENSE-2.0)
 
-The above copyright notice and this permission notice shall be included in all
-copies or substantial portions of the Software.
-
-THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
-IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
-FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
-AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
-LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
-OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
-SOFTWARE.
+Unless required by applicable law or agreed to in writing, software distributed under the License
+is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
+or implied. See the License for the specific language governing permissions and limitations under
+the License.
diff --git a/README.md b/README.md
@@ -1 +1,106 @@
-# samplesheet_verifier
+# Samplesheet Validator
+
+Checks sample sheet naming and contents. Carries out a series of checks on the sample sheet and collects any errors 
+that it identifies (SamplesheetCheck.errors_list). It also identifies whether or not a run is a TSO run from the sample 
+sheet (SamplesheetCheck.tso).
+
+## Protocol
+
+Runs a series of checks on the sample sheet, collects any errors identified. Checks whether: 
+* Sample sheet exists
+* Samplesheet name is valid (validates using the [seglh-naming](https://github.com/moka-guys/seglh-naming/) library)
+* Sequencer ID is in the list of allowed sequencer IDs supplied to the script
+* Samplesheet is not empty (>10 bytes)
+* Samplesheet is for a development run, using the development pan number supplied to the script
+* Samplesheet contains the minimum expected `[Data]` section headers: `Sample_ID, Sample_Name, index`
+* `Sample_ID` and `Sample_Name` match for each sample in the data section of the samplesheet
+* Sample name does not contain any illegal characters
+* Sample name is valid (validates using the [seglh-naming](https://github.com/moka-guys/seglh-naming/) library)
+* Pan numbers are in the list of allowed pan numbers supplied to the script
+* Library prep name in the sample name is in the list of allowed library prep names supplied to the script
+* Samplesheet contains any TSO samples
+
+## Usage
+
+### Python package
+
+The repository provides a python package which can be installed with:
+
+`python3 setup.py install`
+
+NB: Use the --user flag or install into an virtualenv/pipenv if not installing globally.
+
+```python
+
+from samplesheet_validator.samplesheet_validator import SamplesheetCheck
+
+sscheck_obj = SamplesheetCheck(
+    samplesheet_path,  # str
+    sequencer_ids,  # list
+    panels,  # list
+    library_prep_names,  # list
+    tso_panels,  # list
+    dev_panno,  # str
+    logdir,  # str
+)
+sscheck_obj.ss_checks()  # Carry out samplesheeet validation
+
+print(sscheck_obj.errors_dict)  # View the dictionary of error messages
+```
+
+### Command line
+
+The environment must be set up as follows:
+```bash
+python3 -m venv venv
+source venv/bin/activate
+pip3 install -r requirements.txt
+```
+
+The script can then be used as follows:
+```bash
+usage: Used to validate a samplesheet using the seglh-naming conventions
+
+Given an input samplesheet, will validate the samplesheet using seglh-naming conventions and output a logfile
+
+options:
+  -h, --help            show this help message and exit
+  -S SAMPLESHEET_PATH, --samplesheet_path SAMPLESHEET_PATH
+                        Path to samplesheet requiring validation
+  -SI SEQUENCER_IDS, --sequencer_ids SEQUENCER_IDS
+                        Comma separated string of allowed sequencer IDS
+  -P PANELS, --panels PANELS
+                        Comma separated string of allowed panel numbers
+  -R LIBRARY_PREP_NAMES, --library_prep_names LIBRARY_PREP_NAMES
+                        Comma separated string of allowed library prep names
+  -T TSO_PANELS, --tso_panels TSO_PANELS
+                        Comma separated string of tso panels
+  -D DEV_PANNO, --dev_panno DEV_PANNO
+                        Development pan number
+  -L LOGDIR, --logdir LOGDIR
+                        Directory to save the output logfile to
+```
+
+### Testing
+
+Test datasets are stored in [/test/data](../test/data). The script has a full test suite:
+* [test_samplesheet_validator.py](../test/test_samplesheet_validator.py)
+
+These tests should be run before pushing any code to ensure all tests in the GitHub Actions workflow pass. These can be run as follows:
+
+```bash
+python3 -m pytest
+```
+**N.B. Tests and test cases/files MUST be maintained and updated accordingly in conjunction with script development**
+**N.B. This includes ensuring that the arguments passed to pytest in the [pytest.ini](pytest.ini) file are kept up to date**
+
+
+## Logging
+
+Logging is performed by [ss_logger](samplesheet_validator/ss_logger.py). The directory to save the log file to is supplied as an argument. The output log file is named by the script as follows:
+- `$LOGFILE_DIR/$RUNFOLDER_NAME_$TIMESTAMP_samplesheet_validator.log`
+
+The script also collects the error messages as it runs, which can be used by other scripts when this script is used as an import.
+
+
+### Developed by the Synnovis Genome Informatics Team
diff --git a/pytest.ini b/pytest.ini
@@ -0,0 +1,2 @@
+[pytest]
+addopts = -v --ignore=test/data/ --ignore=test/temp/ --cov=. --cov-report term-missing --sequencer_ids=NB551068,NB552085,M02353,M02631,A01229 --library_prep_names=ADX,NGS,TSO,SNP,DEV --tso_panels=Pan4969,Pan5085,Pan5112,Pan5114 --dev_panno=Pan5180 --panels=Pan5180,Pan4009,Pan2835,Pan4940,Pan4396,Pan5113,Pan5115,Pan4969,Pan5085,Pan5112,Pan5114,Pan5007,Pan5008,Pan5009,Pan5010,Pan5011,Pan5012,Pan5013,Pan5014,Pan5015,Pan5016,Pan4119,Pan4121,Pan4122,Pan4125,Pan4126,Pan4974,Pan4975,Pan4976,Pan4977,Pan4978,Pan4979,Pan4980,Pan4981,Pan4982,Pan4983,Pan4984,Pan4821,Pan4822,Pan4823,Pan4824,Pan4825,Pan4149,Pan4150,Pan4129,Pan4964,Pan4130,Pan5121,Pan5185,Pan5186,Pan5143,Pan5147,Pan4816,Pan4817,Pan5122,Pan5144,Pan5148,Pan4819,Pan4820,Pan4145,Pan4146,Pan4132,Pan4134,Pan4136,Pan4137,Pan4138,Pan4143,Pan4144,Pan4151,Pan4314,Pan4351,Pan4387,Pan4390,Pan4826,Pan4827,Pan4828,Pan4829,Pan4830,Pan4831,Pan4832,Pan4833,Pan4834,Pan4835,Pan4836 --logdir=.
diff --git a/requirements.txt b/requirements.txt
@@ -0,0 +1,7 @@
+git+https://github.com/moka-guys/[email protected]
+setuptools==58.2.0
+pytest==7.2.1
+coverage==6.3.1
+pytest==7.2.1
+flake8==6.1.0
+pytest-cov==4.1.0
diff --git a/samplesheet_validator/__init__.py b/samplesheet_validator/__init__.py
diff --git a/samplesheet_validator/__main__.py b/samplesheet_validator/__main__.py
@@ -0,0 +1,108 @@
+import os
+import argparse
+from .samplesheet_validator import SamplesheetCheck
+
+
+def get_arguments():
+    """
+    Uses argparse module to define and handle command line input arguments
+    and help menu
+        :return argparse.Namespace (object):    Contains the parsed arguments
+    """
+    parser = argparse.ArgumentParser(
+        description=(
+            "Given an input samplesheet, will validate the samplesheet using "
+            "seglh-naming conventions and output a logfile"
+        ),
+        usage="Used to validate a samplesheet using the seglh-naming conventions",
+    )
+    parser.add_argument(
+        "-S",
+        "--samplesheet_path",
+        type=lambda x: is_valid_file(parser, x),
+        required=True,
+        help="Path to samplesheet requiring validation",
+    )
+    parser.add_argument(
+        "-SI",
+        "--sequencer_ids",
+        type=str,
+        required=True,
+        help="Comma separated string of allowed sequencer IDS",
+    )
+    parser.add_argument(
+        "-P",
+        "--panels",
+        type=str,
+        required=True,
+        help="Comma separated string of allowed panel numbers",
+    )
+    parser.add_argument(
+        "-R",
+        "--library_prep_names",
+        type=str,
+        required=True,
+        help="Comma separated string of allowed library prep names",
+    )
+    parser.add_argument(
+        "-T",
+        "--tso_panels",
+        type=str,
+        required=True,
+        help="Comma separated string of tso panels",
+    )
+    parser.add_argument(
+        "-D",
+        "--dev_panno",
+        type=str,
+        required=True,
+        help="Development pan number",
+    )
+    parser.add_argument(
+        "-L",
+        "--logdir",
+        type=lambda x: is_valid_dir(parser, x),
+        required=True,
+        help="Directory to save the output logfile to",
+    )
+    return parser.parse_args()
+
+
+def is_valid_file(parser: argparse.ArgumentParser, file: str) -> str:
+    """
+    Check file path is valid
+        :param parser (argparse.ArgumentParser):    Holds necessary info to parse cmd
+                                                    line into Python data types
+        :param file (str):                          Input argument
+    """
+    if not os.path.exists(file):
+        parser.error(f"The file {file} does not exist!")
+    else:
+        return file
+
+
+def is_valid_dir(parser: argparse.ArgumentParser, dir: str) -> str:
+    """
+    Check directory path is valid
+        :param parser (argparse.ArgumentParser):    Holds necessary info to parse cmd
+                                                    line into Python data types
+        :param file (str):                          Input argument
+    """
+    if not os.path.isdir(dir):
+        parser.error(f"The directory {dir} does not exist!")
+    else:
+        return dir
+
+
+if __name__ == "__main__":
+    parsed_args = get_arguments()
+    sscheck_obj = SamplesheetCheck(
+        parsed_args.samplesheet_path,
+        parsed_args.sequencer_ids,
+        parsed_args.panels,
+        parsed_args.library_prep_names,
+        parsed_args.tso_panels,
+        parsed_args.dev_panno,
+        parsed_args.logdir,
+    )
+    sscheck_obj.ss_checks()  # Carry out samplesheeet validation
diff --git a/samplesheet_validator/config.py b/samplesheet_validator/config.py
@@ -0,0 +1,40 @@
+import datetime
+
+TIMESTAMP = str(f"{datetime.datetime.now():%Y%m%d_%H%M%S}")
+
+# Specifies the layout of log records in the final output
+LOGGING_FORMATTER = "%(asctime)s - SAMPLESHEET_VALIDATOR - %(levelname)s - %(message)s"
+
+LOG_MSGS = {
+    "ss_present": "Samplesheet with supplied name exists (%s)",
+    "ss_absent": "Samplesheet with supplied name does not exist (%s)",
+    "ssname_valid": "Samplesheet name is valid (%s)",
+    "ssname_invalid": "Samplesheet name is invalid (%s). Exception: %s",
+    "sequencer_id_valid": "Sequencer ID in samplesheet name is valid",
+    "sequencer_id_invalid": "Sequencer id not in allowed list (%s, %s)",
+    "ss_not_empty": "Samplesheet is (>10 bytes)",
+    "ss_empty": "Samplesheet empty (<10 bytes)",
+    "found_header_line": "Line in samplesheet identified as a header line",
+    "found_sample_line": "Line in samplesheet identified as containing a sample",
+    "error_extracting_headers": "An error was encountered when extracting headers from the samplesheet: %s",
+    "found_empty_line": "Line in samplesheet is an empty line",
+    "col_extraction_error": "Exception raised while attempting to extract %s from sample line %s: %s",
+    "headers_as_expected": "Expected headers present in samplesheet",
+    "headers_err": "Header(/s) missing from [Data] section: '%s'",
+    "samplenames_match": "All sample names and sample IDS match",
+    "nonmatching_samplenames": "The following Sample IDs do not match the corresponding Sample Name: (%s)",
+    "no_illegal_chars": "Sample name %s contains no illegal characters in column %s",
+    "illegal_chars": "Sample name contains invalid characters (%s: %s)",
+    "sample_name_valid": "Sample name valid: %s (%s)",
+    "sample_name_invalid": "Sample name invalid (%s). Exception: %s",
+    "valid_panno": "Pan no is valid: %s",
+    "invalid_panno": "Pan no is invalid: %s (%s: %s)",
+    "valid_library_prep_name": "Library prep name is valid: %s",
+    "library_prep_name_err": "Library prep name not in allowed list (%s, %s)",
+    "dev_run": "Samplesheet is from a development run: %s",
+    "not_dev_run": "Samplesheet is not from a development run: %s",
+    "tso_run": "Samplesheet is for a TSO run",
+    "not_tso_run": "Samplesheet is not for a TSO run",
+    "sschecks_not_passed": "Samplesheet did not pass checks: %s",
+    "sschecks_passed": "Samplesheet passed all checks %s",
+}
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		[pytest]
		addopts = -v --ignore=test/data/ --ignore=test/temp/ --cov=. --cov-report term-missing --sequencer_ids=NB551068,NB552085,M02353,M02631,A01229 --library_prep_names=ADX,NGS,TSO,SNP,DEV --tso_panels=Pan4969,Pan5085,Pan5112,Pan5114 --dev_panno=Pan5180 --panels=Pan5180,Pan4009,Pan2835,Pan4940,Pan4396,Pan5113,Pan5115,Pan4969,Pan5085,Pan5112,Pan5114,Pan5007,Pan5008,Pan5009,Pan5010,Pan5011,Pan5012,Pan5013,Pan5014,Pan5015,Pan5016,Pan4119,Pan4121,Pan4122,Pan4125,Pan4126,Pan4974,Pan4975,Pan4976,Pan4977,Pan4978,Pan4979,Pan4980,Pan4981,Pan4982,Pan4983,Pan4984,Pan4821,Pan4822,Pan4823,Pan4824,Pan4825,Pan4149,Pan4150,Pan4129,Pan4964,Pan4130,Pan5121,Pan5185,Pan5186,Pan5143,Pan5147,Pan4816,Pan4817,Pan5122,Pan5144,Pan5148,Pan4819,Pan4820,Pan4145,Pan4146,Pan4132,Pan4134,Pan4136,Pan4137,Pan4138,Pan4143,Pan4144,Pan4151,Pan4314,Pan4351,Pan4387,Pan4390,Pan4826,Pan4827,Pan4828,Pan4829,Pan4830,Pan4831,Pan4832,Pan4833,Pan4834,Pan4835,Pan4836 --logdir=.