Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved paragraph algorithm #118

Merged
merged 88 commits into from
Aug 1, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
88 commits
Select commit Hold shift + click to select a range
2aade13
Merge branch 'master' of https://github.com/uclh-criu/miade into dev
Jan 9, 2023
fdd1993
update git tag versioning changes from #70
Jan 19, 2023
e387e30
added confidence to meta to match for API compatibility
Jan 31, 2023
f7d0b82
:rotating_light: Black
Feb 1, 2023
5a2a16b
A little manual merge? added latest updated negex meta priority for t…
Feb 1, 2023
3a072fc
Merge branch 'feature/logging-config' into dev
Feb 17, 2023
cce70e1
Merge branch 'feature/logging-config' into dev
Feb 17, 2023
3405ae9
:white_check_mark: Updated tests
Feb 17, 2023
dbc5b62
Merge branch 'feature/logging-config' into dev
Feb 17, 2023
f9b46e0
Merge branch 'hotfix-negex-config' into dev
Feb 21, 2023
6eaaf66
Merge branch 'feature/74-blacklist-filtering' into dev
Mar 3, 2023
1fbfae3
Merge remote-tracking branch 'origin/feature/filter-debug-logs' into dev
Mar 10, 2023
cfb1578
Pull bug fixes and medcat upgrade: Merge branch 'master' of https://g…
Apr 21, 2023
202dfad
Remove Debug Mode - Merge branch 'master' of https://github.com/uclh-…
Apr 26, 2023
508f1d1
Merge branch 'master' of https://github.com/uclh-criu/miade into dev
Apr 27, 2023
31749ac
Merge branch 'feature/48-allergy-postprocessing' into dev
Jul 27, 2023
9924b5d
Merge branch 'master' of https://github.com/uclh-criu/miade into dev
Jul 27, 2023
a958139
Merge branch 'feature/48-allergy-postprocessing' into dev
Jul 28, 2023
1a1e2bc
Remove converted tag
jamesbrandreth Sep 6, 2023
1747d5f
:arrow_up: Upgrade python to 3.11
jamesbrandreth Sep 7, 2023
5362af1
:sparkles: paragraph chunking wip
Sep 22, 2023
f9c0109
Added loading regex config for paragraphs and capping problem concept…
Sep 28, 2023
bc5c973
Keep n dash between numbers, update tests
Sep 29, 2023
06adfde
:white_check_mark:
Sep 29, 2023
c0b39ad
Merge branch 'feature/paragraph-chunking' into dev
Sep 29, 2023
f0acfec
:bug::loud_sound: Fixed negex bug and added logs for paragrapher
Oct 2, 2023
f4708f4
:loud_sound: Log message adjustments to make it more readable
Oct 4, 2023
a5d3f54
:building_construction: Changed the way annotators and negex pipeline…
Oct 4, 2023
c85172a
:bug: Fixed historic concepts with no conversion are returned as is
Oct 4, 2023
088c810
:white_check_mark: forgot to commit updated tests
Oct 4, 2023
7920df1
Added error handling for config loading
Oct 5, 2023
50e4aec
Merge branch 'bug-fix/negex-config-and-improve-logging' into dev
Oct 6, 2023
cd7b5b8
Merge branch 'master' of https://github.com/uclh-criu/miade into dev
Oct 6, 2023
72f31cf
:arrow_up: MedCAT==1.9.3
Oct 10, 2023
a370269
Merge branch 'hotfix/medcat-patch' into dev
Oct 10, 2023
60e02fd
Merge branch 'master' of https://github.com/uclh-criu/miade into dev
Oct 10, 2023
5c2eb5a
Algorithm tweak to default to adverse reaction in allergy paragraphs …
jenniferjiangkells Oct 30, 2023
8956c18
:construction_worker: Install pytorch cpu
Oct 30, 2023
3109cd0
Update lookup data (#94)
jenniferjiangkells Oct 30, 2023
95bbfeb
:bug: Fixed vtm deduplication bug (#96)
jenniferjiangkells Oct 30, 2023
8bf1bfb
:construction_worker: Run tests on dev PRs
Oct 30, 2023
ca7dbe0
:bug: Fixed prob list cap bug to return exactly first 10 prob concepts
Oct 30, 2023
dbb165d
:wrench: Update sample miade config
Oct 30, 2023
3bffac2
Revert long problems list filtering and add numbering to problems lis…
jenniferjiangkells Nov 2, 2023
a1b9aa8
Streamlit trainer preprocessing (#97)
jenniferjiangkells Nov 6, 2023
4f41854
Update problem blacklist
Nov 14, 2023
656b613
Merge branch 'update-blacklist' into dev
Nov 14, 2023
46e7d86
Update blacklist
Nov 14, 2023
57e2a97
Merge branch 'dev' of https://github.com/uclh-criu/miade into dev
Nov 14, 2023
69eeb57
Update blacklist
Nov 14, 2023
75c8695
:card_file_box: Update problems blacklist v2
Nov 15, 2023
4ceab93
Switch to Ruff for formatting and Linting (#104)
jamesbrandreth Nov 21, 2023
8ec874f
Merge branch 'blacklist' into dev
jamesbrandreth Nov 23, 2023
ecb9e93
:bug: Fix allergy_type and reaction filtering bug
Nov 29, 2023
6eab644
Merge branch 'master' of https://github.com/uclh-criu/miade into bugf…
Nov 30, 2023
17cf7ed
Remove debug logging
Nov 30, 2023
1d595e3
Merge branch 'bugfix/allergy-reactions-postprocess' into dev
Nov 30, 2023
a587eb3
Remove (converted) tag
Nov 30, 2023
18ac309
:truck: Move lookup data outside package (#107)
jenniferjiangkells Nov 30, 2023
e074b8f
Merge branch 'master' of https://github.com/uclh-criu/miade into dev
Dec 18, 2023
9bb6fda
Merge branch 'master' of https://github.com/uclh-criu/miade into dev
Jan 9, 2024
57728d7
Put paragraph body in header (#111)
jenniferjiangkells Jan 17, 2024
d89ff7f
:wrench: Make paragraph regex configurable from lookup path (#117)
jenniferjiangkells Apr 22, 2024
f750c3b
Merge branch 'dev' into improved-paragraph-algorithm
Apr 22, 2024
e033cb2
Super manual paragraph algorithm wip
Apr 23, 2024
6799392
wip
jenniferjiangkells Jun 25, 2024
c1b53b5
Fix ci - install with no deps
jenniferjiangkells Jun 25, 2024
7099a36
install med7 with requirements in ci
jenniferjiangkells Jun 25, 2024
0b5ebd1
add requirements.txt
jenniferjiangkells Jun 25, 2024
0cc23b9
try find-links
jenniferjiangkells Jun 25, 2024
eb7ffe3
try installing from wheel
jenniferjiangkells Jun 25, 2024
d37d5dd
try pip install directly
jenniferjiangkells Jun 25, 2024
55fd2e9
ci
jenniferjiangkells Jun 25, 2024
627fec6
Fix pip==24.0 in ci
jenniferjiangkells Jun 25, 2024
f954510
note methods wip
jenniferjiangkells Jun 25, 2024
91ef392
Merge branch 'master' into improved-paragraph-algorithm
jenniferjiangkells Jul 24, 2024
d454701
Add regex lookup back
jenniferjiangkells Jul 24, 2024
36be84e
Fixed lookup data loading
jenniferjiangkells Jul 30, 2024
2010212
Added functions to merge paragraph and NumberedList object
jenniferjiangkells Jul 30, 2024
4189128
Delete prose_paragraph attribute
jenniferjiangkells Jul 30, 2024
b340f74
Add docstrings
jenniferjiangkells Jul 30, 2024
971c760
Added number list filter method in annotator
jenniferjiangkells Jul 30, 2024
b638036
Fix tests
jenniferjiangkells Jul 30, 2024
a57832e
Cover edge cases and added tests
jenniferjiangkells Jul 31, 2024
dd02a06
Add list_cleaner to annotator pipelines and make run_pipeline method …
jenniferjiangkells Jul 31, 2024
745ad15
Add refine_paragraphs option to AnnotatorConfig
jenniferjiangkells Jul 31, 2024
1ca914e
Added tests for new features
jenniferjiangkells Jul 31, 2024
31f976d
Fix docstring
jenniferjiangkells Aug 1, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 6 additions & 6 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -25,16 +25,16 @@ jobs:
pip install torch --index-url https://download.pytorch.org/whl/cpu
pip install ./
pip list
- name: download model
- name: download models
run: |
python -m spacy download en_core_web_md
pip install https://huggingface.co/kormilitzin/en_core_med7_lg/resolve/main/en_core_med7_lg-any-py3-none-any.whl
pip install -r requirements.txt
- name: run pytest
run: pytest ./tests/*
- name: install ruff
- name: Install ruff
run: pip install ruff
- name: ruff format
- name: Lint with ruff
run: |
ruff format
ruff --output-format=github .
ruff check --fix
continue-on-error: true
continue-on-error: true
1 change: 1 addition & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
https://huggingface.co/kormilitzin/en_core_med7_lg/resolve/main/en_core_med7_lg-any-py3-none-any.whl
302 changes: 215 additions & 87 deletions src/miade/annotators.py

Large diffs are not rendered by default.

220 changes: 172 additions & 48 deletions src/miade/note.py
Original file line number Diff line number Diff line change
@@ -1,67 +1,30 @@
import re
import io
import pkgutil
import logging
import pandas as pd

from typing import List, Optional, Dict

from .paragraph import Paragraph, ParagraphType
from .paragraph import ListItem, NumberedList, Paragraph, ParagraphType


log = logging.getLogger(__name__)


def load_regex_config_mappings(filename: str) -> Dict:
"""
Load regex configuration mappings from a file.

Args:
filename (str): The name of the file containing the regex configuration.

Returns:
A dictionary mapping paragraph types to their corresponding regex patterns.
"""
regex_config = pkgutil.get_data(__name__, filename)
data = (
pd.read_csv(
io.BytesIO(regex_config),
index_col=0,
)
.squeeze("columns")
.T.to_dict()
)
regex_lookup = {}

for paragraph, regex in data.items():
paragraph_enum = None
try:
paragraph_enum = ParagraphType(paragraph)
except ValueError as e:
log.warning(e)

if paragraph_enum is not None:
regex_lookup[paragraph_enum] = regex

return regex_lookup


class Note(object):
"""
Represents a note object.
Represents a Note object

Attributes:
text (str): The text content of the note.
raw_text (str): The raw text content of the note.
regex_config (str): The path to the regex configuration file.
paragraphs (Optional[List[Paragraph]]): A list of paragraphs in the note.
paragraphs (Optional[List[Paragraph]]): A list of Paragraph objects representing the paragraphs in the note.
numbered_lists (Optional[List[NumberedList]]): A list of NumberedList objects representing the numbered lists in the note.
"""

def __init__(self, text: str, regex_config_path: str = "./data/regex_para_chunk.csv"):
def __init__(self, text: str):
self.text = text
self.raw_text = text
self.regex_config = load_regex_config_mappings(regex_config_path)
self.paragraphs: Optional[List[Paragraph]] = []
self.numbered_lists: Optional[List[NumberedList]] = []

def clean_text(self) -> None:
"""
Expand All @@ -83,14 +46,61 @@ def clean_text(self) -> None:
# Remove spaces if the entire line (between two line breaks) is just spaces
self.text = re.sub(r"(?<=\n)\s+(?=\n)", "", self.text)

def get_paragraphs(self) -> None:
def get_numbered_lists(self):
"""
Splits the note into paragraphs.
Finds multiple lists of sequentially ordered numbers (with more than one item) that directly follow a newline character
and captures the text following these numbers up to the next newline.

This method splits the text content of the note into paragraphs based on double line breaks.
It also assigns a paragraph type to each paragraph based on matching patterns in the heading.
Parameters:
text (str): The input text in which to search for multiple lists of sequentially ordered numbers with more than one item and their subsequent text.

Returns:
list of lists: Each sublist contains tuples where each tuple includes the start index of the number,
the end index of the line, and the captured text for each valid sequentially ordered list found. Returns an empty list if no such patterns are found.
"""
# Regular expression to find numbers followed by any characters until a newline
pattern = re.compile(r"(?<=\n)(\d+.*)")

# Finding all matches
matches = pattern.finditer(self.text)

all_results = []
results = []
last_num = 0
for match in matches:
number_text = match.group(1)
current_num = int(re.search(r"^\d+", number_text).group())

# Check if current number is the next in sequence
if current_num == last_num + 1:
results.append(ListItem(content=number_text, start=match.start(1), end=match.end(1)))
else:
# If there is a break in the sequence, check if current list has more than one item
if len(results) > 1:
numbered_list = NumberedList(items=results, list_start=results[0].start, list_end=results[-1].end)
all_results.append(numbered_list)
results = [
ListItem(content=number_text, start=match.start(1), end=match.end(1))
] # Start new results list with the current match
last_num = current_num # Update last number to the current

# Add the last sequence if not empty and has more than one item
if len(results) > 1:
numbered_list = NumberedList(items=results, list_start=results[0].start, list_end=results[-1].end)
all_results.append(numbered_list)

self.numbered_lists = all_results

def get_paragraphs(self, paragraph_regex: Dict) -> None:
"""
Split the text into paragraphs and assign paragraph types based on regex patterns.

Args:
paragraph_regex (Dict): A dictionary containing paragraph types as keys and regex patterns as values.

Returns:
None
"""
paragraphs = re.split(r"\n\n+", self.text)
start = 0

Expand All @@ -117,12 +127,126 @@ def get_paragraphs(self) -> None:
if heading:
heading = heading.lower()
# Iterate through the dictionary items and patterns
for paragraph_type, pattern in self.regex_config.items():
for paragraph_type, pattern in paragraph_regex.items():
if re.search(pattern, heading):
paragraph.type = paragraph_type
break # Exit the loop if a match is found

self.paragraphs.append(paragraph)

def merge_prose_sections(self) -> None:
"""
Merges consecutive prose sections in the paragraphs list.

Returns:
A list of merged prose sections.
"""
is_merge = False
all_prose = []
prose_section = []
prose_indices = []

for i, paragraph in enumerate(self.paragraphs):
if paragraph.type == ParagraphType.prose:
if is_merge:
prose_section.append((i, paragraph))
else:
prose_section = [(i, paragraph)]
is_merge = True
else:
if len(prose_section) > 0:
all_prose.append(prose_section)
prose_indices.extend([idx for idx, _ in prose_section])
is_merge = False

if len(prose_section) > 0:
all_prose.append(prose_section)
prose_indices.extend([idx for idx, _ in prose_section])

new_paragraphs = self.paragraphs[:]

for section in all_prose:
start = section[0][1].start
end = section[-1][1].end
new_prose_para = Paragraph(
heading=self.text[start:end], body="", type=ParagraphType.prose, start=start, end=end
)

# Replace the first paragraph in the section with the new merged paragraph
first_idx = section[0][0]
new_paragraphs[first_idx] = new_prose_para

# Mark other paragraphs in the section for deletion
for _, paragraph in section[1:]:
index = self.paragraphs.index(paragraph)
new_paragraphs[index] = None

# Remove the None entries from new_paragraphs
self.paragraphs = [para for para in new_paragraphs if para is not None]

def merge_empty_non_prose_with_next_prose(self) -> None:
"""
This method checks if a Paragraph has an empty body and a type that is not prose,
and merges it with the next Paragraph if the next paragraph is type prose.

Returns:
None
"""
merged_paragraphs = []
skip_next = False

for i in range(len(self.paragraphs) - 1):
if skip_next:
# Skip this iteration because the previous iteration already handled merging
skip_next = False
continue

current_paragraph = self.paragraphs[i]
next_paragraph = self.paragraphs[i + 1]

# Check if current paragraph has an empty body and is not of type prose
if current_paragraph.body == "" and current_paragraph.type != ParagraphType.prose:
# Check if the next paragraph is of type prose
if next_paragraph.type == ParagraphType.prose:
# Create a new Paragraph with merged content and type prose
merged_paragraph = Paragraph(
heading=current_paragraph.heading,
body=next_paragraph.heading,
type=current_paragraph.type,
start=current_paragraph.start,
end=next_paragraph.end,
)
merged_paragraphs.append(merged_paragraph)
# Skip the next paragraph since it's already merged
skip_next = True
continue

# If no merging is done, add the current paragraph to the list
merged_paragraphs.append(current_paragraph)

# Handle the last paragraph if it wasn't merged
if not skip_next:
merged_paragraphs.append(self.paragraphs[-1])

# Update the paragraphs list with the merged paragraphs
self.paragraphs = merged_paragraphs

def process(self, lookup_dict: Dict, refine: bool = True):
"""
Process the note by cleaning the text, extracting numbered lists, and getting paragraphs based on a lookup dictionary.

Args:
lookup_dict (Dict): A dictionary used to lookup specific paragraphs.
refine (bool, optional): Flag indicating whether to refine the processed note - this will merge any consecutive prose
paragraphs and then merge any structured paragraphs with empty body with the next prose paragraph (handles line break
between heading and body). Defaults to True.
"""
self.clean_text()
self.get_numbered_lists()
self.get_paragraphs(lookup_dict)
if refine:
self.merge_prose_sections()
self.merge_empty_non_prose_with_next_prose()

def __str__(self):
return self.text
55 changes: 50 additions & 5 deletions src/miade/paragraph.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
from enum import Enum
from typing import List


class ParagraphType(Enum):
Expand Down Expand Up @@ -27,14 +28,58 @@ class Paragraph(object):
"""

def __init__(self, heading: str, body: str, type: ParagraphType, start: int, end: int):
self.heading = heading
self.body = body
self.type = type
self.start = start
self.end = end
self.heading: str = heading
self.body: str = body
self.type: ParagraphType = type
self.start: int = start
self.end: int = end

def __str__(self):
return str(self.__dict__)

def __eq__(self, other):
return self.type == other.type and self.start == other.start and self.end == other.end


class ListItem(object):
"""
Represents an item in a NumberedList

Attributes:
content (str): The content of the list item.
start (int): The starting index of the list item.
end (int): The ending index of the list item.
"""

def __init__(self, content: str, start: int, end: int) -> None:
self.content: str = content
self.start: int = start
self.end: int = end

def __str__(self):
return str(self.__dict__)

def __eq__(self, other):
return self.start == other.start and self.end == other.end


class NumberedList(object):
"""
Represents a numbered list.

Attributes:
items (List[ListItem]): The list of items in the numbered list.
list_start (int): The starting number of the list.
list_end (int): The ending number of the list.
"""

def __init__(self, items: List[ListItem], list_start: int, list_end: int) -> None:
self.list_start: int = list_start
self.list_end: int = list_end
self.items: List[ListItem] = items

def __str__(self):
return str(self.__dict__)

def __eq__(self, other):
return self.list_start == other.list_start and self.list_end == other.list_end
Loading
Loading