Skip to content

Commit

Permalink
pin exact version for dependencies
Browse files Browse the repository at this point in the history
  • Loading branch information
vietnguyengit committed Oct 23, 2024
1 parent 5c8f4be commit de35d20
Show file tree
Hide file tree
Showing 23 changed files with 5,161 additions and 4,947 deletions.
3 changes: 3 additions & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Override jupyter in Github language stats for more accurate estimate of repo code languages
# reference: https://github.com/github/linguist/blob/master/docs/overrides.md#generated-code
*.ipynb linguist-generated
16 changes: 8 additions & 8 deletions data_discovery_ai/README_a.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,19 +3,19 @@
### 1. Keywords Classification
#### Problem Description

The new AODN Data Discovery portal is underpinned by a Geonetwork catalogue of metadata records that bring together well curated IMOS managed metadata records as well as records from external organisations and other contributors.
The new AODN Data Discovery portal is underpinned by a Geonetwork catalogue of metadata records that bring together well curated IMOS managed metadata records as well as records from external organisations and other contributors.

IMOS records and any records that exist in the current AODN portal use keywords and vocabularies that are tightly controlled in order for the current portal facets to operate. These are AODN vocabularies as defined in the ARDC repositories (https://vocabs.ardc.edu.au/).

Many other organisations use these AODN vocabularies and other well known vocabularies (e.g. GCMD) however there are many records in the catalogue that either use no keywords at all or keywords that are not from controlled vocabs.

The new AODN Data Discovery portal needs to filter metadata records based on a fixed set of “keywords” – using the AODN vocabularies. The most important being
The new AODN Data Discovery portal needs to filter metadata records based on a fixed set of “keywords” – using the AODN vocabularies. The most important being

- 'AODN Organisation Vocabulary',
- 'AODN Organisation Vocabulary',

- 'AODN Instrument Vocabulary',
- 'AODN Instrument Vocabulary',

- 'AODN Discovery Parameter Vocabulary'
- 'AODN Discovery Parameter Vocabulary'

There are many metadata records that have no keywords or keywords that are not using a well known vocabularies. Given the mapping rules based on metadata records which have AODN vocabularies, we aim to develop a machine learning model to predict the AODN keywords for these uncategorised datasets, in order to provide suggestions that can be used by the Data Discovery portal or other applications.

Expand All @@ -37,7 +37,7 @@ We also prepared a dataset that contains title and description information, whic
0832b98c-602e-4902-8438-a80d402469ea | IMOS SOOP Underway Data from AIMS Vessel RV Cape Ferguson Trip 6321 From 30 Oct 2015 To 02 Nov 2015 | 'Ships of Opportunity' (SOOP) is a facility of the Australian 'Integrated Marine Observing System' (IMOS) project. This data set was collected by the SOOP sub-facility 'Sensors on Tropical Research Vessels' aboard the RV Cape Ferguson Trip 6321. | [{'concepts': [{'id': 'Practical salinity of the water body', 'url': 'http://vocab.nerc.ac.uk/collection/P01/current/PSLTZZ01'}, {'id': 'Temperature of the water body', 'url': 'http://vocab.nerc.ac.uk/collection/P01/current/TEMPPR01'}, {'id': 'Fluorescence of the water body', 'url': 'http://vocab.nerc.ac.uk/collection/P01/current/FLUOZZZZ'}, {'id': 'Turbidity of the water body', 'url': 'http://vocab.nerc.ac.uk/collection/P01/current/TURBXXXX'}], 'scheme': 'theme', 'description': '', 'title': 'AODN Discovery Parameter Vocabulary'}]

#### Acceptance Criteria
An Excel file which contains the predicted keywords suggestions for these non-categorised metadata records.
An Excel file which contains the predicted keywords suggestions for these non-categorised metadata records.

### 2.Parameter Clustering

Expand All @@ -60,12 +60,12 @@ A prepared filtered dataset is stuctured as follows:
### 3. Searching Suggestions - Key Phrase Extraction
In the new AODN Data Discovery Portal, the searching suggestions are derived from an algorithm that is extracting the most common terms that appear in the title and abstract and storing this in ElasticSearch (field `_source.search_suggestions.abstract_phrases`). Some likely "non words" are stripped out, but there are still many unhelpful suggestions.

We aim to develop a machine learning model, to extract phrases that are more meaningful and targeted, so that can be used for searching suggestions.
We aim to develop a machine learning model, to extract phrases that are more meaningful and targeted, so that can be used for searching suggestions.

## Research Method
1. understand metadata
2. problem description
3. prepare datasets
4. research on methods
5. experiments
6. evaluation
6. evaluation
16 changes: 8 additions & 8 deletions data_discovery_ai/README_b.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,19 +5,19 @@
### 1. Keywords Classification
#### Problem Description

The new AODN Data Discovery portal is underpinned by a Geonetwork catalogue of metadata records that bring together well curated IMOS managed metadata records as well as records from external organisations and other contributors.
The new AODN Data Discovery portal is underpinned by a Geonetwork catalogue of metadata records that bring together well curated IMOS managed metadata records as well as records from external organisations and other contributors.

IMOS records and any records that exist in the current AODN portal use keywords and vocabularies that are tightly controlled in order for the current portal facets to operate. These are AODN vocabularies as defined in the ARDC repositories (https://vocabs.ardc.edu.au/).

Many other organisations use these AODN vocabularies and other well known vocabularies (e.g. GCMD) however there are many records in the catalogue that either use no keywords at all or keywords that are not from controlled vocabs.

The new AODN Data Discovery portal needs to filter metadata records based on a fixed set of “keywords” – using the AODN vocabularies. The most important being
The new AODN Data Discovery portal needs to filter metadata records based on a fixed set of “keywords” – using the AODN vocabularies. The most important being

- 'AODN Organisation Vocabulary',
- 'AODN Organisation Vocabulary',

- 'AODN Instrument Vocabulary',
- 'AODN Instrument Vocabulary',

- 'AODN Discovery Parameter Vocabulary'
- 'AODN Discovery Parameter Vocabulary'

- 'AODN Parameter Category Vocabulary'

Expand All @@ -43,7 +43,7 @@ We also prepared a dataset that contains title and description information, whic
0832b98c-602e-4902-8438-a80d402469ea | IMOS SOOP Underway Data from AIMS Vessel RV Cape Ferguson Trip 6321 From 30 Oct 2015 To 02 Nov 2015 | 'Ships of Opportunity' (SOOP) is a facility of the Australian 'Integrated Marine Observing System' (IMOS) project. This data set was collected by the SOOP sub-facility 'Sensors on Tropical Research Vessels' aboard the RV Cape Ferguson Trip 6321. | [{'concepts': [{'id': 'Practical salinity of the water body', 'url': 'http://vocab.nerc.ac.uk/collection/P01/current/PSLTZZ01'}, {'id': 'Temperature of the water body', 'url': 'http://vocab.nerc.ac.uk/collection/P01/current/TEMPPR01'}, {'id': 'Fluorescence of the water body', 'url': 'http://vocab.nerc.ac.uk/collection/P01/current/FLUOZZZZ'}, {'id': 'Turbidity of the water body', 'url': 'http://vocab.nerc.ac.uk/collection/P01/current/TURBXXXX'}], 'scheme': 'theme', 'description': '', 'title': 'AODN Discovery Parameter Vocabulary'}]

#### Acceptance Criteria
An Excel file which contains the predicted keywords suggestions for these non-categorised metadata records.
An Excel file which contains the predicted keywords suggestions for these non-categorised metadata records.

### 2.Parameter Clustering

Expand All @@ -66,7 +66,7 @@ A prepared filtered dataset is stuctured as follows:
### 3. Searching Suggestions - Key Phrase Extraction
In the new AODN Data Discovery Portal, the searching suggestions are derived from an algorithm that is extracting the most common terms that appear in the title and abstract and storing this in ElasticSearch (field `_source.search_suggestions.abstract_phrases`). Some likely "non words" are stripped out, but there are still many unhelpful suggestions.

We aim to develop a machine learning model, to extract phrases that are more meaningful and targeted, so that can be used for searching suggestions.
We aim to develop a machine learning model, to extract phrases that are more meaningful and targeted, so that can be used for searching suggestions.


## Datasets
Expand All @@ -89,7 +89,7 @@ To convert the descriptions into model-readable data, we use the [bert-base-unca
For each description, BERT produces an embedding of shape (768,), which is a 768-dimensional vector representing the semantic meaning of the entire description based on the [CLS] token.

#### Justifying the Selection of Classification Model
Task 1 is identified as a **Multi-Label Classification** task. That is, given an uncatagorised item, the item should be catagorised with multiple labels.
Task 1 is identified as a **Multi-Label Classification** task. That is, given an uncatagorised item, the item should be catagorised with multiple labels.

#### Parameter Settings
Split to train and test sets: `test_size=0.2, random_state=42`
Expand Down
1 change: 1 addition & 0 deletions data_discovery_ai/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ async def api_key_auth(x_api_key: str = Security(api_key_header)):
detail="Invalid API Key",
)


@router.get("/hello", dependencies=[Depends(api_key_auth)])
async def hello():
return {"content": "Hello World!"}
Expand Down
128 changes: 86 additions & 42 deletions data_discovery_ai/model/ModelEntity.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,25 +5,48 @@
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
from tensorflow.keras.metrics import AUC
from sklearn.metrics import accuracy_score, hamming_loss, precision_score, recall_score, f1_score, jaccard_score
from sklearn.metrics import (
accuracy_score,
hamming_loss,
precision_score,
recall_score,
f1_score,
jaccard_score,
)
from datetime import datetime

class BaseModel():

class BaseModel:
"""
Base Model for multi-label classification tasks: keywords, parameters, organisation
Base Model for multi-label classification tasks: keywords, parameters, organisation
"""

def __init__(self, model=None):
self.model = model

def compile_model(self, optimizer, loss, metrics):
if self.model:
self.model.compile(optimizer=optimizer, loss=loss, metrics=metrics)

def fit_model(self, X_train, Y_train, epochs, batch_size, validation_split, callbacks, class_weight=None):
def fit_model(
self,
X_train,
Y_train,
epochs,
batch_size,
validation_split,
callbacks,
class_weight=None,
):
if self.model:
history = self.model.fit(
X_train, Y_train, epochs=epochs, batch_size=batch_size,
validation_split=validation_split, callbacks=callbacks, class_weight=class_weight
X_train,
Y_train,
epochs=epochs,
batch_size=batch_size,
validation_split=validation_split,
callbacks=callbacks,
class_weight=class_weight,
)
return history

Expand All @@ -44,53 +67,74 @@ def __init__(self, dim, n_labels):
self.build_model()

def build_model(self):
self.model = Sequential([
Input(shape=(self.dim,)),
Dense(128, activation='relu'),
Dropout(0.3),
Dense(64, activation='relu'),
Dropout(0.3),
Dense(self.n_labels, activation='sigmoid')
])

def train(self, X_train, Y_train, X_test, Y_test, class_weight=None, epochs=100, batch_size=32):
self.compile_model(optimizer=Adam(learning_rate=1e-3), loss='binary_crossentropy',
metrics=['accuracy', 'precision', 'recall', AUC()])

early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)
reduce_lr = ReduceLROnPlateau(monitor='val_loss', patience=5, min_lr=1e-6)
self.model = Sequential(
[
Input(shape=(self.dim,)),
Dense(128, activation="relu"),
Dropout(0.3),
Dense(64, activation="relu"),
Dropout(0.3),
Dense(self.n_labels, activation="sigmoid"),
]
)

def train(
self,
X_train,
Y_train,
X_test,
Y_test,
class_weight=None,
epochs=100,
batch_size=32,
):
self.compile_model(
optimizer=Adam(learning_rate=1e-3),
loss="binary_crossentropy",
metrics=["accuracy", "precision", "recall", AUC()],
)

early_stopping = EarlyStopping(
monitor="val_loss", patience=5, restore_best_weights=True
)
reduce_lr = ReduceLROnPlateau(monitor="val_loss", patience=5, min_lr=1e-6)

history = self.fit_model(
X_train, Y_train, epochs=epochs, batch_size=batch_size, validation_split=0.1,
callbacks=[early_stopping, reduce_lr], class_weight=class_weight
X_train,
Y_train,
epochs=epochs,
batch_size=batch_size,
validation_split=0.1,
callbacks=[early_stopping, reduce_lr],
class_weight=class_weight,
)

current_time = datetime.now().strftime('%Y%m%d%H%M%S')
current_time = datetime.now().strftime("%Y%m%d%H%M%S")
filepath = f"./output/saved/{current_time}-trained-keyword-epoch{epochs}-batch{batch_size}.keras"

self.save_model(filepath)
return history

@staticmethod
def evaluation(Y_test, predictions):
accuracy = accuracy_score(Y_test, predictions)
hammingloss = hamming_loss(Y_test, predictions)
precision = precision_score(Y_test, predictions, average='micro')
recall = recall_score(Y_test, predictions, average='micro')
f1 = f1_score(Y_test, predictions, average='micro')
jaccard = jaccard_score(Y_test, predictions, average='samples')
precision = precision_score(Y_test, predictions, average="micro")
recall = recall_score(Y_test, predictions, average="micro")
f1 = f1_score(Y_test, predictions, average="micro")
jaccard = jaccard_score(Y_test, predictions, average="samples")

return {
'accuracy': accuracy,
'hammingloss': hammingloss,
'precision': precision,
'recall': recall,
'f1': f1,
'Jaccard Index': jaccard
"accuracy": accuracy,
"hammingloss": hammingloss,
"precision": precision,
"recall": recall,
"f1": f1,
"Jaccard Index": jaccard,
}

def predict_and_save(self, ds, confidence, labels):
X = np.array(ds['embedding'].tolist())
X = np.array(ds["embedding"].tolist())
predictions = self.model.predict(X)
predicted_labels = (predictions > confidence).astype(int)

Expand All @@ -102,12 +146,12 @@ def predict_and_save(self, ds, confidence, labels):
if len(keywords) == 0:
predicted_keywords.append(None)
else:
predicted_keywords.append(' | '.join(keywords))
ds['keywords'] = predicted_keywords
ds.drop(columns=['embedding'], inplace=True)
predicted_keywords.append(" | ".join(keywords))

ds["keywords"] = predicted_keywords
ds.drop(columns=["embedding"], inplace=True)

current_time = datetime.now().strftime('%Y%m%d%H%M%S')
current_time = datetime.now().strftime("%Y%m%d%H%M%S")
filepath = f"./output/saved/{current_time}.csv"
ds.to_csv(filepath)
return ds
return ds
Loading

0 comments on commit de35d20

Please sign in to comment.