pin exact version for dependencies

aodn · Oct 23, 2024 · de35d20 · de35d20
1 parent 5c8f4be
commit de35d20
Show file tree

Hide file tree

Showing 23 changed files with 5,161 additions and 4,947 deletions.
diff --git a/.gitattributes b/.gitattributes
@@ -0,0 +1,3 @@
+# Override jupyter in Github language stats for more accurate estimate of repo code languages
+# reference: https://github.com/github/linguist/blob/master/docs/overrides.md#generated-code
+*.ipynb linguist-generated
diff --git a/data_discovery_ai/README_a.md b/data_discovery_ai/README_a.md
@@ -3,19 +3,19 @@
 ### 1. Keywords Classification
 #### Problem Description
 
-The new AODN Data Discovery portal is underpinned by a Geonetwork catalogue of metadata records that bring together well curated IMOS managed metadata records as well as records from external organisations and other contributors. 
+The new AODN Data Discovery portal is underpinned by a Geonetwork catalogue of metadata records that bring together well curated IMOS managed metadata records as well as records from external organisations and other contributors.
 
 IMOS records and any records that exist in the current AODN portal use keywords and vocabularies that are tightly controlled in order for the current portal facets to operate.  These are AODN vocabularies as defined in the ARDC repositories (https://vocabs.ardc.edu.au/).
 
 Many other organisations use these AODN vocabularies and other well known vocabularies (e.g. GCMD) however there are many records in the catalogue that either use no keywords at all or keywords that are not from controlled vocabs.
 
-The new AODN Data Discovery portal needs to filter metadata records based on a fixed set of “keywords” – using the AODN vocabularies.  The most important being 
+The new AODN Data Discovery portal needs to filter metadata records based on a fixed set of “keywords” – using the AODN vocabularies.  The most important being
 
-- 'AODN Organisation Vocabulary',  
+- 'AODN Organisation Vocabulary',
 
-- 'AODN Instrument Vocabulary', 
+- 'AODN Instrument Vocabulary',
 
-- 'AODN Discovery Parameter Vocabulary'  
+- 'AODN Discovery Parameter Vocabulary'
 
 There are many metadata records that have no keywords or keywords that are not using a well known vocabularies. Given the mapping rules based on metadata records which have AODN vocabularies, we aim to develop a machine learning model to predict the AODN keywords for these uncategorised datasets, in order to provide suggestions that can be used by the Data Discovery portal or other applications.
 
@@ -37,7 +37,7 @@ We also prepared a dataset that contains title and description information, whic
 0832b98c-602e-4902-8438-a80d402469ea | IMOS SOOP Underway Data from AIMS Vessel RV Cape Ferguson Trip 6321 From 30 Oct 2015 To 02 Nov 2015 | 'Ships of Opportunity' (SOOP) is a facility of the Australian 'Integrated Marine Observing System' (IMOS) project. This data set was collected by the SOOP sub-facility 'Sensors on Tropical Research Vessels' aboard the RV Cape Ferguson Trip 6321. | [{'concepts': [{'id': 'Practical salinity of the water body', 'url': 'http://vocab.nerc.ac.uk/collection/P01/current/PSLTZZ01'}, {'id': 'Temperature of the water body', 'url': 'http://vocab.nerc.ac.uk/collection/P01/current/TEMPPR01'}, {'id': 'Fluorescence of the water body', 'url': 'http://vocab.nerc.ac.uk/collection/P01/current/FLUOZZZZ'}, {'id': 'Turbidity of the water body', 'url': 'http://vocab.nerc.ac.uk/collection/P01/current/TURBXXXX'}], 'scheme': 'theme', 'description': '', 'title': 'AODN Discovery Parameter Vocabulary'}]
 
 #### Acceptance Criteria
-An Excel file which contains the predicted keywords suggestions for these non-categorised metadata records. 
+An Excel file which contains the predicted keywords suggestions for these non-categorised metadata records.
 
 ### 2.Parameter Clustering
 
@@ -60,12 +60,12 @@ A prepared filtered dataset is stuctured as follows:
 ### 3. Searching Suggestions - Key Phrase Extraction
 In the new AODN Data Discovery Portal, the searching suggestions are derived from an algorithm that is extracting the most common terms that appear in the title and abstract and storing this in ElasticSearch (field `_source.search_suggestions.abstract_phrases`). Some likely "non words" are stripped out, but there are still many unhelpful suggestions.
 
-We aim to develop a machine learning model, to extract phrases that are more meaningful and targeted, so that can be used for searching suggestions. 
+We aim to develop a machine learning model, to extract phrases that are more meaningful and targeted, so that can be used for searching suggestions.
 
 ## Research Method
 1. understand metadata
 2. problem description
 3. prepare datasets
 4. research on methods
 5. experiments
-6. evaluation
+6. evaluation
diff --git a/data_discovery_ai/README_b.md b/data_discovery_ai/README_b.md
@@ -5,19 +5,19 @@
 ### 1. Keywords Classification
 #### Problem Description
 
-The new AODN Data Discovery portal is underpinned by a Geonetwork catalogue of metadata records that bring together well curated IMOS managed metadata records as well as records from external organisations and other contributors. 
+The new AODN Data Discovery portal is underpinned by a Geonetwork catalogue of metadata records that bring together well curated IMOS managed metadata records as well as records from external organisations and other contributors.
 
 IMOS records and any records that exist in the current AODN portal use keywords and vocabularies that are tightly controlled in order for the current portal facets to operate.  These are AODN vocabularies as defined in the ARDC repositories (https://vocabs.ardc.edu.au/).
 
 Many other organisations use these AODN vocabularies and other well known vocabularies (e.g. GCMD) however there are many records in the catalogue that either use no keywords at all or keywords that are not from controlled vocabs.
 
-The new AODN Data Discovery portal needs to filter metadata records based on a fixed set of “keywords” – using the AODN vocabularies.  The most important being 
+The new AODN Data Discovery portal needs to filter metadata records based on a fixed set of “keywords” – using the AODN vocabularies.  The most important being
 
-- 'AODN Organisation Vocabulary',  
+- 'AODN Organisation Vocabulary',
 
-- 'AODN Instrument Vocabulary', 
+- 'AODN Instrument Vocabulary',
 
-- 'AODN Discovery Parameter Vocabulary'  
+- 'AODN Discovery Parameter Vocabulary'
 
 - 'AODN Parameter Category Vocabulary'
 
@@ -43,7 +43,7 @@ We also prepared a dataset that contains title and description information, whic
 0832b98c-602e-4902-8438-a80d402469ea | IMOS SOOP Underway Data from AIMS Vessel RV Cape Ferguson Trip 6321 From 30 Oct 2015 To 02 Nov 2015 | 'Ships of Opportunity' (SOOP) is a facility of the Australian 'Integrated Marine Observing System' (IMOS) project. This data set was collected by the SOOP sub-facility 'Sensors on Tropical Research Vessels' aboard the RV Cape Ferguson Trip 6321. | [{'concepts': [{'id': 'Practical salinity of the water body', 'url': 'http://vocab.nerc.ac.uk/collection/P01/current/PSLTZZ01'}, {'id': 'Temperature of the water body', 'url': 'http://vocab.nerc.ac.uk/collection/P01/current/TEMPPR01'}, {'id': 'Fluorescence of the water body', 'url': 'http://vocab.nerc.ac.uk/collection/P01/current/FLUOZZZZ'}, {'id': 'Turbidity of the water body', 'url': 'http://vocab.nerc.ac.uk/collection/P01/current/TURBXXXX'}], 'scheme': 'theme', 'description': '', 'title': 'AODN Discovery Parameter Vocabulary'}]
 
 #### Acceptance Criteria
-An Excel file which contains the predicted keywords suggestions for these non-categorised metadata records. 
+An Excel file which contains the predicted keywords suggestions for these non-categorised metadata records.
 
 ### 2.Parameter Clustering
 
@@ -66,7 +66,7 @@ A prepared filtered dataset is stuctured as follows:
 ### 3. Searching Suggestions - Key Phrase Extraction
 In the new AODN Data Discovery Portal, the searching suggestions are derived from an algorithm that is extracting the most common terms that appear in the title and abstract and storing this in ElasticSearch (field `_source.search_suggestions.abstract_phrases`). Some likely "non words" are stripped out, but there are still many unhelpful suggestions.
 
-We aim to develop a machine learning model, to extract phrases that are more meaningful and targeted, so that can be used for searching suggestions. 
+We aim to develop a machine learning model, to extract phrases that are more meaningful and targeted, so that can be used for searching suggestions.
 
 
 ## Datasets
@@ -89,7 +89,7 @@ To convert the descriptions into model-readable data, we use the [bert-base-unca
 For each description, BERT produces an embedding of shape (768,), which is a 768-dimensional vector representing the semantic meaning of the entire description based on the [CLS] token.
 
 #### Justifying the Selection of Classification Model
-Task 1 is identified as a **Multi-Label Classification** task. That is, given an uncatagorised item, the item should be catagorised with multiple labels. 
+Task 1 is identified as a **Multi-Label Classification** task. That is, given an uncatagorised item, the item should be catagorised with multiple labels.
 
 #### Parameter Settings
 Split to train and test sets: `test_size=0.2, random_state=42`

diff --git a/data_discovery_ai/main.py b/data_discovery_ai/main.py
@@ -23,6 +23,7 @@ async def api_key_auth(x_api_key: str = Security(api_key_header)):
         detail="Invalid API Key",
     )
 
+
 @router.get("/hello", dependencies=[Depends(api_key_auth)])
 async def hello():
     return {"content": "Hello World!"}

diff --git a/data_discovery_ai/model/ModelEntity.py b/data_discovery_ai/model/ModelEntity.py
@@ -5,25 +5,48 @@
 from tensorflow.keras.optimizers import Adam
 from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
 from tensorflow.keras.metrics import AUC
-from sklearn.metrics import accuracy_score, hamming_loss, precision_score, recall_score, f1_score, jaccard_score
+from sklearn.metrics import (
+    accuracy_score,
+    hamming_loss,
+    precision_score,
+    recall_score,
+    f1_score,
+    jaccard_score,
+)
 from datetime import datetime
 
-class BaseModel():
+
+class BaseModel:
     """
-        Base Model for multi-label classification tasks: keywords, parameters, organisation
+    Base Model for multi-label classification tasks: keywords, parameters, organisation
     """
+
     def __init__(self, model=None):
         self.model = model
 
     def compile_model(self, optimizer, loss, metrics):
         if self.model:
             self.model.compile(optimizer=optimizer, loss=loss, metrics=metrics)
 
-    def fit_model(self, X_train, Y_train, epochs, batch_size, validation_split, callbacks, class_weight=None):
+    def fit_model(
+        self,
+        X_train,
+        Y_train,
+        epochs,
+        batch_size,
+        validation_split,
+        callbacks,
+        class_weight=None,
+    ):
         if self.model:
             history = self.model.fit(
-                X_train, Y_train, epochs=epochs, batch_size=batch_size,
-                validation_split=validation_split, callbacks=callbacks, class_weight=class_weight
+                X_train,
+                Y_train,
+                epochs=epochs,
+                batch_size=batch_size,
+                validation_split=validation_split,
+                callbacks=callbacks,
+                class_weight=class_weight,
             )
             return history
 
@@ -44,53 +67,74 @@ def __init__(self, dim, n_labels):
         self.build_model()
 
     def build_model(self):
-        self.model = Sequential([
-            Input(shape=(self.dim,)),
-            Dense(128, activation='relu'),
-            Dropout(0.3),
-            Dense(64, activation='relu'),
-            Dropout(0.3),
-            Dense(self.n_labels, activation='sigmoid')
-        ])
-
-    def train(self, X_train, Y_train, X_test, Y_test, class_weight=None, epochs=100, batch_size=32):
-        self.compile_model(optimizer=Adam(learning_rate=1e-3), loss='binary_crossentropy',
-                           metrics=['accuracy', 'precision', 'recall', AUC()])
-
-        early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)
-        reduce_lr = ReduceLROnPlateau(monitor='val_loss', patience=5, min_lr=1e-6)
+        self.model = Sequential(
+            [
+                Input(shape=(self.dim,)),
+                Dense(128, activation="relu"),
+                Dropout(0.3),
+                Dense(64, activation="relu"),
+                Dropout(0.3),
+                Dense(self.n_labels, activation="sigmoid"),
+            ]
+        )
+
+    def train(
+        self,
+        X_train,
+        Y_train,
+        X_test,
+        Y_test,
+        class_weight=None,
+        epochs=100,
+        batch_size=32,
+    ):
+        self.compile_model(
+            optimizer=Adam(learning_rate=1e-3),
+            loss="binary_crossentropy",
+            metrics=["accuracy", "precision", "recall", AUC()],
+        )
+
+        early_stopping = EarlyStopping(
+            monitor="val_loss", patience=5, restore_best_weights=True
+        )
+        reduce_lr = ReduceLROnPlateau(monitor="val_loss", patience=5, min_lr=1e-6)
 
         history = self.fit_model(
-            X_train, Y_train, epochs=epochs, batch_size=batch_size, validation_split=0.1,
-            callbacks=[early_stopping, reduce_lr], class_weight=class_weight
+            X_train,
+            Y_train,
+            epochs=epochs,
+            batch_size=batch_size,
+            validation_split=0.1,
+            callbacks=[early_stopping, reduce_lr],
+            class_weight=class_weight,
         )
 
-        current_time = datetime.now().strftime('%Y%m%d%H%M%S')
+        current_time = datetime.now().strftime("%Y%m%d%H%M%S")
         filepath = f"./output/saved/{current_time}-trained-keyword-epoch{epochs}-batch{batch_size}.keras"
-        
+
         self.save_model(filepath)
         return history
 
     @staticmethod
     def evaluation(Y_test, predictions):
         accuracy = accuracy_score(Y_test, predictions)
         hammingloss = hamming_loss(Y_test, predictions)
-        precision = precision_score(Y_test, predictions, average='micro')
-        recall = recall_score(Y_test, predictions, average='micro')
-        f1 = f1_score(Y_test, predictions, average='micro')
-        jaccard = jaccard_score(Y_test, predictions, average='samples')
+        precision = precision_score(Y_test, predictions, average="micro")
+        recall = recall_score(Y_test, predictions, average="micro")
+        f1 = f1_score(Y_test, predictions, average="micro")
+        jaccard = jaccard_score(Y_test, predictions, average="samples")
 
         return {
-            'accuracy': accuracy,
-            'hammingloss': hammingloss,
-            'precision': precision,
-            'recall': recall,
-            'f1': f1,
-            'Jaccard Index': jaccard
+            "accuracy": accuracy,
+            "hammingloss": hammingloss,
+            "precision": precision,
+            "recall": recall,
+            "f1": f1,
+            "Jaccard Index": jaccard,
         }
 
     def predict_and_save(self, ds, confidence, labels):
-        X = np.array(ds['embedding'].tolist())
+        X = np.array(ds["embedding"].tolist())
         predictions = self.model.predict(X)
         predicted_labels = (predictions > confidence).astype(int)
 
@@ -102,12 +146,12 @@ def predict_and_save(self, ds, confidence, labels):
             if len(keywords) == 0:
                 predicted_keywords.append(None)
             else:
-                predicted_keywords.append(' | '.join(keywords))
-                
-        ds['keywords'] = predicted_keywords
-        ds.drop(columns=['embedding'], inplace=True)
+                predicted_keywords.append(" | ".join(keywords))
+
+        ds["keywords"] = predicted_keywords
+        ds.drop(columns=["embedding"], inplace=True)
 
-        current_time = datetime.now().strftime('%Y%m%d%H%M%S')
+        current_time = datetime.now().strftime("%Y%m%d%H%M%S")
         filepath = f"./output/saved/{current_time}.csv"
         ds.to_csv(filepath)
-        return ds
+        return ds