Skip to content

Machine Learning

Module to prepare and run machine learning (ML) in either a supervised or unsupervised fashion.

3 Classes for end user usage:

  1. ClassificationModel For building ML models with categorical target data (classification).

  2. RegressionModel For building ML models with continious target data (regression).

  3. UnsupervisedModel For building ML models for datasets without labels. Limited support for this module at present.

These classes inherit first from an abstract base class called "_MachineLearnModel". which sets a basic outline for all 3 classes above.

Classes 1 and 2 then inherit from a parent class called "_SupervisedRunner" which abstracts as much as their shared behaviour as possible.

ClassificationModel dataclass

Bases: _SupervisedRunner

Class to construct supervised machine learning models when the target class is categorical (aka classification).

Attributes

pd.DataFrame

Input dataset.

list

Names of the classes to train the model on. Can be left empty if you want to use all the classes you currently have. Default = [] (use all classes in the Target column.)

list

List of machine learning models/algorithims to use. Default = ["CatBoost"]

float

Ratio of data that should be used to make the evaluation test set. The rest of the data will be used for the training/hyper-param tuning. Default = 0.15

str

How to scale the dataset prior to machine learning. Options are "min_max" (scikit-learn's MinMaxScaler) or "standard_scaling" (scikit-learn's StandardScaler). Default = "min_max"

str

Directory path to store results files to. Default = ""

int

Number of splits in the cross validation, (the "k" in k-fold cross validation). Default = 5

int

Number of repeats for the k-fold cross validation to perform. Default = 3

str

Define how extensive the grid search protocol should be for the models. Options are: "none", "quick", "moderate", "exhaustive" or "custom". Default = "quick"

bool

Choose to weight each class according to the number of observations. (This can be used in the case of an imbalanced dataset.) The weights that will be used are the inverse of the class distribution so each class has effective weight 1 at the end. Default = False

dict

Nested dictionary of model parameters that can be read directly into Scikit-learn's implementation of grid search cv.

RepeatedStratifiedKFold

Instance of scikit-learn's RepeatedStratifiedKFold class for model building.

np.ndarray

All feature names/labels.

dict

Nested dictionary containing the training and testing data (both features and classes) needed to run the model building.

LabelEncoder

Instance of sci-kit learn's label encoder to encode the target classes. Required for the XGBoost model.

dict

Keys are the model name/method and values are the instance of the built model.

Methods

describe_ml_planned() Prints a summary of what machine learning protocol has been selected.

build_models(save_models) Runs the machine learning and summarizes the results.

evaluate_models() Evaluates each ML model's performance on the validation data set and provides the user with a summary of the results.

generate_confusion_matrix() For each ml model used, determine the confusion matrix from the validation dataset.

Source code in key_interactions_finder/model_building.py
@dataclass
class ClassificationModel(_SupervisedRunner):
    """
    Class to construct supervised machine learning models when the target class
    is categorical (aka classification).

    Attributes
    ----------

    dataset : pd.DataFrame
        Input dataset.

    classes_to_use : list
        Names of the classes to train the model on.
        Can be left empty if you want to use all the classes you currently have.
        Default = [] (use all classes in the Target column.)

    models_to_use : list
        List of machine learning models/algorithims to use.
        Default = ["CatBoost"]

    evaluation_split_ratio : float
        Ratio of data that should be used to make the evaluation test set.
        The rest of the data will be used for the training/hyper-param tuning.
        Default = 0.15

    scaling_method : str
        How to scale the dataset prior to machine learning.
        Options are "min_max" (scikit-learn's MinMaxScaler)
        or "standard_scaling" (scikit-learn's StandardScaler).
        Default = "min_max"

    out_dir : str
        Directory path to store results files to.
        Default = ""

    cross_validation_splits : int
        Number of splits in the cross validation, (the "k" in k-fold cross validation).
        Default = 5

    cross_validation_repeats : int
        Number of repeats for the k-fold cross validation to perform.
        Default = 3

    search_approach : str
        Define how extensive the grid search protocol should be for the models.
        Options are: "none", "quick", "moderate", "exhaustive" or "custom".
        Default = "quick"

    use_class_weights : bool
        Choose to weight each class according to the number of observations.
        (This can be used in the case of an imbalanced dataset.)
        The weights that will be used are the inverse of the class distribution so
        each class has effective weight 1 at the end.
        Default = False

    all_model_params : dict
        Nested dictionary of model parameters that can be read directly into
        Scikit-learn's implementation of grid search cv.

    cross_validation_approach : RepeatedStratifiedKFold
        Instance of scikit-learn's RepeatedStratifiedKFold class for model building.

    feat_names : np.ndarray
        All feature names/labels.

    ml_datasets : dict
        Nested dictionary containing the training and testing data (both features and
        classes) needed to run the model building.

    label_encoder : LabelEncoder
        Instance of sci-kit learn's label encoder to encode the target classes.
        Required for the XGBoost model.

    ml_models : dict
        Keys are the model name/method and values are the instance of the
        built model.

    Methods
    -------

    describe_ml_planned()
        Prints a summary of what machine learning protocol has been selected.

    build_models(save_models)
        Runs the machine learning and summarizes the results.

    evaluate_models()
        Evaluates each ML model's performance on the validation data set
        and provides the user with a summary of the results.

    generate_confusion_matrix()
        For each ml model used, determine the confusion matrix from the validation dataset.
    """
    # Only non-shared parameters between classificaiton and regression.
    label_encoder: LabelEncoder = LabelEncoder()
    classes_to_use: list = field(default_factory=[])
    use_class_weights: bool = False

    def __post_init__(self):
        """Setup the provided dataset and params for ML."""
        self.out_dir = _prep_out_dir(self.out_dir)

        # not populated till build_models method called later.
        self.all_model_params = {}
        self.ml_models = {}

        # Filter to only include desired classes.
        if len(self.classes_to_use) != 0:
            self.dataset = self.dataset[self.dataset["Target"].isin(
                self.classes_to_use)]

        # Train-test splitting and scaling.
        df_features = self.dataset.drop("Target", axis=1)
        x_array = df_features.to_numpy()
        self.feat_names = df_features.columns.values

        # label encode the target - for XGBOOST compatability.
        self.label_encoder = LabelEncoder()
        y_classes = self.label_encoder.fit_transform(
            self.dataset.Target.values)

        x_array_train, x_array_eval, y_train, y_eval = train_test_split(
            x_array, y_classes, test_size=self.evaluation_split_ratio)

        train_data_scaled, eval_data_scaled = self._supervised_scale_features(
            scaling_method=self.scaling_method,
            x_array_train=x_array_train,
            x_array_eval=x_array_eval
        )

        self.ml_datasets = {}
        self.ml_datasets["train_data_scaled"] = train_data_scaled
        self.ml_datasets["eval_data_scaled"] = eval_data_scaled
        self.ml_datasets["y_train"] = y_train
        self.ml_datasets["y_eval"] = y_eval

        if self.use_class_weights:

            # For CatBoostClassifier
            classes = np.unique(y_train)
            weights = compute_class_weight(
                class_weight='balanced', classes=classes, y=y_train)
            class_weights = dict(zip(classes, weights))

            # For XGBClassifier, I need the total number of examples in the majority class
            # divided by the total number of examples in the minority class.
            majority = self.dataset["Target"].value_counts()[0]
            minority = self.dataset["Target"].value_counts()[1]
            scaled_weight = round((majority / minority), 2)

            self.available_models = {
                "CatBoost": {"model": CatBoostClassifier(
                    class_weights=class_weights, logging_level="Silent"), "params": {}},
                "XGBoost": {"model": XGBClassifier(eval_metric="logloss", scale_pos_weight=scaled_weight
                                                   ), "params": {}},
                "Random_Forest": {"model": RandomForestClassifier(
                    class_weight="balanced"), "params": {}}}

            print("Class weights have now been added to your dataset.")
            print(
                f"The class imbalance in your dataset was: 1 : {1/scaled_weight:.2f}")

        else:
            self.available_models = {
                "CatBoost": {"model": CatBoostClassifier(logging_level="Silent"), "params": {}},
                "XGBoost": {"model": XGBClassifier(eval_metric="logloss"), "params": {}},
                "Random_Forest": {"model": RandomForestClassifier(), "params": {}},
            }

        # Define ML Pipeline:
        self.cross_validation_approach = RepeatedStratifiedKFold(
            n_splits=self.cross_validation_splits, n_repeats=self.cross_validation_repeats)

        self.all_model_params = self._assign_model_params(
            models_to_use=self.models_to_use,
            available_models=self.available_models,
            search_approach=self.search_approach,
            is_classification=True
        )

        self.describe_ml_planned()

    def evaluate_models(self) -> dict:
        """
        Evaluates each ML model's performance on the validation data set
        and provides the user with a summary of the results.

        Returns
        ----------

        dict
            A dictionary with keys being the model names and values being a pd.DataFrame
            with several scoring metrics output for each model used.
        """
        # https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html
        # yhat_decoded = self.label_encoder.inverse_transform(yhat)
        y_eval_decoded = self.label_encoder.inverse_transform(
            self.ml_datasets["y_eval"])
        class_labels = (np.unique(y_eval_decoded)).tolist()

        all_classification_reports = {}
        for model_name, clf in self.ml_models.items():
            yhat = clf.predict(self.ml_datasets["eval_data_scaled"])
            yhat_decoded = self.label_encoder.inverse_transform(yhat)

            report = metrics.classification_report(
                y_eval_decoded, yhat_decoded, output_dict=True)

            # now reformat report so dataframe friendly.
            new_report = {}
            for label in class_labels:
                new_report.update({label: report[label]})

            accuracy_row = {
                "precision": "N/A", "recall": "N/A",
                "f1-score": report["accuracy"],
                "support": report["weighted avg"]["support"]
            }
            new_report.update({"accuracy": accuracy_row})

            new_report.update({"macro avg": report["macro avg"]})
            new_report.update({"weighted avg": report["weighted avg"]})

            df_classification_report = pd.DataFrame(new_report).transpose()

            all_classification_reports.update(
                {model_name: df_classification_report})

        print("Returning classification reports for each model inside a single dictionary")
        return all_classification_reports

    def generate_confusion_matrix(self) -> dict:
        """
        For each ml model used, determine the confusion matrix from the validation data set.
        Returns a dictionary with model names as keys and the corresponding matrix as the values.

        Returns
        ----------

        dict
            Keys are strings of each model name. Values are the confusion matrix
            of said model as a numpy.ndarray.
        """
        confusion_matrices = {}
        for model_name, clf in self.ml_models.items():
            yhat = clf.predict(self.ml_datasets["eval_data_scaled"])
            y_true = self.ml_datasets["y_eval"]

            yhat_decoded = self.label_encoder.inverse_transform(yhat)
            y_true_decoded = self.label_encoder.inverse_transform(y_true)

            confuse_matrix = metrics.confusion_matrix(
                y_true_decoded, yhat_decoded)

            confusion_matrices.update({model_name: confuse_matrix})

        return confusion_matrices

__post_init__()

Setup the provided dataset and params for ML.

Source code in key_interactions_finder/model_building.py
def __post_init__(self):
    """Setup the provided dataset and params for ML."""
    self.out_dir = _prep_out_dir(self.out_dir)

    # not populated till build_models method called later.
    self.all_model_params = {}
    self.ml_models = {}

    # Filter to only include desired classes.
    if len(self.classes_to_use) != 0:
        self.dataset = self.dataset[self.dataset["Target"].isin(
            self.classes_to_use)]

    # Train-test splitting and scaling.
    df_features = self.dataset.drop("Target", axis=1)
    x_array = df_features.to_numpy()
    self.feat_names = df_features.columns.values

    # label encode the target - for XGBOOST compatability.
    self.label_encoder = LabelEncoder()
    y_classes = self.label_encoder.fit_transform(
        self.dataset.Target.values)

    x_array_train, x_array_eval, y_train, y_eval = train_test_split(
        x_array, y_classes, test_size=self.evaluation_split_ratio)

    train_data_scaled, eval_data_scaled = self._supervised_scale_features(
        scaling_method=self.scaling_method,
        x_array_train=x_array_train,
        x_array_eval=x_array_eval
    )

    self.ml_datasets = {}
    self.ml_datasets["train_data_scaled"] = train_data_scaled
    self.ml_datasets["eval_data_scaled"] = eval_data_scaled
    self.ml_datasets["y_train"] = y_train
    self.ml_datasets["y_eval"] = y_eval

    if self.use_class_weights:

        # For CatBoostClassifier
        classes = np.unique(y_train)
        weights = compute_class_weight(
            class_weight='balanced', classes=classes, y=y_train)
        class_weights = dict(zip(classes, weights))

        # For XGBClassifier, I need the total number of examples in the majority class
        # divided by the total number of examples in the minority class.
        majority = self.dataset["Target"].value_counts()[0]
        minority = self.dataset["Target"].value_counts()[1]
        scaled_weight = round((majority / minority), 2)

        self.available_models = {
            "CatBoost": {"model": CatBoostClassifier(
                class_weights=class_weights, logging_level="Silent"), "params": {}},
            "XGBoost": {"model": XGBClassifier(eval_metric="logloss", scale_pos_weight=scaled_weight
                                               ), "params": {}},
            "Random_Forest": {"model": RandomForestClassifier(
                class_weight="balanced"), "params": {}}}

        print("Class weights have now been added to your dataset.")
        print(
            f"The class imbalance in your dataset was: 1 : {1/scaled_weight:.2f}")

    else:
        self.available_models = {
            "CatBoost": {"model": CatBoostClassifier(logging_level="Silent"), "params": {}},
            "XGBoost": {"model": XGBClassifier(eval_metric="logloss"), "params": {}},
            "Random_Forest": {"model": RandomForestClassifier(), "params": {}},
        }

    # Define ML Pipeline:
    self.cross_validation_approach = RepeatedStratifiedKFold(
        n_splits=self.cross_validation_splits, n_repeats=self.cross_validation_repeats)

    self.all_model_params = self._assign_model_params(
        models_to_use=self.models_to_use,
        available_models=self.available_models,
        search_approach=self.search_approach,
        is_classification=True
    )

    self.describe_ml_planned()

evaluate_models()

Evaluates each ML model's performance on the validation data set and provides the user with a summary of the results.

Returns

dict A dictionary with keys being the model names and values being a pd.DataFrame with several scoring metrics output for each model used.

Source code in key_interactions_finder/model_building.py
def evaluate_models(self) -> dict:
    """
    Evaluates each ML model's performance on the validation data set
    and provides the user with a summary of the results.

    Returns
    ----------

    dict
        A dictionary with keys being the model names and values being a pd.DataFrame
        with several scoring metrics output for each model used.
    """
    # https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html
    # yhat_decoded = self.label_encoder.inverse_transform(yhat)
    y_eval_decoded = self.label_encoder.inverse_transform(
        self.ml_datasets["y_eval"])
    class_labels = (np.unique(y_eval_decoded)).tolist()

    all_classification_reports = {}
    for model_name, clf in self.ml_models.items():
        yhat = clf.predict(self.ml_datasets["eval_data_scaled"])
        yhat_decoded = self.label_encoder.inverse_transform(yhat)

        report = metrics.classification_report(
            y_eval_decoded, yhat_decoded, output_dict=True)

        # now reformat report so dataframe friendly.
        new_report = {}
        for label in class_labels:
            new_report.update({label: report[label]})

        accuracy_row = {
            "precision": "N/A", "recall": "N/A",
            "f1-score": report["accuracy"],
            "support": report["weighted avg"]["support"]
        }
        new_report.update({"accuracy": accuracy_row})

        new_report.update({"macro avg": report["macro avg"]})
        new_report.update({"weighted avg": report["weighted avg"]})

        df_classification_report = pd.DataFrame(new_report).transpose()

        all_classification_reports.update(
            {model_name: df_classification_report})

    print("Returning classification reports for each model inside a single dictionary")
    return all_classification_reports

generate_confusion_matrix()

For each ml model used, determine the confusion matrix from the validation data set. Returns a dictionary with model names as keys and the corresponding matrix as the values.

Returns

dict Keys are strings of each model name. Values are the confusion matrix of said model as a numpy.ndarray.

Source code in key_interactions_finder/model_building.py
def generate_confusion_matrix(self) -> dict:
    """
    For each ml model used, determine the confusion matrix from the validation data set.
    Returns a dictionary with model names as keys and the corresponding matrix as the values.

    Returns
    ----------

    dict
        Keys are strings of each model name. Values are the confusion matrix
        of said model as a numpy.ndarray.
    """
    confusion_matrices = {}
    for model_name, clf in self.ml_models.items():
        yhat = clf.predict(self.ml_datasets["eval_data_scaled"])
        y_true = self.ml_datasets["y_eval"]

        yhat_decoded = self.label_encoder.inverse_transform(yhat)
        y_true_decoded = self.label_encoder.inverse_transform(y_true)

        confuse_matrix = metrics.confusion_matrix(
            y_true_decoded, yhat_decoded)

        confusion_matrices.update({model_name: confuse_matrix})

    return confusion_matrices

RegressionModel dataclass

Bases: _SupervisedRunner

Class to construct supervised machine learning models when the target class is contionous (aka regression).

Attributes

pd.DataFrame

Input dataset.

list

List of machine learning models/algorithims to use. Default = ["CatBoost"]

float

Ratio of data that should be used to make the evaluation test set. The rest of the data will be used for the training/hyper-param tuning. Default = 0.15

str

How to scale the dataset prior to machine learning. Options are "min_max" (scikit-learn's MinMaxScaler) or "standard_scaling" (scikit-learn's StandardScaler). Default = "min_max"

str

Directory path to store results files to. Default = ""

int

Number of splits in the cross validation, (the "k" in k-fold cross validation). Default = 5

int

Number of repeats for the k-fold cross validation to perform. Default = 3

str

Define how extensive the grid search protocol should be for the models. Options are: "none", "quick", "moderate", "exhaustive" or "custom". Default = "quick"

dict

Nested dictionary of model parameters that can be read directly into Scikit-learn's implementation of grid search cv.

RepeatedStratifiedKFold

Instance of scikit-learn's RepeatedStratifiedKFold class for model building.

np.ndarray

All feature names/labels.

dict

Nested dictionary containing the training and testing data (both features and classes) needed to run the model building.

dict

Keys are the model name/method and values are the instance of the built model.

Methods

describe_ml_planned() Prints a summary of what machine learning protocol has been selected.

build_models(save_models) Runs the machine learning and summarizes the results.

evaluate_models() Evaluates each ML model's performance on the validation data set and provides the user with a summary of the results.

Source code in key_interactions_finder/model_building.py
@dataclass
class RegressionModel(_SupervisedRunner):
    """
    Class to construct supervised machine learning models when the target class
    is contionous (aka regression).

    Attributes
    ----------

    dataset : pd.DataFrame
        Input dataset.

    models_to_use : list
        List of machine learning models/algorithims to use.
        Default = ["CatBoost"]

    evaluation_split_ratio : float
        Ratio of data that should be used to make the evaluation test set.
        The rest of the data will be used for the training/hyper-param tuning.
        Default = 0.15

    scaling_method : str
        How to scale the dataset prior to machine learning.
        Options are "min_max" (scikit-learn's MinMaxScaler)
        or "standard_scaling" (scikit-learn's StandardScaler).
        Default = "min_max"

    out_dir : str
        Directory path to store results files to.
        Default = ""

    cross_validation_splits : int
        Number of splits in the cross validation, (the "k" in k-fold cross validation).
        Default = 5

    cross_validation_repeats : int
        Number of repeats for the k-fold cross validation to perform.
        Default = 3

    search_approach : str
        Define how extensive the grid search protocol should be for the models.
        Options are: "none", "quick", "moderate", "exhaustive" or "custom".
        Default = "quick"

    all_model_params : dict
        Nested dictionary of model parameters that can be read directly into
        Scikit-learn's implementation of grid search cv.

    cross_validation_approach : RepeatedStratifiedKFold
        Instance of scikit-learn's RepeatedStratifiedKFold class for model building.

    feat_names : np.ndarray
        All feature names/labels.

    ml_datasets : dict
        Nested dictionary containing the training and testing data (both features and
        classes) needed to run the model building.

    ml_models : dict
        Keys are the model name/method and values are the instance of the
        built model.

    Methods
    -------

    describe_ml_planned()
        Prints a summary of what machine learning protocol has been selected.

    build_models(save_models)
        Runs the machine learning and summarizes the results.

    evaluate_models()
        Evaluates each ML model's performance on the validation data set
        and provides the user with a summary of the results.
    """
    available_models = {
        "CatBoost": {"model": CatBoostRegressor(logging_level="Silent"), "params": {}},
        "XGBoost": {"model": XGBRegressor(objective="reg:squarederror"), "params": {}},
        "Random_Forest": {"model": RandomForestRegressor(), "params": {}},
    }

    # This is called at the end of the dataclass's initialization procedure.
    def __post_init__(self):
        """Setup the provided dataset and params for ML."""
        self.out_dir = _prep_out_dir(self.out_dir)

        # These are all not populated till a method is called later.
        self.ml_models = {}
        self.all_model_params = {}

        # Train-test splitting and scaling.
        df_features = self.dataset.drop("Target", axis=1)
        x_array = df_features.to_numpy()
        self.feat_names = df_features.columns.values
        y_classes = self.dataset["Target"]
        x_array_train, x_array_eval, y_train, y_eval = train_test_split(
            x_array, y_classes, test_size=self.evaluation_split_ratio)

        train_data_scaled, eval_data_scaled = self._supervised_scale_features(
            scaling_method=self.scaling_method,
            x_array_train=x_array_train,
            x_array_eval=x_array_eval
        )

        self.ml_datasets = {}
        self.ml_datasets["train_data_scaled"] = train_data_scaled
        self.ml_datasets["eval_data_scaled"] = eval_data_scaled
        self.ml_datasets["y_train"] = y_train
        self.ml_datasets["y_eval"] = y_eval

        # Define ML Pipeline:
        self.cross_validation_approach = RepeatedKFold(
            n_splits=self.cross_validation_splits, n_repeats=self.cross_validation_repeats)

        self.all_model_params = self._assign_model_params(
            models_to_use=self.models_to_use,
            available_models=self.available_models,
            search_approach=self.search_approach,
            is_classification=False
        )

        self.describe_ml_planned()

    def evaluate_models(self) -> pd.DataFrame:
        """
        Evaluates each ML model's performance on the validation data set
        and provides the user with a summary of the results.

        Returns
        ----------

        pd.DataFrame
            Dataframe with each row a containing several regression metrics
            for each ML model generated.
        """
        all_regression_dfs = []
        for model_name, clf in self.ml_models.items():
            y_validation = self.ml_datasets["y_eval"]
            yhat = clf.predict(self.ml_datasets["eval_data_scaled"])

            regression_df = self._regression_metrics(
                model_name=model_name,
                y_true=y_validation,
                y_pred=yhat
            )

            all_regression_dfs.append(regression_df)

        return pd.concat(all_regression_dfs).reset_index(drop=True)

    @staticmethod
    def _regression_metrics(
            model_name: str, y_true: np.ndarray, y_pred: np.ndarray) -> pd.DataFrame:
        """
        Calculate several regression statistics for a given ml model.

        Adapted from:
        https://stackoverflow.com/questions/26319259/how-to-get-a-regression-summary-in-scikit-learn-like-r-does

        Parameters
        ----------

        model_name : str
            Name of the ML model.

        y_true : np.ndarray
            1D array of the actual values of the target label.

        y_pred: np.ndarray
            1D array of the predicted values of the target label.

        Returns
        ----------

        pd.DataFrame
            Contains various regression metrics for the provided model.
        """
        explained_variance = np.round(
            metrics.explained_variance_score(y_true, y_pred), 4)

        mean_absolute_error = np.round(
            metrics.mean_absolute_error(y_true, y_pred), 4)

        mse = np.round(
            metrics.mean_squared_error(y_true, y_pred), 4)

        rmse = np.round(
            np.sqrt(metrics.mean_squared_error(y_true, y_pred)), 4)

        r_squared = np.round(
            metrics.r2_score(y_true, y_pred), 4)

        try:
            mean_squared_log_error = np.round(
                metrics.mean_squared_log_error(y_true, y_pred), 4)

        except ValueError:
            print("""Mean Squared Log Error cannot be calculated as your target column contains
                  negative numbers. Continuing with the other metrics.""")
            mean_squared_log_error = "N/A"

        all_metrics = [[model_name, explained_variance, mean_absolute_error,
                        mse, rmse, mean_squared_log_error, r_squared]]

        column_labels = ["Model", "Explained Variance", "Mean Absolute Error",
                         "MSE", "RMSE", "Mean Squared Log Error", "r squared"]

        return pd.DataFrame(all_metrics, columns=column_labels)

__post_init__()

Setup the provided dataset and params for ML.

Source code in key_interactions_finder/model_building.py
def __post_init__(self):
    """Setup the provided dataset and params for ML."""
    self.out_dir = _prep_out_dir(self.out_dir)

    # These are all not populated till a method is called later.
    self.ml_models = {}
    self.all_model_params = {}

    # Train-test splitting and scaling.
    df_features = self.dataset.drop("Target", axis=1)
    x_array = df_features.to_numpy()
    self.feat_names = df_features.columns.values
    y_classes = self.dataset["Target"]
    x_array_train, x_array_eval, y_train, y_eval = train_test_split(
        x_array, y_classes, test_size=self.evaluation_split_ratio)

    train_data_scaled, eval_data_scaled = self._supervised_scale_features(
        scaling_method=self.scaling_method,
        x_array_train=x_array_train,
        x_array_eval=x_array_eval
    )

    self.ml_datasets = {}
    self.ml_datasets["train_data_scaled"] = train_data_scaled
    self.ml_datasets["eval_data_scaled"] = eval_data_scaled
    self.ml_datasets["y_train"] = y_train
    self.ml_datasets["y_eval"] = y_eval

    # Define ML Pipeline:
    self.cross_validation_approach = RepeatedKFold(
        n_splits=self.cross_validation_splits, n_repeats=self.cross_validation_repeats)

    self.all_model_params = self._assign_model_params(
        models_to_use=self.models_to_use,
        available_models=self.available_models,
        search_approach=self.search_approach,
        is_classification=False
    )

    self.describe_ml_planned()

evaluate_models()

Evaluates each ML model's performance on the validation data set and provides the user with a summary of the results.

Returns

pd.DataFrame Dataframe with each row a containing several regression metrics for each ML model generated.

Source code in key_interactions_finder/model_building.py
def evaluate_models(self) -> pd.DataFrame:
    """
    Evaluates each ML model's performance on the validation data set
    and provides the user with a summary of the results.

    Returns
    ----------

    pd.DataFrame
        Dataframe with each row a containing several regression metrics
        for each ML model generated.
    """
    all_regression_dfs = []
    for model_name, clf in self.ml_models.items():
        y_validation = self.ml_datasets["y_eval"]
        yhat = clf.predict(self.ml_datasets["eval_data_scaled"])

        regression_df = self._regression_metrics(
            model_name=model_name,
            y_true=y_validation,
            y_pred=yhat
        )

        all_regression_dfs.append(regression_df)

    return pd.concat(all_regression_dfs).reset_index(drop=True)

UnsupervisedModel dataclass

Bases: _MachineLearnModel

Class to construct machine learning models for when there is no target class available (aka, unsupervised learning).

At present there is limited support for this, with only principal component analysis (PCA) available.

Attributes

pd.DataFrame

Input dataset.

str

Directory path to store results files to. Default = ""

np.ndarray

All feature names/labels.

np.ndarray

Input dataset scaled with Standard scaling.

dict

Keys are the model name/method and values are the instance of the built model.

Methods

describe_ml_planned() Prints a summary of what machine learning protocol has been selected.

build_models(save_models) Runs the machine learning and summarizes the results.

Source code in key_interactions_finder/model_building.py
@dataclass
class UnsupervisedModel(_MachineLearnModel):
    """
    Class to construct machine learning models for when there is
    no target class available (aka, unsupervised learning).

    At present there is limited support for this, with only
    principal component analysis (PCA) available.

    Attributes
    ----------

    dataset : pd.DataFrame
        Input dataset.

    out_dir : str
        Directory path to store results files to.
        Default = ""

    feat_names : np.ndarray
        All feature names/labels.

    data_scaled : np.ndarray
        Input dataset scaled with Standard scaling.

    ml_models : dict
        Keys are the model name/method and values are the instance of the
        built model.

    Methods
    -------

    describe_ml_planned()
        Prints a summary of what machine learning protocol has been selected.

    build_models(save_models)
        Runs the machine learning and summarizes the results.
    """
    dataset: pd.DataFrame
    out_dir: str = ""

    # Dynamically generated:
    feat_names: np.ndarray = field(init=False)
    data_scaled: np.ndarray = field(init=False)
    ml_models: dict = field(init=False)

    # This is called at the end of the dataclass's initialization procedure.
    def __post_init__(self):
        """Setup the provided dataset and params for ML."""
        self.out_dir = _prep_out_dir(self.out_dir)

        self.ml_models = {}

        # Allow a user with a supervised dataset to do unsupervised learning.
        try:
            self.dataset = self.dataset.drop(["Target"], axis=1)
        except KeyError:
            pass

        self.feat_names = self.dataset.columns.values
        data_array = self.dataset.to_numpy()

        scaler = StandardScaler()
        self.data_scaled = scaler.fit_transform(data_array)

        self.describe_ml_planned()

    def describe_ml_planned(self) -> None:
        """Prints a summary of what machine learning protocol has been selected."""
        out_text = "\n"
        out_text += "Below is a summary of the unsupervised machine learning you have planned. \n"

        out_text += f"You will use {len(self.dataset.columns)} features to build the model, with "
        out_text += "all of your data will be used for training the model, "
        out_text += f"which is {len(self.dataset)} observations.\n"

        out_text += "Currently you will use principal component analysis to get your results. "
        out_text += "More methods might be added in the future. "

        out_text += "If you're happy with the above, lets get model building!"
        return out_text

    def build_models(self, save_models: bool = True) -> None:
        """
        Runs the machine learning and summarizes the results.

        Parameters
        ----------

        save_models : bool
            Whether to save the ML models made to disk.
            Default is True.
        """
        pca = PCA()
        pca.fit(self.data_scaled)
        self.ml_models["PCA"] = pca
        print("All models built.")

        if save_models:
            temp_folder = Path("temporary_files")
            if not temp_folder.exists():
                Path.mkdir(temp_folder)

            feat_names_file = Path(temp_folder, "feature_names.npy")
            np.save(feat_names_file, self.feat_names)

            model_out_path = Path(temp_folder, "PCA_Model.pickle")
            self._save_best_models(best_model=pca, out_path=model_out_path)

__post_init__()

Setup the provided dataset and params for ML.

Source code in key_interactions_finder/model_building.py
def __post_init__(self):
    """Setup the provided dataset and params for ML."""
    self.out_dir = _prep_out_dir(self.out_dir)

    self.ml_models = {}

    # Allow a user with a supervised dataset to do unsupervised learning.
    try:
        self.dataset = self.dataset.drop(["Target"], axis=1)
    except KeyError:
        pass

    self.feat_names = self.dataset.columns.values
    data_array = self.dataset.to_numpy()

    scaler = StandardScaler()
    self.data_scaled = scaler.fit_transform(data_array)

    self.describe_ml_planned()

build_models(save_models=True)

Runs the machine learning and summarizes the results.

Parameters
bool

Whether to save the ML models made to disk. Default is True.

Source code in key_interactions_finder/model_building.py
def build_models(self, save_models: bool = True) -> None:
    """
    Runs the machine learning and summarizes the results.

    Parameters
    ----------

    save_models : bool
        Whether to save the ML models made to disk.
        Default is True.
    """
    pca = PCA()
    pca.fit(self.data_scaled)
    self.ml_models["PCA"] = pca
    print("All models built.")

    if save_models:
        temp_folder = Path("temporary_files")
        if not temp_folder.exists():
            Path.mkdir(temp_folder)

        feat_names_file = Path(temp_folder, "feature_names.npy")
        np.save(feat_names_file, self.feat_names)

        model_out_path = Path(temp_folder, "PCA_Model.pickle")
        self._save_best_models(best_model=pca, out_path=model_out_path)

describe_ml_planned()

Prints a summary of what machine learning protocol has been selected.

Source code in key_interactions_finder/model_building.py
def describe_ml_planned(self) -> None:
    """Prints a summary of what machine learning protocol has been selected."""
    out_text = "\n"
    out_text += "Below is a summary of the unsupervised machine learning you have planned. \n"

    out_text += f"You will use {len(self.dataset.columns)} features to build the model, with "
    out_text += "all of your data will be used for training the model, "
    out_text += f"which is {len(self.dataset)} observations.\n"

    out_text += "Currently you will use principal component analysis to get your results. "
    out_text += "More methods might be added in the future. "

    out_text += "If you're happy with the above, lets get model building!"
    return out_text