Skip to content

Data Preperation

Takes a processed PyContact feature set and prepares the dataset for either supervised/unsupervised learning.

Main responsibilities: 1. Add target variable data to supervised learning datasets. 2. Offer several filtering methods for the PyContact features.

There are 2 classes for end user usage:

  1. SupervisedFeatureData For supervised datasets where (there is a target variable)

  2. UnsupervisedFeatureData For unsupervised datasets where (no target variable)

These classes both inherit from the class "_FeatureData", which abstracts as much as their shared behaviour as possible.

SupervisedFeatureData dataclass

Bases: _FeatureData

FeatureData Class for datasets with classification data.

Attributes

pd.DataFrame

Dataframe of PyContact features to process.

bool

Select True IF the target variable is a classifications(discrete data). Select False IF the target variable is a regression (continous data).

str

String for the path to the target variable file.

bool

True or False, does the target_file have a header. Default is True.

pd.DataFrame

Dataframe generated after merging feature and classifcation data together but before any filtering has been performed.

pd.DataFrame

Dataframe generated after filtering. Each time a filtering method is applied, this dataset is updated so all filtering method previously performed are preserved.

Methods

filter_by_occupancy(min_occupancy) Filter features such that only features with %occupancy >= the min_occupancy are kept.

filter_by_interaction_type(interaction_types_included) Filter features/interactions to use by their type (e.g. hbond or vdws...)

filter_by_main_or_side_chain(main_side_chain_types_included) Filter features to only certain combinations of main and side chain interactions.

filter_by_avg_strength(average_strength_cut_off) Filter features/interactions to use by their average strength.

reset_filtering() Reset the filtered dataframe back to its original form.

filter_by_occupancy_by_class(min_occupancy) Special alternative to the the standard filter features by occupancy method. %occupancy is determined for each class (as opposed to whole dataset), meaning only observations from 1 class have to meet the cut-off to keep the feature. Only avaible to datasets with a categorical target variable (classification).

Source code in key_interactions_finder/data_preperation.py
@dataclass
class SupervisedFeatureData(_FeatureData):
    """
    FeatureData Class for datasets with classification data.

    Attributes
    ----------

    input_df : pd.DataFrame
        Dataframe of PyContact features to process.

    is_classification : bool
        Select True IF the target variable is a classifications(discrete data).
        Select False IF the target variable is a regression (continous data).

    target_file : str
        String for the path to the target variable file.

    header_present : bool
        True or False, does the target_file have a header.
        Default is True.

    df_processed : pd.DataFrame
        Dataframe generated after merging feature and classifcation data together
        but before any filtering has been performed.

    df_filtered : pd.DataFrame
        Dataframe generated after filtering. Each time a filtering method is applied, this
        dataset is updated so all filtering method previously performed are preserved.

    Methods
    -------

    filter_by_occupancy(min_occupancy)
        Filter features such that only features with %occupancy >= the min_occupancy are kept.

    filter_by_interaction_type(interaction_types_included)
        Filter features/interactions to use by their type (e.g. hbond or vdws...)

    filter_by_main_or_side_chain(main_side_chain_types_included)
        Filter features to only certain combinations of main and side chain interactions.

    filter_by_avg_strength(average_strength_cut_off)
        Filter features/interactions to use by their average strength.

    reset_filtering()
        Reset the filtered dataframe back to its original form.

    filter_by_occupancy_by_class(min_occupancy)
        Special alternative to the the standard filter features by occupancy method.
        %occupancy is determined for each class (as opposed to whole dataset),
        meaning only observations from 1 class have to meet the cut-off to keep the feature.
        Only avaible to datasets with a categorical target variable (classification).
    """
    # Others are defined in parent class.
    is_classification: bool
    target_file: str
    header_present: bool = True

    def __post_init__(self):
        """Merge target data to the dataframe, make an empty df for df_filtered."""
        if self.header_present:
            df_class = pd.read_csv(self.target_file)
        else:
            df_class = pd.read_csv(self.target_file, header=None)

        df_class = df_class.set_axis(["Target"], axis=1)

        if len(df_class) == len(self.input_df):
            self.df_processed = pd.concat([df_class, self.input_df], axis=1)
        else:
            exception_message = (f"Number of rows for target variables data: {len(df_class)} \n" +
                                 f"Number of rows for PyContact data: {len(self.input_df)} \n" +
                                 "The length of your target variables file doesn't match the " +
                                 "length of your features file. If the difference is 1, " +
                                 "check if you set the 'header_present' keyword correctly."
                                 )
            raise ValueError(exception_message)

        # Empty for now until any filtering is performed
        self.df_filtered = pd.DataFrame()

        print("Your PyContact features and target variable have been succesufully merged.")
        print("You can access this dataset through the class attribute: '.df_processed'.")

    def filter_by_occupancy_by_class(self, min_occupancy: float) -> pd.DataFrame:
        """
        Special alternative to the standard filter features by occupancy method.
        As in the standard method, only features with %occupancy >= the min_occupancy are kept.
        (%occupancy is the % of frames that have a non-zero interaction value).

        However, in this approach, %occupancy is determined for each class, meaning only
        observations from 1 class have to meet the cut-off to keep the feature.

        Only available to datasets with classification (not regression) target data.

        Parameters
        ----------

        min_occupancy : float
            Minimum %occupancy that a feature must have to be retained.

        Returns
        -------

        pd.DataFrame
            Filtered dataframe.
        """
        if not self.is_classification:
            error_message = (
                "Only datasets with discrete data (i.e. for classification) can use this method. " +
                "You specified your target data was continous (i.e. for regression)." +
                "You are likely after the method: filter_by_occupancy(min_occupancy) instead.")
            raise TypeError(error_message)

        keep_cols = ["Target"]  # always want "Target" present...
        try:
            for class_label in list(self.df_filtered["Target"].unique()):
                df_single_class = self.df_filtered[(
                    self.df_filtered["Target"] == class_label)]
                keep_cols_single_class = list(
                    (df_single_class.loc[:, (df_single_class !=
                                             0).mean() > (min_occupancy/100)]).columns
                )
                keep_cols.extend(keep_cols_single_class)

            self.df_filtered = self.df_filtered[list(
                sorted(set(keep_cols), reverse=True))]

        except KeyError:  # if no other filtering has been performed yet, follow this path.
            for class_label in list(self.df_processed["Target"].unique()):
                df_single_class = self.df_processed[(
                    self.df_processed["Target"] == class_label)]
                keep_cols_single_class = list(
                    (df_single_class.loc[:, (df_single_class !=
                                             0).mean() > (min_occupancy/100)]).columns
                )
                keep_cols.extend(keep_cols_single_class)

            self.df_filtered = self.df_processed[list(
                sorted(set(keep_cols), reverse=True))]

        return self.df_filtered

__post_init__()

Merge target data to the dataframe, make an empty df for df_filtered.

Source code in key_interactions_finder/data_preperation.py
def __post_init__(self):
    """Merge target data to the dataframe, make an empty df for df_filtered."""
    if self.header_present:
        df_class = pd.read_csv(self.target_file)
    else:
        df_class = pd.read_csv(self.target_file, header=None)

    df_class = df_class.set_axis(["Target"], axis=1)

    if len(df_class) == len(self.input_df):
        self.df_processed = pd.concat([df_class, self.input_df], axis=1)
    else:
        exception_message = (f"Number of rows for target variables data: {len(df_class)} \n" +
                             f"Number of rows for PyContact data: {len(self.input_df)} \n" +
                             "The length of your target variables file doesn't match the " +
                             "length of your features file. If the difference is 1, " +
                             "check if you set the 'header_present' keyword correctly."
                             )
        raise ValueError(exception_message)

    # Empty for now until any filtering is performed
    self.df_filtered = pd.DataFrame()

    print("Your PyContact features and target variable have been succesufully merged.")
    print("You can access this dataset through the class attribute: '.df_processed'.")

filter_by_occupancy_by_class(min_occupancy)

Special alternative to the standard filter features by occupancy method. As in the standard method, only features with %occupancy >= the min_occupancy are kept. (%occupancy is the % of frames that have a non-zero interaction value).

However, in this approach, %occupancy is determined for each class, meaning only observations from 1 class have to meet the cut-off to keep the feature.

Only available to datasets with classification (not regression) target data.

Parameters
float

Minimum %occupancy that a feature must have to be retained.

Returns

pd.DataFrame Filtered dataframe.

Source code in key_interactions_finder/data_preperation.py
def filter_by_occupancy_by_class(self, min_occupancy: float) -> pd.DataFrame:
    """
    Special alternative to the standard filter features by occupancy method.
    As in the standard method, only features with %occupancy >= the min_occupancy are kept.
    (%occupancy is the % of frames that have a non-zero interaction value).

    However, in this approach, %occupancy is determined for each class, meaning only
    observations from 1 class have to meet the cut-off to keep the feature.

    Only available to datasets with classification (not regression) target data.

    Parameters
    ----------

    min_occupancy : float
        Minimum %occupancy that a feature must have to be retained.

    Returns
    -------

    pd.DataFrame
        Filtered dataframe.
    """
    if not self.is_classification:
        error_message = (
            "Only datasets with discrete data (i.e. for classification) can use this method. " +
            "You specified your target data was continous (i.e. for regression)." +
            "You are likely after the method: filter_by_occupancy(min_occupancy) instead.")
        raise TypeError(error_message)

    keep_cols = ["Target"]  # always want "Target" present...
    try:
        for class_label in list(self.df_filtered["Target"].unique()):
            df_single_class = self.df_filtered[(
                self.df_filtered["Target"] == class_label)]
            keep_cols_single_class = list(
                (df_single_class.loc[:, (df_single_class !=
                                         0).mean() > (min_occupancy/100)]).columns
            )
            keep_cols.extend(keep_cols_single_class)

        self.df_filtered = self.df_filtered[list(
            sorted(set(keep_cols), reverse=True))]

    except KeyError:  # if no other filtering has been performed yet, follow this path.
        for class_label in list(self.df_processed["Target"].unique()):
            df_single_class = self.df_processed[(
                self.df_processed["Target"] == class_label)]
            keep_cols_single_class = list(
                (df_single_class.loc[:, (df_single_class !=
                                         0).mean() > (min_occupancy/100)]).columns
            )
            keep_cols.extend(keep_cols_single_class)

        self.df_filtered = self.df_processed[list(
            sorted(set(keep_cols), reverse=True))]

    return self.df_filtered

UnsupervisedFeatureData dataclass

Bases: _FeatureData

FeatureData Class for datasets without a target varaible.

Attributes

pd.DataFrame

Dataframe of PyContact features to process.

pd.DataFrame

Dataframe generated after class initialisation.

pd.DataFrame

Dataframe generated after filtering. If multiple filtering methods are used this is repeatedly updated, (so all filtering method performed on it are preserved).

Methods

filter_by_occupancy(min_occupancy) Filter features such that only features with %occupancy >= the min_occupancy are kept.

filter_by_interaction_type(interaction_types_included) Filter features/interactions to use by their type (e.g. hbond or vdws...)

filter_by_main_or_side_chain(main_side_chain_types_included) Filter features to only certain combinations of main and side chain interactions.

filter_by_avg_strength(average_strength_cut_off) Filter features/interactions to use by their average strength.

reset_filtering() Reset the filtered dataframe back to its original form.

Source code in key_interactions_finder/data_preperation.py
@dataclass
class UnsupervisedFeatureData(_FeatureData):
    """
    FeatureData Class for datasets without a target varaible.

    Attributes
    ----------

    input_df : pd.DataFrame
        Dataframe of PyContact features to process.

    df_processed : pd.DataFrame
        Dataframe generated after class initialisation.

    df_filtered : pd.DataFrame
        Dataframe generated after filtering. If multiple filtering methods are used
        this is repeatedly updated, (so all filtering method performed on it are preserved).

    Methods
    -------

    filter_by_occupancy(min_occupancy)
        Filter features such that only features with %occupancy >= the min_occupancy are kept.

    filter_by_interaction_type(interaction_types_included)
        Filter features/interactions to use by their type (e.g. hbond or vdws...)

    filter_by_main_or_side_chain(main_side_chain_types_included)
        Filter features to only certain combinations of main and side chain interactions.

    filter_by_avg_strength(average_strength_cut_off)
        Filter features/interactions to use by their average strength.

    reset_filtering()
        Reset the filtered dataframe back to its original form.
    """

    def __post_init__(self):
        """Initialise an empty dataframe so dataclass can be printed."""
        self.df_filtered = pd.DataFrame()

        # A little hacky, but doing this unites the supervised + unsuperivsed methods.
        # Save a lots of code duplication...
        self.df_processed = self.input_df

__post_init__()

Initialise an empty dataframe so dataclass can be printed.

Source code in key_interactions_finder/data_preperation.py
def __post_init__(self):
    """Initialise an empty dataframe so dataclass can be printed."""
    self.df_filtered = pd.DataFrame()

    # A little hacky, but doing this unites the supervised + unsuperivsed methods.
    # Save a lots of code duplication...
    self.df_processed = self.input_df