Skip to content

Data Preperation

Takes a processed PyContact feature set and prepares the dataset for either supervised/unsupervised learning.

Main responsibilities: 1. Add target variable data to supervised learning datasets. 2. Offer several filtering methods for the PyContact features.

There are 2 classes for end user usage:

  1. SupervisedFeatureData For supervised datasets where (there is a target variable)

  2. UnsupervisedFeatureData For unsupervised datasets where (no target variable)

These classes both inherit from the class "_FeatureData", which abstracts as much as their shared behaviour as possible.

SupervisedFeatureData dataclass

Bases: _FeatureData

FeatureData Class for datasets with classification data.

Attributes:

Name Type Description
input_df DataFrame

Dataframe of PyContact features to process.

is_classification bool

Select True IF the target variable is a classifications(discrete data). Select False IF the target variable is a regression (continous data).

target_file str

String for the path to the target variable file.

header_present bool

True or False, does the target_file have a header. Default is True.

df_processed DataFrame

Dataframe generated after merging feature and classifcation data together but before any filtering has been performed.

df_filtered DataFrame

Dataframe generated after filtering. Each time a filtering method is applied, this dataset is updated so all filtering method previously performed are preserved.

Methods:

Name Description
filter_by_occupancy

Filter features such that only features with %occupancy >= the min_occupancy are kept.

filter_by_interaction_type

Filter features/interactions to use by their type (e.g. hbond or vdws...)

filter_by_main_or_side_chain

Filter features to only certain combinations of main and side chain interactions.

filter_by_avg_strength

Filter features/interactions to use by their average strength.

reset_filtering

Reset the filtered dataframe back to its original form.

filter_by_occupancy_by_class

Special alternative to the the standard filter features by occupancy method. %occupancy is determined for each class (as opposed to whole dataset), meaning only observations from 1 class have to meet the cut-off to keep the feature. Only avaible to datasets with a categorical target variable (classification).

Source code in key_interactions_finder/data_preperation.py
@dataclass
class SupervisedFeatureData(_FeatureData):
    """
    FeatureData Class for datasets with classification data.

    Attributes
    ----------

    input_df : pd.DataFrame
        Dataframe of PyContact features to process.

    is_classification : bool
        Select True IF the target variable is a classifications(discrete data).
        Select False IF the target variable is a regression (continous data).

    target_file : str
        String for the path to the target variable file.

    header_present : bool
        True or False, does the target_file have a header.
        Default is True.

    df_processed : pd.DataFrame
        Dataframe generated after merging feature and classifcation data together
        but before any filtering has been performed.

    df_filtered : pd.DataFrame
        Dataframe generated after filtering. Each time a filtering method is applied, this
        dataset is updated so all filtering method previously performed are preserved.

    Methods
    -------

    filter_by_occupancy(min_occupancy)
        Filter features such that only features with %occupancy >= the min_occupancy are kept.

    filter_by_interaction_type(interaction_types_included)
        Filter features/interactions to use by their type (e.g. hbond or vdws...)

    filter_by_main_or_side_chain(main_side_chain_types_included)
        Filter features to only certain combinations of main and side chain interactions.

    filter_by_avg_strength(average_strength_cut_off)
        Filter features/interactions to use by their average strength.

    reset_filtering()
        Reset the filtered dataframe back to its original form.

    filter_by_occupancy_by_class(min_occupancy)
        Special alternative to the the standard filter features by occupancy method.
        %occupancy is determined for each class (as opposed to whole dataset),
        meaning only observations from 1 class have to meet the cut-off to keep the feature.
        Only avaible to datasets with a categorical target variable (classification).
    """

    # Others are defined in parent class.
    is_classification: bool
    target_file: str
    header_present: bool = True

    def __post_init__(self):
        """Merge target data to the dataframe, make an empty df for df_filtered."""
        if self.header_present:
            df_class = pd.read_csv(self.target_file)
        else:
            df_class = pd.read_csv(self.target_file, header=None)

        df_class = df_class.set_axis(["Target"], axis=1)

        if len(df_class) == len(self.input_df):
            self.df_processed = pd.concat([df_class, self.input_df], axis=1)
        else:
            exception_message = (
                f"Number of rows for target variables data: {len(df_class)} \n"
                + f"Number of rows for PyContact data: {len(self.input_df)} \n"
                + "The length of your target variables file doesn't match the "
                + "length of your features file. If the difference is 1, "
                + "check if you set the 'header_present' keyword correctly."
            )
            raise ValueError(exception_message)

        # Empty for now until any filtering is performed
        self.df_filtered = pd.DataFrame()

        print("Your PyContact features and target variable have been succesufully merged.")
        print("You can access this dataset through the class attribute: '.df_processed'.")

    def filter_by_occupancy_by_class(self, min_occupancy: float) -> pd.DataFrame:
        """
        Special alternative to the standard filter features by occupancy method.
        As in the standard method, only features with %occupancy >= the min_occupancy are kept.
        (%occupancy is the % of frames that have a non-zero interaction value).

        However, in this approach, %occupancy is determined for each class, meaning only
        observations from 1 class have to meet the cut-off to keep the feature.

        Only available to datasets with classification (not regression) target data.

        Parameters
        ----------

        min_occupancy : float
            Minimum %occupancy that a feature must have to be retained.

        Returns
        -------

        pd.DataFrame
            Filtered dataframe.
        """
        if not self.is_classification:
            error_message = (
                "Only datasets with discrete data (i.e. for classification) can use this method. "
                + "You specified your target data was continous (i.e. for regression)."
                + "You are likely after the method: filter_by_occupancy(min_occupancy) instead."
            )
            raise TypeError(error_message)

        keep_cols = ["Target"]  # always want "Target" present...
        try:
            for class_label in list(self.df_filtered["Target"].unique()):
                df_single_class = self.df_filtered[(self.df_filtered["Target"] == class_label)]
                keep_cols_single_class = list(
                    (df_single_class.loc[:, (df_single_class != 0).mean() > (min_occupancy / 100)]).columns
                )
                keep_cols.extend(keep_cols_single_class)

            self.df_filtered = self.df_filtered[list(sorted(set(keep_cols), reverse=True))]

        except KeyError:  # if no other filtering has been performed yet, follow this path.
            for class_label in list(self.df_processed["Target"].unique()):
                df_single_class = self.df_processed[(self.df_processed["Target"] == class_label)]
                keep_cols_single_class = list(
                    (df_single_class.loc[:, (df_single_class != 0).mean() > (min_occupancy / 100)]).columns
                )
                keep_cols.extend(keep_cols_single_class)

            self.df_filtered = self.df_processed[list(sorted(set(keep_cols), reverse=True))]

        return self.df_filtered

__post_init__()

Merge target data to the dataframe, make an empty df for df_filtered.

Source code in key_interactions_finder/data_preperation.py
def __post_init__(self):
    """Merge target data to the dataframe, make an empty df for df_filtered."""
    if self.header_present:
        df_class = pd.read_csv(self.target_file)
    else:
        df_class = pd.read_csv(self.target_file, header=None)

    df_class = df_class.set_axis(["Target"], axis=1)

    if len(df_class) == len(self.input_df):
        self.df_processed = pd.concat([df_class, self.input_df], axis=1)
    else:
        exception_message = (
            f"Number of rows for target variables data: {len(df_class)} \n"
            + f"Number of rows for PyContact data: {len(self.input_df)} \n"
            + "The length of your target variables file doesn't match the "
            + "length of your features file. If the difference is 1, "
            + "check if you set the 'header_present' keyword correctly."
        )
        raise ValueError(exception_message)

    # Empty for now until any filtering is performed
    self.df_filtered = pd.DataFrame()

    print("Your PyContact features and target variable have been succesufully merged.")
    print("You can access this dataset through the class attribute: '.df_processed'.")

filter_by_occupancy_by_class(min_occupancy)

Special alternative to the standard filter features by occupancy method. As in the standard method, only features with %occupancy >= the min_occupancy are kept. (%occupancy is the % of frames that have a non-zero interaction value).

However, in this approach, %occupancy is determined for each class, meaning only observations from 1 class have to meet the cut-off to keep the feature.

Only available to datasets with classification (not regression) target data.

Parameters:

Name Type Description Default
min_occupancy float

Minimum %occupancy that a feature must have to be retained.

required

Returns:

Type Description
DataFrame

Filtered dataframe.

Source code in key_interactions_finder/data_preperation.py
def filter_by_occupancy_by_class(self, min_occupancy: float) -> pd.DataFrame:
    """
    Special alternative to the standard filter features by occupancy method.
    As in the standard method, only features with %occupancy >= the min_occupancy are kept.
    (%occupancy is the % of frames that have a non-zero interaction value).

    However, in this approach, %occupancy is determined for each class, meaning only
    observations from 1 class have to meet the cut-off to keep the feature.

    Only available to datasets with classification (not regression) target data.

    Parameters
    ----------

    min_occupancy : float
        Minimum %occupancy that a feature must have to be retained.

    Returns
    -------

    pd.DataFrame
        Filtered dataframe.
    """
    if not self.is_classification:
        error_message = (
            "Only datasets with discrete data (i.e. for classification) can use this method. "
            + "You specified your target data was continous (i.e. for regression)."
            + "You are likely after the method: filter_by_occupancy(min_occupancy) instead."
        )
        raise TypeError(error_message)

    keep_cols = ["Target"]  # always want "Target" present...
    try:
        for class_label in list(self.df_filtered["Target"].unique()):
            df_single_class = self.df_filtered[(self.df_filtered["Target"] == class_label)]
            keep_cols_single_class = list(
                (df_single_class.loc[:, (df_single_class != 0).mean() > (min_occupancy / 100)]).columns
            )
            keep_cols.extend(keep_cols_single_class)

        self.df_filtered = self.df_filtered[list(sorted(set(keep_cols), reverse=True))]

    except KeyError:  # if no other filtering has been performed yet, follow this path.
        for class_label in list(self.df_processed["Target"].unique()):
            df_single_class = self.df_processed[(self.df_processed["Target"] == class_label)]
            keep_cols_single_class = list(
                (df_single_class.loc[:, (df_single_class != 0).mean() > (min_occupancy / 100)]).columns
            )
            keep_cols.extend(keep_cols_single_class)

        self.df_filtered = self.df_processed[list(sorted(set(keep_cols), reverse=True))]

    return self.df_filtered

UnsupervisedFeatureData dataclass

Bases: _FeatureData

FeatureData Class for datasets without a target varaible.

Attributes:

Name Type Description
input_df DataFrame

Dataframe of PyContact features to process.

df_processed DataFrame

Dataframe generated after class initialisation.

df_filtered DataFrame

Dataframe generated after filtering. If multiple filtering methods are used this is repeatedly updated, (so all filtering method performed on it are preserved).

Methods:

Name Description
filter_by_occupancy

Filter features such that only features with %occupancy >= the min_occupancy are kept.

filter_by_interaction_type

Filter features/interactions to use by their type (e.g. hbond or vdws...)

filter_by_main_or_side_chain

Filter features to only certain combinations of main and side chain interactions.

filter_by_avg_strength

Filter features/interactions to use by their average strength.

reset_filtering

Reset the filtered dataframe back to its original form.

Source code in key_interactions_finder/data_preperation.py
@dataclass
class UnsupervisedFeatureData(_FeatureData):
    """
    FeatureData Class for datasets without a target varaible.

    Attributes
    ----------

    input_df : pd.DataFrame
        Dataframe of PyContact features to process.

    df_processed : pd.DataFrame
        Dataframe generated after class initialisation.

    df_filtered : pd.DataFrame
        Dataframe generated after filtering. If multiple filtering methods are used
        this is repeatedly updated, (so all filtering method performed on it are preserved).

    Methods
    -------

    filter_by_occupancy(min_occupancy)
        Filter features such that only features with %occupancy >= the min_occupancy are kept.

    filter_by_interaction_type(interaction_types_included)
        Filter features/interactions to use by their type (e.g. hbond or vdws...)

    filter_by_main_or_side_chain(main_side_chain_types_included)
        Filter features to only certain combinations of main and side chain interactions.

    filter_by_avg_strength(average_strength_cut_off)
        Filter features/interactions to use by their average strength.

    reset_filtering()
        Reset the filtered dataframe back to its original form.
    """

    def __post_init__(self):
        """Initialise an empty dataframe so dataclass can be printed."""
        self.df_filtered = pd.DataFrame()

        # A little hacky, but doing this unites the supervised + unsuperivsed methods.
        # Save a lots of code duplication...
        self.df_processed = self.input_df

__post_init__()

Initialise an empty dataframe so dataclass can be printed.

Source code in key_interactions_finder/data_preperation.py
def __post_init__(self):
    """Initialise an empty dataframe so dataclass can be printed."""
    self.df_filtered = pd.DataFrame()

    # A little hacky, but doing this unites the supervised + unsuperivsed methods.
    # Save a lots of code duplication...
    self.df_processed = self.input_df