Data Preperation
Takes a processed PyContact feature set and prepares the dataset for either supervised/unsupervised learning.
Main responsibilities: 1. Add target variable data to supervised learning datasets. 2. Offer several filtering methods for the PyContact features.
There are 2 classes for end user usage:
-
SupervisedFeatureData For supervised datasets where (there is a target variable)
-
UnsupervisedFeatureData For unsupervised datasets where (no target variable)
These classes both inherit from the class "_FeatureData", which abstracts as much as their shared behaviour as possible.
SupervisedFeatureData
dataclass
Bases: _FeatureData
FeatureData Class for datasets with classification data.
Attributes
pd.DataFrame
Dataframe of PyContact features to process.
bool
Select True IF the target variable is a classifications(discrete data). Select False IF the target variable is a regression (continous data).
str
String for the path to the target variable file.
bool
True or False, does the target_file have a header. Default is True.
pd.DataFrame
Dataframe generated after merging feature and classifcation data together but before any filtering has been performed.
pd.DataFrame
Dataframe generated after filtering. Each time a filtering method is applied, this dataset is updated so all filtering method previously performed are preserved.
Methods
filter_by_occupancy(min_occupancy) Filter features such that only features with %occupancy >= the min_occupancy are kept.
filter_by_interaction_type(interaction_types_included) Filter features/interactions to use by their type (e.g. hbond or vdws...)
filter_by_main_or_side_chain(main_side_chain_types_included) Filter features to only certain combinations of main and side chain interactions.
filter_by_avg_strength(average_strength_cut_off) Filter features/interactions to use by their average strength.
reset_filtering() Reset the filtered dataframe back to its original form.
filter_by_occupancy_by_class(min_occupancy) Special alternative to the the standard filter features by occupancy method. %occupancy is determined for each class (as opposed to whole dataset), meaning only observations from 1 class have to meet the cut-off to keep the feature. Only avaible to datasets with a categorical target variable (classification).
Source code in key_interactions_finder/data_preperation.py
212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 |
|
__post_init__()
Merge target data to the dataframe, make an empty df for df_filtered.
Source code in key_interactions_finder/data_preperation.py
filter_by_occupancy_by_class(min_occupancy)
Special alternative to the standard filter features by occupancy method. As in the standard method, only features with %occupancy >= the min_occupancy are kept. (%occupancy is the % of frames that have a non-zero interaction value).
However, in this approach, %occupancy is determined for each class, meaning only observations from 1 class have to meet the cut-off to keep the feature.
Only available to datasets with classification (not regression) target data.
Parameters
float
Minimum %occupancy that a feature must have to be retained.
Returns
pd.DataFrame Filtered dataframe.
Source code in key_interactions_finder/data_preperation.py
UnsupervisedFeatureData
dataclass
Bases: _FeatureData
FeatureData Class for datasets without a target varaible.
Attributes
pd.DataFrame
Dataframe of PyContact features to process.
pd.DataFrame
Dataframe generated after class initialisation.
pd.DataFrame
Dataframe generated after filtering. If multiple filtering methods are used this is repeatedly updated, (so all filtering method performed on it are preserved).
Methods
filter_by_occupancy(min_occupancy) Filter features such that only features with %occupancy >= the min_occupancy are kept.
filter_by_interaction_type(interaction_types_included) Filter features/interactions to use by their type (e.g. hbond or vdws...)
filter_by_main_or_side_chain(main_side_chain_types_included) Filter features to only certain combinations of main and side chain interactions.
filter_by_avg_strength(average_strength_cut_off) Filter features/interactions to use by their average strength.
reset_filtering() Reset the filtered dataframe back to its original form.
Source code in key_interactions_finder/data_preperation.py
__post_init__()
Initialise an empty dataframe so dataclass can be printed.