ehrapy.preprocessing.miss_forest_impute¶
- ehrapy.preprocessing.miss_forest_impute(adata, var_names=None, *, num_initial_strategy='mean', max_iter=3, n_estimators=100, random_state=0, warning_threshold=70, copy=False)[source]¶
Impute data using the MissForest strategy.
This function uses the MissForest strategy to impute missing values in the data matrix of an AnnData object. The strategy works by fitting a random forest model on each feature containing missing values, and using the trained model to predict the missing values.
See https://academic.oup.com/bioinformatics/article/28/1/112/219101. This requires the computation of which columns in X contain numerical only (including NaNs) and which contain non-numerical data.
- Parameters:
adata (
AnnData
) – The AnnData object to use MissForest Imputation on.var_names (
dict
[str
,list
[str
]] |list
[str
] |None
) – List of columns to impute or a dict with two keys (‘numerical’ and ‘non_numerical’) indicating which var contain mixed data and which numerical data only.num_initial_strategy (
Literal
['mean'
,'median'
,'most_frequent'
,'constant'
]) – The initial strategy to replace all missing numerical values with. Defaults to ‘mean’.max_iter (
int
) – The maximum number of iterations if the stop criterion has not been met yet. Defaults to 3.n_estimators – The number of trees to fit for every missing variable. Has a big effect on the run time. Decrease for faster computations. Defaults to 100.
random_state (
int
) – The random seed for the initialization. Defaults to 0.warning_threshold (
int
) – Threshold of percentage of missing values to display a warning for. Defaults to 70 .copy (
bool
) – Whether to return a copy or act in place. Defaults to False.
- Return type:
- Returns:
The imputed (but unencoded) AnnData object.
Examples
>>> import ehrapy as ep >>> adata = ep.dt.mimic_2(encoded=True) >>> ep.pp.miss_forest_impute(adata)