解决丢失值
删除不完整列
最简单直接,但是会浪费很多数据
1 | # Get names of columns with missing values |
插补
在缺失处填上诸如列均值的方法,但要根据缺失项目的实际特征来决定是否应该这样做
1 | from sklearn.impute import SimpleImputer |
插补-拓展
补上新值,新增一列布尔值表示是否为插补值
1 | # Make copy to avoid changing original data (when imputing) |
Categorical 分类数据
一个类别数据,例如问你有什么品牌的车,答“大众”、“丰田”、“奔驰”等
处理类别数据的3个方法
如果没有很重要数据,drop丢掉该变量
1
2
3
4
5
6
7
8
9
10
11
12# Get list of categorical variables
s = (X_train.dtypes == 'object')
object_cols = list(s[s].index)
print("Categorical variables:")
print(object_cols)
drop_X_train = X_train.select_dtypes(exclude=['object'])
drop_X_valid = X_valid.select_dtypes(exclude=['object'])
print("MAE from Approach 1 (Drop categorical variables):")
print(score_dataset(drop_X_train, drop_X_valid, y_train, y_valid))序数编码:为每一个类别制定数字,适用于强度指标如“强”、“中”、“弱”
1
2
3
4
5
6
7
8
9
10
11
12
13from sklearn.preprocessing import OrdinalEncoder
# Make copy to avoid changing original data
label_X_train = X_train.copy()
label_X_valid = X_valid.copy()
# Apply ordinal encoder to each column with categorical data
ordinal_encoder = OrdinalEncoder()
label_X_train[object_cols] = ordinal_encoder.fit_transform(X_train[object_cols])
label_X_valid[object_cols] = ordinal_encoder.transform(X_valid[object_cols])
print("MAE from Approach 2 (Ordinal Encoding):")
print(score_dataset(label_X_train, label_X_valid, y_train, y_valid))使用序数编码时,如果训练数据中的变量与测试数据变量不一样,会出现问题因此需要避免
1
2
3
4
5
6
7
8
9
10
11
12# Categorical columns in the training data
object_cols = [col for col in X_train.columns if X_train[col].dtype == "object"]
# Columns that can be safely ordinal encoded
good_label_cols = [col for col in object_cols if
set(X_valid[col]).issubset(set(X_train[col]))]
# Problematic columns that will be dropped from the dataset
bad_label_cols = list(set(object_cols)-set(good_label_cols))
print('Categorical columns that will be ordinal encoded:', good_label_cols)
print('\nCategorical columns that will be dropped from the dataset:', bad_label_cols)One-hot Encoding:创建类别数变量相同的变量,每行只有一个1,其他都是0,适用于无序类别,即名义变量nominal variables
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25from sklearn.preprocessing import OneHotEncoder
# Apply one-hot encoder to each column with categorical data
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[object_cols]))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[object_cols]))
# One-hot encoding removed index; put it back
OH_cols_train.index = X_train.index
OH_cols_valid.index = X_valid.index
# Remove categorical columns (will replace with one-hot encoding)
num_X_train = X_train.drop(object_cols, axis=1)
num_X_valid = X_valid.drop(object_cols, axis=1)
# Add one-hot encoded columns to numerical features
OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)
# Ensure all columns have string type
OH_X_train.columns = OH_X_train.columns.astype(str)
OH_X_valid.columns = OH_X_valid.columns.astype(str)
print("MAE from Approach 3 (One-Hot Encoding):")
print(score_dataset(OH_X_train, OH_X_valid, y_train, y_valid))
Pipline 批处理
1 | import pandas as pd |
看下面
1 | from sklearn.compose import ColumnTransformer |
data leakage
data leakage是训练时包括目标的信息,在预测时却没有该项信息,从而在训练集和验证集表现很好,但是在实际应用或者测试集表现不好。
target leakage: 以目标值为因素进行变化的任何变量都应该舍弃,如判断是否感冒,就应该把吃感冒药这种变量舍去
Train-Test 污染:人会根据测试结果调整预处理方式