机器学习概述

机器学习Machine Learning简称 ML是人工智能的一个重要分支它研究如何让计算机从数据中自动学习规律和模式并利用这些知识对新的数据进行预测或决策机器学习的核心思想是通过算法从数据中提取特征构建模型从而实现智能化的任务处理而无需为每个具体问题编写明确的规则

⭐⭐特点

数据驱动机器学习依赖于大量高质量的数据数据质量直接影响模型性能


自动学习算法能够自动从数据中发现模式和规律减少人工干预
泛化能力训练好的模型能够对未见过的数据做出准确预测
迭代优化模型可以通过更多数据和训练不断优化性能
适用广泛可应用于分类回归聚类降维等多种任务场景
计算密集通常需要较强的计算资源尤其是深度学习模型
黑盒特性部分复杂模型如深度神经网络的可解释性较差
持续演进随着新数据的到来模型可以持续更新和改进

⭐⭐应用领域

计算机视觉图像分类目标检测人脸识别医学影像分析等


自然语言处理文本分类情感分析机器翻译智能问答等
推荐系统电商商品推荐视频内容推荐新闻资讯个性化推送等
金融风控信用评分欺诈检测股票预测风险评估等
医疗健康疾病诊断药物研发基因分析健康预测等
自动驾驶环境感知路径规划行为预测决策控制等
语音识别语音转文字声纹识别智能助手语音合成等
工业制造质量检测故障预测工艺优化供应链管理等等

基本概念

💗💗 理解机器学习的基本术语和概念是入门的关键

数据集划分

  • 训练集Training Set用于训练模型的数据集约占 60-80%
  • 验证集Validation Set用于调整超参数和选择模型约占 10-20%
  • 测试集Test Set用于最终评估模型性能约占 10-20%
1
2
3
4
5
6
7
8
9
10
11
from sklearn.model_selection import train_test_split

# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)

# 进一步划分验证集
X_train, X_val, y_train, y_val = train_test_split(
X_train, y_train, test_size=0.2, random_state=42
)

特征与标签

  • 特征Features输入变量用于描述样本的属性
  • 标签Labels输出变量需要预测的目标值
  • 样本Samples单个数据实例
  • 特征工程从原始数据中提取有用特征的过程
1
2
3
4
5
6
7
8
9
10
11
import pandas as pd

# 加载数据
data = pd.read_csv('dataset.csv')

# 特征和标签
features = data[['age', 'income', 'education']] # 特征
labels = data['purchase'] # 标签

print(f"特征形状{features.shape}")
print(f"标签形状{labels.shape}")

模型类型

💗💗 根据学习方式机器学习主要分为以下几类

类型描述典型算法
监督学习有标签数据学习输入到输出的映射线性回归决策树SVM
无监督学习无标签数据发现数据内在结构K-MeansPCADBSCAN
半监督学习少量标注数据 + 大量未标注数据自训练协同训练
强化学习通过与环境交互学习最优策略Q-LearningDQNPPO

工作流程

⭐⭐ 典型的机器学习项目包含以下步骤

1. 问题定义

明确业务目标确定是分类回归还是聚类问题

1
2
3
4
# 示例房价预测回归问题
# 目标根据房屋特征预测价格
# 输入面积位置房龄等
# 输出房价连续值

2. 数据收集

从各种来源获取相关数据

1
2
3
4
5
6
7
8
9
10
# 数据来源示例
# - CSV/Excel 文件
# - 数据库查询
# - API 接口
# - 网络爬虫
# - 传感器数据

import pandas as pd
df = pd.read_csv('housing_data.csv')
print(df.head())

3. 数据预处理

清洗和处理原始数据使其适合模型训练

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# 处理缺失值
df.dropna() # 删除缺失值
df.fillna(0) # 用0填充
df.fillna(df.mean()) # 用均值填充

# 处理异常值
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
df = df[~((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR))).any(axis=1)]

# 数据类型转换
df['date'] = pd.to_datetime(df['date'])
df['category'] = df['category'].astype('category')

4. 特征工程

提取选择和转换特征以提升模型性能

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
from sklearn.preprocessing import StandardScaler, LabelEncoder

# 数值特征标准化
scaler = StandardScaler()
numerical_features = ['age', 'income', 'score']
df[numerical_features] = scaler.fit_transform(df[numerical_features])

# 类别特征编码
le = LabelEncoder()
df['gender_encoded'] = le.fit_transform(df['gender'])

# 特征组合
df['income_per_age'] = df['income'] / df['age']

# 特征选择
from sklearn.feature_selection import SelectKBest, f_classif
selector = SelectKBest(score_func=f_classif, k=10)
X_selected = selector.fit_transform(X, y)

5. 模型训练

选择合适的算法并训练模型

1
2
3
4
5
6
7
8
9
10
11
12
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

# 选择模型
model = RandomForestClassifier(n_estimators=100, random_state=42)

# 训练模型
model.fit(X_train, y_train)

# 查看模型参数
print(f"模型类型{type(model).__name__}")
print(f"特征重要性{model.feature_importances_}")

6. 模型评估

使用合适的指标评估模型性能

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# 预测
y_pred = model.predict(X_test)

# 评估指标
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')

print(f"准确率{accuracy:.4f}")
print(f"精确率{precision:.4f}")
print(f"召回率{recall:.4f}")
print(f"F1分数{f1:.4f}")

7. 模型优化

调整超参数以提升模型性能

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
from sklearn.model_selection import GridSearchCV

# 定义参数网格
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [5, 10, 20, None],
'min_samples_split': [2, 5, 10]
}

# 网格搜索
grid_search = GridSearchCV(
RandomForestClassifier(random_state=42),
param_grid,
cv=5,
scoring='accuracy',
n_jobs=-1
)

grid_search.fit(X_train, y_train)

# 最佳参数
print(f"最佳参数{grid_search.best_params_}")
print(f"最佳得分{grid_search.best_score_:.4f}")

8. 模型部署

将训练好的模型应用到生产环境

1
2
3
4
5
6
7
8
9
10
11
12
import joblib

# 保存模型
joblib.dump(model, 'model.pkl')

# 加载模型
loaded_model = joblib.load('model.pkl')

# 预测新数据
new_data = [[25, 50000, 1]]
prediction = loaded_model.predict(new_data)
print(f"预测结果{prediction}")

监督学习

线性回归

线性回归是最基础的回归算法用于预测连续值

1
pip install scikit-learn
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# 生成示例数据
np.random.seed(42)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)

# 创建模型
model = LinearRegression()

# 训练模型
model.fit(X, y)

# 预测
X_new = np.array([[0], [2]])
y_pred = model.predict(X_new)

# 评估
print(f"截距{model.intercept_}")
print(f"系数{model.coef_}")
print(f"R² 分数{r2_score(y, model.predict(X)):.4f}")
print(f"均方误差{mean_squared_error(y, model.predict(X)):.4f}")

逻辑回归

逻辑回归用于二分类问题输出概率值

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.metrics import classification_report

# 加载数据
iris = load_iris()
X, y = iris.data, iris.target

# 只使用前两类进行二分类演示
X_binary = X[y < 2]
y_binary = y[y < 2]

# 创建模型
model = LogisticRegression(max_iter=200)

# 训练
model.fit(X_binary, y_binary)

# 预测
y_pred = model.predict(X_binary)

# 评估
print(classification_report(y_binary, y_pred))
print(f"准确率{model.score(X_binary, y_binary):.4f}")

决策树

决策树通过树形结构进行分类或回归

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.datasets import load_wine

# 加载数据
wine = load_wine()
X, y = wine.data, wine.target

# 创建模型
model = DecisionTreeClassifier(max_depth=3, random_state=42)

# 训练
model.fit(X, y)

# 查看决策规则
tree_rules = export_text(model, feature_names=wine.feature_names)
print(tree_rules)

# 预测
y_pred = model.predict(X)
print(f"准确率{model.score(X, y):.4f}")

支持向量机SVM

SVM 通过寻找最优超平面进行分类

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
from sklearn.svm import SVC
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score

# 生成数据
X, y = make_classification(n_samples=300, n_features=2,
n_redundant=0, random_state=42)

# 创建模型
model = SVC(kernel='rbf', C=1.0, gamma='scale')

# 交叉验证
scores = cross_val_score(model, X, y, cv=5)
print(f"交叉验证得分{scores}")
print(f"平均得分{scores.mean():.4f} (+/- {scores.std() * 2:.4f})")

# 训练和预测
model.fit(X, y)
y_pred = model.predict(X)
print(f"训练准确率{model.score(X, y):.4f}")

随机森林

随机森林是多个决策树的集成学习方法

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer

# 加载数据
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target

# 创建模型
model = RandomForestClassifier(
n_estimators=100,
max_depth=10,
random_state=42,
n_jobs=-1
)

# 训练
model.fit(X, y)

# 特征重要性
feature_importance = pd.Series(
model.feature_importances_,
index=cancer.feature_names
).nlargest(10)

print("Top 10 重要特征")
print(feature_importance)

# 预测
y_pred = model.predict(X)
print(f"准确率{model.score(X, y):.4f}")

K近邻KNN

KNN 基于距离度量进行分类

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler

# 加载数据
iris = load_iris()
X, y = iris.data, iris.target

# 标准化
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 创建模型
model = KNeighborsClassifier(n_neighbors=5)

# 训练
model.fit(X_scaled, y)

# 预测
y_pred = model.predict(X_scaled)
print(f"准确率{model.score(X_scaled, y):.4f}")

# 预测新样本
new_sample = scaler.transform([[5.1, 3.5, 1.4, 0.2]])
prediction = model.predict(new_sample)
print(f"预测类别{iris.target_names[prediction][0]}")

无监督学习

K-Means 聚类

K-Means 将数据划分为 K 个簇

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt

# 生成数据
X, y_true = make_blobs(n_samples=300, centers=4,
cluster_std=0.60, random_state=0)

# 创建模型
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)

# 训练
kmeans.fit(X)

# 预测
y_pred = kmeans.predict(X)

# 评估
print(f"惯性簇内平方和{kmeans.inertia_:.2f}")
print(f"簇中心\n{kmeans.cluster_centers_}")

# 可视化
plt.scatter(X[:, 0], X[:, 1], c=y_pred, s=50, cmap='viridis')
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.75)
plt.title('K-Means Clustering')
plt.show()

层次聚类

层次聚类构建树状聚类结构

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage
import numpy as np

# 生成数据
np.random.seed(42)
X = np.random.rand(50, 2)

# 创建模型
model = AgglomerativeClustering(n_clusters=3)

# 训练
clusters = model.fit_predict(X)

# 绘制树状图
linkage_matrix = linkage(X, method='ward')
plt.figure(figsize=(10, 5))
dendrogram(linkage_matrix)
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('Sample Index')
plt.ylabel('Distance')
plt.show()

print(f"聚类标签{clusters}")

DBSCAN

DBSCAN 基于密度的聚类算法

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler

# 生成数据
X, _ = make_blobs(n_samples=300, centers=4,
cluster_std=0.60, random_state=0)

# 添加噪声
noise = np.random.uniform(-10, 10, (20, 2))
X = np.vstack([X, noise])

# 标准化
X_scaled = StandardScaler().fit_transform(X)

# 创建模型
dbscan = DBSCAN(eps=0.5, min_samples=5)

# 训练
labels = dbscan.fit_predict(X_scaled)

# 统计
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = list(labels).count(-1)

print(f"发现的簇数量{n_clusters}")
print(f"噪声点数量{n_noise}")

PCA 降维

主成分分析用于降维和特征提取

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
from sklearn.decomposition import PCA
from sklearn.datasets import load_digits

# 加载数据
digits = load_digits()
X, y = digits.data, digits.target

# 创建 PCA 模型
pca = PCA(n_components=2)

# 降维
X_pca = pca.fit_transform(X)

# 方差解释率
print(f"解释方差比{pca.explained_variance_ratio_}")
print(f"累计解释方差{sum(pca.explained_variance_ratio_):.4f}")

# 可视化
plt.figure(figsize=(10, 8))
scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='tab10')
plt.colorbar(scatter)
plt.title('PCA Dimensionality Reduction')
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
plt.show()

模型评估

分类指标

混淆矩阵

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import seaborn as sns

# 生成预测
y_pred = model.predict(X_test)

# 混淆矩阵
cm = confusion_matrix(y_test, y_pred)
print("混淆矩阵")
print(cm)

# 可视化
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot(cmap='Blues')
plt.title('Confusion Matrix')
plt.show()

详细分类报告

1
2
3
4
5
from sklearn.metrics import classification_report

# 分类报告
report = classification_report(y_test, y_pred, target_names=['Class 0', 'Class 1'])
print(report)

ROC 曲线和 AUC

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
from sklearn.metrics import roc_curve, auc
from sklearn.preprocessing import label_binarize

# 二分类
y_scores = model.predict_proba(X_test)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, y_scores)
roc_auc = auc(fpr, tpr)

# 绘制 ROC 曲线
plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2,
label=f'ROC curve (area = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.show()

回归指标

1
2
3
4
5
6
7
8
9
10
11
12
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# 回归评估
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print(f"平均绝对误差 (MAE){mae:.4f}")
print(f"均方误差 (MSE){mse:.4f}")
print(f"均方根误差 (RMSE){rmse:.4f}")
print(f"R² 分数{r2:.4f}")

交叉验证

1
2
3
4
5
6
7
8
9
10
11
from sklearn.model_selection import cross_val_score, KFold

# K 折交叉验证
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# 执行交叉验证
scores = cross_val_score(model, X, y, cv=kf, scoring='accuracy')

print(f"各折得分{scores}")
print(f"平均得分{scores.mean():.4f}")
print(f"标准差{scores.std():.4f}")

特征工程

数据标准化

MinMaxScaler

1
2
3
4
5
6
7
8
from sklearn.preprocessing import MinMaxScaler

# 归一化到 [0, 1]
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

print(f"最小值{X_scaled.min(axis=0)}")
print(f"最大值{X_scaled.max(axis=0)}")

StandardScaler

1
2
3
4
5
6
7
8
from sklearn.preprocessing import StandardScaler

# 标准化均值为0方差为1
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print(f"均值{X_scaled.mean(axis=0)}")
print(f"标准差{X_scaled.std(axis=0)}")

类别编码

One-Hot Encoding

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
from sklearn.preprocessing import OneHotEncoder
import pandas as pd

# 创建示例数据
df = pd.DataFrame({
'color': ['red', 'blue', 'green', 'red'],
'size': ['S', 'M', 'L', 'M']
})

# One-Hot 编码
encoder = OneHotEncoder(sparse_output=False)
encoded = encoder.fit_transform(df)

# 获取特征名
feature_names = encoder.get_feature_names_out(['color', 'size'])
df_encoded = pd.DataFrame(encoded, columns=feature_names)

print(df_encoded)

Label Encoding

1
2
3
4
5
6
7
8
9
10
from sklearn.preprocessing import LabelEncoder

# 标签编码
le = LabelEncoder()
colors = ['red', 'blue', 'green', 'red', 'blue']
encoded = le.fit_transform(colors)

print(f"原始值{colors}")
print(f"编码后{encoded}")
print(f"类别映射{dict(zip(le.classes_, range(len(le.classes_))))}")

特征选择

基于相关性的选择

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import seaborn as sns

# 计算相关性矩阵
corr_matrix = df.corr()

# 可视化
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title('Feature Correlation Matrix')
plt.show()

# 选择高相关性的特征
threshold = 0.8
high_corr = []
for i in range(len(corr_matrix.columns)):
for j in range(i):
if abs(corr_matrix.iloc[i, j]) > threshold:
high_corr.append(corr_matrix.columns[i])

print(f"高相关性特征{set(high_corr)}")

基于模型的选择

1
2
3
4
5
6
7
8
9
10
11
12
13
14
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier

# 使用随机森林进行特征选择
selector = SelectFromModel(
RandomForestClassifier(n_estimators=100, random_state=42),
threshold='median'
)

X_selected = selector.fit_transform(X, y)
selected_mask = selector.get_support()

print(f"选择的特征数{X_selected.shape[1]}")
print(f"特征掩码{selected_mask}")

超参数调优

网格搜索

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

# 参数网格
param_grid = {
'C': [0.1, 1, 10, 100],
'gamma': [1, 0.1, 0.01, 0.001],
'kernel': ['rbf', 'linear']
}

# 网格搜索
grid_search = GridSearchCV(
SVC(),
param_grid,
refit=True,
verbose=2,
cv=5,
n_jobs=-1
)

grid_search.fit(X_train, y_train)

# 结果
print(f"最佳参数{grid_search.best_params_}")
print(f"最佳得分{grid_search.best_score_:.4f}")

# 使用最佳模型预测
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)

随机搜索

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform, randint

# 参数分布
param_distributions = {
'C': uniform(0.1, 100),
'gamma': uniform(0.001, 1),
'kernel': ['rbf', 'linear']
}

# 随机搜索
random_search = RandomizedSearchCV(
SVC(),
param_distributions,
n_iter=100,
cv=5,
random_state=42,
n_jobs=-1,
verbose=1
)

random_search.fit(X_train, y_train)

print(f"最佳参数{random_search.best_params_}")
print(f"最佳得分{random_search.best_score_:.4f}")

Bayesian Optimization

1
pip install scikit-optimize
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
from skopt import BayesSearchCV
from sklearn.svm import SVC

# 贝叶斯优化
bayes_search = BayesSearchCV(
SVC(),
{
'C': (1e-6, 1e+6, 'log-uniform'),
'gamma': (1e-6, 1e+1, 'log-uniform'),
'kernel': ['rbf', 'linear']
},
n_iter=50,
cv=5,
random_state=42,
n_jobs=-1
)

bayes_search.fit(X_train, y_train)

print(f"最佳参数{bayes_search.best_params_}")
print(f"最佳得分{bayes_search.best_score_:.4f}")

常用库介绍

Scikit-learn

Scikit-learn 是最流行的传统机器学习库

1
pip install scikit-learn
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import sklearn
print(f"Scikit-learn 版本{sklearn.__version__}")

# 完整流程示例
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

# 创建管道
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', SVC())
])

# 训练和预测
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

print(f"准确率{pipeline.score(X_test, y_test):.4f}")

XGBoost

XGBoost 是高效的梯度提升库

1
pip install xgboost
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import xgboost as xgb
from sklearn.metrics import accuracy_score

# 创建 DMatrix
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

# 设置参数
params = {
'max_depth': 6,
'eta': 0.1,
'objective': 'multi:softmax',
'num_class': 3,
'eval_metric': 'mlogloss'
}

# 训练
model = xgb.train(params, dtrain, num_boost_round=100)

# 预测
y_pred = model.predict(dtest)

# 评估
accuracy = accuracy_score(y_test, y_pred)
print(f"准确率{accuracy:.4f}")

# 特征重要性
xgb.plot_importance(model)
plt.title('Feature Importance')
plt.show()

LightGBM

LightGBM 是微软开发的快速梯度提升框架

1
pip install lightgbm
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import lightgbm as lgb

# 创建数据集
train_data = lgb.Dataset(X_train, label=y_train)
test_data = lgb.Dataset(X_test, label=y_test, reference=train_data)

# 设置参数
params = {
'objective': 'multiclass',
'num_class': 3,
'metric': 'multi_logloss',
'boosting_type': 'gbdt',
'num_leaves': 31,
'learning_rate': 0.05,
'feature_fraction': 0.9
}

# 训练
model = lgb.train(
params,
train_data,
num_boost_round=100,
valid_sets=[test_data],
callbacks=[lgb.early_stopping(10), lgb.log_evaluation(10)]
)

# 预测
y_pred = model.predict(X_test)
y_pred_labels = np.argmax(y_pred, axis=1)

accuracy = accuracy_score(y_test, y_pred_labels)
print(f"准确率{accuracy:.4f}")

CatBoost

CatBoost 是 Yandex 开发的处理类别特征的库

1
pip install catboost
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
from catboost import CatBoostClassifier

# 创建模型
model = CatBoostClassifier(
iterations=100,
learning_rate=0.1,
depth=6,
loss_function='MultiClass',
verbose=False,
random_seed=42
)

# 训练
model.fit(X_train, y_train)

# 预测
y_pred = model.predict(X_test)

# 评估
accuracy = accuracy_score(y_test, y_pred)
print(f"准确率{accuracy:.4f}")

# 特征重要性
feature_importance = model.get_feature_importance()
print(f"特征重要性{feature_importance}")

实践案例

鸢尾花分类

经典的分类问题示例

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# 加载数据
iris = load_iris()
X, y = iris.data, iris.target

# 划分数据集
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)

# 创建模型
model = RandomForestClassifier(n_estimators=100, random_state=42)

# 训练
model.fit(X_train, y_train)

# 预测
y_pred = model.predict(X_test)

# 评估
print("分类报告")
print(classification_report(y_test, y_pred, target_names=iris.target_names))

# 特征重要性
feature_importance = pd.Series(
model.feature_importances_,
index=iris.feature_names
).sort_values(ascending=False)

print("\n特征重要性")
print(feature_importance)

波士顿房价预测

回归问题示例

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error, r2_score

# 加载数据
housing = fetch_california_housing()
X, y = housing.data, housing.target

# 划分数据集
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)

# 创建模型
model = Ridge(alpha=1.0)

# 训练
model.fit(X_train, y_train)

# 预测
y_pred = model.predict(X_test)

# 评估
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print(f"均方根误差 (RMSE){rmse:.4f}")
print(f"R² 分数{r2:.4f}")

# 系数
coefficients = pd.Series(model.coef_, index=housing.feature_names)
print("\n回归系数")
print(coefficients.sort_values(ascending=False))

客户分群

聚类分析示例

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import pandas as pd

# 创建示例数据
np.random.seed(42)
customers = pd.DataFrame({
'age': np.random.randint(18, 70, 200),
'income': np.random.randint(20000, 100000, 200),
'spending_score': np.random.randint(1, 100, 200)
})

# 标准化
scaler = StandardScaler()
X_scaled = scaler.fit_transform(customers)

# 肘部法则确定 K 值
inertias = []
K_range = range(1, 11)

for k in K_range:
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
kmeans.fit(X_scaled)
inertias.append(kmeans.inertia_)

# 绘制肘部图
plt.plot(K_range, inertias, 'bo-')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia')
plt.title('Elbow Method')
plt.show()

# 使用最佳 K 值
optimal_k = 4
kmeans = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)
clusters = kmeans.fit_predict(X_scaled)

# 添加聚类标签
customers['cluster'] = clusters

# 分析各簇特征
cluster_analysis = customers.groupby('cluster').mean()
print("各簇特征分析")
print(cluster_analysis)

# 可视化
plt.figure(figsize=(10, 8))
scatter = plt.scatter(
customers['age'],
customers['income'],
c=customers['cluster'],
cmap='viridis',
alpha=0.6
)
plt.colorbar(scatter)
plt.xlabel('Age')
plt.ylabel('Income')
plt.title('Customer Segmentation')
plt.show()

常见问题

过拟合与欠拟合

过拟合Overfitting

模型在训练集上表现很好但在测试集上表现差

症状

  • 训练准确率高测试准确率低
  • 训练损失低测试损失高

解决方案

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# 1. 增加正则化
model = Ridge(alpha=1.0) # L2 正则化
model = Lasso(alpha=0.1) # L1 正则化

# 2. 减少模型复杂度
model = DecisionTreeClassifier(max_depth=5) # 限制深度

# 3. 增加训练数据
# 获取更多数据或使用数据增强

# 4. Dropout深度学习
# 在神经网络中添加 Dropout 层

# 5. 早停法
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5)

欠拟合Underfitting

模型在训练集和测试集上都表现不好

症状

  • 训练准确率和测试准确率都低
  • 训练损失和测试损失都高

解决方案

1
2
3
4
5
6
7
8
9
10
11
# 1. 增加模型复杂度
model = DecisionTreeClassifier(max_depth=None) # 不限制深度

# 2. 增加特征
# 添加更多有意义的特征

# 3. 减少正则化
model = Ridge(alpha=0.01) # 减小正则化强度

# 4. 延长训练时间
# 增加训练迭代次数

数据不平衡

SMOTE 过采样

1
pip install imbalanced-learn
1
2
3
4
5
6
7
8
9
10
11
12
13
14
from imblearn.over_sampling import SMOTE
from collections import Counter

# 检查类别分布
print("原始分布", Counter(y))

# SMOTE 过采样
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)

print("重采样后分布", Counter(y_resampled))

# 使用重采样后的数据训练
model.fit(X_resampled, y_resampled)

类别权重

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
from sklearn.utils.class_weight import compute_class_weight

# 计算类别权重
class_weights = compute_class_weight(
'balanced',
classes=np.unique(y),
y=y
)

class_weight_dict = dict(zip(np.unique(y), class_weights))
print("类别权重", class_weight_dict)

# 在模型中使用
model = RandomForestClassifier(
class_weight='balanced', # 自动计算权重
random_state=42
)
model.fit(X, y)

特征缩放的重要性

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.svm import SVC

# 不同缩放方法对比
scalers = {
'StandardScaler': StandardScaler(),
'MinMaxScaler': MinMaxScaler()
}

for name, scaler in scalers.items():
X_scaled = scaler.fit_transform(X)

model = SVC()
scores = cross_val_score(model, X_scaled, y, cv=5)

print(f"{name}: {scores.mean():.4f} (+/- {scores.std() * 2:.4f})")

学习资源

  • 视频

    • 机器学习全套视频教程https://www.bilibili.com/video/BV1Fzszz4Ek7
  • 书籍

    • Pattern Recognition and Machine Learning- Christopher Bishop
    • The Elements of Statistical Learning- Trevor Hastie
    • Hands-On Machine Learning with Scikit-Learn- Aurélien Géron
    • 机器学习西瓜书- 周志华
    • 统计学习方法- 李航
  • 实践平台

    • Kaggle数据科学竞赛和数据集平台
    • UCI Machine Learning Repository经典数据集仓库
    • Google Colab免费的 GPU 云端环境
    • Papers with Code最新论文和代码实现
  • 社区与论坛

    • Stack Overflow技术问题问答
    • Reddit - r/MachineLearning机器学习讨论社区
    • 知乎 - 机器学习话题中文机器学习交流
    • GitHub开源项目和代码分享
  • 博客与资讯

    • Towards Data ScienceMedium 上的数据科学专栏
    • Machine Learning MasteryJason Brownlee 的博客
    • Distill.pub可视化的机器学习文章
    • AI Conference PapersNeurIPS, ICML, ICLR 等顶级会议