常用单特征评价的指标(通俗理解)及其公式:
记录下日常工作中, 分析某一三方服务,是否对当前业务有影响, 主要用以下指标评判:
指标定义:
- 查得率/查全率: 预先准备的数据样本, 获取对应三方特征, 在三方特征库中,存在准备样本表示的百分比
- 好样本: 与预设的条件等同的为好样本.
- 准确率(precision)
- 召回率(recall)
- 打扰率(disturb)
- 群体稳定度(PSI)
- 信息值(IV)
指标公式:
- 精确度( precision ):TP / ( TP+FP ) = TP / P
- 召回率(recall):TP / (TP + FN ) = TP / T
- 真阳性率(True positive rate):TPR = TP / ( TP+FN ) = TP / T (敏感性 sensitivity)
- 假阳性率(False positive rate):FPR = FP / ( FP + TN ) = FP / F (特异性:specificity)
- 准确率(Accuracy):Acc = ( TP + TN ) / ( P +N )
- F-measure:2recallprecision / ( recall + precision )
- ROC曲线:FPR为横坐标,TPR为纵坐标
- PR曲线:recall为横坐标,precision 为纵坐标
参数关系图
==> 简化后:
代码实现:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65def cal_score_distribution(labels,score,n=0):
'''
计算相关指标
label: feature list
score: score list
n:要分多少组
'''
re_col = {'precision':'准确率',
'cum_precision':'累计准确率',
'recall':'召回率',
'cum_recall':'累计召回率',
'disturb':'打扰率',
'cum_disturb':'累计打扰率',
'good_rate':'好样本占比',
'bad_rate':'坏样本占比',
'total':'区间样本分配',
'total_rate':'总量占比',
'pred':'分数区间',
'good':'好样本',
'bad':'坏样本',
'cum_good':'好样本累计总量',
'cum_bad':'坏样本累计总量',
'sum':'总量'
}
if type(n) == int and n > 0 :
ar_rang = np.array([i for i in range(n) ])
elif type(n) == list:
ar_rang = np.array(n)
else:
ar_rang = np.arange(score.min()//2*2,score.max()//2*2,2)
#print(ar_rang,type(ar_rang))
bins = [i for i in ar_rang ]
preds , bins = pd.cut(score, bins,retbins=True)
pred = preds # 预测值
bad = labels # 取1为bad, 0为good
ksds = pd.DataFrame({'bad': bad, 'pred': pred})
ksds['good'] = 1 - ksds.bad
df_gp = ksds.groupby('pred').agg({'good':'sum','bad':'sum'})
result_df = df_gp.reset_index().sort_values(by=['pred'],ascending=False)
result_df['good_rate'] = result_df['good']/result_df['good'].sum()
result_df['bad_rate'] = result_df['bad']/result_df['bad'].sum()
result_df['cum_good']=result_df.good.cumsum()
result_df['cum_bad']=result_df.bad.cumsum()
result_df['total'] = result_df['good'] + result_df['bad']
result_df['total_rate'] = result_df['total']/result_df['total'].sum()
result_df['sum']=result_df['cum_good']+result_df['cum_bad']
result_df['overdue_rate'] = result_df['cum_bad']/result_df['sum']
result_df['woe'] = np.log(result_df['good_rate']/result_df['bad_rate'])
result_df['iv'] = (result_df['good_rate'] - result_df['bad_rate'])*result_df['woe']
result_df['pass_rate'] = result_df['sum']/(result_df['good'].sum()+result_df['bad'].sum())
result_df['precision'] = result_df['bad']/result_df['total']
result_df['cum_precision'] = result_df['cum_bad']/result_df['sum']
result_df['recall'] = result_df['bad']/result_df['bad'].sum()
result_df['cum_recall'] = result_df['cum_bad'] / result_df['bad'].sum()
result_df['disturb'] = result_df['good']/result_df['good'].sum()
result_df['cum_disturb'] = result_df['cum_good']/result_df['good'].sum()
result_df['range_ks'] = result_df['cum_bad'] / result_df['bad'].sum() - result_df['cum_good'] / result_df['good'].sum()
result_df['ks'] = result_df['range_ks'].max()
#pre_data.rename(columns=columns, inplace=True)
return result_df.rename(columns=re_col)