博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
Pandas缺失值处理
阅读量:4360 次
发布时间:2019-06-07

本文共 4358 字,大约阅读时间需要 14 分钟。

#导入库import pandas as pdimport numpy as npfrom sklearn.preprocessing import Imputer#生成缺失数据df=pd.DataFrame(np.random.randn(6,4),columns=['col1','col2','col3','col4'])df.iloc[1:2,1] = np.nan #增加缺失值df.iloc[4,3] = np.nan #增加缺失值print(df) #打印输出       col1      col2      col3      col40 -0.977511 -0.566332 -0.529934  1.4896951 -0.491128       NaN -0.811174 -1.1027172  0.385777 -0.638822  0.325953 -0.2407803  0.938351 -0.746889  0.375200 -0.7152654  1.103418  0.238959 -0.459114       NaN5  1.002177  0.448844 -0.584634 -1.038151#查看缺失值位置nan_all=df.isnull()print(nan_all)    col1   col2   col3   col40  False  False  False  False1  False   True  False  False2  False  False  False  False3  False  False  False  False4  False  False  False   True5  False  False  False  Falsenan_col1=df.isnull().any() #获取含有NA的列print(nan_col1)col1    Falsecol2     Truecol3    Falsecol4     Truedtype: boolnan_col2=df.isnull().all() #获得全部为NA的列print(nan_col2)col1    Falsecol2    Falsecol3    Falsecol4    Falsedtype: bool#丢弃缺失值df2=df.dropna() #直接丢弃含有NA的行纪录print(df2)       col1      col2      col3      col40 -0.977511 -0.566332 -0.529934  1.4896952  0.385777 -0.638822  0.325953 -0.2407803  0.938351 -0.746889  0.375200 -0.7152655  1.002177  0.448844 -0.584634 -1.038151#通过sklearn的数据预处理方法对缺失值进行处理nan_model=Imputer(missing_values='NaN',strategy='mean',axis=0) #建立替换规则:将值为NaN的缺失值以均值做替换nan_result=nan_model.fit_transform(df) #应用模型规则print(nan_result) #打印输出[[-0.97751051 -0.56633185 -0.52993389  1.48969465] [-0.49112788 -0.25284792 -0.81117388 -1.10271738] [ 0.38577678 -0.63882219  0.32595345 -0.24077995] [ 0.93835121 -0.74688892  0.37519957 -0.71526484] [ 1.10341788  0.23895916 -0.45911413 -0.32144373] [ 1.00217657  0.4488442  -0.58463419 -1.03815116]]#使用Pandas做缺失值处理nan_result_pd1 = df.fillna(method='backfill') #用后面的值替换缺失值print(nan_result_pd1)       col1      col2      col3      col40 -0.977511 -0.566332 -0.529934  1.4896951 -0.491128 -0.638822 -0.811174 -1.1027172  0.385777 -0.638822  0.325953 -0.2407803  0.938351 -0.746889  0.375200 -0.7152654  1.103418  0.238959 -0.459114 -1.0381515  1.002177  0.448844 -0.584634 -1.038151nan_result_pd2 = df.fillna(method='bfill',limit=1) #用后面的值替换缺失值,限制每列只能替代一个缺失值print(nan_result_pd2)       col1      col2      col3      col40 -0.977511 -0.566332 -0.529934  1.4896951 -0.491128 -0.638822 -0.811174 -1.1027172  0.385777 -0.638822  0.325953 -0.2407803  0.938351 -0.746889  0.375200 -0.7152654  1.103418  0.238959 -0.459114 -1.0381515  1.002177  0.448844 -0.584634 -1.038151nan_result_df3=df.fillna(method='pad') #用前面的值替换缺失值print(nan_result_df3)       col1      col2      col3      col40 -0.977511 -0.566332 -0.529934  1.4896951 -0.491128 -0.566332 -0.811174 -1.1027172  0.385777 -0.638822  0.325953 -0.2407803  0.938351 -0.746889  0.375200 -0.7152654  1.103418  0.238959 -0.459114 -0.7152655  1.002177  0.448844 -0.584634 -1.038151nan_result_df4=df.fillna(0) #用0替换缺失值print(nan_result_df4)       col1      col2      col3      col40 -0.977511 -0.566332 -0.529934  1.4896951 -0.491128  0.000000 -0.811174 -1.1027172  0.385777 -0.638822  0.325953 -0.2407803  0.938351 -0.746889  0.375200 -0.7152654  1.103418  0.238959 -0.459114  0.0000005  1.002177  0.448844 -0.584634 -1.038151nan_result_df5=df.fillna({'col2':1.1,'col4':1.2}) #用不同值替换不同列的缺失值print(nan_result_df5)       col1      col2      col3      col40 -0.977511 -0.566332 -0.529934  1.4896951 -0.491128  1.100000 -0.811174 -1.1027172  0.385777 -0.638822  0.325953 -0.2407803  0.938351 -0.746889  0.375200 -0.7152654  1.103418  0.238959 -0.459114  1.2000005  1.002177  0.448844 -0.584634 -1.038151nan_result_df6=df.fillna(df.mean()['col2':'col4']) #用各自列的平均数替换缺失值print(nan_result_df6)       col1      col2      col3      col40 -0.977511 -0.566332 -0.529934  1.4896951 -0.491128 -0.252848 -0.811174 -1.1027172  0.385777 -0.638822  0.325953 -0.2407803  0.938351 -0.746889  0.375200 -0.7152654  1.103418  0.238959 -0.459114 -0.3214445  1.002177  0.448844 -0.584634 -1.038151nan_result_df7=df.replace(np.nan,0) #用Pandas的replace替换缺失值print(nan_result_df7)       col1      col2      col3      col40 -0.977511 -0.566332 -0.529934  1.4896951 -0.491128  0.000000 -0.811174 -1.1027172  0.385777 -0.638822  0.325953 -0.2407803  0.938351 -0.746889  0.375200 -0.7152654  1.103418  0.238959 -0.459114  0.0000005  1.002177  0.448844 -0.584634 -1.038151

转载于:https://www.cnblogs.com/hankleo/p/11462830.html

你可能感兴趣的文章
mybatis动态SQL
查看>>
mybatis环境搭建(eclipse,idea)
查看>>
MyBatis整体架构
查看>>
mybatis高级查询
查看>>
css绘制基本案例
查看>>
svg教程
查看>>
eclipse快捷键
查看>>
ssm搭建,maven,javaConfig
查看>>
idea快捷键
查看>>
UltraISO制作manjaro系统盘,使用优盘
查看>>
Git常用命令
查看>>
VsCode常用插件
查看>>
uni-app项目配置记录
查看>>
Git恢复删除的分支
查看>>
HNOI2002 公交车路线
查看>>
NOI 2012 随机数生成器
查看>>
[无聊]Frank的暑假oi(摸鱼)计划
查看>>
面试的内容
查看>>
8.28%你赛T3(数论,思维)提高组(tg)
查看>>
windows10中WSL启用OpenSSH
查看>>