[原创]stata寻找重复数据(duplicates)
- hellohappy
- 网站管理员
- 帖子: 306
- 注册时间: 2018年11月18日, 14:27
- Been thanked: 5 time
#1 [原创]stata寻找重复数据(duplicates)
目录:
前言:
方法:
duplicates report
duplicates list
duplicates example
前言:
错误数据中,有一类是重复变量。
比如,正确的数据是面板数据,时间维度为year,个体维度为countrycode(随便提一下,把字符串转换成对应数字id的语法是gen id = group(countrycode)。)。但是,删掉了那些缺失数据以后,依然被stata报告说变量组合 year 、 countrycode 不唯一。
这就说明数据里面可能有了重复数据(错误数据)。要怎么快速定位这些重复数据呢?
方法:
stata内置了一个命令 duplicates ,专门用于处理这类事务。下面是 duplicates 的help帮助原文:
help文件
Show
Title
[D] duplicates -- Report, tag, or drop duplicate observations
Syntax
Report duplicates
duplicates report [varlist] [if] [in]
List one example for each group of duplicates
duplicates examples [varlist] [if] [in] [, options]
List all duplicates
duplicates list [varlist] [if] [in] [, options]
Tag duplicates
duplicates tag [varlist] [if] [in] , generate(newvar)
Drop duplicates
duplicates drop [if] [in]
duplicates drop varlist [if] [in] , force
options Description
-------------------------------------------------------------------------
Main
compress compress width of columns in both table and display formats
nocompress use display format of each variable
fast synonym for nocompress; no delay in output of large datasets
abbreviate(#) abbreviate variable names to # characters; default is ab(8)
string(#) truncate string variables to # characters; default is string(10)
Options
table force table format
display force display format
header display variable header once; default is table mode
noheader suppress variable header
header(#) display variable header every # lines
clean force table format with no divider or separator lines
divider draw divider lines between columns
separator(#) draw a separator line every # lines; default is separator(5)
sepby(varlist) draw a separator line whenever varlist values change
nolabel display numeric codes rather than label values
Summary
mean[(varlist)] add line reporting the mean for each of the (specified) variables
sum[(varlist)] add line reporting the sum for each of the (specified) variables
N[(varlist)] add line reporting the number of nonmissing values for each of the (specified) variables
labvar(varname) substitute Mean, Sum, or N for value of varname in last row of table
Advanced
constant[(varlist)] separate and list variables that are constant
only once
notrim suppress string trimming
absolute display overall observation numbers when using
by varlist:
nodotz display numerical values equal to .z as field of
blanks
subvarname substitute characteristic for variable name in
header
linesize(#) columns per line; default is linesize(79)
-------------------------------------------------------------------------
Menu
duplicates report, duplicates examples, and duplicates list
Data > Data utilities > Report and list duplicated observations
duplicates tag
Data > Data utilities > Tag duplicated observations
duplicates drop
Data > Data utilities > Drop duplicated observations
Description
duplicates reports, displays, lists, tags, or drops duplicate observations, depending on the subcommand specified. Duplicates are observations with identical values either on all variables if no varlist is specified or on a specified varlist.
duplicates report produces a table showing observations that occur as one or more copies and indicating how many observations are "surplus" in the sense that they are the second (third, ...) copy of the first of each group of duplicates.
duplicates examples lists one example for each group of duplicated observations. Each example represents the first occurrence of each group in the dataset.
duplicates list lists all duplicated observations.
duplicates tag generates a variable representing the number of duplicates for each observation. This will be 0 for all unique observations.
duplicates drop drops all but the first occurrence of each group of duplicated observations. The word drop may not be abbreviated.
Any observations that do not satisfy specified if and/or in conditions are ignored when you use report, examples, list, or drop. The variable created by tag will have missing values for such observations.
Examples
Setup
. sysuse auto
. keep make price mpg rep78 foreign
. expand 2 in 1/2
Report duplicates
. duplicates report
List one example for each group of duplicated observations
. duplicates examples
List all duplicated observations
. duplicates list
Create variable dup containing the number of duplicates (0 if observation is unique)
. duplicates tag, generate(dup)
List the duplicated observations
. list if dup==1
Drop all but the first occurrence of each group of duplicated
observations
. duplicates drop
List all duplicated observations
. duplicates list
更多详细请去stata输入 help duplicates
[D] duplicates -- Report, tag, or drop duplicate observations
Syntax
Report duplicates
duplicates report [varlist] [if] [in]
List one example for each group of duplicates
duplicates examples [varlist] [if] [in] [, options]
List all duplicates
duplicates list [varlist] [if] [in] [, options]
Tag duplicates
duplicates tag [varlist] [if] [in] , generate(newvar)
Drop duplicates
duplicates drop [if] [in]
duplicates drop varlist [if] [in] , force
options Description
-------------------------------------------------------------------------
Main
compress compress width of columns in both table and display formats
nocompress use display format of each variable
fast synonym for nocompress; no delay in output of large datasets
abbreviate(#) abbreviate variable names to # characters; default is ab(8)
string(#) truncate string variables to # characters; default is string(10)
Options
table force table format
display force display format
header display variable header once; default is table mode
noheader suppress variable header
header(#) display variable header every # lines
clean force table format with no divider or separator lines
divider draw divider lines between columns
separator(#) draw a separator line every # lines; default is separator(5)
sepby(varlist) draw a separator line whenever varlist values change
nolabel display numeric codes rather than label values
Summary
mean[(varlist)] add line reporting the mean for each of the (specified) variables
sum[(varlist)] add line reporting the sum for each of the (specified) variables
N[(varlist)] add line reporting the number of nonmissing values for each of the (specified) variables
labvar(varname) substitute Mean, Sum, or N for value of varname in last row of table
Advanced
constant[(varlist)] separate and list variables that are constant
only once
notrim suppress string trimming
absolute display overall observation numbers when using
by varlist:
nodotz display numerical values equal to .z as field of
blanks
subvarname substitute characteristic for variable name in
header
linesize(#) columns per line; default is linesize(79)
-------------------------------------------------------------------------
Menu
duplicates report, duplicates examples, and duplicates list
Data > Data utilities > Report and list duplicated observations
duplicates tag
Data > Data utilities > Tag duplicated observations
duplicates drop
Data > Data utilities > Drop duplicated observations
Description
duplicates reports, displays, lists, tags, or drops duplicate observations, depending on the subcommand specified. Duplicates are observations with identical values either on all variables if no varlist is specified or on a specified varlist.
duplicates report produces a table showing observations that occur as one or more copies and indicating how many observations are "surplus" in the sense that they are the second (third, ...) copy of the first of each group of duplicates.
duplicates examples lists one example for each group of duplicated observations. Each example represents the first occurrence of each group in the dataset.
duplicates list lists all duplicated observations.
duplicates tag generates a variable representing the number of duplicates for each observation. This will be 0 for all unique observations.
duplicates drop drops all but the first occurrence of each group of duplicated observations. The word drop may not be abbreviated.
Any observations that do not satisfy specified if and/or in conditions are ignored when you use report, examples, list, or drop. The variable created by tag will have missing values for such observations.
Examples
Setup
. sysuse auto
. keep make price mpg rep78 foreign
. expand 2 in 1/2
Report duplicates
. duplicates report
List one example for each group of duplicated observations
. duplicates examples
List all duplicated observations
. duplicates list
Create variable dup containing the number of duplicates (0 if observation is unique)
. duplicates tag, generate(dup)
List the duplicated observations
. list if dup==1
Drop all but the first occurrence of each group of duplicated
observations
. duplicates drop
List all duplicated observations
. duplicates list
更多详细请去stata输入 help duplicates
duplicates report
首先是 duplicates report ,这个命令的意思是,报告有多少重复的变量(单个变量)或者变量组合(比如我这里的year 和 countrycode 变量组合),他会以数字的形式告诉你。以变量组合year 和 countrycode 为例:
Code: 全选
duplicates report year countrycode
这里告诉我,19722个变量组合都是没有重复的,但是有两个变量组合重复了1次(其实就有一对组合重复了,比如a、1和a、1两个重复了)。现在知道有数据是错的了,但是错在哪里呢?接着下一个命令。
duplicates list
duplicates list 是把重复的数据列举出来的命令,比如以变量组合year 和 countrycode 为例:
Code: 全选
duplicates list year countrycode
这里告诉我,第15480个变量组合,也就是year=1962 并且 countrycode=SEN 的时候的变量组合重复了。
duplicates example
当然,如果你某些变量组合重复次数很大的话,可以需要把list换成example,这样,每一个重复组合都会只汇报一次。我这里同样以变量组合year 和 countrycode 为例:
Code: 全选
duplicates example year countrycode
对于我自己上面的变量,由于我知道是不小心重复了,直接删掉某一个观测值即可。(如果你觉得重复数据都是错的,使用duplicates drop命令可以删掉重复数据)
drop 删掉的格式是 drop in n1/n2, n1 和n2是你要删除的起始行和结束行。
Code: 全选
drop in 15480/15480
Link: | |
Hide post links |