더미 변수 생성

Program Tip

더미 변수 생성

programtip 2020. 11. 12. 20:07

더미 변수 생성

R에서 다음 더미 변수를 생성하는 데 문제가 있습니다.

연간 시계열 데이터를 분석 중입니다 (기간 1948-2009). 두 가지 질문이 있습니다.

관찰 # 10, 즉 1957 년에 대한 더미 변수를 어떻게 생성합니까 (1957 년에 값 = 1이고 그렇지 않으면 0)?
1957 년 이전에 0이고 1957 년부터 2009 년까지 값 1을 취하는 더미 변수를 어떻게 생성합니까?

변수가 많은 경우 더 잘 작동 할 수있는 또 다른 옵션은 factor및 model.matrix입니다.

> year.f = factor(year)
> dummies = model.matrix(~year.f)

여기에는 절편 열 (모두 1 개)과 "기본값"또는 절편 값이 될 1 개를 제외하고 데이터 세트의 각 연도에 대한 열이 포함됩니다.

contrasts.arg에서 엉망으로 "기본값"을 선택하는 방법을 변경할 수 있습니다 model.matrix.

또한 절편을 생략하려면 첫 번째 열을 삭제하거나 +0수식 끝에 추가하면 됩니다.

이것이 유용하기를 바랍니다.

이러한 더미 변수를 생성하는 가장 간단한 방법은 다음과 같습니다.

> print(year)
[1] 1956 1957 1957 1958 1958 1959
> dummy <- as.numeric(year == 1957)
> print(dummy)
[1] 0 1 1 0 0 0
> dummy2 <- as.numeric(year >= 1957)
> print(dummy2)
[1] 0 1 1 1 1 1

보다 일반적으로 ifelse조건에 따라 두 값 중에서 선택할 수 있습니다 . 따라서 0-1 더미 변수 대신 어떤 이유로 4와 7을 사용하고 싶다면 ifelse(year == 1957, 4, 7).

더미 사용 :: dummy () :

library(dummies)

# example data
df1 <- data.frame(id = 1:4, year = 1991:1994)

df1 <- cbind(df1, dummy(df1$year, sep = "_"))

df1
#   id year df1_1991 df1_1992 df1_1993 df1_1994
# 1  1 1991        1        0        0        0
# 2  2 1992        0        1        0        0
# 3  3 1993        0        0        1        0
# 4  4 1994        0        0        0        1

이러한 목적을위한 패키지 mlr에는 createDummyFeatures다음이 포함됩니다 .

library(mlr)
df <- data.frame(var = sample(c("A", "B", "C"), 10, replace = TRUE))
df

#    var
# 1    B
# 2    A
# 3    C
# 4    B
# 5    C
# 6    A
# 7    C
# 8    A
# 9    B
# 10   C

createDummyFeatures(df, cols = "var")

#    var.A var.B var.C
# 1      0     1     0
# 2      1     0     0
# 3      0     0     1
# 4      0     1     0
# 5      0     0     1
# 6      1     0     0
# 7      0     0     1
# 8      1     0     0
# 9      0     1     0
# 10     0     0     1

createDummyFeatures 원래 변수를 삭제합니다.

https://www.rdocumentation.org/packages/mlr/versions/2.9/topics/createDummyFeatures
.....

여기에있는 다른 답변은이 작업을 수행하기위한 직접적인 경로를 제공 lm합니다. 어쨌든 많은 모델 (예 :) 이 내부적 으로 수행 할 작업 중 하나입니다 . 그럼에도 불구하고 Max Kuhn의 인기 caret및 recipes패키지로 더미 변수를 만드는 방법이 있습니다. 다소 장황하지만 둘 다 더 복잡한 상황으로 쉽게 확장 할 수 있으며 각각의 프레임 워크에 깔끔하게 맞습니다.

`caret::dummyVars`

함께 caret, 해당 함수이고 dummyVars하는 갖는 predict데이터 프레임에 적용하는 방법 :

df <- data.frame(letter = rep(c('a', 'b', 'c'), each = 2),
                 y = 1:6)

library(caret)

dummy <- dummyVars(~ ., data = df, fullRank = TRUE)

dummy
#> Dummy Variable Object
#> 
#> Formula: ~.
#> 2 variables, 1 factors
#> Variables and levels will be separated by '.'
#> A full rank encoding is used

predict(dummy, df)
#>   letter.b letter.c y
#> 1        0        0 1
#> 2        0        0 2
#> 3        1        0 3
#> 4        1        0 4
#> 5        0        1 5
#> 6        0        1 6

`recipes::step_dummy`

으로 recipes, 관련 기능입니다 step_dummy:

library(recipes)

dummy_recipe <- recipe(y ~ letter, df) %>% 
    step_dummy(letter)

dummy_recipe
#> Data Recipe
#> 
#> Inputs:
#> 
#>       role #variables
#>    outcome          1
#>  predictor          1
#> 
#> Steps:
#> 
#> Dummy variables from letter

상황에 따라,로 데이터를 추출 prep하고 하나 bake또는 juice:

# Prep and bake on new data...
dummy_recipe %>% 
    prep() %>% 
    bake(df)
#> # A tibble: 6 x 3
#>       y letter_b letter_c
#>   <int>    <dbl>    <dbl>
#> 1     1        0        0
#> 2     2        0        0
#> 3     3        1        0
#> 4     4        1        0
#> 5     5        0        1
#> 6     6        0        1

# ...or use `retain = TRUE` and `juice` to extract training data
dummy_recipe %>% 
    prep(retain = TRUE) %>% 
    juice()
#> # A tibble: 6 x 3
#>       y letter_b letter_c
#>   <int>    <dbl>    <dbl>
#> 1     1        0        0
#> 2     2        0        0
#> 3     3        1        0
#> 4     4        1        0
#> 5     5        0        1
#> 6     6        0        1

이런 종류의 더미 변수로 작업하기 위해 일반적으로 수행하는 작업은 다음과 같습니다.

(1) 관찰 # 10, 즉 1957 년에 대한 더미 변수를 생성하는 방법 (1957 년 값 = 1, 그렇지 않으면 0)

data$factor_year_1 <- factor ( with ( data, ifelse ( ( year == 1957 ), 1 , 0 ) ) )

(2) 1957 년 이전에 0이고 1957 년부터 2009 년까지 값 1을 취하는 더미 변수를 어떻게 생성합니까?

data$factor_year_2 <- factor ( with ( data, ifelse ( ( year < 1957 ), 0 , 1 ) ) )

그런 다음이 요소를 모델에 더미 변수로 도입 할 수 있습니다. 예를 들어, 변수에 장기적인 추세가 있는지 확인하려면 다음을 수행하십시오 y.

summary ( lm ( y ~ t,  data = data ) )

도움이 되었기를 바랍니다!

질문에 제시된 사용 사례의 경우 논리 조건을 다음과 같이 곱할 수도 있습니다 1(또는 더 좋을 수도 있습니다 1L).

# example data
df1 <- data.frame(yr = 1951:1960)

# create the dummies
df1$is.1957 <- 1L * (df1$yr == 1957)
df1$after.1957 <- 1L * (df1$yr >= 1957)

다음을 제공합니다.

> df1
     yr is.1957 after.1957
1  1951       0          0
2  1952       0          0
3  1953       0          0
4  1954       0          0
5  1955       0          0
6  1956       0          0
7  1957       1          1
8  1958       0          1
9  1959       0          1
10 1960       0          1

예를 들어 @ zx8754 및 @Sotos의 답변에 제시된 사용 사례의 경우 아직 다루지 않은 다른 옵션이 아직 있습니다.

1) 자신의 확인 make_dummiesα- 함수를

# example data
df2 <- data.frame(id = 1:5, year = c(1991:1994,1992))

# create a function
make_dummies <- function(v, prefix = '') {
  s <- sort(unique(v))
  d <- outer(v, s, function(v, s) 1L * (v == s))
  colnames(d) <- paste0(prefix, s)
  d
}

# bind the dummies to the original dataframe
cbind(df2, make_dummies(df2$year, prefix = 'y'))

다음을 제공합니다.

  id year y1991 y1992 y1993 y1994
1  1 1991     1     0     0     0
2  2 1992     0     1     0     0
3  3 1993     0     0     1     0
4  4 1994     0     0     0     1
5  5 1992     0     1     0     0

2) data.table 또는 reshape2dcast 에서-함수 사용

 dcast(df2, id + year ~ year, fun.aggregate = length)

다음을 제공합니다.

  id year 1991 1992 1993 1994
1  1 1991    1    0    0    0
2  2 1992    0    1    0    0
3  3 1993    0    0    1    0
4  4 1994    0    0    0    1
5  5 1992    0    1    0    0

그러나 더미를 만들어야하는 열에 중복 값이있는 경우에는 작동하지 않습니다. 특정 집계 함수가 필요 dcast하고 그 결과를 dcast원본으로 다시 병합해야하는 경우 :

# example data
df3 <- data.frame(var = c("B", "C", "A", "B", "C"))

# aggregation function to get dummy values
f <- function(x) as.integer(length(x) > 0)

# reshape to wide with the cumstom aggregation function and merge back to the original
merge(df3, dcast(df3, var ~ var, fun.aggregate = f), by = 'var', all.x = TRUE)

결과는 by열에 따라 정렬됩니다 .

  var A B C
1   A 1 0 0
2   B 0 1 0
3   B 0 1 0
4   C 0 0 1
5   C 0 0 1

3) 사용 spread에서 α- 함수 tidyr (와 mutate발 dplyr를 )

library(dplyr)
library(tidyr)

df2 %>% 
  mutate(v = 1, yr = year) %>% 
  spread(yr, v, fill = 0)

다음을 제공합니다.

  id year 1991 1992 1993 1994
1  1 1991    1    0    0    0
2  2 1992    0    1    0    0
3  3 1993    0    0    1    0
4  4 1994    0    0    0    1
5  5 1992    0    1    0    0

나는 kaggle 포럼에서 이것을 읽었습니다.

#Generate example dataframe with character column
example <- as.data.frame(c("A", "A", "B", "F", "C", "G", "C", "D", "E", "F"))
names(example) <- "strcol"

#For every unique value in the string column, create a new 1/0 column
#This is what Factors do "under-the-hood" automatically when passed to function requiring numeric data
for(level in unique(example$strcol)){
  example[paste("dummy", level, sep = "_")] <- ifelse(example$strcol == level, 1, 0)
}

K-1 대신 K 더미 변수를 얻으려면 다음을 시도하십시오.

dummies = table(1:length(year),as.factor(year))

베스트,

이 ifelse함수는 이와 같은 간단한 논리에 가장 적합합니다.

> x <- seq(1950, 1960, 1)

    ifelse(x == 1957, 1, 0)
    ifelse(x <= 1957, 1, 0)

>  [1] 0 0 0 0 0 0 0 1 0 0 0
>  [1] 1 1 1 1 1 1 1 1 0 0 0

또한 문자 데이터를 반환하려면 그렇게 할 수 있습니다.

> x <- seq(1950, 1960, 1)

    ifelse(x == 1957, "foo", "bar")
    ifelse(x <= 1957, "foo", "bar")

>  [1] "bar" "bar" "bar" "bar" "bar" "bar" "bar" "foo" "bar" "bar" "bar"
>  [1] "foo" "foo" "foo" "foo" "foo" "foo" "foo" "foo" "bar" "bar" "bar"

Categorical variables with nesting...

> x <- seq(1950, 1960, 1)

    ifelse(x == 1957, "foo", ifelse(x == 1958, "bar","baz"))

>  [1] "baz" "baz" "baz" "baz" "baz" "baz" "baz" "foo" "bar" "baz" "baz"

This is the most straightforward option.

Another way is to use mtabulate from qdapTools package, i.e.

df <- data.frame(var = sample(c("A", "B", "C"), 5, replace = TRUE))
  var
#1   C
#2   A
#3   C
#4   B
#5   B

library(qdapTools)
mtabulate(df$var)

which gives,

I use such a function (for data.table):

# Ta funkcja dla obiektu data.table i zmiennej var.name typu factor tworzy dummy variables o nazwach "var.name: (level1)"
factorToDummy <- function(dtable, var.name){
  stopifnot(is.data.table(dtable))
  stopifnot(var.name %in% names(dtable))
  stopifnot(is.factor(dtable[, get(var.name)]))

  dtable[, paste0(var.name,": ",levels(get(var.name)))] -> new.names
  dtable[, (new.names) := transpose(lapply(get(var.name), FUN = function(x){x == levels(get(var.name))})) ]

  cat(paste("\nDodano zmienne dummy: ", paste0(new.names, collapse = ", ")))
}

Usage:

data <- data.table(data)
data[, x:= droplevels(x)]
factorToDummy(data, "x")

Convert your data to a data.table and use set by reference and row filtering

library(data.table)

dt <- as.data.table(your.dataframe.or.whatever)
dt[, is.1957 := 0]
dt[year == 1957, is.1957 := 1]

Proof-of-concept toy example:

library(data.table)

dt <- as.data.table(cbind(c(1, 1, 1), c(2, 2, 3)))
dt[, is.3 := 0]
dt[V2 == 3, is.3 := 1]

Hi i wrote this general function to generate a dummy variable which essentially replicates the replace function in Stata.

If x is the data frame is x and i want a dummy variable called a which will take value 1 when x$b takes value c

introducedummy<-function(x,a,b,c){
   g<-c(a,b,c)
  n<-nrow(x)
  newcol<-g[1]
  p<-colnames(x)
  p2<-c(p,newcol)
  new1<-numeric(n)
  state<-x[,g[2]]
  interest<-g[3]
  for(i in 1:n){
    if(state[i]==interest){
      new1[i]=1
    }
    else{
      new1[i]=0
    }
  }
    x$added<-new1
    colnames(x)<-p2
    x
  }

another way you can do it is use

ifelse(year < 1965 , 1, 0)

We can also use cSplit_e from splitstackshape. Using @zx8754's data

df1 <- data.frame(id = 1:4, year = 1991:1994)
splitstackshape::cSplit_e(df1, "year", fill = 0)

#  id year year_1 year_2 year_3 year_4
#1  1 1991      1      0      0      0
#2  2 1992      0      1      0      0
#3  3 1993      0      0      1      0
#4  4 1994      0      0      0      1

To make it work for data other than numeric we need to specify type as "character" explicitly

df1 <- data.frame(id = 1:4, let = LETTERS[1:4])
splitstackshape::cSplit_e(df1, "let", fill = 0, type = "character")

#  id let let_A let_B let_C let_D
#1  1   A     1     0     0     0
#2  2   B     0     1     0     0
#3  3   C     0     0     1     0
#4  4   D     0     0     0     1

참고 URL : https://stackoverflow.com/questions/11952706/generate-a-dummy-variable

'Program Tip' 카테고리의 다른 글

0.-5가 -5로 평가되는 이유는 무엇입니까? (0)	2020.11.12
Rails : 데이터베이스에 요소가 없을 때 메시지를 표시하는 우아한 방법 (0)	2020.11.12
Twitter Bootstrap 버튼 클릭하여 버튼 위의 텍스트 섹션 확장 / 축소 전환 (0)	2020.11.12
.NET에서 숫자에 대한 "st", "nd", "rd"및 "th"엔딩을 쉽게 얻을 수있는 방법이 있습니까? (0)	2020.11.12
Mountain Lion에 Rmagick 설치 오류 (0)	2020.11.12

현재글더미 변수 생성

programtip

더미 변수 생성

더미 변수 생성

`caret::dummyVars`

`recipes::step_dummy`

'Program Tip' 카테고리의 다른 글

'Program Tip'의 다른글

티스토리툴바

더미 변수 생성

더미 변수 생성

caret::dummyVars

recipes::step_dummy

'Program Tip' 카테고리의 다른 글

'Program Tip'의 다른글

관련글

티스토리툴바

`caret::dummyVars`

`recipes::step_dummy`