
05장 회원 탈퇴를 예측하는 테크닉 10¶
앞장에서 소개한 클러스터링을 통한 행동 분석은 사용방법에 따라 많은 가능성이 있는 기술이다. 행동 패턴을 분석할 수 있으면 어떤 고객이 탈퇴할지와 같은 예측도 어느정도 정확하게 할 수 있다. 탈퇴 방지를 하기 위해 미리 정책을 준비하는 것도 가능하다.
Decision Tree라고 부르는 지도학습의 분류 알고리즘을 이용하여 탈퇴를 예측하는 흐름을 배운다.
In [1]:
import pandas as pd
# Load Dataset
folder_p = '/content/drive/MyDrive/파이썬데이터분석실무테크닉100/pyda100/5장/'
customer = pd.read_csv(folder_p+'customer_join.csv')
uselog_months = pd.read_csv(folder_p+'use_log_months.csv')
미래 예측을 위해 해당 달과 1개월 전 이용 이력만으로 데이터를 작성한다.
In [2]:
year_months = list(uselog_months["연월"].unique())
uselog = pd.DataFrame()
for i in range(1, len(year_months)):
tmp = uselog_months.loc[uselog_months["연월"]==year_months[i]] #각 해당 월 데이터
tmp.rename(columns={"count":"count_0"}, inplace=True)
tmp_before = uselog_months.loc[uselog_months["연월"]==year_months[i-1]] #1개월 전 이력
del tmp_before["연월"]
tmp_before.rename(columns={"count":"count_1"}, inplace=True)
tmp = pd.merge(tmp, tmp_before, on="customer_id", how="left")
uselog = pd.concat([uselog, tmp], ignore_index=True)
uselog.head()
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py:4308: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
errors=errors,
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py:4308: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
errors=errors,
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py:4308: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
errors=errors,
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py:4308: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
errors=errors,
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py:4308: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
errors=errors,
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py:4308: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
errors=errors,
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py:4308: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
errors=errors,
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py:4308: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
errors=errors,
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py:4308: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
errors=errors,
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py:4308: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
errors=errors,
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py:4308: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
errors=errors,
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py:4308: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
errors=errors,
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py:4308: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
errors=errors,
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py:4308: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
errors=errors,
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py:4308: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
errors=errors,
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py:4308: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
errors=errors,
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py:4308: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
errors=errors,
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py:4308: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
errors=errors,
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py:4308: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
errors=errors,
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py:4308: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
errors=errors,
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py:4308: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
errors=errors,
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py:4308: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
errors=errors,
Out[2]:
| 연월 | customer_id | count_0 | count_1 | |
|---|---|---|---|---|
| 0 | 201805 | AS002855 | 5 | 4.0 |
| 1 | 201805 | AS009373 | 4 | 3.0 |
| 2 | 201805 | AS015233 | 7 | NaN |
| 3 | 201805 | AS015315 | 3 | 6.0 |
| 4 | 201805 | AS015739 | 5 | 7.0 |
탈퇴 전월의 탈퇴 고객 데이터 작성하기¶
- 탈퇴한 월이 아닌 탈퇴 전월의 데이터를 작성하는 이유?
- 탈퇴를 예측한느 목적은 탈퇴를 방지하는 것.
- 해당 데이터를 제공한 스포츠 센터에서는 월말까지 탈퇴 신청을 해야 다음 달 말에 탈퇴가 가능하다.
즉, 탈퇴 전월로부터 탈퇴 신청 확률을 예측한다
In [3]:
from dateutil.relativedelta import relativedelta
exit_customer = customer.loc[customer["is_deleted"]==1]
exit_customer["exit_date"] = None
exit_customer["end_date"] = pd.to_datetime(exit_customer["end_date"]) # datetime으로 변환
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:4: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
after removing the cwd from sys.path.
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:5: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
"""
In [4]:
for i in range(len(exit_customer)):
exit_customer["exit_date"].iloc[i] = exit_customer["end_date"].iloc[i] - relativedelta(months=1) #탈퇴 일자 - 1개월로 변환
/usr/local/lib/python3.7/dist-packages/pandas/core/indexing.py:670: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
iloc._setitem_with_indexer(indexer, value)
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
In [5]:
exit_customer["exit_date"] = pd.to_datetime(exit_customer["exit_date"]) #한번 더 datetime으로 바꿔줌
exit_customer["연월"] = exit_customer["exit_date"].dt.strftime("%Y%m")
uselog["연월"] = uselog["연월"].astype(str)
exit_uselog = pd.merge(uselog, exit_customer, on=["customer_id", "연월"], how="left")
print(len(uselog))
exit_uselog.head()
33851
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
"""Entry point for launching an IPython kernel.
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
Out[5]:
| 연월 | customer_id | count_0 | count_1 | name | class | gender | start_date | end_date | campaign_id | is_deleted | class_name | price | campaign_name | mean | median | max | min | routine_flg | calc_date | membership_period | exit_date | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 201805 | AS002855 | 5 | 4.0 | NaN | NaN | NaN | NaN | NaT | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaT |
| 1 | 201805 | AS009373 | 4 | 3.0 | NaN | NaN | NaN | NaN | NaT | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaT |
| 2 | 201805 | AS015233 | 7 | NaN | NaN | NaN | NaN | NaN | NaT | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaT |
| 3 | 201805 | AS015315 | 3 | 6.0 | NaN | NaN | NaN | NaN | NaT | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaT |
| 4 | 201805 | AS015739 | 5 | 7.0 | NaN | NaN | NaN | NaN | NaT | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaT |
In [7]:
exit_uselog = exit_uselog.dropna(subset=["name"])
print(len(exit_uselog))
print(len(exit_uselog["customer_id"].unique())) # 결측치 없는지 확인
exit_uselog.head()
1104
1104
Out[7]:
| 연월 | customer_id | count_0 | count_1 | name | class | gender | start_date | end_date | campaign_id | is_deleted | class_name | price | campaign_name | mean | median | max | min | routine_flg | calc_date | membership_period | exit_date | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 19 | 201805 | AS055680 | 3 | 3.0 | XXXXX | C01 | M | 2018-03-01 | 2018-06-30 | CA1 | 1.0 | 종일 | 10500.0 | 일반 | 3.000000 | 3.0 | 3.0 | 3.0 | 0.0 | 2018-06-30 | 3.0 | 2018-05-30 |
| 57 | 201805 | AS169823 | 2 | 3.0 | XX | C01 | M | 2017-11-01 | 2018-06-30 | CA1 | 1.0 | 종일 | 10500.0 | 일반 | 3.000000 | 3.0 | 4.0 | 2.0 | 1.0 | 2018-06-30 | 7.0 | 2018-05-30 |
| 110 | 201805 | AS305860 | 5 | 3.0 | XXXX | C01 | M | 2017-06-01 | 2018-06-30 | CA1 | 1.0 | 종일 | 10500.0 | 일반 | 3.333333 | 3.0 | 5.0 | 2.0 | 0.0 | 2018-06-30 | 12.0 | 2018-05-30 |
| 128 | 201805 | AS363699 | 5 | 3.0 | XXXXX | C01 | M | 2018-02-01 | 2018-06-30 | CA1 | 1.0 | 종일 | 10500.0 | 일반 | 3.333333 | 3.0 | 5.0 | 2.0 | 0.0 | 2018-06-30 | 4.0 | 2018-05-30 |
| 147 | 201805 | AS417696 | 1 | 4.0 | XX | C03 | F | 2017-09-01 | 2018-06-30 | CA1 | 1.0 | 야간 | 6000.0 | 일반 | 2.000000 | 1.0 | 4.0 | 1.0 | 0.0 | 2018-06-30 | 9.0 | 2018-05-30 |
지속 회원의 데이터 작성하기¶
In [8]:
conti_customer = customer.loc[customer["is_deleted"]==0]
conti_uselog = pd.merge(uselog, conti_customer, on=["customer_id"], how="left")
print(len(conti_uselog))
conti_uselog = conti_uselog.dropna(subset=["name"])
print(len(conti_uselog))
33851
27422
탈퇴 회원의 데이터 수는 1104인 반면, 지속 회원의 데이터수는 name의 결측치를 제거했음에도 27422개이기 때문에 데이터 불균형이 생긴다.
따라서 샘플의 수를 조정한다.
모든 기간의 회원 데이터를 사용하지 않고 하나의 기간만 사용하여 회원 당 데이터를 1개만 사용하도록 한다.
In [9]:
conti_uselog = conti_uselog.sample(frac=1).reset_index(drop=True)
conti_uselog = conti_uselog.drop_duplicates(subset="customer_id")
print(len(conti_uselog))
conti_uselog.head()
2842
Out[9]:
| 연월 | customer_id | count_0 | count_1 | name | class | gender | start_date | end_date | campaign_id | is_deleted | class_name | price | campaign_name | mean | median | max | min | routine_flg | calc_date | membership_period | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 201901 | AS551109 | 8 | 8.0 | XXX | C01 | F | 2018-10-15 | NaN | CA1 | 0.0 | 종일 | 10500.0 | 일반 | 7.333333 | 8.0 | 8.0 | 5.0 | 1.0 | 2019-04-30 | 6.0 |
| 1 | 201806 | OA124956 | 7 | 7.0 | XXXXX | C01 | F | 2016-11-01 | NaN | CA1 | 0.0 | 종일 | 10500.0 | 일반 | 5.333333 | 5.0 | 7.0 | 3.0 | 1.0 | 2019-04-30 | 29.0 |
| 2 | 201811 | IK983460 | 5 | 6.0 | XXXXXX | C02 | M | 2015-10-01 | NaN | CA1 | 0.0 | 주간 | 7500.0 | 일반 | 5.250000 | 5.0 | 8.0 | 3.0 | 1.0 | 2019-04-30 | 42.0 |
| 3 | 201901 | IK155697 | 5 | 5.0 | XXXX | C01 | F | 2015-08-01 | NaN | CA1 | 0.0 | 종일 | 10500.0 | 일반 | 4.250000 | 5.0 | 6.0 | 1.0 | 1.0 | 2019-04-30 | 44.0 |
| 4 | 201903 | AS515526 | 9 | 4.0 | XXXXX | C03 | M | 2015-11-01 | NaN | CA1 | 0.0 | 야간 | 6000.0 | 일반 | 5.000000 | 5.0 | 9.0 | 2.0 | 1.0 | 2019-04-30 | 41.0 |
In [11]:
predict_data = pd.concat([conti_uselog, exit_uselog], ignore_index=True) # 탈퇴 회원과 지속 회원 결합
print(len(predict_data))
predict_data.head()
3946
Out[11]:
| 연월 | customer_id | count_0 | count_1 | name | class | gender | start_date | end_date | campaign_id | is_deleted | class_name | price | campaign_name | mean | median | max | min | routine_flg | calc_date | membership_period | exit_date | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 201901 | AS551109 | 8 | 8.0 | XXX | C01 | F | 2018-10-15 | NaN | CA1 | 0.0 | 종일 | 10500.0 | 일반 | 7.333333 | 8.0 | 8.0 | 5.0 | 1.0 | 2019-04-30 | 6.0 | NaT |
| 1 | 201806 | OA124956 | 7 | 7.0 | XXXXX | C01 | F | 2016-11-01 | NaN | CA1 | 0.0 | 종일 | 10500.0 | 일반 | 5.333333 | 5.0 | 7.0 | 3.0 | 1.0 | 2019-04-30 | 29.0 | NaT |
| 2 | 201811 | IK983460 | 5 | 6.0 | XXXXXX | C02 | M | 2015-10-01 | NaN | CA1 | 0.0 | 주간 | 7500.0 | 일반 | 5.250000 | 5.0 | 8.0 | 3.0 | 1.0 | 2019-04-30 | 42.0 | NaT |
| 3 | 201901 | IK155697 | 5 | 5.0 | XXXX | C01 | F | 2015-08-01 | NaN | CA1 | 0.0 | 종일 | 10500.0 | 일반 | 4.250000 | 5.0 | 6.0 | 1.0 | 1.0 | 2019-04-30 | 44.0 | NaT |
| 4 | 201903 | AS515526 | 9 | 4.0 | XXXXX | C03 | M | 2015-11-01 | NaN | CA1 | 0.0 | 야간 | 6000.0 | 일반 | 5.000000 | 5.0 | 9.0 | 2.0 | 1.0 | 2019-04-30 | 41.0 | NaT |
In [12]:
predict_data.tail()
Out[12]:
| 연월 | customer_id | count_0 | count_1 | name | class | gender | start_date | end_date | campaign_id | is_deleted | class_name | price | campaign_name | mean | median | max | min | routine_flg | calc_date | membership_period | exit_date | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3941 | 201902 | TS645212 | 4 | 2.0 | XXXX | C03 | F | 2018-03-01 | 2019-03-31 00:00:00 | CA1 | 1.0 | 야간 | 6000.0 | 일반 | 4.50 | 4.5 | 7.0 | 1.0 | 0.0 | 2019-03-31 | 12.0 | 2019-02-28 |
| 3942 | 201902 | TS741703 | 5 | 6.0 | XXXX | C03 | M | 2018-12-08 | 2019-03-31 00:00:00 | CA3 | 1.0 | 야간 | 6000.0 | 입회비무료 | 6.25 | 6.0 | 8.0 | 5.0 | 0.0 | 2019-03-31 | 3.0 | 2019-02-28 |
| 3943 | 201902 | TS859258 | 1 | 3.0 | XXXXX | C02 | F | 2018-12-07 | 2019-03-31 00:00:00 | CA3 | 1.0 | 주간 | 7500.0 | 입회비무료 | 2.50 | 2.0 | 5.0 | 1.0 | 0.0 | 2019-03-31 | 3.0 | 2019-02-28 |
| 3944 | 201902 | TS886985 | 5 | 3.0 | XXX | C02 | F | 2018-03-01 | 2019-03-31 00:00:00 | CA1 | 1.0 | 주간 | 7500.0 | 일반 | 4.25 | 4.0 | 7.0 | 2.0 | 1.0 | 2019-03-31 | 12.0 | 2019-02-28 |
| 3945 | 201902 | TS921837 | 2 | 3.0 | XXXXXX | C01 | M | 2018-06-04 | 2019-03-31 00:00:00 | CA2 | 1.0 | 종일 | 10500.0 | 입회비반액할인 | 4.00 | 3.5 | 9.0 | 2.0 | 1.0 | 2019-03-31 | 9.0 | 2019-02-28 |
예측할 달의 재적 기간작성¶
In [13]:
predict_data["period"] = 0
predict_data["now_date"] = pd.to_datetime(predict_data["연월"], format="%Y%m")
predict_data["start_date"] = pd.to_datetime(predict_data["start_date"])
for i in range(len(predict_data)):
delta = relativedelta(predict_data["now_date"][i], predict_data["start_date"][i])
predict_data["period"][i] = int(delta.years*12 + delta.months)
predict_data.head()
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:6: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
Out[13]:
| 연월 | customer_id | count_0 | count_1 | name | class | gender | start_date | end_date | campaign_id | is_deleted | class_name | price | campaign_name | mean | median | max | min | routine_flg | calc_date | membership_period | exit_date | period | now_date | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 201901 | AS551109 | 8 | 8.0 | XXX | C01 | F | 2018-10-15 | NaN | CA1 | 0.0 | 종일 | 10500.0 | 일반 | 7.333333 | 8.0 | 8.0 | 5.0 | 1.0 | 2019-04-30 | 6.0 | NaT | 2 | 2019-01-01 |
| 1 | 201806 | OA124956 | 7 | 7.0 | XXXXX | C01 | F | 2016-11-01 | NaN | CA1 | 0.0 | 종일 | 10500.0 | 일반 | 5.333333 | 5.0 | 7.0 | 3.0 | 1.0 | 2019-04-30 | 29.0 | NaT | 19 | 2018-06-01 |
| 2 | 201811 | IK983460 | 5 | 6.0 | XXXXXX | C02 | M | 2015-10-01 | NaN | CA1 | 0.0 | 주간 | 7500.0 | 일반 | 5.250000 | 5.0 | 8.0 | 3.0 | 1.0 | 2019-04-30 | 42.0 | NaT | 37 | 2018-11-01 |
| 3 | 201901 | IK155697 | 5 | 5.0 | XXXX | C01 | F | 2015-08-01 | NaN | CA1 | 0.0 | 종일 | 10500.0 | 일반 | 4.250000 | 5.0 | 6.0 | 1.0 | 1.0 | 2019-04-30 | 44.0 | NaT | 41 | 2019-01-01 |
| 4 | 201903 | AS515526 | 9 | 4.0 | XXXXX | C03 | M | 2015-11-01 | NaN | CA1 | 0.0 | 야간 | 6000.0 | 일반 | 5.000000 | 5.0 | 9.0 | 2.0 | 1.0 | 2019-04-30 | 41.0 | NaT | 40 | 2019-03-01 |
결측치 제거¶
In [15]:
predict_data.isna().sum() # 컬럼 별 결측치 수 확인
Out[15]:
연월 0
customer_id 0
count_0 0
count_1 260
name 0
class 0
gender 0
start_date 0
end_date 2842
campaign_id 0
is_deleted 0
class_name 0
price 0
campaign_name 0
mean 0
median 0
max 0
min 0
routine_flg 0
calc_date 0
membership_period 0
exit_date 2842
period 0
now_date 0
dtype: int64
In [17]:
predict_data = predict_data.dropna(subset=["count_1"])
predict_data.isna().sum()
Out[17]:
연월 0
customer_id 0
count_0 0
count_1 0
name 0
class 0
gender 0
start_date 0
end_date 2634
campaign_id 0
is_deleted 0
class_name 0
price 0
campaign_name 0
mean 0
median 0
max 0
min 0
routine_flg 0
calc_date 0
membership_period 0
exit_date 2634
period 0
now_date 0
dtype: int64
문자열 변수를 처리하도록 가공하자¶
카테고리 변수를 처리할 수 있도록 더미 변수를 만든다
In [18]:
target_col = ["campaign_name", "class_name", "gender", "count_1", "routine_flg", "period", "is_deleted"]
predict_data = predict_data[target_col]
predict_data.head()
Out[18]:
| campaign_name | class_name | gender | count_1 | routine_flg | period | is_deleted | |
|---|---|---|---|---|---|---|---|
| 0 | 일반 | 종일 | F | 8.0 | 1.0 | 2 | 0.0 |
| 1 | 일반 | 종일 | F | 7.0 | 1.0 | 19 | 0.0 |
| 2 | 일반 | 주간 | M | 6.0 | 1.0 | 37 | 0.0 |
| 3 | 일반 | 종일 | F | 5.0 | 1.0 | 41 | 0.0 |
| 4 | 일반 | 야간 | M | 4.0 | 1.0 | 40 | 0.0 |
In [19]:
predict_data = pd.get_dummies(predict_data)
predict_data.head()
Out[19]:
| count_1 | routine_flg | period | is_deleted | campaign_name_일반 | campaign_name_입회비무료 | campaign_name_입회비반액할인 | class_name_야간 | class_name_종일 | class_name_주간 | gender_F | gender_M | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 8.0 | 1.0 | 2 | 0.0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
| 1 | 7.0 | 1.0 | 19 | 0.0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
| 2 | 6.0 | 1.0 | 37 | 0.0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
| 3 | 5.0 | 1.0 | 41 | 0.0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
| 4 | 4.0 | 1.0 | 40 | 0.0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
In [20]:
del predict_data["campaign_name_일반"]
del predict_data["class_name_야간"]
del predict_data["gender_M"]
predict_data.head() # 다른 컬럼 정보로도 얻을 수 있는 컬럼은 삭제함
Out[20]:
| count_1 | routine_flg | period | is_deleted | campaign_name_입회비무료 | campaign_name_입회비반액할인 | class_name_종일 | class_name_주간 | gender_F | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 8.0 | 1.0 | 2 | 0.0 | 0 | 0 | 1 | 0 | 1 |
| 1 | 7.0 | 1.0 | 19 | 0.0 | 0 | 0 | 1 | 0 | 1 |
| 2 | 6.0 | 1.0 | 37 | 0.0 | 0 | 0 | 0 | 1 | 0 |
| 3 | 5.0 | 1.0 | 41 | 0.0 | 0 | 0 | 1 | 0 | 1 |
| 4 | 4.0 | 1.0 | 40 | 0.0 | 0 | 0 | 0 | 0 | 0 |
의사결정트리 사용하기¶
In [21]:
from sklearn.tree import DecisionTreeClassifier
import sklearn.model_selection
exit = predict_data.loc[predict_data["is_deleted"]==1]
conti = predict_data.loc[predict_data["is_deleted"]==0].sample(len(exit)) # 탈퇴회원 수 만큼 데이터 샘플링
In [22]:
X = pd.concat([exit, conti], ignore_index=True)
y = X["is_deleted"]
del X["is_deleted"]
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y)
model = DecisionTreeClassifier(random_state=0)
model.fit(X_train, y_train)
y_test_pred = model.predict(X_test)
print(y_test_pred)
[1. 0. 1. 1. 0. 0. 1. 0. 1. 1. 1. 1. 0. 1. 0. 0. 1. 1. 0. 0. 1. 0. 0. 0.
0. 1. 0. 1. 1. 0. 1. 1. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 1. 1. 0. 1. 0. 1.
1. 0. 1. 0. 1. 1. 0. 1. 0. 0. 1. 1. 0. 1. 0. 0. 0. 1. 1. 1. 0. 0. 0. 0.
0. 0. 1. 1. 0. 0. 1. 1. 1. 1. 0. 0. 1. 1. 0. 0. 0. 1. 0. 0. 0. 1. 0. 1.
1. 1. 1. 0. 0. 1. 0. 1. 0. 0. 0. 1. 1. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.
0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 1. 1. 0. 1. 1. 0. 0. 0. 1. 1. 0. 1. 1. 1.
1. 1. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 1. 0. 0. 0. 0.
0. 1. 0. 1. 1. 1. 1. 1. 0. 0. 0. 0. 1. 0. 1. 0. 0. 1. 1. 1. 1. 0. 1. 0.
1. 0. 1. 0. 1. 1. 1. 1. 0. 1. 1. 0. 0. 0. 0. 1. 0. 1. 1. 1. 1. 0. 1. 1.
1. 1. 0. 0. 1. 1. 0. 1. 0. 0. 1. 1. 1. 1. 1. 0. 1. 0. 0. 1. 0. 0. 1. 1.
1. 1. 1. 0. 1. 1. 0. 0. 0. 1. 0. 0. 1. 1. 0. 1. 0. 0. 0. 1. 0. 1. 1. 1.
1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 0. 0. 0. 0. 0. 1. 0. 1. 1. 1. 0. 0. 0. 1.
0. 0. 0. 1. 0. 0. 1. 0. 1. 1. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 1. 1. 1.
1. 1. 1. 1. 0. 0. 1. 0. 1. 1. 1. 1. 0. 1. 0. 0. 0. 0. 1. 1. 0. 1. 1. 1.
0. 0. 1. 1. 0. 0. 1. 1. 1. 1. 1. 0. 1. 0. 1. 1. 0. 1. 1. 1. 1. 0. 0. 0.
1. 0. 0. 1. 1. 0. 1. 0. 1. 1. 0. 0. 1. 1. 0. 0. 1. 0. 1. 0. 0. 0. 1. 0.
0. 0. 0. 1. 1. 0. 1. 0. 0. 1. 0. 0. 1. 0. 1. 1. 1. 0. 1. 0. 1. 1. 0. 0.
0. 1. 0. 0. 0. 1. 0. 1. 1. 1. 1. 0. 0. 1. 0. 0. 0. 1. 0. 0. 1. 1. 0. 1.
0. 0. 1. 0. 1. 1. 1. 0. 1. 0. 1. 0. 0. 1. 0. 1. 1. 1. 0. 1. 1. 1. 1. 1.
1. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 1. 0. 1. 0. 0. 1. 0. 1. 1. 1. 0. 0. 1.
1. 0. 0. 0. 0. 1. 1. 1. 0. 0. 1. 1. 1. 1. 1. 0. 0. 0. 1. 0. 1. 0. 1. 1.
1. 1. 1. 1. 0. 0. 1. 1. 1. 0. 0. 1. 0. 1. 1. 1. 0. 0. 0. 1. 1. 0.]
In [23]:
results_test = pd.DataFrame({"y_test":y_test, "y_pred":y_test_pred})
results_test.head()
Out[23]:
| y_test | y_pred | |
|---|---|---|
| 217 | 1.0 | 1.0 |
| 1995 | 0.0 | 0.0 |
| 304 | 1.0 | 1.0 |
| 791 | 1.0 | 1.0 |
| 1911 | 0.0 | 0.0 |
모델 평가, 튜닝¶
In [24]:
correct = len(results_test.loc[results_test["y_test"]==results_test["y_pred"]])
data_count = len(results_test)
score_test = correct / data_count
print(score_test)
0.8916349809885932
In [25]:
print(model.score(X_test, y_test))
print(model.score(X_train, y_train)) # 오버피팅
0.8916349809885932
0.9816223067173637
In [26]:
model = DecisionTreeClassifier(random_state=0, max_depth=5)
model.fit(X_train, y_train)
print(model.score(X_test, y_test))
print(model.score(X_train, y_train))
0.9258555133079848
0.926489226869455
변수들의 기여도 확인¶
In [27]:
importance = pd.DataFrame({"feature_names": X.columns, "coefficient":model.feature_importances_})
importance
Out[27]:
| feature_names | coefficient | |
|---|---|---|
| 0 | count_1 | 0.325178 |
| 1 | routine_flg | 0.120963 |
| 2 | period | 0.549796 |
| 3 | campaign_name_입회비무료 | 0.000000 |
| 4 | campaign_name_입회비반액할인 | 0.003784 |
| 5 | class_name_종일 | 0.000279 |
| 6 | class_name_주간 | 0.000000 |
| 7 | gender_F | 0.000000 |
1개월전 이용 횟수, 정기 이용 여부, 재적 기간이 기여하고 있는 것을 확인
탈퇴 예측¶
In [28]:
count_1 = 3
routine_flg = 1
period = 10
campaign_name = "입회비무료"
class_name = "종일"
gender = "M"
In [29]:
if campaign_name == "입회비반값할인":
campaign_name_list = [1,0]
elif campaign_name=='입회비무료':
campaign_name_list = [0,1]
elif campaign_name=='일반':
campaign_name_list = [0,0]
if class_name == "종일":
class_name_list = [1,0]
elif class_name == "주간":
class_name_list = [0,1]
elif class_name == "야간":
class_name_list = [0,0]
if gender == 'F' :
gender_list = [1]
elif gender == 'M':
gender_list = [0]
input_data = [count_1, routine_flg, period]
input_data.extend(campaign_name_list)
input_data.extend(class_name_list)
input_data.extend(gender_list)
In [32]:
print(model.predict([input_data]))
print(model.predict_proba([input_data]))
[1.]
[[0. 1.]]
In [ ]:
'Machine Learning > Statistics' 카테고리의 다른 글
| 여러가지 확률분포 (0) | 2022.02.03 |
|---|---|
| 파이썬 데이터분석 실무 테크닉 100 - 6장 (1) | 2021.07.01 |
| 파이썬 데이터분석 실무 테크닉 100 - 4장 (0) | 2021.01.07 |
| 파이썬 데이터 분석 실무 테크닉 100 - 3장 (0) | 2020.12.17 |
| 파이썬 데이터 분석 실무 테크닉 100 - 2장 (0) | 2020.12.07 |