05장 회원 탈퇴를 예측하는 테크닉 10¶
앞장에서 소개한 클러스터링을 통한 행동 분석은 사용방법에 따라 많은 가능성이 있는 기술이다. 행동 패턴을 분석할 수 있으면 어떤 고객이 탈퇴할지와 같은 예측도 어느정도 정확하게 할 수 있다. 탈퇴 방지를 하기 위해 미리 정책을 준비하는 것도 가능하다.
Decision Tree라고 부르는 지도학습의 분류 알고리즘을 이용하여 탈퇴를 예측하는 흐름을 배운다.
In [1]:
import pandas as pd
# Load Dataset
folder_p = '/content/drive/MyDrive/파이썬데이터분석실무테크닉100/pyda100/5장/'
customer = pd.read_csv(folder_p+'customer_join.csv')
uselog_months = pd.read_csv(folder_p+'use_log_months.csv')
미래 예측을 위해 해당 달과 1개월 전 이용 이력만으로 데이터를 작성한다.
In [2]:
year_months = list(uselog_months["연월"].unique())
uselog = pd.DataFrame()
for i in range(1, len(year_months)):
tmp = uselog_months.loc[uselog_months["연월"]==year_months[i]] #각 해당 월 데이터
tmp.rename(columns={"count":"count_0"}, inplace=True)
tmp_before = uselog_months.loc[uselog_months["연월"]==year_months[i-1]] #1개월 전 이력
del tmp_before["연월"]
tmp_before.rename(columns={"count":"count_1"}, inplace=True)
tmp = pd.merge(tmp, tmp_before, on="customer_id", how="left")
uselog = pd.concat([uselog, tmp], ignore_index=True)
uselog.head()
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py:4308: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
errors=errors,
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py:4308: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
errors=errors,
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py:4308: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
errors=errors,
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py:4308: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
errors=errors,
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py:4308: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
errors=errors,
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py:4308: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
errors=errors,
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py:4308: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
errors=errors,
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py:4308: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
errors=errors,
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py:4308: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
errors=errors,
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py:4308: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
errors=errors,
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py:4308: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
errors=errors,
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py:4308: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
errors=errors,
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py:4308: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
errors=errors,
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py:4308: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
errors=errors,
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py:4308: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
errors=errors,
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py:4308: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
errors=errors,
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py:4308: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
errors=errors,
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py:4308: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
errors=errors,
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py:4308: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
errors=errors,
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py:4308: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
errors=errors,
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py:4308: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
errors=errors,
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py:4308: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
errors=errors,
Out[2]:
연월 | customer_id | count_0 | count_1 | |
---|---|---|---|---|
0 | 201805 | AS002855 | 5 | 4.0 |
1 | 201805 | AS009373 | 4 | 3.0 |
2 | 201805 | AS015233 | 7 | NaN |
3 | 201805 | AS015315 | 3 | 6.0 |
4 | 201805 | AS015739 | 5 | 7.0 |
탈퇴 전월의 탈퇴 고객 데이터 작성하기¶
- 탈퇴한 월이 아닌 탈퇴 전월의 데이터를 작성하는 이유?
- 탈퇴를 예측한느 목적은 탈퇴를 방지하는 것.
- 해당 데이터를 제공한 스포츠 센터에서는 월말까지 탈퇴 신청을 해야 다음 달 말에 탈퇴가 가능하다.
즉, 탈퇴 전월로부터 탈퇴 신청 확률을 예측한다
In [3]:
from dateutil.relativedelta import relativedelta
exit_customer = customer.loc[customer["is_deleted"]==1]
exit_customer["exit_date"] = None
exit_customer["end_date"] = pd.to_datetime(exit_customer["end_date"]) # datetime으로 변환
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:4: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
after removing the cwd from sys.path.
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:5: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
"""
In [4]:
for i in range(len(exit_customer)):
exit_customer["exit_date"].iloc[i] = exit_customer["end_date"].iloc[i] - relativedelta(months=1) #탈퇴 일자 - 1개월로 변환
/usr/local/lib/python3.7/dist-packages/pandas/core/indexing.py:670: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
iloc._setitem_with_indexer(indexer, value)
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
In [5]:
exit_customer["exit_date"] = pd.to_datetime(exit_customer["exit_date"]) #한번 더 datetime으로 바꿔줌
exit_customer["연월"] = exit_customer["exit_date"].dt.strftime("%Y%m")
uselog["연월"] = uselog["연월"].astype(str)
exit_uselog = pd.merge(uselog, exit_customer, on=["customer_id", "연월"], how="left")
print(len(uselog))
exit_uselog.head()
33851
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
"""Entry point for launching an IPython kernel.
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
Out[5]:
연월 | customer_id | count_0 | count_1 | name | class | gender | start_date | end_date | campaign_id | is_deleted | class_name | price | campaign_name | mean | median | max | min | routine_flg | calc_date | membership_period | exit_date | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 201805 | AS002855 | 5 | 4.0 | NaN | NaN | NaN | NaN | NaT | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaT |
1 | 201805 | AS009373 | 4 | 3.0 | NaN | NaN | NaN | NaN | NaT | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaT |
2 | 201805 | AS015233 | 7 | NaN | NaN | NaN | NaN | NaN | NaT | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaT |
3 | 201805 | AS015315 | 3 | 6.0 | NaN | NaN | NaN | NaN | NaT | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaT |
4 | 201805 | AS015739 | 5 | 7.0 | NaN | NaN | NaN | NaN | NaT | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaT |
In [7]:
exit_uselog = exit_uselog.dropna(subset=["name"])
print(len(exit_uselog))
print(len(exit_uselog["customer_id"].unique())) # 결측치 없는지 확인
exit_uselog.head()
1104
1104
Out[7]:
연월 | customer_id | count_0 | count_1 | name | class | gender | start_date | end_date | campaign_id | is_deleted | class_name | price | campaign_name | mean | median | max | min | routine_flg | calc_date | membership_period | exit_date | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
19 | 201805 | AS055680 | 3 | 3.0 | XXXXX | C01 | M | 2018-03-01 | 2018-06-30 | CA1 | 1.0 | 종일 | 10500.0 | 일반 | 3.000000 | 3.0 | 3.0 | 3.0 | 0.0 | 2018-06-30 | 3.0 | 2018-05-30 |
57 | 201805 | AS169823 | 2 | 3.0 | XX | C01 | M | 2017-11-01 | 2018-06-30 | CA1 | 1.0 | 종일 | 10500.0 | 일반 | 3.000000 | 3.0 | 4.0 | 2.0 | 1.0 | 2018-06-30 | 7.0 | 2018-05-30 |
110 | 201805 | AS305860 | 5 | 3.0 | XXXX | C01 | M | 2017-06-01 | 2018-06-30 | CA1 | 1.0 | 종일 | 10500.0 | 일반 | 3.333333 | 3.0 | 5.0 | 2.0 | 0.0 | 2018-06-30 | 12.0 | 2018-05-30 |
128 | 201805 | AS363699 | 5 | 3.0 | XXXXX | C01 | M | 2018-02-01 | 2018-06-30 | CA1 | 1.0 | 종일 | 10500.0 | 일반 | 3.333333 | 3.0 | 5.0 | 2.0 | 0.0 | 2018-06-30 | 4.0 | 2018-05-30 |
147 | 201805 | AS417696 | 1 | 4.0 | XX | C03 | F | 2017-09-01 | 2018-06-30 | CA1 | 1.0 | 야간 | 6000.0 | 일반 | 2.000000 | 1.0 | 4.0 | 1.0 | 0.0 | 2018-06-30 | 9.0 | 2018-05-30 |
지속 회원의 데이터 작성하기¶
In [8]:
conti_customer = customer.loc[customer["is_deleted"]==0]
conti_uselog = pd.merge(uselog, conti_customer, on=["customer_id"], how="left")
print(len(conti_uselog))
conti_uselog = conti_uselog.dropna(subset=["name"])
print(len(conti_uselog))
33851
27422
탈퇴 회원의 데이터 수는 1104인 반면, 지속 회원의 데이터수는 name
의 결측치를 제거했음에도 27422개이기 때문에 데이터 불균형이 생긴다.
따라서 샘플의 수를 조정한다.
모든 기간의 회원 데이터를 사용하지 않고 하나의 기간만 사용하여 회원 당 데이터를 1개만 사용하도록 한다.
In [9]:
conti_uselog = conti_uselog.sample(frac=1).reset_index(drop=True)
conti_uselog = conti_uselog.drop_duplicates(subset="customer_id")
print(len(conti_uselog))
conti_uselog.head()
2842
Out[9]:
연월 | customer_id | count_0 | count_1 | name | class | gender | start_date | end_date | campaign_id | is_deleted | class_name | price | campaign_name | mean | median | max | min | routine_flg | calc_date | membership_period | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 201901 | AS551109 | 8 | 8.0 | XXX | C01 | F | 2018-10-15 | NaN | CA1 | 0.0 | 종일 | 10500.0 | 일반 | 7.333333 | 8.0 | 8.0 | 5.0 | 1.0 | 2019-04-30 | 6.0 |
1 | 201806 | OA124956 | 7 | 7.0 | XXXXX | C01 | F | 2016-11-01 | NaN | CA1 | 0.0 | 종일 | 10500.0 | 일반 | 5.333333 | 5.0 | 7.0 | 3.0 | 1.0 | 2019-04-30 | 29.0 |
2 | 201811 | IK983460 | 5 | 6.0 | XXXXXX | C02 | M | 2015-10-01 | NaN | CA1 | 0.0 | 주간 | 7500.0 | 일반 | 5.250000 | 5.0 | 8.0 | 3.0 | 1.0 | 2019-04-30 | 42.0 |
3 | 201901 | IK155697 | 5 | 5.0 | XXXX | C01 | F | 2015-08-01 | NaN | CA1 | 0.0 | 종일 | 10500.0 | 일반 | 4.250000 | 5.0 | 6.0 | 1.0 | 1.0 | 2019-04-30 | 44.0 |
4 | 201903 | AS515526 | 9 | 4.0 | XXXXX | C03 | M | 2015-11-01 | NaN | CA1 | 0.0 | 야간 | 6000.0 | 일반 | 5.000000 | 5.0 | 9.0 | 2.0 | 1.0 | 2019-04-30 | 41.0 |
In [11]:
predict_data = pd.concat([conti_uselog, exit_uselog], ignore_index=True) # 탈퇴 회원과 지속 회원 결합
print(len(predict_data))
predict_data.head()
3946
Out[11]:
연월 | customer_id | count_0 | count_1 | name | class | gender | start_date | end_date | campaign_id | is_deleted | class_name | price | campaign_name | mean | median | max | min | routine_flg | calc_date | membership_period | exit_date | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 201901 | AS551109 | 8 | 8.0 | XXX | C01 | F | 2018-10-15 | NaN | CA1 | 0.0 | 종일 | 10500.0 | 일반 | 7.333333 | 8.0 | 8.0 | 5.0 | 1.0 | 2019-04-30 | 6.0 | NaT |
1 | 201806 | OA124956 | 7 | 7.0 | XXXXX | C01 | F | 2016-11-01 | NaN | CA1 | 0.0 | 종일 | 10500.0 | 일반 | 5.333333 | 5.0 | 7.0 | 3.0 | 1.0 | 2019-04-30 | 29.0 | NaT |
2 | 201811 | IK983460 | 5 | 6.0 | XXXXXX | C02 | M | 2015-10-01 | NaN | CA1 | 0.0 | 주간 | 7500.0 | 일반 | 5.250000 | 5.0 | 8.0 | 3.0 | 1.0 | 2019-04-30 | 42.0 | NaT |
3 | 201901 | IK155697 | 5 | 5.0 | XXXX | C01 | F | 2015-08-01 | NaN | CA1 | 0.0 | 종일 | 10500.0 | 일반 | 4.250000 | 5.0 | 6.0 | 1.0 | 1.0 | 2019-04-30 | 44.0 | NaT |
4 | 201903 | AS515526 | 9 | 4.0 | XXXXX | C03 | M | 2015-11-01 | NaN | CA1 | 0.0 | 야간 | 6000.0 | 일반 | 5.000000 | 5.0 | 9.0 | 2.0 | 1.0 | 2019-04-30 | 41.0 | NaT |
In [12]:
predict_data.tail()
Out[12]:
연월 | customer_id | count_0 | count_1 | name | class | gender | start_date | end_date | campaign_id | is_deleted | class_name | price | campaign_name | mean | median | max | min | routine_flg | calc_date | membership_period | exit_date | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3941 | 201902 | TS645212 | 4 | 2.0 | XXXX | C03 | F | 2018-03-01 | 2019-03-31 00:00:00 | CA1 | 1.0 | 야간 | 6000.0 | 일반 | 4.50 | 4.5 | 7.0 | 1.0 | 0.0 | 2019-03-31 | 12.0 | 2019-02-28 |
3942 | 201902 | TS741703 | 5 | 6.0 | XXXX | C03 | M | 2018-12-08 | 2019-03-31 00:00:00 | CA3 | 1.0 | 야간 | 6000.0 | 입회비무료 | 6.25 | 6.0 | 8.0 | 5.0 | 0.0 | 2019-03-31 | 3.0 | 2019-02-28 |
3943 | 201902 | TS859258 | 1 | 3.0 | XXXXX | C02 | F | 2018-12-07 | 2019-03-31 00:00:00 | CA3 | 1.0 | 주간 | 7500.0 | 입회비무료 | 2.50 | 2.0 | 5.0 | 1.0 | 0.0 | 2019-03-31 | 3.0 | 2019-02-28 |
3944 | 201902 | TS886985 | 5 | 3.0 | XXX | C02 | F | 2018-03-01 | 2019-03-31 00:00:00 | CA1 | 1.0 | 주간 | 7500.0 | 일반 | 4.25 | 4.0 | 7.0 | 2.0 | 1.0 | 2019-03-31 | 12.0 | 2019-02-28 |
3945 | 201902 | TS921837 | 2 | 3.0 | XXXXXX | C01 | M | 2018-06-04 | 2019-03-31 00:00:00 | CA2 | 1.0 | 종일 | 10500.0 | 입회비반액할인 | 4.00 | 3.5 | 9.0 | 2.0 | 1.0 | 2019-03-31 | 9.0 | 2019-02-28 |
예측할 달의 재적 기간작성¶
In [13]:
predict_data["period"] = 0
predict_data["now_date"] = pd.to_datetime(predict_data["연월"], format="%Y%m")
predict_data["start_date"] = pd.to_datetime(predict_data["start_date"])
for i in range(len(predict_data)):
delta = relativedelta(predict_data["now_date"][i], predict_data["start_date"][i])
predict_data["period"][i] = int(delta.years*12 + delta.months)
predict_data.head()
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:6: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
Out[13]:
연월 | customer_id | count_0 | count_1 | name | class | gender | start_date | end_date | campaign_id | is_deleted | class_name | price | campaign_name | mean | median | max | min | routine_flg | calc_date | membership_period | exit_date | period | now_date | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 201901 | AS551109 | 8 | 8.0 | XXX | C01 | F | 2018-10-15 | NaN | CA1 | 0.0 | 종일 | 10500.0 | 일반 | 7.333333 | 8.0 | 8.0 | 5.0 | 1.0 | 2019-04-30 | 6.0 | NaT | 2 | 2019-01-01 |
1 | 201806 | OA124956 | 7 | 7.0 | XXXXX | C01 | F | 2016-11-01 | NaN | CA1 | 0.0 | 종일 | 10500.0 | 일반 | 5.333333 | 5.0 | 7.0 | 3.0 | 1.0 | 2019-04-30 | 29.0 | NaT | 19 | 2018-06-01 |
2 | 201811 | IK983460 | 5 | 6.0 | XXXXXX | C02 | M | 2015-10-01 | NaN | CA1 | 0.0 | 주간 | 7500.0 | 일반 | 5.250000 | 5.0 | 8.0 | 3.0 | 1.0 | 2019-04-30 | 42.0 | NaT | 37 | 2018-11-01 |
3 | 201901 | IK155697 | 5 | 5.0 | XXXX | C01 | F | 2015-08-01 | NaN | CA1 | 0.0 | 종일 | 10500.0 | 일반 | 4.250000 | 5.0 | 6.0 | 1.0 | 1.0 | 2019-04-30 | 44.0 | NaT | 41 | 2019-01-01 |
4 | 201903 | AS515526 | 9 | 4.0 | XXXXX | C03 | M | 2015-11-01 | NaN | CA1 | 0.0 | 야간 | 6000.0 | 일반 | 5.000000 | 5.0 | 9.0 | 2.0 | 1.0 | 2019-04-30 | 41.0 | NaT | 40 | 2019-03-01 |
결측치 제거¶
In [15]:
predict_data.isna().sum() # 컬럼 별 결측치 수 확인
Out[15]:
연월 0
customer_id 0
count_0 0
count_1 260
name 0
class 0
gender 0
start_date 0
end_date 2842
campaign_id 0
is_deleted 0
class_name 0
price 0
campaign_name 0
mean 0
median 0
max 0
min 0
routine_flg 0
calc_date 0
membership_period 0
exit_date 2842
period 0
now_date 0
dtype: int64
In [17]:
predict_data = predict_data.dropna(subset=["count_1"])
predict_data.isna().sum()
Out[17]:
연월 0
customer_id 0
count_0 0
count_1 0
name 0
class 0
gender 0
start_date 0
end_date 2634
campaign_id 0
is_deleted 0
class_name 0
price 0
campaign_name 0
mean 0
median 0
max 0
min 0
routine_flg 0
calc_date 0
membership_period 0
exit_date 2634
period 0
now_date 0
dtype: int64
문자열 변수를 처리하도록 가공하자¶
카테고리 변수를 처리할 수 있도록 더미 변수를 만든다
In [18]:
target_col = ["campaign_name", "class_name", "gender", "count_1", "routine_flg", "period", "is_deleted"]
predict_data = predict_data[target_col]
predict_data.head()
Out[18]:
campaign_name | class_name | gender | count_1 | routine_flg | period | is_deleted | |
---|---|---|---|---|---|---|---|
0 | 일반 | 종일 | F | 8.0 | 1.0 | 2 | 0.0 |
1 | 일반 | 종일 | F | 7.0 | 1.0 | 19 | 0.0 |
2 | 일반 | 주간 | M | 6.0 | 1.0 | 37 | 0.0 |
3 | 일반 | 종일 | F | 5.0 | 1.0 | 41 | 0.0 |
4 | 일반 | 야간 | M | 4.0 | 1.0 | 40 | 0.0 |
In [19]:
predict_data = pd.get_dummies(predict_data)
predict_data.head()
Out[19]:
count_1 | routine_flg | period | is_deleted | campaign_name_일반 | campaign_name_입회비무료 | campaign_name_입회비반액할인 | class_name_야간 | class_name_종일 | class_name_주간 | gender_F | gender_M | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 8.0 | 1.0 | 2 | 0.0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
1 | 7.0 | 1.0 | 19 | 0.0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
2 | 6.0 | 1.0 | 37 | 0.0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
3 | 5.0 | 1.0 | 41 | 0.0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
4 | 4.0 | 1.0 | 40 | 0.0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
In [20]:
del predict_data["campaign_name_일반"]
del predict_data["class_name_야간"]
del predict_data["gender_M"]
predict_data.head() # 다른 컬럼 정보로도 얻을 수 있는 컬럼은 삭제함
Out[20]:
count_1 | routine_flg | period | is_deleted | campaign_name_입회비무료 | campaign_name_입회비반액할인 | class_name_종일 | class_name_주간 | gender_F | |
---|---|---|---|---|---|---|---|---|---|
0 | 8.0 | 1.0 | 2 | 0.0 | 0 | 0 | 1 | 0 | 1 |
1 | 7.0 | 1.0 | 19 | 0.0 | 0 | 0 | 1 | 0 | 1 |
2 | 6.0 | 1.0 | 37 | 0.0 | 0 | 0 | 0 | 1 | 0 |
3 | 5.0 | 1.0 | 41 | 0.0 | 0 | 0 | 1 | 0 | 1 |
4 | 4.0 | 1.0 | 40 | 0.0 | 0 | 0 | 0 | 0 | 0 |
의사결정트리 사용하기¶
In [21]:
from sklearn.tree import DecisionTreeClassifier
import sklearn.model_selection
exit = predict_data.loc[predict_data["is_deleted"]==1]
conti = predict_data.loc[predict_data["is_deleted"]==0].sample(len(exit)) # 탈퇴회원 수 만큼 데이터 샘플링
In [22]:
X = pd.concat([exit, conti], ignore_index=True)
y = X["is_deleted"]
del X["is_deleted"]
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y)
model = DecisionTreeClassifier(random_state=0)
model.fit(X_train, y_train)
y_test_pred = model.predict(X_test)
print(y_test_pred)
[1. 0. 1. 1. 0. 0. 1. 0. 1. 1. 1. 1. 0. 1. 0. 0. 1. 1. 0. 0. 1. 0. 0. 0.
0. 1. 0. 1. 1. 0. 1. 1. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 1. 1. 0. 1. 0. 1.
1. 0. 1. 0. 1. 1. 0. 1. 0. 0. 1. 1. 0. 1. 0. 0. 0. 1. 1. 1. 0. 0. 0. 0.
0. 0. 1. 1. 0. 0. 1. 1. 1. 1. 0. 0. 1. 1. 0. 0. 0. 1. 0. 0. 0. 1. 0. 1.
1. 1. 1. 0. 0. 1. 0. 1. 0. 0. 0. 1. 1. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.
0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 1. 1. 0. 1. 1. 0. 0. 0. 1. 1. 0. 1. 1. 1.
1. 1. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 1. 0. 0. 0. 0.
0. 1. 0. 1. 1. 1. 1. 1. 0. 0. 0. 0. 1. 0. 1. 0. 0. 1. 1. 1. 1. 0. 1. 0.
1. 0. 1. 0. 1. 1. 1. 1. 0. 1. 1. 0. 0. 0. 0. 1. 0. 1. 1. 1. 1. 0. 1. 1.
1. 1. 0. 0. 1. 1. 0. 1. 0. 0. 1. 1. 1. 1. 1. 0. 1. 0. 0. 1. 0. 0. 1. 1.
1. 1. 1. 0. 1. 1. 0. 0. 0. 1. 0. 0. 1. 1. 0. 1. 0. 0. 0. 1. 0. 1. 1. 1.
1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 0. 0. 0. 0. 0. 1. 0. 1. 1. 1. 0. 0. 0. 1.
0. 0. 0. 1. 0. 0. 1. 0. 1. 1. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 1. 1. 1.
1. 1. 1. 1. 0. 0. 1. 0. 1. 1. 1. 1. 0. 1. 0. 0. 0. 0. 1. 1. 0. 1. 1. 1.
0. 0. 1. 1. 0. 0. 1. 1. 1. 1. 1. 0. 1. 0. 1. 1. 0. 1. 1. 1. 1. 0. 0. 0.
1. 0. 0. 1. 1. 0. 1. 0. 1. 1. 0. 0. 1. 1. 0. 0. 1. 0. 1. 0. 0. 0. 1. 0.
0. 0. 0. 1. 1. 0. 1. 0. 0. 1. 0. 0. 1. 0. 1. 1. 1. 0. 1. 0. 1. 1. 0. 0.
0. 1. 0. 0. 0. 1. 0. 1. 1. 1. 1. 0. 0. 1. 0. 0. 0. 1. 0. 0. 1. 1. 0. 1.
0. 0. 1. 0. 1. 1. 1. 0. 1. 0. 1. 0. 0. 1. 0. 1. 1. 1. 0. 1. 1. 1. 1. 1.
1. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 1. 0. 1. 0. 0. 1. 0. 1. 1. 1. 0. 0. 1.
1. 0. 0. 0. 0. 1. 1. 1. 0. 0. 1. 1. 1. 1. 1. 0. 0. 0. 1. 0. 1. 0. 1. 1.
1. 1. 1. 1. 0. 0. 1. 1. 1. 0. 0. 1. 0. 1. 1. 1. 0. 0. 0. 1. 1. 0.]
In [23]:
results_test = pd.DataFrame({"y_test":y_test, "y_pred":y_test_pred})
results_test.head()
Out[23]:
y_test | y_pred | |
---|---|---|
217 | 1.0 | 1.0 |
1995 | 0.0 | 0.0 |
304 | 1.0 | 1.0 |
791 | 1.0 | 1.0 |
1911 | 0.0 | 0.0 |
모델 평가, 튜닝¶
In [24]:
correct = len(results_test.loc[results_test["y_test"]==results_test["y_pred"]])
data_count = len(results_test)
score_test = correct / data_count
print(score_test)
0.8916349809885932
In [25]:
print(model.score(X_test, y_test))
print(model.score(X_train, y_train)) # 오버피팅
0.8916349809885932
0.9816223067173637
In [26]:
model = DecisionTreeClassifier(random_state=0, max_depth=5)
model.fit(X_train, y_train)
print(model.score(X_test, y_test))
print(model.score(X_train, y_train))
0.9258555133079848
0.926489226869455
변수들의 기여도 확인¶
In [27]:
importance = pd.DataFrame({"feature_names": X.columns, "coefficient":model.feature_importances_})
importance
Out[27]:
feature_names | coefficient | |
---|---|---|
0 | count_1 | 0.325178 |
1 | routine_flg | 0.120963 |
2 | period | 0.549796 |
3 | campaign_name_입회비무료 | 0.000000 |
4 | campaign_name_입회비반액할인 | 0.003784 |
5 | class_name_종일 | 0.000279 |
6 | class_name_주간 | 0.000000 |
7 | gender_F | 0.000000 |
1개월전 이용 횟수, 정기 이용 여부, 재적 기간이 기여하고 있는 것을 확인
탈퇴 예측¶
In [28]:
count_1 = 3
routine_flg = 1
period = 10
campaign_name = "입회비무료"
class_name = "종일"
gender = "M"
In [29]:
if campaign_name == "입회비반값할인":
campaign_name_list = [1,0]
elif campaign_name=='입회비무료':
campaign_name_list = [0,1]
elif campaign_name=='일반':
campaign_name_list = [0,0]
if class_name == "종일":
class_name_list = [1,0]
elif class_name == "주간":
class_name_list = [0,1]
elif class_name == "야간":
class_name_list = [0,0]
if gender == 'F' :
gender_list = [1]
elif gender == 'M':
gender_list = [0]
input_data = [count_1, routine_flg, period]
input_data.extend(campaign_name_list)
input_data.extend(class_name_list)
input_data.extend(gender_list)
In [32]:
print(model.predict([input_data]))
print(model.predict_proba([input_data]))
[1.]
[[0. 1.]]
In [ ]:
'Machine Learning > Statistics' 카테고리의 다른 글
여러가지 확률분포 (0) | 2022.02.03 |
---|---|
파이썬 데이터분석 실무 테크닉 100 - 6장 (1) | 2021.07.01 |
파이썬 데이터분석 실무 테크닉 100 - 4장 (0) | 2021.01.07 |
파이썬 데이터 분석 실무 테크닉 100 - 3장 (0) | 2020.12.17 |
파이썬 데이터 분석 실무 테크닉 100 - 2장 (0) | 2020.12.07 |