Machine Learning/Statistics

파이썬 데이터분석 실무 테크닉 100 - 5장

양갱맨 2021. 3. 3. 17:20

05장 회원 탈퇴를 예측하는 테크닉 10¶

앞장에서 소개한 클러스터링을 통한 행동 분석은 사용방법에 따라 많은 가능성이 있는 기술이다. 행동 패턴을 분석할 수 있으면 어떤 고객이 탈퇴할지와 같은 예측도 어느정도 정확하게 할 수 있다. 탈퇴 방지를 하기 위해 미리 정책을 준비하는 것도 가능하다.

Decision Tree라고 부르는 지도학습의 분류 알고리즘을 이용하여 탈퇴를 예측하는 흐름을 배운다.

In [1]:

import pandas as pd

# Load Dataset
folder_p = '/content/drive/MyDrive/파이썬데이터분석실무테크닉100/pyda100/5장/'
customer = pd.read_csv(folder_p+'customer_join.csv')
uselog_months = pd.read_csv(folder_p+'use_log_months.csv')

미래 예측을 위해 해당 달과 1개월 전 이용 이력만으로 데이터를 작성한다.

In [2]:

year_months = list(uselog_months["연월"].unique())
uselog = pd.DataFrame()
for i in range(1, len(year_months)):
    tmp = uselog_months.loc[uselog_months["연월"]==year_months[i]] #각 해당 월 데이터
    tmp.rename(columns={"count":"count_0"}, inplace=True)
    tmp_before = uselog_months.loc[uselog_months["연월"]==year_months[i-1]] #1개월 전 이력
    del tmp_before["연월"]

    tmp_before.rename(columns={"count":"count_1"}, inplace=True)
    tmp = pd.merge(tmp, tmp_before, on="customer_id", how="left")
    uselog = pd.concat([uselog, tmp], ignore_index=True)

uselog.head()

/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py:4308: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py:4308: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py:4308: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py:4308: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py:4308: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py:4308: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py:4308: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py:4308: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py:4308: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py:4308: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py:4308: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py:4308: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py:4308: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py:4308: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py:4308: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py:4308: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py:4308: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py:4308: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py:4308: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py:4308: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py:4308: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py:4308: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,

Out[2]:

	연월	customer_id	count_0	count_1
0	201805	AS002855	5	4.0
1	201805	AS009373	4	3.0
2	201805	AS015233	7	NaN
3	201805	AS015315	3	6.0
4	201805	AS015739	5	7.0

탈퇴 전월의 탈퇴 고객 데이터 작성하기¶

탈퇴한 월이 아닌 탈퇴 전월의 데이터를 작성하는 이유?
- 탈퇴를 예측한느 목적은 탈퇴를 방지하는 것.
- 해당 데이터를 제공한 스포츠 센터에서는 월말까지 탈퇴 신청을 해야 다음 달 말에 탈퇴가 가능하다.

즉, 탈퇴 전월로부터 탈퇴 신청 확률을 예측한다

In [3]:

from dateutil.relativedelta import relativedelta

exit_customer = customer.loc[customer["is_deleted"]==1]
exit_customer["exit_date"] = None
exit_customer["end_date"] = pd.to_datetime(exit_customer["end_date"]) # datetime으로 변환

/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """

In [4]:

for i in range(len(exit_customer)):
    exit_customer["exit_date"].iloc[i] = exit_customer["end_date"].iloc[i] - relativedelta(months=1) #탈퇴 일자 - 1개월로 변환

/usr/local/lib/python3.7/dist-packages/pandas/core/indexing.py:670: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

In [5]:

exit_customer["exit_date"] = pd.to_datetime(exit_customer["exit_date"]) #한번 더 datetime으로 바꿔줌
exit_customer["연월"] = exit_customer["exit_date"].dt.strftime("%Y%m")
uselog["연월"] = uselog["연월"].astype(str)
exit_uselog = pd.merge(uselog, exit_customer, on=["customer_id", "연월"], how="left")
print(len(uselog))
exit_uselog.head()

/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Out[5]:

	연월	customer_id	count_0	count_1	name	class	gender	start_date	end_date	campaign_id	is_deleted	class_name	price	campaign_name	mean	median	max	min	routine_flg	calc_date	membership_period	exit_date
0	201805	AS002855	5	4.0	NaN	NaN	NaN	NaN	NaT	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaT
1	201805	AS009373	4	3.0	NaN	NaN	NaN	NaN	NaT	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaT
2	201805	AS015233	7	NaN	NaN	NaN	NaN	NaN	NaT	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaT
3	201805	AS015315	3	6.0	NaN	NaN	NaN	NaN	NaT	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaT
4	201805	AS015739	5	7.0	NaN	NaN	NaN	NaN	NaT	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaT

In [7]:

exit_uselog = exit_uselog.dropna(subset=["name"])
print(len(exit_uselog))
print(len(exit_uselog["customer_id"].unique())) # 결측치 없는지 확인
exit_uselog.head()

1104
1104

Out[7]:

	연월	customer_id	count_0	count_1	name	class	gender	start_date	end_date	campaign_id	is_deleted	class_name	price	campaign_name	mean	median	max	min	routine_flg	calc_date	membership_period	exit_date
19	201805	AS055680	3	3.0	XXXXX	C01	M	2018-03-01	2018-06-30	CA1	1.0	종일	10500.0	일반	3.000000	3.0	3.0	3.0	0.0	2018-06-30	3.0	2018-05-30
57	201805	AS169823	2	3.0	XX	C01	M	2017-11-01	2018-06-30	CA1	1.0	종일	10500.0	일반	3.000000	3.0	4.0	2.0	1.0	2018-06-30	7.0	2018-05-30
110	201805	AS305860	5	3.0	XXXX	C01	M	2017-06-01	2018-06-30	CA1	1.0	종일	10500.0	일반	3.333333	3.0	5.0	2.0	0.0	2018-06-30	12.0	2018-05-30
128	201805	AS363699	5	3.0	XXXXX	C01	M	2018-02-01	2018-06-30	CA1	1.0	종일	10500.0	일반	3.333333	3.0	5.0	2.0	0.0	2018-06-30	4.0	2018-05-30
147	201805	AS417696	1	4.0	XX	C03	F	2017-09-01	2018-06-30	CA1	1.0	야간	6000.0	일반	2.000000	1.0	4.0	1.0	0.0	2018-06-30	9.0	2018-05-30

지속 회원의 데이터 작성하기¶

In [8]:

conti_customer = customer.loc[customer["is_deleted"]==0]
conti_uselog = pd.merge(uselog, conti_customer, on=["customer_id"], how="left")
print(len(conti_uselog))
conti_uselog = conti_uselog.dropna(subset=["name"])
print(len(conti_uselog))

33851
27422

탈퇴 회원의 데이터 수는 1104인 반면, 지속 회원의 데이터수는 name의 결측치를 제거했음에도 27422개이기 때문에 데이터 불균형이 생긴다.

따라서 샘플의 수를 조정한다.

모든 기간의 회원 데이터를 사용하지 않고 하나의 기간만 사용하여 회원 당 데이터를 1개만 사용하도록 한다.

In [9]:

conti_uselog = conti_uselog.sample(frac=1).reset_index(drop=True)
conti_uselog = conti_uselog.drop_duplicates(subset="customer_id")
print(len(conti_uselog))
conti_uselog.head()

Out[9]:

	연월	customer_id	count_0	count_1	name	class	gender	start_date	end_date	campaign_id	class_name	price	campaign_name	mean	median	max	min	routine_flg	calc_date	membership_period
0	201901	AS551109	8	8.0	XXX	C01	F	2018-10-15	NaN	CA1	종일	10500.0	일반	7.333333	8.0	8.0	5.0	1.0	2019-04-30	6.0
1	201806	OA124956	7	7.0	XXXXX	C01	F	2016-11-01	NaN	CA1	종일	10500.0	일반	5.333333	5.0	7.0	3.0	1.0	2019-04-30	29.0
2	201811	IK983460	5	6.0	XXXXXX	C02	M	2015-10-01	NaN	CA1	주간	7500.0	일반	5.250000	5.0	8.0	3.0	1.0	2019-04-30	42.0
3	201901	IK155697	5	5.0	XXXX	C01	F	2015-08-01	NaN	CA1	종일	10500.0	일반	4.250000	5.0	6.0	1.0	1.0	2019-04-30	44.0
4	201903	AS515526	9	4.0	XXXXX	C03	M	2015-11-01	NaN	CA1	야간	6000.0	일반	5.000000	5.0	9.0	2.0	1.0	2019-04-30	41.0

In [11]:

predict_data = pd.concat([conti_uselog, exit_uselog], ignore_index=True) # 탈퇴 회원과 지속 회원 결합
print(len(predict_data))
predict_data.head()

Out[11]:

	연월	customer_id	count_0	count_1	name	class	gender	start_date	end_date	campaign_id	class_name	price	campaign_name	mean	median	max	min	routine_flg	calc_date	membership_period	exit_date
0	201901	AS551109	8	8.0	XXX	C01	F	2018-10-15	NaN	CA1	종일	10500.0	일반	7.333333	8.0	8.0	5.0	1.0	2019-04-30	6.0	NaT
1	201806	OA124956	7	7.0	XXXXX	C01	F	2016-11-01	NaN	CA1	종일	10500.0	일반	5.333333	5.0	7.0	3.0	1.0	2019-04-30	29.0	NaT
2	201811	IK983460	5	6.0	XXXXXX	C02	M	2015-10-01	NaN	CA1	주간	7500.0	일반	5.250000	5.0	8.0	3.0	1.0	2019-04-30	42.0	NaT
3	201901	IK155697	5	5.0	XXXX	C01	F	2015-08-01	NaN	CA1	종일	10500.0	일반	4.250000	5.0	6.0	1.0	1.0	2019-04-30	44.0	NaT
4	201903	AS515526	9	4.0	XXXXX	C03	M	2015-11-01	NaN	CA1	야간	6000.0	일반	5.000000	5.0	9.0	2.0	1.0	2019-04-30	41.0	NaT

In [12]:

predict_data.tail()

Out[12]:

	연월	customer_id	count_0	count_1	name	class	gender	start_date	end_date	campaign_id	is_deleted	class_name	price	campaign_name	mean	median	max	min	routine_flg	calc_date	membership_period	exit_date
3941	201902	TS645212	4	2.0	XXXX	C03	F	2018-03-01	2019-03-31 00:00:00	CA1	1.0	야간	6000.0	일반	4.50	4.5	7.0	1.0	0.0	2019-03-31	12.0	2019-02-28
3942	201902	TS741703	5	6.0	XXXX	C03	M	2018-12-08	2019-03-31 00:00:00	CA3	1.0	야간	6000.0	입회비무료	6.25	6.0	8.0	5.0	0.0	2019-03-31	3.0	2019-02-28
3943	201902	TS859258	1	3.0	XXXXX	C02	F	2018-12-07	2019-03-31 00:00:00	CA3	1.0	주간	7500.0	입회비무료	2.50	2.0	5.0	1.0	0.0	2019-03-31	3.0	2019-02-28
3944	201902	TS886985	5	3.0	XXX	C02	F	2018-03-01	2019-03-31 00:00:00	CA1	1.0	주간	7500.0	일반	4.25	4.0	7.0	2.0	1.0	2019-03-31	12.0	2019-02-28
3945	201902	TS921837	2	3.0	XXXXXX	C01	M	2018-06-04	2019-03-31 00:00:00	CA2	1.0	종일	10500.0	입회비반액할인	4.00	3.5	9.0	2.0	1.0	2019-03-31	9.0	2019-02-28

예측할 달의 재적 기간작성¶

In [13]:

predict_data["period"] = 0
predict_data["now_date"] = pd.to_datetime(predict_data["연월"], format="%Y%m")
predict_data["start_date"] = pd.to_datetime(predict_data["start_date"])
for i in range(len(predict_data)):
    delta = relativedelta(predict_data["now_date"][i], predict_data["start_date"][i])
    predict_data["period"][i] = int(delta.years*12 + delta.months)
predict_data.head()

/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Out[13]:

	연월	customer_id	count_0	count_1	name	class	gender	start_date	end_date	campaign_id	class_name	price	campaign_name	mean	median	max	min	routine_flg	calc_date	membership_period	exit_date	period	now_date
0	201901	AS551109	8	8.0	XXX	C01	F	2018-10-15	NaN	CA1	종일	10500.0	일반	7.333333	8.0	8.0	5.0	1.0	2019-04-30	6.0	NaT	2	2019-01-01
1	201806	OA124956	7	7.0	XXXXX	C01	F	2016-11-01	NaN	CA1	종일	10500.0	일반	5.333333	5.0	7.0	3.0	1.0	2019-04-30	29.0	NaT	19	2018-06-01
2	201811	IK983460	5	6.0	XXXXXX	C02	M	2015-10-01	NaN	CA1	주간	7500.0	일반	5.250000	5.0	8.0	3.0	1.0	2019-04-30	42.0	NaT	37	2018-11-01
3	201901	IK155697	5	5.0	XXXX	C01	F	2015-08-01	NaN	CA1	종일	10500.0	일반	4.250000	5.0	6.0	1.0	1.0	2019-04-30	44.0	NaT	41	2019-01-01
4	201903	AS515526	9	4.0	XXXXX	C03	M	2015-11-01	NaN	CA1	야간	6000.0	일반	5.000000	5.0	9.0	2.0	1.0	2019-04-30	41.0	NaT	40	2019-03-01

결측치 제거¶

In [15]:

predict_data.isna().sum() # 컬럼 별 결측치 수 확인

Out[15]:

연월                      0
customer_id             0
count_0                 0
count_1               260
name                    0
class                   0
gender                  0
start_date              0
end_date             2842
campaign_id             0
is_deleted              0
class_name              0
price                   0
campaign_name           0
mean                    0
median                  0
max                     0
min                     0
routine_flg             0
calc_date               0
membership_period       0
exit_date            2842
period                  0
now_date                0
dtype: int64

In [17]:

predict_data = predict_data.dropna(subset=["count_1"])
predict_data.isna().sum()

Out[17]:

연월                      0
customer_id             0
count_0                 0
count_1                 0
name                    0
class                   0
gender                  0
start_date              0
end_date             2634
campaign_id             0
is_deleted              0
class_name              0
price                   0
campaign_name           0
mean                    0
median                  0
max                     0
min                     0
routine_flg             0
calc_date               0
membership_period       0
exit_date            2634
period                  0
now_date                0
dtype: int64

문자열 변수를 처리하도록 가공하자¶

카테고리 변수를 처리할 수 있도록 더미 변수를 만든다

In [18]:

target_col = ["campaign_name", "class_name", "gender", "count_1", "routine_flg", "period", "is_deleted"]
predict_data = predict_data[target_col]
predict_data.head()

Out[18]:

	campaign_name	class_name	gender	count_1	routine_flg	period
0	일반	종일	F	8.0	1.0	2
1	일반	종일	F	7.0	1.0	19
2	일반	주간	M	6.0	1.0	37
3	일반	종일	F	5.0	1.0	41
4	일반	야간	M	4.0	1.0	40

In [19]:

predict_data = pd.get_dummies(predict_data)
predict_data.head()

Out[19]:

	count_1	routine_flg	period	campaign_name_일반	class_name_야간	class_name_종일	class_name_주간	gender_F	gender_M
0	8.0	1.0	2	1	0	1	0	1	0
1	7.0	1.0	19	1	0	1	0	1	0
2	6.0	1.0	37	1	0	0	1	0	1
3	5.0	1.0	41	1	0	1	0	1	0
4	4.0	1.0	40	1	1	0	0	0	1

In [20]:

del predict_data["campaign_name_일반"]
del predict_data["class_name_야간"]
del predict_data["gender_M"]
predict_data.head() # 다른 컬럼 정보로도 얻을 수 있는 컬럼은 삭제함

Out[20]:

	count_1	routine_flg	period	class_name_종일	class_name_주간	gender_F
0	8.0	1.0	2	1	0	1
1	7.0	1.0	19	1	0	1
2	6.0	1.0	37	0	1	0
3	5.0	1.0	41	1	0	1
4	4.0	1.0	40	0	0	0

의사결정트리 사용하기¶

In [21]:

from sklearn.tree import DecisionTreeClassifier
import sklearn.model_selection

exit = predict_data.loc[predict_data["is_deleted"]==1]
conti = predict_data.loc[predict_data["is_deleted"]==0].sample(len(exit)) # 탈퇴회원 수 만큼 데이터 샘플링

In [22]:

X = pd.concat([exit, conti], ignore_index=True)
y = X["is_deleted"]
del X["is_deleted"]

X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y)

model = DecisionTreeClassifier(random_state=0)
model.fit(X_train, y_train)
y_test_pred = model.predict(X_test)
print(y_test_pred)

[1. 0. 1. 1. 0. 0. 1. 0. 1. 1. 1. 1. 0. 1. 0. 0. 1. 1. 0. 0. 1. 0. 0. 0.
 0. 1. 0. 1. 1. 0. 1. 1. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 1. 1. 0. 1. 0. 1.
 1. 0. 1. 0. 1. 1. 0. 1. 0. 0. 1. 1. 0. 1. 0. 0. 0. 1. 1. 1. 0. 0. 0. 0.
 0. 0. 1. 1. 0. 0. 1. 1. 1. 1. 0. 0. 1. 1. 0. 0. 0. 1. 0. 0. 0. 1. 0. 1.
 1. 1. 1. 0. 0. 1. 0. 1. 0. 0. 0. 1. 1. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.
 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 1. 1. 0. 1. 1. 0. 0. 0. 1. 1. 0. 1. 1. 1.
 1. 1. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 1. 0. 0. 0. 0.
 0. 1. 0. 1. 1. 1. 1. 1. 0. 0. 0. 0. 1. 0. 1. 0. 0. 1. 1. 1. 1. 0. 1. 0.
 1. 0. 1. 0. 1. 1. 1. 1. 0. 1. 1. 0. 0. 0. 0. 1. 0. 1. 1. 1. 1. 0. 1. 1.
 1. 1. 0. 0. 1. 1. 0. 1. 0. 0. 1. 1. 1. 1. 1. 0. 1. 0. 0. 1. 0. 0. 1. 1.
 1. 1. 1. 0. 1. 1. 0. 0. 0. 1. 0. 0. 1. 1. 0. 1. 0. 0. 0. 1. 0. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 0. 0. 0. 0. 0. 1. 0. 1. 1. 1. 0. 0. 0. 1.
 0. 0. 0. 1. 0. 0. 1. 0. 1. 1. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 1. 1. 1.
 1. 1. 1. 1. 0. 0. 1. 0. 1. 1. 1. 1. 0. 1. 0. 0. 0. 0. 1. 1. 0. 1. 1. 1.
 0. 0. 1. 1. 0. 0. 1. 1. 1. 1. 1. 0. 1. 0. 1. 1. 0. 1. 1. 1. 1. 0. 0. 0.
 1. 0. 0. 1. 1. 0. 1. 0. 1. 1. 0. 0. 1. 1. 0. 0. 1. 0. 1. 0. 0. 0. 1. 0.
 0. 0. 0. 1. 1. 0. 1. 0. 0. 1. 0. 0. 1. 0. 1. 1. 1. 0. 1. 0. 1. 1. 0. 0.
 0. 1. 0. 0. 0. 1. 0. 1. 1. 1. 1. 0. 0. 1. 0. 0. 0. 1. 0. 0. 1. 1. 0. 1.
 0. 0. 1. 0. 1. 1. 1. 0. 1. 0. 1. 0. 0. 1. 0. 1. 1. 1. 0. 1. 1. 1. 1. 1.
 1. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 1. 0. 1. 0. 0. 1. 0. 1. 1. 1. 0. 0. 1.
 1. 0. 0. 0. 0. 1. 1. 1. 0. 0. 1. 1. 1. 1. 1. 0. 0. 0. 1. 0. 1. 0. 1. 1.
 1. 1. 1. 1. 0. 0. 1. 1. 1. 0. 0. 1. 0. 1. 1. 1. 0. 0. 0. 1. 1. 0.]

In [23]:

results_test = pd.DataFrame({"y_test":y_test, "y_pred":y_test_pred})
results_test.head()

Out[23]:

	y_test	y_pred
217	1.0	1.0
1995	0.0	0.0
304	1.0	1.0
791	1.0	1.0
1911	0.0	0.0

모델 평가, 튜닝¶

In [24]:

correct = len(results_test.loc[results_test["y_test"]==results_test["y_pred"]])
data_count = len(results_test)
score_test = correct / data_count
print(score_test)

0.8916349809885932

In [25]:

print(model.score(X_test, y_test))
print(model.score(X_train, y_train)) # 오버피팅

0.8916349809885932
0.9816223067173637

In [26]:

model = DecisionTreeClassifier(random_state=0, max_depth=5)
model.fit(X_train, y_train)
print(model.score(X_test, y_test))
print(model.score(X_train, y_train))

0.9258555133079848
0.926489226869455

변수들의 기여도 확인¶

In [27]:

importance = pd.DataFrame({"feature_names": X.columns, "coefficient":model.feature_importances_})
importance

Out[27]:

	feature_names	coefficient
0	count_1	0.325178
1	routine_flg	0.120963
2	period	0.549796
3	campaign_name_입회비무료	0.000000
4	campaign_name_입회비반액할인	0.003784
5	class_name_종일	0.000279
6	class_name_주간	0.000000
7	gender_F	0.000000

1개월전 이용 횟수, 정기 이용 여부, 재적 기간이 기여하고 있는 것을 확인

탈퇴 예측¶

In [28]:

count_1 = 3
routine_flg = 1
period = 10
campaign_name = "입회비무료"
class_name = "종일"
gender = "M"

In [29]:

if campaign_name == "입회비반값할인":
    campaign_name_list = [1,0]
elif campaign_name=='입회비무료':
    campaign_name_list = [0,1]
elif campaign_name=='일반':
    campaign_name_list = [0,0]

if class_name == "종일":
    class_name_list = [1,0]
elif class_name == "주간":
    class_name_list = [0,1]
elif class_name == "야간":
    class_name_list = [0,0]

if gender == 'F' :
    gender_list = [1]
elif gender == 'M':
    gender_list = [0]

input_data = [count_1, routine_flg, period]
input_data.extend(campaign_name_list)
input_data.extend(class_name_list)
input_data.extend(gender_list)
    
    

In [32]:

print(model.predict([input_data]))
print(model.predict_proba([input_data])) 

[1.]
[[0. 1.]]

In [ ]:

저작자표시 비영리 변경금지

'Machine Learning > Statistics' 카테고리의 다른 글

여러가지 확률분포 (0)	2022.02.03
파이썬 데이터분석 실무 테크닉 100 - 6장 (1)	2021.07.01
파이썬 데이터분석 실무 테크닉 100 - 4장 (0)	2021.01.07
파이썬 데이터 분석 실무 테크닉 100 - 3장 (0)	2020.12.17
파이썬 데이터 분석 실무 테크닉 100 - 2장 (0)	2020.12.07

현재글파이썬 데이터분석 실무 테크닉 100 - 5장

백준, Reinforcement Learning, OpenCV, DAFIT, deeplearning, 딥러닝, 강화학습, python, pyTorch, 가벼운학습지, CV, 데이터분석, 머신러닝, TensorFlow, 파이토치, RL, 텐서플로우, 파이썬, 알고리즘, 다핏,

Today :
Yesterday :

양갱로그

파이썬 데이터분석 실무 테크닉 100 - 5장

05장 회원 탈퇴를 예측하는 테크닉 10¶

탈퇴 전월의 탈퇴 고객 데이터 작성하기¶

지속 회원의 데이터 작성하기¶

예측할 달의 재적 기간작성¶

결측치 제거¶

문자열 변수를 처리하도록 가공하자¶

의사결정트리 사용하기¶

모델 평가, 튜닝¶

변수들의 기여도 확인¶

탈퇴 예측¶

'Machine Learning > Statistics' 카테고리의 다른 글

'Machine Learning/Statistics'의 다른글

티스토리툴바

« 2025/05 »
일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

파이썬 데이터분석 실무 테크닉 100 - 5장

05장 회원 탈퇴를 예측하는 테크닉 10¶

탈퇴 전월의 탈퇴 고객 데이터 작성하기¶

지속 회원의 데이터 작성하기¶

예측할 달의 재적 기간작성¶

결측치 제거¶

문자열 변수를 처리하도록 가공하자¶

의사결정트리 사용하기¶

모델 평가, 튜닝¶

변수들의 기여도 확인¶

탈퇴 예측¶

'Machine Learning > Statistics' 카테고리의 다른 글

'Machine Learning/Statistics'의 다른글

관련글

티스토리툴바