CARGA DE LIBRERIAS¶
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
CARGA DE LOS DATOS¶
df = pd.read_csv('AbandonoEmpleados.csv', sep = ';', index_col= 'id', na_values='#N/D')
df
edad | abandono | viajes | departamento | distancia_casa | educacion | carrera | empleados | satisfaccion_entorno | sexo | ... | satisfaccion_companeros | horas_quincena | nivel_acciones | anos_experiencia | num_formaciones_ult_ano | conciliacion | anos_compania | anos_en_puesto | anos_desde_ult_promocion | anos_con_manager_actual | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||||||
1 | 41 | Yes | Travel_Rarely | Sales | 1 | Universitaria | Life Sciences | 1 | Media | 3.0 | ... | Baja | 80 | 0 | 8 | 0 | NaN | 6 | NaN | 0 | 5 |
2 | 49 | No | Travel_Frequently | Research & Development | 8 | Secundaria | Life Sciences | 1 | Alta | 2.0 | ... | Muy_Alta | 80 | 1 | 10 | 3 | NaN | 10 | NaN | 1 | 7 |
4 | 37 | Yes | Travel_Rarely | Research & Development | 2 | Secundaria | Other | 1 | Muy_Alta | 2.0 | ... | Media | 80 | 0 | 7 | 3 | NaN | 0 | 2.0 | 0 | 0 |
5 | 33 | No | Travel_Frequently | Research & Development | 3 | Universitaria | Life Sciences | 1 | Muy_Alta | 3.0 | ... | Alta | 80 | 0 | 8 | 3 | NaN | 8 | 3.0 | 3 | 0 |
7 | 27 | No | Travel_Rarely | Research & Development | 2 | Universitaria | Medical | 1 | Baja | 3.0 | ... | Muy_Alta | 80 | 1 | 6 | 3 | NaN | 2 | NaN | 2 | 2 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2061 | 36 | No | Travel_Frequently | Research & Development | 23 | Master | Medical | 1 | Alta | 4.0 | ... | Alta | 80 | 1 | 17 | 3 | NaN | 5 | 4.0 | 0 | 3 |
2062 | 39 | No | Travel_Rarely | Research & Development | 6 | Secundaria | Medical | 1 | Muy_Alta | 2.0 | ... | Baja | 80 | 1 | 9 | 5 | NaN | 7 | NaN | 1 | 7 |
2064 | 27 | No | Travel_Rarely | Research & Development | 4 | Master | Life Sciences | 1 | Media | 4.0 | ... | Media | 80 | 1 | 6 | 0 | NaN | 6 | NaN | 0 | 3 |
2065 | 49 | No | Travel_Frequently | Sales | 2 | Secundaria | Medical | 1 | Muy_Alta | NaN | ... | Muy_Alta | 80 | 0 | 17 | 3 | NaN | 9 | NaN | 0 | 8 |
2068 | 34 | No | Travel_Rarely | Research & Development | 8 | NaN | Medical | 1 | Media | 4.0 | ... | Baja | 80 | 0 | 6 | 3 | Muy_Alta | 4 | NaN | 1 | 2 |
1470 rows × 31 columns
BUSINESS ANALYTICS¶
df.info()
<class 'pandas.core.frame.DataFrame'> Index: 1470 entries, 1 to 2068 Data columns (total 31 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 edad 1470 non-null int64 1 abandono 1470 non-null object 2 viajes 1470 non-null object 3 departamento 1470 non-null object 4 distancia_casa 1470 non-null int64 5 educacion 1369 non-null object 6 carrera 1470 non-null object 7 empleados 1470 non-null int64 8 satisfaccion_entorno 1470 non-null object 9 sexo 1271 non-null float64 10 implicacion 1452 non-null object 11 nivel_laboral 1470 non-null int64 12 puesto 1470 non-null object 13 satisfaccion_trabajo 1394 non-null object 14 estado_civil 1470 non-null object 15 salario_mes 1470 non-null int64 16 num_empresas_anteriores 1470 non-null int64 17 mayor_edad 1470 non-null object 18 horas_extra 1470 non-null object 19 incremento_salario_porc 1470 non-null int64 20 evaluacion 1470 non-null object 21 satisfaccion_companeros 1470 non-null object 22 horas_quincena 1470 non-null int64 23 nivel_acciones 1470 non-null int64 24 anos_experiencia 1470 non-null int64 25 num_formaciones_ult_ano 1470 non-null int64 26 conciliacion 459 non-null object 27 anos_compania 1470 non-null int64 28 anos_en_puesto 232 non-null float64 29 anos_desde_ult_promocion 1470 non-null int64 30 anos_con_manager_actual 1470 non-null int64 dtypes: float64(2), int64(14), object(15) memory usage: 367.5+ KB
ANÁLISIS DE NULOS¶
df.isna().sum().sort_values(ascending = False)
anos_en_puesto 1238 conciliacion 1011 sexo 199 educacion 101 satisfaccion_trabajo 76 implicacion 18 edad 0 nivel_acciones 0 evaluacion 0 satisfaccion_companeros 0 horas_quincena 0 anos_experiencia 0 horas_extra 0 num_formaciones_ult_ano 0 anos_compania 0 anos_desde_ult_promocion 0 incremento_salario_porc 0 salario_mes 0 mayor_edad 0 num_empresas_anteriores 0 abandono 0 estado_civil 0 puesto 0 nivel_laboral 0 satisfaccion_entorno 0 empleados 0 carrera 0 distancia_casa 0 departamento 0 viajes 0 anos_con_manager_actual 0 dtype: int64
Conclusiones:
- anos_en_puesto y conciliacion tienen demasiados nulos --> eliminar Variables
- sexo, educacion, satisfaccion_trabajo e implicacion --> imputarlos tras EDA
df.drop(columns = ['anos_en_puesto','conciliacion'], inplace = True)
df
edad | abandono | viajes | departamento | distancia_casa | educacion | carrera | empleados | satisfaccion_entorno | sexo | ... | incremento_salario_porc | evaluacion | satisfaccion_companeros | horas_quincena | nivel_acciones | anos_experiencia | num_formaciones_ult_ano | anos_compania | anos_desde_ult_promocion | anos_con_manager_actual | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||||||
1 | 41 | Yes | Travel_Rarely | Sales | 1 | Universitaria | Life Sciences | 1 | Media | 3.0 | ... | 11 | Alta | Baja | 80 | 0 | 8 | 0 | 6 | 0 | 5 |
2 | 49 | No | Travel_Frequently | Research & Development | 8 | Secundaria | Life Sciences | 1 | Alta | 2.0 | ... | 23 | Muy_Alta | Muy_Alta | 80 | 1 | 10 | 3 | 10 | 1 | 7 |
4 | 37 | Yes | Travel_Rarely | Research & Development | 2 | Secundaria | Other | 1 | Muy_Alta | 2.0 | ... | 15 | Alta | Media | 80 | 0 | 7 | 3 | 0 | 0 | 0 |
5 | 33 | No | Travel_Frequently | Research & Development | 3 | Universitaria | Life Sciences | 1 | Muy_Alta | 3.0 | ... | 11 | Alta | Alta | 80 | 0 | 8 | 3 | 8 | 3 | 0 |
7 | 27 | No | Travel_Rarely | Research & Development | 2 | Universitaria | Medical | 1 | Baja | 3.0 | ... | 12 | Alta | Muy_Alta | 80 | 1 | 6 | 3 | 2 | 2 | 2 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2061 | 36 | No | Travel_Frequently | Research & Development | 23 | Master | Medical | 1 | Alta | 4.0 | ... | 17 | Alta | Alta | 80 | 1 | 17 | 3 | 5 | 0 | 3 |
2062 | 39 | No | Travel_Rarely | Research & Development | 6 | Secundaria | Medical | 1 | Muy_Alta | 2.0 | ... | 15 | Alta | Baja | 80 | 1 | 9 | 5 | 7 | 1 | 7 |
2064 | 27 | No | Travel_Rarely | Research & Development | 4 | Master | Life Sciences | 1 | Media | 4.0 | ... | 20 | Muy_Alta | Media | 80 | 1 | 6 | 0 | 6 | 0 | 3 |
2065 | 49 | No | Travel_Frequently | Sales | 2 | Secundaria | Medical | 1 | Muy_Alta | NaN | ... | 14 | Alta | Muy_Alta | 80 | 0 | 17 | 3 | 9 | 0 | 8 |
2068 | 34 | No | Travel_Rarely | Research & Development | 8 | NaN | Medical | 1 | Media | 4.0 | ... | 12 | Alta | Baja | 80 | 0 | 6 | 3 | 4 | 1 | 2 |
1470 rows × 29 columns
EDA VARIABLES CATEGÓRICAS¶
def graficos_eda_categoricos(cat):
#Calculamos el número de filas que necesitamos
from math import ceil
filas = ceil(cat.shape[1] / 2)
#Definimos el gráfico
f, ax = plt.subplots(nrows = filas, ncols = 2, figsize = (16, filas * 6))
#Aplanamos para iterar por el gráfico como si fuera de 1 dimensión en lugar de 2
ax = ax.flat
#Creamos el bucle que va añadiendo gráficos
for cada, variable in enumerate(cat):
cat[variable].value_counts().plot.barh(ax = ax[cada])
ax[cada].set_title(variable, fontsize = 12, fontweight = "bold")
ax[cada].tick_params(labelsize = 12)
graficos_eda_categoricos(df.select_dtypes('O'))
Conclusiones:
- mayor_edad solo tiene un valor --> eliminarla
- Sobre las imputaciones pendientes de variables categóricas:
- educacion: imputar por 'Universitaria'
- satisfaccion_trabajo: imputar por 'Alta'
- implicacion: imputar por 'Alta'
df.drop(columns = 'mayor_edad', inplace = True)
df['educacion'] = df['educacion'].fillna('Universitaria')
df['satisfaccion_trabajo'] = df['satisfaccion_trabajo'].fillna('Alta')
df['implicacion'] = df['implicacion'].fillna('Alta')
EDA VARIABLES NUMÉRICAS¶
def estadisticos_cont(num):
#Calculamos describe
estadisticos = num.describe().T
#Añadimos la mediana
estadisticos['median'] = num.median()
#Reordenamos para que la mediana esté al lado de la media
estadisticos = estadisticos.iloc[:,[0,1,8,2,3,4,5,6,7]]
#Lo devolvemos
return(estadisticos)
estadisticos_cont(df.select_dtypes('number'))
count | mean | median | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|---|
edad | 1470.0 | 36.923810 | 36.0 | 9.135373 | 18.0 | 30.0 | 36.0 | 43.0 | 60.0 |
distancia_casa | 1470.0 | 9.192517 | 7.0 | 8.106864 | 1.0 | 2.0 | 7.0 | 14.0 | 29.0 |
empleados | 1470.0 | 1.000000 | 1.0 | 0.000000 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
sexo | 1271.0 | 2.727773 | 3.0 | 0.720788 | 1.0 | 2.0 | 3.0 | 3.0 | 4.0 |
nivel_laboral | 1470.0 | 2.063946 | 2.0 | 1.106940 | 1.0 | 1.0 | 2.0 | 3.0 | 5.0 |
salario_mes | 1470.0 | 6502.931293 | 4919.0 | 4707.956783 | 1009.0 | 2911.0 | 4919.0 | 8379.0 | 19999.0 |
num_empresas_anteriores | 1470.0 | 2.693197 | 2.0 | 2.498009 | 0.0 | 1.0 | 2.0 | 4.0 | 9.0 |
incremento_salario_porc | 1470.0 | 15.209524 | 14.0 | 3.659938 | 11.0 | 12.0 | 14.0 | 18.0 | 25.0 |
horas_quincena | 1470.0 | 80.000000 | 80.0 | 0.000000 | 80.0 | 80.0 | 80.0 | 80.0 | 80.0 |
nivel_acciones | 1470.0 | 0.793878 | 1.0 | 0.852077 | 0.0 | 0.0 | 1.0 | 1.0 | 3.0 |
anos_experiencia | 1470.0 | 11.279592 | 10.0 | 7.780782 | 0.0 | 6.0 | 10.0 | 15.0 | 40.0 |
num_formaciones_ult_ano | 1470.0 | 2.799320 | 3.0 | 1.289271 | 0.0 | 2.0 | 3.0 | 3.0 | 6.0 |
anos_compania | 1470.0 | 7.008163 | 5.0 | 6.126525 | 0.0 | 3.0 | 5.0 | 9.0 | 40.0 |
anos_desde_ult_promocion | 1470.0 | 2.187755 | 1.0 | 3.222430 | 0.0 | 0.0 | 1.0 | 3.0 | 15.0 |
anos_con_manager_actual | 1470.0 | 4.123129 | 3.0 | 3.568136 | 0.0 | 2.0 | 3.0 | 7.0 | 17.0 |
Conclusiones:
- Empleados solo tiene un valor --> Eliminarla
- Sexo tiene 4 valores --> Eliminarla
- Horas quincena solo tiene una valor --> Eliminarla
- De los nulos pendientes de imputación que sean numéricas solo está el sexo, pero como la vamos a eliminar ya no hay que imputar nada
df.drop(columns = ['empleados','sexo','horas_quincena'], inplace = True)
df
edad | abandono | viajes | departamento | distancia_casa | educacion | carrera | satisfaccion_entorno | implicacion | nivel_laboral | ... | horas_extra | incremento_salario_porc | evaluacion | satisfaccion_companeros | nivel_acciones | anos_experiencia | num_formaciones_ult_ano | anos_compania | anos_desde_ult_promocion | anos_con_manager_actual | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||||||
1 | 41 | Yes | Travel_Rarely | Sales | 1 | Universitaria | Life Sciences | Media | Alta | 2 | ... | Yes | 11 | Alta | Baja | 0 | 8 | 0 | 6 | 0 | 5 |
2 | 49 | No | Travel_Frequently | Research & Development | 8 | Secundaria | Life Sciences | Alta | Media | 2 | ... | No | 23 | Muy_Alta | Muy_Alta | 1 | 10 | 3 | 10 | 1 | 7 |
4 | 37 | Yes | Travel_Rarely | Research & Development | 2 | Secundaria | Other | Muy_Alta | Media | 1 | ... | Yes | 15 | Alta | Media | 0 | 7 | 3 | 0 | 0 | 0 |
5 | 33 | No | Travel_Frequently | Research & Development | 3 | Universitaria | Life Sciences | Muy_Alta | Alta | 1 | ... | Yes | 11 | Alta | Alta | 0 | 8 | 3 | 8 | 3 | 0 |
7 | 27 | No | Travel_Rarely | Research & Development | 2 | Universitaria | Medical | Baja | Alta | 1 | ... | No | 12 | Alta | Muy_Alta | 1 | 6 | 3 | 2 | 2 | 2 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2061 | 36 | No | Travel_Frequently | Research & Development | 23 | Master | Medical | Alta | Muy_Alta | 2 | ... | No | 17 | Alta | Alta | 1 | 17 | 3 | 5 | 0 | 3 |
2062 | 39 | No | Travel_Rarely | Research & Development | 6 | Secundaria | Medical | Muy_Alta | Media | 3 | ... | No | 15 | Alta | Baja | 1 | 9 | 5 | 7 | 1 | 7 |
2064 | 27 | No | Travel_Rarely | Research & Development | 4 | Master | Life Sciences | Media | Muy_Alta | 2 | ... | Yes | 20 | Muy_Alta | Media | 1 | 6 | 0 | 6 | 0 | 3 |
2065 | 49 | No | Travel_Frequently | Sales | 2 | Secundaria | Medical | Muy_Alta | Media | 2 | ... | No | 14 | Alta | Muy_Alta | 0 | 17 | 3 | 9 | 0 | 8 |
2068 | 34 | No | Travel_Rarely | Research & Development | 8 | Universitaria | Medical | Media | Muy_Alta | 2 | ... | No | 12 | Alta | Baja | 0 | 6 | 3 | 4 | 1 | 2 |
1470 rows × 25 columns
GENERACIÓN DE INSIGHTS¶
Cuantificación del problema: ¿Cual es la tasa de abandono?¶
df.abandono.value_counts(normalize = True) * 100
abandono No 83.877551 Yes 16.122449 Name: proportion, dtype: float64
¿Hay un perfil tipo de empleado que deja la empresa?¶
# Transformar abandono a numérica
df['abandono'] = df.abandono.map({'No':0, 'Yes':1})
# Analisis por educación
temp = df.groupby('educacion').abandono.mean().sort_values(ascending = False) * 100
temp.plot.bar();
# Analisis por estado civil
temp = df.groupby('estado_civil').abandono.mean().sort_values(ascending = False) * 100
temp.plot.bar();
# Analisis por horas extras
temp = df.groupby('horas_extra').abandono.mean().sort_values(ascending = False) * 100
temp.plot.bar();
# Analisis por puesto
temp = df.groupby('puesto').abandono.mean().sort_values(ascending = False) * 100
temp.plot.bar();
temp = df.groupby('abandono').salario_mes.mean()
temp.plot.bar();
Conclusiones:
El perfil medio del empleado que deja la empresa es:
- Bajo nivel educativo
- Soltero
- Trabaja en ventas
- Bajo salario
- Alta carga de horas extras
¿Cual es el impacto económico de este problema?¶
Según el estudio "Cost of Turnover" del Center for American Progress:
El coste de la fuga de los empleados que ganan menos de 30000 es del 16,1% de su salario
El coste de la fuga de los empleados que ganan entre 30000-50000 es del 19,7% de su salario
El coste de la fuga de los empleados que ganan entre 50000-75000 es del 20,4% de su salario
El coste de la fuga de los empleados que ganan más de 75000 es del 21% de su salario
# Creamos una nueva variable salario_ano del empleado
df['salario_ano'] = df.salario_mes.transform(lambda x: x*12)
df[['salario_mes','salario_ano']]
salario_mes | salario_ano | |
---|---|---|
id | ||
1 | 5993 | 71916 |
2 | 5130 | 61560 |
4 | 2090 | 25080 |
5 | 2909 | 34908 |
7 | 3468 | 41616 |
... | ... | ... |
2061 | 2571 | 30852 |
2062 | 9991 | 119892 |
2064 | 6142 | 73704 |
2065 | 5390 | 64680 |
2068 | 4404 | 52848 |
1470 rows × 2 columns
# Calculamos el impacto económico de cada empleado si deja la empresa
#Lista de condiciones
condiciones = [(df['salario_ano'] <= 30000),
(df['salario_ano'] > 30000) & (df['salario_ano'] <= 50000),
(df['salario_ano'] > 50000) & (df['salario_ano'] <= 75000),
(df['salario_ano'] > 75000)]
#Lista de resultados
resultados = [df.salario_ano * 0.161, df.salario_ano * 0.197, df.salario_ano * 0.204, df.salario_ano * 0.21]
#Aplicamos select
df['impacto_abandono'] = np.select(condiciones,resultados, default = -999)
df
edad | abandono | viajes | departamento | distancia_casa | educacion | carrera | satisfaccion_entorno | implicacion | nivel_laboral | ... | evaluacion | satisfaccion_companeros | nivel_acciones | anos_experiencia | num_formaciones_ult_ano | anos_compania | anos_desde_ult_promocion | anos_con_manager_actual | salario_ano | impacto_abandono | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||||||
1 | 41 | 1 | Travel_Rarely | Sales | 1 | Universitaria | Life Sciences | Media | Alta | 2 | ... | Alta | Baja | 0 | 8 | 0 | 6 | 0 | 5 | 71916 | 14670.864 |
2 | 49 | 0 | Travel_Frequently | Research & Development | 8 | Secundaria | Life Sciences | Alta | Media | 2 | ... | Muy_Alta | Muy_Alta | 1 | 10 | 3 | 10 | 1 | 7 | 61560 | 12558.240 |
4 | 37 | 1 | Travel_Rarely | Research & Development | 2 | Secundaria | Other | Muy_Alta | Media | 1 | ... | Alta | Media | 0 | 7 | 3 | 0 | 0 | 0 | 25080 | 4037.880 |
5 | 33 | 0 | Travel_Frequently | Research & Development | 3 | Universitaria | Life Sciences | Muy_Alta | Alta | 1 | ... | Alta | Alta | 0 | 8 | 3 | 8 | 3 | 0 | 34908 | 6876.876 |
7 | 27 | 0 | Travel_Rarely | Research & Development | 2 | Universitaria | Medical | Baja | Alta | 1 | ... | Alta | Muy_Alta | 1 | 6 | 3 | 2 | 2 | 2 | 41616 | 8198.352 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2061 | 36 | 0 | Travel_Frequently | Research & Development | 23 | Master | Medical | Alta | Muy_Alta | 2 | ... | Alta | Alta | 1 | 17 | 3 | 5 | 0 | 3 | 30852 | 6077.844 |
2062 | 39 | 0 | Travel_Rarely | Research & Development | 6 | Secundaria | Medical | Muy_Alta | Media | 3 | ... | Alta | Baja | 1 | 9 | 5 | 7 | 1 | 7 | 119892 | 25177.320 |
2064 | 27 | 0 | Travel_Rarely | Research & Development | 4 | Master | Life Sciences | Media | Muy_Alta | 2 | ... | Muy_Alta | Media | 1 | 6 | 0 | 6 | 0 | 3 | 73704 | 15035.616 |
2065 | 49 | 0 | Travel_Frequently | Sales | 2 | Secundaria | Medical | Muy_Alta | Media | 2 | ... | Alta | Muy_Alta | 0 | 17 | 3 | 9 | 0 | 8 | 64680 | 13194.720 |
2068 | 34 | 0 | Travel_Rarely | Research & Development | 8 | Universitaria | Medical | Media | Muy_Alta | 2 | ... | Alta | Baja | 0 | 6 | 3 | 4 | 1 | 2 | 52848 | 10780.992 |
1470 rows × 27 columns
¿Cúanto nos ha costado este problema en el último año?¶
coste_total = df.loc[df.abandono == 1].impacto_abandono.sum()
coste_total
2719005.912
¿Cuanto nos cuesta que los empleados no estén motivados? (pérdidas en implicación == Baja)¶
df.loc[(df.abandono == 1) & (df.implicacion == 'Baja')].impacto_abandono.sum()
368672.688
¿Cuanto dinero podríamos ahorrar fidelizando mejor a nuestros empleados?¶
print(f"Reducir un 10% la fuga de empleados nos ahorraría {int(coste_total * 0.1)}$ cada año.")
print(f"Reducir un 20% la fuga de empleados nos ahorraría {int(coste_total * 0.2)}$ cada año.")
print(f"Reducir un 30% la fuga de empleados nos ahorraría {int(coste_total * 0.3)}$ cada año.")
Reducir un 10% la fuga de empleados nos ahorraría 271900$ cada año. Reducir un 20% la fuga de empleados nos ahorraría 543801$ cada año. Reducir un 30% la fuga de empleados nos ahorraría 815701$ cada año.
Y podemos seguir trazando estrategias asociadas a los insights de abandono:¶
Habíamos visto que los representantes de ventas son el puesto que más se van. ¿Tendría sentido hacer un plan específico para ellos? ¿Cual sería el coste ahorrado si disminuimos la fuga un 30%?
Primero vamos a calcular el % de representantes de ventas que se han ido el año pasado
total_repre_pasado = len(df.loc[df.puesto == 'Sales Representative'])
abandonos_repre_pasado = len(df.loc[(df.puesto == 'Sales Representative') & (df.abandono == 1)])
porc_pasado = abandonos_repre_pasado / total_repre_pasado
porc_pasado
0.39759036144578314
Ahora vamos a estimar cuántos se nos irán este año
total_repre_actual = len(df.loc[(df.puesto == 'Sales Representative') & (df.abandono == 0)])
se_iran = int(total_repre_actual * porc_pasado)
se_iran
19
Sobre ellos cuantos podemos retener (hipótesis 30%) y cuanto dinero puede suponer
retenemos = int(se_iran * 0.3)
ahorramos = df.loc[(df.puesto == 'Sales Representative') & (df.abandono == 0),'impacto_abandono'].sum() * porc_pasado * 0.3
print(f'Podemos retener {retenemos} representantes de ventas y ello supondría ahorrar {ahorramos}$.')
Podemos retener 5 representantes de ventas y ello supondría ahorrar 37447.22424578312$.
Este dato también es muy interesante porque nos permite determinar el presupuesto para acciones de retención por departamento o perfil.
Ya que sabemos que podemos gastarnos hasta 37.000$ sólo en acciones específicas para retener a representantes de ventas y se estarían pagando sólas con la pérdida evitada
MODELO DE MACHINE LEARNING¶
df_ml = df.copy()
df_ml.info()
<class 'pandas.core.frame.DataFrame'> Index: 1470 entries, 1 to 2068 Data columns (total 27 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 edad 1470 non-null int64 1 abandono 1470 non-null int64 2 viajes 1470 non-null object 3 departamento 1470 non-null object 4 distancia_casa 1470 non-null int64 5 educacion 1470 non-null object 6 carrera 1470 non-null object 7 satisfaccion_entorno 1470 non-null object 8 implicacion 1470 non-null object 9 nivel_laboral 1470 non-null int64 10 puesto 1470 non-null object 11 satisfaccion_trabajo 1470 non-null object 12 estado_civil 1470 non-null object 13 salario_mes 1470 non-null int64 14 num_empresas_anteriores 1470 non-null int64 15 horas_extra 1470 non-null object 16 incremento_salario_porc 1470 non-null int64 17 evaluacion 1470 non-null object 18 satisfaccion_companeros 1470 non-null object 19 nivel_acciones 1470 non-null int64 20 anos_experiencia 1470 non-null int64 21 num_formaciones_ult_ano 1470 non-null int64 22 anos_compania 1470 non-null int64 23 anos_desde_ult_promocion 1470 non-null int64 24 anos_con_manager_actual 1470 non-null int64 25 salario_ano 1470 non-null int64 26 impacto_abandono 1470 non-null float64 dtypes: float64(1), int64(14), object(12) memory usage: 321.6+ KB
PREPARACIÓN DE LOS DATOS PARA LA MODELIZACIÓN¶
Transformar todas las variables categóricas a númericas¶
from sklearn.preprocessing import OneHotEncoder
#Categóricas
cat = df_ml.select_dtypes('O')
#Instanciamos
ohe = OneHotEncoder(sparse = False)
#Entrenamos
ohe.fit(cat)
#Aplicamos
cat_ohe = ohe.transform(cat)
#Ponemos los nombres
cat_ohe = pd.DataFrame(cat_ohe, columns = ohe.get_feature_names_out(input_features = cat.columns)).reset_index(drop = True)
/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/sklearn/preprocessing/_encoders.py:975: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value. warnings.warn(
cat_ohe
viajes_Non-Travel | viajes_Travel_Frequently | viajes_Travel_Rarely | departamento_Human Resources | departamento_Research & Development | departamento_Sales | educacion_Master | educacion_Primaria | educacion_Secundaria | educacion_Universitaria | ... | estado_civil_Married | estado_civil_Single | horas_extra_No | horas_extra_Yes | evaluacion_Alta | evaluacion_Muy_Alta | satisfaccion_companeros_Alta | satisfaccion_companeros_Baja | satisfaccion_companeros_Media | satisfaccion_companeros_Muy_Alta | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | ... | 0.0 | 1.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
1 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | ... | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 |
2 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | ... | 0.0 | 1.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
3 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | ... | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 |
4 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | ... | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1465 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | ... | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 |
1466 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | ... | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
1467 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | ... | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
1468 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | ... | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
1469 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | ... | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
1470 rows × 48 columns
Dataframe final¶
Seleccionamos las variables numéricas para poder juntarlas a las cat_hoe
num = df.select_dtypes('number').reset_index(drop = True)
Las juntamos todas en el dataframe final
df_ml = pd.concat([cat_ohe,num], axis = 1)
df_ml
viajes_Non-Travel | viajes_Travel_Frequently | viajes_Travel_Rarely | departamento_Human Resources | departamento_Research & Development | departamento_Sales | educacion_Master | educacion_Primaria | educacion_Secundaria | educacion_Universitaria | ... | num_empresas_anteriores | incremento_salario_porc | nivel_acciones | anos_experiencia | num_formaciones_ult_ano | anos_compania | anos_desde_ult_promocion | anos_con_manager_actual | salario_ano | impacto_abandono | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | ... | 8 | 11 | 0 | 8 | 0 | 6 | 0 | 5 | 71916 | 14670.864 |
1 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | ... | 1 | 23 | 1 | 10 | 3 | 10 | 1 | 7 | 61560 | 12558.240 |
2 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | ... | 6 | 15 | 0 | 7 | 3 | 0 | 0 | 0 | 25080 | 4037.880 |
3 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | ... | 1 | 11 | 0 | 8 | 3 | 8 | 3 | 0 | 34908 | 6876.876 |
4 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | ... | 9 | 12 | 1 | 6 | 3 | 2 | 2 | 2 | 41616 | 8198.352 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1465 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | ... | 4 | 17 | 1 | 17 | 3 | 5 | 0 | 3 | 30852 | 6077.844 |
1466 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | ... | 4 | 15 | 1 | 9 | 5 | 7 | 1 | 7 | 119892 | 25177.320 |
1467 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | ... | 1 | 20 | 1 | 6 | 0 | 6 | 0 | 3 | 73704 | 15035.616 |
1468 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | ... | 2 | 14 | 0 | 17 | 3 | 9 | 0 | 8 | 64680 | 13194.720 |
1469 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | ... | 2 | 12 | 0 | 6 | 3 | 4 | 1 | 2 | 52848 | 10780.992 |
1470 rows × 63 columns
DISEÑO DE LA MODELIZACIÓN¶
Separación predictoras y target¶
x = df_ml.drop(columns='abandono')
y = df_ml['abandono']
Separación train y test¶
from sklearn.model_selection import train_test_split
train_x, test_x, train_y, test_y = train_test_split(x, y, test_size = 0.3)
ENTRENAMIENTO DEL MODELO SOBRE TRAIN¶
from sklearn.tree import DecisionTreeClassifier
#Instanciar
ac = DecisionTreeClassifier(max_depth=4)
#Entrenar
ac.fit(train_x,train_y)
DecisionTreeClassifier(max_depth=4)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(max_depth=4)
PREDICCIÓN Y VALIDACIÓN SOBRE TEST¶
# Predicción
pred = ac.predict_proba(test_x)[:, 1]
pred[:20]
array([0.02601156, 1. , 0.13559322, 0.09541985, 0.02601156, 0.02601156, 0.02601156, 0.09541985, 0.02601156, 1. , 0.13559322, 0.11538462, 0.09541985, 0.11538462, 0.09541985, 0.12650602, 0.02601156, 0.02601156, 0.09541985, 0.09541985])
# Evaluación
from sklearn.metrics import roc_auc_score
roc_auc_score(test_y,pred)
0.690701352897624
INTERPRETACIÓN¶
Diagrama del árbol¶
from sklearn.tree import plot_tree
plt.figure(figsize = (50,50))
plot_tree(ac,
feature_names= test_x.columns,
impurity = False,
node_ids = True,
proportion = True,
rounded = True,
precision = 2);
Importancia de las variables¶
pd.Series(ac.feature_importances_,index = test_x.columns).sort_values(ascending = False).plot(kind = 'bar', figsize = (30,20));
EXPLOTACIÓN¶
Incoporación del scoring al dataframe principal
df['scoring_abandono'] = ac.predict_proba(df_ml.drop(columns = 'abandono'))[:, 1]
df
edad | abandono | viajes | departamento | distancia_casa | educacion | carrera | satisfaccion_entorno | implicacion | nivel_laboral | ... | satisfaccion_companeros | nivel_acciones | anos_experiencia | num_formaciones_ult_ano | anos_compania | anos_desde_ult_promocion | anos_con_manager_actual | salario_ano | impacto_abandono | scoring_abandono | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||||||
1 | 41 | 1 | Travel_Rarely | Sales | 1 | Universitaria | Life Sciences | Media | Alta | 2 | ... | Baja | 0 | 8 | 0 | 6 | 0 | 5 | 71916 | 14670.864 | 0.382353 |
2 | 49 | 0 | Travel_Frequently | Research & Development | 8 | Secundaria | Life Sciences | Alta | Media | 2 | ... | Muy_Alta | 1 | 10 | 3 | 10 | 1 | 7 | 61560 | 12558.240 | 0.026012 |
4 | 37 | 1 | Travel_Rarely | Research & Development | 2 | Secundaria | Other | Muy_Alta | Media | 1 | ... | Media | 0 | 7 | 3 | 0 | 0 | 0 | 25080 | 4037.880 | 0.909091 |
5 | 33 | 0 | Travel_Frequently | Research & Development | 3 | Universitaria | Life Sciences | Muy_Alta | Alta | 1 | ... | Alta | 0 | 8 | 3 | 8 | 3 | 0 | 34908 | 6876.876 | 0.126506 |
7 | 27 | 0 | Travel_Rarely | Research & Development | 2 | Universitaria | Medical | Baja | Alta | 1 | ... | Muy_Alta | 1 | 6 | 3 | 2 | 2 | 2 | 41616 | 8198.352 | 0.026012 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2061 | 36 | 0 | Travel_Frequently | Research & Development | 23 | Master | Medical | Alta | Muy_Alta | 2 | ... | Alta | 1 | 17 | 3 | 5 | 0 | 3 | 30852 | 6077.844 | 0.026012 |
2062 | 39 | 0 | Travel_Rarely | Research & Development | 6 | Secundaria | Medical | Muy_Alta | Media | 3 | ... | Baja | 1 | 9 | 5 | 7 | 1 | 7 | 119892 | 25177.320 | 0.026012 |
2064 | 27 | 0 | Travel_Rarely | Research & Development | 4 | Master | Life Sciences | Media | Muy_Alta | 2 | ... | Media | 1 | 6 | 0 | 6 | 0 | 3 | 73704 | 15035.616 | 0.126506 |
2065 | 49 | 0 | Travel_Frequently | Sales | 2 | Secundaria | Medical | Muy_Alta | Media | 2 | ... | Muy_Alta | 0 | 17 | 3 | 9 | 0 | 8 | 64680 | 13194.720 | 0.095420 |
2068 | 34 | 0 | Travel_Rarely | Research & Development | 8 | Universitaria | Medical | Media | Muy_Alta | 2 | ... | Baja | 0 | 6 | 3 | 4 | 1 | 2 | 52848 | 10780.992 | 0.095420 |
1470 rows × 28 columns
Ejemplo de los 10 empleados con mayor probabilidad de dejar la empresa
df.sort_values(by = 'scoring_abandono', ascending = False)[0:10]
edad | abandono | viajes | departamento | distancia_casa | educacion | carrera | satisfaccion_entorno | implicacion | nivel_laboral | ... | satisfaccion_companeros | nivel_acciones | anos_experiencia | num_formaciones_ult_ano | anos_compania | anos_desde_ult_promocion | anos_con_manager_actual | salario_ano | impacto_abandono | scoring_abandono | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||||||
1868 | 29 | 1 | Travel_Frequently | Research & Development | 24 | Secundaria | Life Sciences | Muy_Alta | Media | 1 | ... | Media | 0 | 1 | 3 | 1 | 1 | 0 | 29268 | 4712.148 | 1.0 |
911 | 32 | 1 | Travel_Rarely | Research & Development | 25 | Universitaria | Life Sciences | Baja | Alta | 1 | ... | Alta | 0 | 1 | 2 | 1 | 0 | 1 | 33540 | 6607.380 | 1.0 |
1624 | 18 | 1 | Travel_Frequently | Sales | 3 | Universitaria | Medical | Media | Alta | 1 | ... | Alta | 0 | 0 | 2 | 0 | 0 | 0 | 18828 | 3031.308 | 1.0 |
235 | 19 | 1 | Travel_Frequently | Sales | 1 | Primaria | Technical Degree | Alta | Baja | 1 | ... | Baja | 0 | 1 | 5 | 0 | 0 | 0 | 27900 | 4491.900 | 1.0 |
1604 | 28 | 1 | Travel_Rarely | Research & Development | 24 | Universitaria | Medical | Alta | Alta | 1 | ... | Alta | 3 | 1 | 3 | 1 | 0 | 0 | 28896 | 4652.256 | 1.0 |
2023 | 23 | 1 | Travel_Frequently | Sales | 9 | Universitaria | Marketing | Muy_Alta | Alta | 1 | ... | Baja | 1 | 1 | 3 | 1 | 1 | 0 | 21480 | 3458.280 | 1.0 |
811 | 23 | 1 | Travel_Rarely | Research & Development | 6 | Master | Life Sciences | Alta | Muy_Alta | 1 | ... | Alta | 2 | 1 | 2 | 0 | 0 | 0 | 19212 | 3093.132 | 1.0 |
1569 | 35 | 1 | Travel_Rarely | Research & Development | 2 | Universitaria | Life Sciences | Baja | Alta | 1 | ... | Muy_Alta | 1 | 1 | 2 | 1 | 0 | 0 | 24888 | 4006.968 | 1.0 |
1203 | 22 | 1 | Travel_Frequently | Research & Development | 3 | Secundaria | Life Sciences | Alta | Media | 1 | ... | Media | 1 | 1 | 5 | 0 | 0 | 0 | 34236 | 6744.492 | 1.0 |
65 | 28 | 1 | Travel_Rarely | Research & Development | 5 | Universitaria | Technical Degree | Alta | Alta | 1 | ... | Alta | 0 | 2 | 3 | 2 | 2 | 2 | 41292 | 8134.524 | 1.0 |
10 rows × 28 columns
Ejemplo: riesgo de dejar la empresa por puesto de trabajo
df.boxplot(column='scoring_abandono', by='puesto', figsize = (20,12));
GUARDAR EL RESULTADO¶
df
edad | abandono | viajes | departamento | distancia_casa | educacion | carrera | satisfaccion_entorno | implicacion | nivel_laboral | ... | satisfaccion_companeros | nivel_acciones | anos_experiencia | num_formaciones_ult_ano | anos_compania | anos_desde_ult_promocion | anos_con_manager_actual | salario_ano | impacto_abandono | scoring_abandono | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||||||
1 | 41 | 1 | Travel_Rarely | Sales | 1 | Universitaria | Life Sciences | Media | Alta | 2 | ... | Baja | 0 | 8 | 0 | 6 | 0 | 5 | 71916 | 14670.864 | 0.382353 |
2 | 49 | 0 | Travel_Frequently | Research & Development | 8 | Secundaria | Life Sciences | Alta | Media | 2 | ... | Muy_Alta | 1 | 10 | 3 | 10 | 1 | 7 | 61560 | 12558.240 | 0.026012 |
4 | 37 | 1 | Travel_Rarely | Research & Development | 2 | Secundaria | Other | Muy_Alta | Media | 1 | ... | Media | 0 | 7 | 3 | 0 | 0 | 0 | 25080 | 4037.880 | 0.909091 |
5 | 33 | 0 | Travel_Frequently | Research & Development | 3 | Universitaria | Life Sciences | Muy_Alta | Alta | 1 | ... | Alta | 0 | 8 | 3 | 8 | 3 | 0 | 34908 | 6876.876 | 0.126506 |
7 | 27 | 0 | Travel_Rarely | Research & Development | 2 | Universitaria | Medical | Baja | Alta | 1 | ... | Muy_Alta | 1 | 6 | 3 | 2 | 2 | 2 | 41616 | 8198.352 | 0.026012 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2061 | 36 | 0 | Travel_Frequently | Research & Development | 23 | Master | Medical | Alta | Muy_Alta | 2 | ... | Alta | 1 | 17 | 3 | 5 | 0 | 3 | 30852 | 6077.844 | 0.026012 |
2062 | 39 | 0 | Travel_Rarely | Research & Development | 6 | Secundaria | Medical | Muy_Alta | Media | 3 | ... | Baja | 1 | 9 | 5 | 7 | 1 | 7 | 119892 | 25177.320 | 0.026012 |
2064 | 27 | 0 | Travel_Rarely | Research & Development | 4 | Master | Life Sciences | Media | Muy_Alta | 2 | ... | Media | 1 | 6 | 0 | 6 | 0 | 3 | 73704 | 15035.616 | 0.126506 |
2065 | 49 | 0 | Travel_Frequently | Sales | 2 | Secundaria | Medical | Muy_Alta | Media | 2 | ... | Muy_Alta | 0 | 17 | 3 | 9 | 0 | 8 | 64680 | 13194.720 | 0.095420 |
2068 | 34 | 0 | Travel_Rarely | Research & Development | 8 | Universitaria | Medical | Media | Muy_Alta | 2 | ... | Baja | 0 | 6 | 3 | 4 | 1 | 2 | 52848 | 10780.992 | 0.095420 |
1470 rows × 28 columns
# from google.colab import files
df.to_excel('abandono_con_scoring.xlsx')
# files.download('abandono_con_scoring.xlsx')