'survey' 태그의 글 목록 (2 Page)

survey

[설문조사 분석] Python으로 분석한 Kaggle 2017 Survey -4 2018.06.27

[설문조사 분석] Python으로 분석한 Kaggle 2017 Survey -4

2018. 6. 27. 11:23

설문조사를 분석하고 시각화하는데 좋은 자료가 있어서 포스팅을 한다.

자료는 아래서 참조하였다.

ref.
corazzon / KaggleStruggle / kaggle-survey-2017 / Kaggle-ML-DS-survey-2017-EDA-FAQ.ipynb
https://github.com/corazzon/KaggleStruggle/blob/master/kaggle-survey-2017/Kaggle-ML-DS-survey-2017-EDA-FAQ.ipynb

분석에 필요한 자료는 아래링크에서 Data탭에 들어가면 구할 수 있다.

https://www.kaggle.com/kaggle/kaggle-survey-2017

============================= Python =============================

import pprint

import pandas as pd

import numpy as np

from scipy import stats

import matplotlib.pyplot as plt

# 이미지를 그려주는 모듈.

import seaborn as sns

# 데이터 프레임의 컬럼을 최대 12개로 늘린다.

pd.set_option('display.max_columns', 12)

# schema.csv를 question 객체에 담기

question = pd.read_csv('schema.csv')

# multipleChoiceResponses.csv를 mcq에 담기. encoding='ISO-8859-1'은 범용적으로 사용되는 인코딩이다.

mcq = pd.read_csv('multipleChoiceResponses.csv', encoding='ISO-8859-1', low_memory=False)

# 한국 자료를 따로 설정해준다.

korea = mcq.loc[mcq.Country == 'South Korea']

# 데이터 과학자의 평균 급여는 얼마일까

# print(mcq[mcq['CompensationAmount'].notnull()].shape) # (5224, 228). 총 응답자 5524

# print(korea[korea['CompensationAmount'].notnull()].shape) # (35, 228). 총 응답자 35

# 자료 내에 존재하는 , - 를 없앤다.

mcq['CompensationAmount'] = mcq['CompensationAmount'].str.replace(',', '')

mcq['CompensationAmount'] = mcq['CompensationAmount'].str.replace('-', '')

# 환률 계산을 위한 정보를 가져오기

rates = pd.read_csv('conversionRates.csv')

# print(rates.head()) # 데이터 확인

'''

'결과'

Unnamed: 0 originCountry exchangeRate

0 1 USD 1.000000

1 2 EUR 1.195826

2 3 INR 0.015620

3 4 GBP 1.324188

4 5 BRL 0.321350

'''

# inplace는 기본적으로 False이고

# False면 .drop()이 rates를 변경시키지 않고 True면 .drop()이 실행된 결과가 rates에 저장된다.

rates.drop('Unnamed: 0', axis=1, inplace=True) # Unnamed: 0 컬럼을 없앤다.

# print(rates.head()) # 데이터 확인

'''

'결과'

originCountry exchangeRate

0 USD 1.000000

1 EUR 1.195826

2 INR 0.015620

3 GBP 1.324188

4 BRL 0.321350

'''

# 'CompensationAmount','CompensationCurrency', 'GenderSelect', 'Country', 'CurrentJobTitleSelect' 컬럼만 뽑고

# NaN값을 제거한다.

salary = mcq[

['CompensationAmount','CompensationCurrency',

'GenderSelect',

'Country',

'CurrentJobTitleSelect']].dropna()

# print(salary.head()) # 데이터 확인

# print(salary.shape) # (4367, 5). 4367개의 응답이 존재

# salary에 rates를 병합한다.

# 좌측은 'CompensationCurrency', 우측은 'originCountry'을 이용해서 rates를 salary의 좌측에 붙인다.

salary = salary.merge(rates, left_on='CompensationCurrency', right_on='originCountry', how='left')

# print(salary.head()) # 데이터 확인

# 숫자가 아닌 것을 숫자로 바꿔주는 to_numeric()메서드. 아래 Document에서 자세히 알아 볼 수 있다.

# https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_numeric.html?highlight=to_numeric#pandas.to_numeric

salary['Salary'] = pd.to_numeric(salary['CompensationAmount']) * salary['exchangeRate']

# print(salary.Salary.head(10)) # 데이터 확인

# 각 항목별 최대, 최소, 중앙값을 확인.

# print('최댓값', salary['Salary'].dropna().astype(int).max())

# print('최솟값', salary['Salary'].dropna().astype(int).min())

# print('중앙값', salary['Salary'].dropna().astype(int).median())

# 시각화

# plt.subplots(figsize=(15, 8))

# 연봉이 500000 미국 달러 미만인 사람들만 출력

salary = salary[salary['Salary'] < 500000]

# sns.distplot(salary['Salary'])

# plt.axvline(salary['Salary'].median(), linestyle='dashed') # 중앙값을 점선으로 표시

# plt.show()

# 나라별 중앙값을 비교하여 어느 나라의 봉급이 더 많은지 확인.

# plt.subplots(figsize=(8,12))

# sal_coun = salary.groupby('Country')['Salary'].median().sort_values(ascending=False)[:30].to_frame()

# sns.barplot('Salary',

# sal_coun.index,

# data = sal_coun,

# palette='RdYlGn')

# plt.axvline(salary['Salary'].median(), linestyle='dashed')

# plt.title('Highest Salary Paying Countries')

# plt.show()

# 전세계의 남성 여성의 임금 차이를 보여주는 boxplot

# plt.subplots(figsize=(8, 4))

# sns.boxplot(y='GenderSelect',x='Salary', data=salary)

# plt.show()

# 한국의 경우 남여 임금 차이를 보여주는 boxplot

salary_korea = salary.loc[(salary['Country'] == 'South Korea')]

# plt.subplots(figsize=(8, 4))

# sns.boxplot(y='GenderSelect', x='Salary', data=salary_korea)

# plt.show()

# print(salary_korea.shape) # (26, 8). 총 26명이 응답.

# print(salary_korea[salary_korea['GenderSelect'] == 'Female']) # 한국의 경우 여성 응답자가 3명이다. 너무 적은 숫자.

# print(salary['Salary'].describe()) # Salary 컬럼에 대한 요약

# row data를 어디서 얻는지 확인

mcq['PublicDatasetsSelect'] = mcq['PublicDatasetsSelect'].astype('str').apply(lambda x: x.split(','))

q = mcq.apply(lambda x:pd.Series(x['PublicDatasetsSelect']), axis=1).stack().reset_index(level=1, drop=True)

q.name = 'courses'

q = q[q != 'nan'].value_counts()

# print(q.head()) # 데이터 확인

# print(pd.DataFrame(q)) # 데이터 확인

# 시각화

plt.title("Most Popular Dataset Platforms")

sns.barplot(y=q.index, x=q)

plt.show()

============================= Python =============================

'데이터 분석 > 데이터 분석 프로젝트' 카테고리의 다른 글

[데이터 분석] 기상데이터를 이용하여 단위 면적 다양파 생산량을 예측 분석(기상청) - 1(분석 전 단계) (0)	2018.07.01
[설문조사 분석] Python으로 분석한 Kaggle 2017 Survey -5 (0)	2018.06.28
[설문조사 분석] Python으로 분석한 Kaggle 2017 Survey -3 (1)	2018.06.25
[설문조사 분석] Python으로 분석한 Kaggle 2017 Survey -2 (0)	2018.06.22
[설문조사 분석] Python으로 분석한 Kaggle 2017 Survey -1 (0)	2018.06.21

PREV 1 2 3 4 5 NEXT

데이터 분석가 블로그

survey

[설문조사 분석] Python으로 분석한 Kaggle 2017 Survey -4

'데이터 분석 > 데이터 분석 프로젝트' 카테고리의 다른 글

+ Recent posts

티스토리툴바