[Python] Python 19일차(예제로 배우는 파이썬 데이터 시각화)

2018. 6. 15. 13:45

이상치를 정제하고 그것을 이용해서 히스토그램, 산점도, 박스플롯을 그려봤다.

이상치 정제는 z-score를 이용해서 했다.

여기 나오는 내용은 '예제로 배우는 파이썬 데이터 시각화'의 코드(파이썬 2.7)를

파이썬 3 코드로 바꾼 것이다.

============================ Python ============================

import numpy as np

import matplotlib.pyplot as plt

def is_outlier(points, threshold = 3.5):

'''

포인터가 이상치면 참이고, 그렇지 않으면 거짓인 boolean 배열을 반환한다.

이것은 수정된 z-score보다 큰 데이터 포인트를 지닌다.

'''

# 벡터로 변환

if len(points.shape) == 1: # 만약 1차원 구조라면 이걸 진행. 1차원은 [] <- 이런 형태를 뜻한다.

# 2차원 -> [[]], 3차원 ->[[[]]]

points = points[:, None] # 1차원 배열 구조(n X 1)를 각 원소별 리스트 2차원 구조로 바꿔준다.

# ex) [1, 2, 3] -> [[1],[2],[3]]

# print(points)

# 중앙값 연산

# axis = 0 (x축 기준), axis = 1 (y축 기준), axis = 2 (z축 기준)

# axis에 대한 설명 => http://taewan.kim/post/numpy_sum_axis/

median = np.median(points, axis = 0)

# print(median)

# 축상의 diff sum 연산(데이터갯수*루트분산 = 데이터갯수*표준편차)

'''

'axis = -1 에 대한 설명'

numpy.stack 에서...

The axis parameter specifies the index of the new axis in the dimensions of the result.

For example, if axis=0 it will be the first dimension and if axis=-1

it will be the last dimension.

ref.

https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.stack.html

numpy.sum 에서...

axis : None or int or tuple of ints, optional

Axis or axes along which a sum is performed.

The default, axis=None, will sum all of the elements of the input array.

If axis is negative it counts from the last to the first axis.

New in version 1.7.0.

If axis is a tuple of ints, a sum is performed on all of the axes specified in the tuple

instead of a single axis or all the axes as before.

ref.

https://www.numpy.org/devdocs/reference/generated/numpy.sum.html?highlight=sum

'''

diff = np.sum((points - median)**2, axis = -1)

# print(diff)

diff = np.sqrt(diff)

### end of if; ###

# MAD(Median Abs Deviation) 연산 (중앙값을 이용한 표준편차)

med_abs_deviation = np.median(diff)

# 수정된 z-score 연산

# http://www.itl.nist.gov/div898/handbook/eda/section4/eda43.thm#Iglewicz

# z-score => 검사의 결과 해석할 떄 측정 단위가 다른 것으로 환산해서 직접적인 비교를 가능케 하는 방법

modified_z_score = 0.6745 * diff / med_abs_deviation

# 각 아웃라이어 마스크 반환

return modified_z_score > threshold # boolean으로 리턴이 된다. 여기가 True이면 이상치라는 뜻.

# 랜덤 데이터

x = np.random.random(100)

# 히스토그램 버켓

buckets = 50

# 아웃라이어 추가

x = np.r_[x, -49, 95, 100, -100]

# 유효한 데이터 포인트 유지

# ~ 연산자가 불린 numpy 배열에서는 논리 NOT으로 동작한다.

filtered = x[~is_outlier(x)] # ~는 numpy.array에서 not과 동일하다. 즉, 이상치가 아닌 값만 filtered에 저장된다.

# 히스토그램 그리기

plt.figure()

plt.subplot(211)

plt.hist(x, buckets)

plt.xlabel('Raw')

plt.subplot(212)

plt.hist(filtered, buckets)

plt.xlabel('Cleaned')

plt.show()

from numpy.matlib import rand

from pylab import *

# 가짜 데이터 생성

spread = rand(50) * 100 # 50개의 원소를 가진 랜덤한 리스트에 100을 곱해준다. 따라서 100이하 0이상인 데이터가 생긴다.

# print(spread)

center = ones(25) * 50 # 1이 25개인 리스트를 만들고 리스트에 50을 곱한다. 원소는 전부 50이다.

# print(center)

# 최대 최소 아웃라이어 생성

flier_high = rand(10) * 100 + 100 # 10개의 0~100의 데이터를 생성해서 100을 더한다.

# print(flier_high)

flier_low = rand(10) * -100 # 10개의 0~1의 데이터를 생성해서 -100을 곱한다.

# print(flier_low)

# 생성된 데이터 셋 합치기

data = concatenate((spread, center, flier_high, flier_low), axis = 0) # 위의 4개 데이터를 전부 합친다.

print(data)

subplot(3, 1, 1) # subplot(311)과 동일. plot을 3행 1열의 1번째 자리에 놓는다는 뜻.

# 기본 플롯

# 'gx'는 아웃라이어 출력 설정을 뜻함. symbols 설정. gD 같은 것도 있음.

boxplot(data, 0, sym='gx')

# 비슷한 산점도와의 비교

subplot(3, 1, 2) # subplot(312)과 동일. plot을 3행 1열의 2번째 자리에 놓는다는 뜻.

spread_1 = concatenate((spread, flier_high, flier_low), 0)

center_1 = ones(70) * 25

scatter(center_1, spread_1)

xlim([0, 50]) # x의 범위를 0~50까지 설정.

# 산점도의 장점을 보여주는 그래프 추가 출력

subplot(3, 1, 3) # subplot(313)과 동일. plot을 3행 1열의 3번째 자리에 놓는다는 뜻.

center_2 = rand(70) * 50

scatter(center_2, spread_1)

xlim([0, 50])

show()

============================ Python ============================

'프로그래밍 > Python, R 프로그래밍' 카테고리의 다른 글

[Python] Python 21일차(예제로 배우는 파이썬 데이터 시각화) (0)	2018.06.20
[Python] Python 20일차(예제로 배우는 파이썬 데이터 시각화) (0)	2018.06.19
[Python] Python 18일차 (0)	2018.06.14
[Python DataStreamOut] csv, json, xlsx로 데이터 내보내기 (0)	2018.06.13
[Python] pandas에서 특정 문자열이 들어간 데이터를 제외하기 (0)	2018.06.12

데이터 분석가 블로그

[Python] Python 19일차(예제로 배우는 파이썬 데이터 시각화)

'프로그래밍 > Python, R 프로그래밍' 카테고리의 다른 글

+ Recent posts

티스토리툴바