[기상청 콘테스트] 해면 기상 상태에 따른 선박 사고 위험도를 측정하는 모델을 개발 - 9(텐서플로우 로지스틱스 회귀분석)

2018. 7. 24. 16:43

이전에 했던 텐서플로우 로지스틱스 회귀분석이 제대로 동작하지 않았던 부분을 해결해보기 위해서 이런저런 고민을 했었는데 결국 아주 사소한 부분이 문제였었다.

시그모이드 함수의 Cost Func.을 설정할 때 log가 들어가서 log 내부의 값이 0이 되었을 때 발생하는 문제임을 알지 못하고 계속 왜 inf, nan이 나오는지에 대해서 고민을 했던 것이었다.

이 문제는 텐서플로우의 clip_by_value() 메서드를 이용해서 해결을 했다.

아래는 사용한 전체 소스 코드이다.

============================= Python =============================

import tensorflow as tf

import pandas as pd

import numpy as np

xy = np.loadtxt("D:/deep1/projectdata/sixx/test.csv", delimiter=",")

df = pd.read_csv("D:/deep1/projectdata/sixx/test.csv", encoding='mbcs')

x_data = xy[:, 10:18]

y_data = xy[:, [-1]]

column_len = len(df.columns.values.tolist()) - 1 # 맨 마지막 [-1]컬럼을 제외

column_len = 8

X = tf.placeholder(tf.float32, shape=[None, column_len])

Y = tf.placeholder(tf.float32, shape=[None, 1])

W = tf.Variable(tf.random_normal([column_len, 1]), name='weight')

b = tf.Variable(tf.random_normal([1]))

hypothesis = tf.sigmoid(tf.matmul(X, W) + b)

# hypothesis가 0이 나오면 cost func을 돌리는데 문제가 발생하므로 clip_by_value를 이용해서 값의 범위를 정한다.

hypothesis = tf.clip_by_value(hypothesis,1e-5,1-(1e-5)) # 상한, 하한 설정

cost = -tf.reduce_mean(Y * tf.log(hypothesis) + (1 - Y) * (tf.log(1 - hypothesis)))

optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.01)

train = optimizer.minimize(cost)

sess = tf.Session()

sess.run(tf.global_variables_initializer())

for step in range(1001):

cost_val, W_val, b_val, _ = sess.run([cost, W, b, train], feed_dict={X:x_data, Y:y_data})

if step % 100 == 0:

print('step :', step, " , cost_val :", cost_val, ' , W :', W_val, ' , b_val :', b_val)

#print('step :', step, " , cost_val :", cost_val)

predicted = tf.cast(hypothesis > 0.5, dtype=tf.float32)

accuracy = tf.reduce_mean(tf.cast(tf.equal(predicted, Y), dtype=tf.float32))

h, c, a = sess.run([hypothesis, predicted, accuracy], feed_dict={X: x_data, Y:y_data})

#print('\n', h, '\n',c, '\n',a)

print(a)

============================= Python =============================

결과는 cost_val가 1.0823343, 정확도가 0.9058798로 나왔다.

cost_val 값이 step 100에서부터 변화가 없는 것으로 보아 마지막 W, b를 Variable 객체에 넣고 learning_rate를 줄여서 다시 시행을 해봐야겠다는 생각이 든다.

============================= Python =============================

import tensorflow as tf

import pandas as pd

import numpy as np

xy = np.loadtxt("D:/deep1/projectdata/sixx/test.csv", delimiter=",")

df = pd.read_csv("D:/deep1/projectdata/sixx/test.csv", encoding='mbcs')

x_data = xy[:, 10:18]

y_data = xy[:, [-1]]

column_len = len(df.columns.values.tolist()) - 1 # 맨 마지막 [-1]컬럼을 제외

column_len = 8

K = [[-0.5448839 ], [-0.7040692 ], [-1.5916004 ], [ 0.9268937 ], [-0.56564385],

[ 1.1852934 ], [ 0.17947902], [-0.31339148]]

L = [1.2286115]

X = tf.placeholder(tf.float32, shape=[None, column_len])

Y = tf.placeholder(tf.float32, shape=[None, 1])

W = tf.Variable(K, name='weight')

b = tf.Variable(L)

hypothesis = tf.sigmoid(tf.matmul(X, W) + b)

# hypothesis가 0이 나오면 cost func을 돌리는데 문제가 발생하므로 clip_by_value를 이용해서 값의 범위를 정한다.

hypothesis = tf.clip_by_value(hypothesis,1e-5,1-(1e-5)) # 상한, 하한 설정

cost = -tf.reduce_mean(Y * tf.log(hypothesis) + (1 - Y) * (tf.log(1 - hypothesis)))

optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.13)

train = optimizer.minimize(cost)

sess = tf.Session()

sess.run(tf.global_variables_initializer())

for step in range(1001):

cost_val, W_val, b_val, _ = \

sess.run([cost, W, b, train], feed_dict={X:x_data, Y:y_data})

if step % 100 == 0:

#print('step :', step, " , cost_val :", cost_val, ' , \

# W :', W_val, ' , b_val :', b_val)

print('step :', step, " , cost_val :", cost_val)

predicted = tf.cast(hypothesis > 0.5, dtype=tf.float32)

accuracy = tf.reduce_mean(tf.cast(tf.equal(predicted, Y), dtype=tf.float32))

h, c, a = sess.run([hypothesis, predicted, accuracy], feed_dict={X: x_data, Y:y_data})

#print('\n', h, '\n',c, '\n',a)

print(a)

============================= Python =============================

바꿨더니 cost_val은 조금 줄었지만 정확도도 떨어졌다.

유의미하게 수치가 줄어든 것이 아니기에 필요없다고 생각한다.

'데이터 분석 > 데이터 분석 프로젝트' 카테고리의 다른 글

[기상청 콘테스트] 해면 기상 상태에 따른 선박 사고 위험도를 측정하는 모델을 개발 - 11(텐서플로우 소프트맥스 함수 적용) (0)	2018.07.25
[기상청 콘테스트] 해면 기상 상태에 따른 선박 사고 위험도를 측정하는 모델을 개발 - 10(데이터셋을 바꿔서 로지스틱스 회귀분석 적용) (0)	2018.07.24
[기상청 콘테스트] 해면 기상 상태에 따른 선박 사고 위험도를 측정하는 모델을 개발 - 8(바다를 6개로 나눠서 로지스틱스 회귀분석) (0)	2018.07.19
[기상청 콘테스트] 해면 기상 상태에 따른 선박 사고 위험도를 측정하는 모델을 개발 - 7(더미 데이터 생성 및 로지스틱스 회귀분석) (0)	2018.07.18
[기상청 콘테스트] 해면 기상 상태에 따른 선박 사고 위험도를 측정하는 모델을 개발 - 6(Weka를 이용한 분석) (0)	2018.07.17

데이터 분석가 블로그

[기상청 콘테스트] 해면 기상 상태에 따른 선박 사고 위험도를 측정하는 모델을 개발 - 9(텐서플로우 로지스틱스 회귀분석)

'데이터 분석 > 데이터 분석 프로젝트' 카테고리의 다른 글

+ Recent posts

티스토리툴바