단일선형회귀 분석 실습¶

import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error
from math import sqrt
%matplotlib inline
matplotlib.style.use('ggplot')

먼저 필요한 도구들을 불러옵니다.

from sklearn import datasets
boston_house_prices = datasets.load_boston()
print(boston_house_prices.keys())
print(boston_house_prices.data.shape)
print(boston_house_prices.feature_names)

dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])
(506, 13)
['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']

sklearn에서 제공하는 데이터셋 boston을 불러와서 저장해줍니다. 그 후 데이터의 확인을 위해 key값, 행과 열의 길이, 컬럼의 이름들을 프린트해줍니다.

print(boston_house_prices.DESCR)

.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
.. topic:: References

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.

이 부분은 저희가 불러온 데이터 셋의 설명입니다. 해당 컬럼이 어떤 컬럼인지 설명도 나와있습니다. 참고하면 좋습니다.

data_frame = pd.DataFrame(boston_house_prices.data)
data_frame.tail()

불러온 데이터 셋에서 data에 해당하는 값들을 dataframe형태로 변경해줍니다.

data_frame.columns = boston_house_prices.feature_names
data_frame.tail()

이전의 결과는 컬럼이름이 숫자로 되어있는데 이것을 데이터 셋에 저장된 컬럼이름으로 바꾸는 과정입니다.

data_frame['Price'] = boston_house_prices.target
data_frame.tail()

데이터프레임에 price라는 컬럼을 추가하여 데이터는 target에 저장된 데이터를 사용합니다.

data_frame.plot(kind="scatter", x="RM", y="Price", figsize=(6,6),
                color="blue", xlim = (4,8), ylim = (10,45))

<matplotlib.axes._subplots.AxesSubplot at 0x224f11aa438>

산점도로 나타내줍니다.

from sklearn import linear_model
linear_regression = linear_model.LinearRegression()
linear_regression.fit(X = pd.DataFrame(data_frame["RM"]), y = data_frame["Price"])
prediction = linear_regression.predict(X = pd.DataFrame(data_frame["RM"]))
print('a value = ', linear_regression.intercept_)
print('b balue =', linear_regression.coef_)

a value =  -34.67062077643857
b balue = [9.10210898]

데이터를 학습시킵니다. 선형회귀분석 모델을 저장하고, 예측값을 저장합니다. 그 후 선형회귀분석 모델에 맞게 학습하는 함수를 이용하여 새로운 값을 예측합니다. 그리고 선형회귀분석식의 a계수와 b계수를 출력해줍니다.

residuals = data_frame["Price"] - prediction
residuals.describe()

count    5.060000e+02
mean     1.899227e-15
std      6.609606e+00
min     -2.334590e+01
25%     -2.547477e+00
50%      8.976267e-02
75%      2.985532e+00
max      3.943314e+01
Name: Price, dtype: float64

잔차를 구하여 적합도를 검증해주기 위한 작업입니다.

SSE = (residuals**2).sum()
SST = ((data_frame["Price"]-data_frame["Price"].mean())**2).sum()
R_squared = 1 - (SSE/SST)
print('R_squared = ', R_squared)

R_squared =  0.4835254559913341

SSE와 SST 값을 저장하고 결정계수 값을 계산합니다.

data_frame.plot(kind="scatter",x="RM",y="Price",figsize=(6,6),
                color="blue", xlim = (4,8), ylim = (10,45))

# Plot regression line
plt.plot(data_frame["RM"],prediction,color="red")

[<matplotlib.lines.Line2D at 0x224f121e2e8>]

print('score = ', linear_regression.score(X = pd.DataFrame(data_frame["RM"]), y = data_frame["Price"]))
print('Mean_Squared_Error = ', mean_squared_error(prediction, data_frame["Price"]))
print('RMSE = ', mean_squared_error(prediction, data_frame["Price"])**0.5)

score =  0.4835254559913343
Mean_Squared_Error =  43.60055177116956
RMSE =  6.603071389222561

MSE를 구하는 모듈을 불러와 값을 구하고, score를 통해 예측한 값이 얼만큼의 성능을 보이는지 평가합니다.

다중선형회귀분석

지금까지 한 개의 독립변수와 하나의 종속변수의 관계를 분석하는 단일선형회귀분석을 배웠습니다. 아무래도 한 개의 독립변수로 종속변수의 값을 예측하는데는 무리가 있을 것으로 보입니다. 이럴 때는 다중선형회귀분석을 이용합니다. 다중회귀선형분석은 두 개 이상의 독립변수들이 하나의 종속변수의 관계를 분석하는 것인데요. 과정은 단일선형회귀분석과 비슷합니다.

다중선형회귀모델식을 이러합니다.

이 식을 통해 회귀계수를 추정하고 적절한 회귀 직선을 구하는 것입니다. 방법은 최소자승법을 이용하죠. 그 외 적합도 검증과정은 단일과 같습니다. 바로 실습진행해보죠.

단일선형회귀분석에서 이용했던 boston 데이터 셋을 이용하여 비교하기 쉽게 진행해보겠습니다.

X = pd.DataFrame(boston_house_prices.data)
X.tail()

X.columns = boston_house_prices.feature_names
X.tail()

단일선형회귀분석과 변수명만 다르고 과정은 동일합니다.

X['Price'] = boston_house_prices.target
y = X.pop('Price')
X.tail()

이번에는 변수에 전체 데이터 중 price에 해당하는 값만 데이터 프레임에서 제외 시키고 저장하였습니다.

linear_regression = linear_model.LinearRegression()
linear_regression.fit(X = pd.DataFrame(X), y = y)
prediction = linear_regression.predict(X = pd.DataFrame(X))
print('a value = ', linear_regression.intercept_)
print('b balue =', linear_regression.coef_)

a value =  36.459488385089855
b balue = [-1.08011358e-01  4.64204584e-02  2.05586264e-02  2.68673382e+00
 -1.77666112e+01  3.80986521e+00  6.92224640e-04 -1.47556685e+00
  3.06049479e-01 -1.23345939e-02 -9.52747232e-01  9.31168327e-03
 -5.24758378e-01]

데이터를 학습시켜 선형회귀분석의 계수들을 출력합니다.

residuals = y-prediction
residuals.describe()

count    5.060000e+02
mean     2.924319e-15
std      4.683822e+00
min     -1.559447e+01
25%     -2.729716e+00
50%     -5.180489e-01
75%      1.777051e+00
max      2.619927e+01
Name: Price, dtype: float64

적합도 검증을 위해 잔차를 구해줍니다.

SSE = (residuals**2).sum()
SST = ((y-y.mean())**2).sum()
R_squared = 1 - (SSE/SST)
print('R_squared = ', R_squared)

R_squared =  0.7406426641094094

print('score = ', linear_regression.score(X = pd.DataFrame(X), y = y))
print('Mean_Squared_Error = ', mean_squared_error(prediction, y))
print('RMSE = ', mean_squared_error(prediction, y)**0.5)

score =  0.7406426641094095
Mean_Squared_Error =  21.894831181729202
RMSE =  4.679191295697281

이렇게 적합도검증과 성능평가까지 마쳤습니다. 결과를 보시면 단일선형회귀보다 다중선형회귀분석이 더 높은 스코어를 보이는 것을 알 수 있죠. 하지만 꼭 변수가 많다고 예측확률이 높은 것은 아닙니다. 적절히 전처리를 해주고 영향을 주는 변수들만 추려서 독립변수들로 사용한다면 더욱 높은 스코어를 기대할 수 있을 것 입니다.:)

	0	2	4	5	6	7	8	9	10	11	12
501	0.06263	11.93	0.573	6.593	69.1	2.4786	1.0	273.0	21.0	391.99	9.67
502	0.04527	11.93	0.573	6.120	76.7	2.2875	1.0	273.0	21.0	396.90	9.08
503	0.06076	11.93	0.573	6.976	91.0	2.1675	1.0	273.0	21.0	396.90	5.64
504	0.10959	11.93	0.573	6.794	89.3	2.3889	1.0	273.0	21.0	393.45	6.48
505	0.04741	11.93	0.573	6.030	80.8	2.5050	1.0	273.0	21.0	396.90	7.88

	CRIM	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT
501	0.06263	11.93	0.573	6.593	69.1	2.4786	1.0	273.0	21.0	391.99	9.67
502	0.04527	11.93	0.573	6.120	76.7	2.2875	1.0	273.0	21.0	396.90	9.08
503	0.06076	11.93	0.573	6.976	91.0	2.1675	1.0	273.0	21.0	396.90	5.64
504	0.10959	11.93	0.573	6.794	89.3	2.3889	1.0	273.0	21.0	393.45	6.48
505	0.04741	11.93	0.573	6.030	80.8	2.5050	1.0	273.0	21.0	396.90	7.88

	CRIM	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	Price
501	0.06263	11.93	0.573	6.593	69.1	2.4786	1.0	273.0	21.0	391.99	9.67	22.4
502	0.04527	11.93	0.573	6.120	76.7	2.2875	1.0	273.0	21.0	396.90	9.08	20.6
503	0.06076	11.93	0.573	6.976	91.0	2.1675	1.0	273.0	21.0	396.90	5.64	23.9
504	0.10959	11.93	0.573	6.794	89.3	2.3889	1.0	273.0	21.0	393.45	6.48	22.0
505	0.04741	11.93	0.573	6.030	80.8	2.5050	1.0	273.0	21.0	396.90	7.88	11.9

	0	2	4	5	6	7	8	9	10	11	12
501	0.06263	11.93	0.573	6.593	69.1	2.4786	1.0	273.0	21.0	391.99	9.67
502	0.04527	11.93	0.573	6.120	76.7	2.2875	1.0	273.0	21.0	396.90	9.08
503	0.06076	11.93	0.573	6.976	91.0	2.1675	1.0	273.0	21.0	396.90	5.64
504	0.10959	11.93	0.573	6.794	89.3	2.3889	1.0	273.0	21.0	393.45	6.48
505	0.04741	11.93	0.573	6.030	80.8	2.5050	1.0	273.0	21.0	396.90	7.88

	CRIM	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT
501	0.06263	11.93	0.573	6.593	69.1	2.4786	1.0	273.0	21.0	391.99	9.67
502	0.04527	11.93	0.573	6.120	76.7	2.2875	1.0	273.0	21.0	396.90	9.08
503	0.06076	11.93	0.573	6.976	91.0	2.1675	1.0	273.0	21.0	396.90	5.64
504	0.10959	11.93	0.573	6.794	89.3	2.3889	1.0	273.0	21.0	393.45	6.48
505	0.04741	11.93	0.573	6.030	80.8	2.5050	1.0	273.0	21.0	396.90	7.88

Justkeepitsteady

선형회귀분석에 대해 알아보고 코드를 음미해봅시다:)

단일선형회귀 분석 실습¶

'데이터분석 및 프로젝트' 카테고리의 다른 글

티스토리툴바

한국복지패널데이터를 분석하여 봅시다 :) part2 (0)	2020.01.23
한국복지패널데이터를 분석하여 봅시다 :) part1 (0)	2020.01.21
[정보이론] 위너-킨친 정리에 대해 알아봅시다:) (2)	2020.01.19
기사를 크롤링하여 mysql에 넣어보자 :) (0)	2020.01.19
Folium 지도에 heatmap을 이용하여 빈도수를 표현해보자 :) (0)	2020.01.19

선형회귀분석에 대해 알아보고 코드를 음미해봅시다:)

단일선형회귀 분석 실습¶

'데이터분석 및 프로젝트' 카테고리의 다른 글

'데이터분석 및 프로젝트' Related Articles

티스토리툴바