見出し画像

Data Analysis with Python (quick note)

1. Importing Datasets

You can download my .ipynb here: 

Formats for dataset: .csv, .json, .xls
Pandas Library: 
- Read datasets into a data frame

# import pandas as pd
import pandas as pd

Read data:

path = "path of the data file (it can be a link or a file in the computer)"
df = pd.read_csv(other_path, header=None)
df.head(n)
df.tail(n)
# check the top n rows / bottom n rows of the dataframe

Add header:

# create headers list
headers = ["abc","efg","xyz"]

# replace headers
df.columns = headers

Drop missing values:

df.dropna(subset=["name of column want to drop"], axis = 0)

Save Dataset:

df.to_csv("file_name.csv", index = False)

Data Types:
- object
- float
- int
- bool
- datetime64

# print the type of each column
df.dtypes

Describe:
- Get a statistical summary of each column (count, mean, SD)

df.describe() # excluding NaN (Not a Number)
df.describe(include = "all") # including NaN

# apply describe for selected columns
df[["column1","column2","column3"]].describe()

Info:
- Provide a concise summary of DataFrame

df.info

2. Data Wrangling

You can download my .ipynb here: 


Data Pre-processing: converting/mapping data from "raw" form into another format
Data Cleaning / Data Wrangling
Learning Objectives:
- Identify and handle missing values
- Data Formatting
- Data Normalization (centering/scaling)
- Data Binning
- Turning Categorical values to numeric variables

Missing values: "?", "N/A", 0 or a blank cell
How to deal with missing data?
- Drop the missing values: variable / data entry
- Replace the missing values with: an average, frequency, based on other functions
- Leave it as missing data

Drop:

df.dropna()
# axis = 0: drops the entire row
# axis = 1: drops the entire column
df.dropna(subnet = ["column want to drop"],axis = 0, inplace = True)

Replace:

df.replace(missing_value, new_value)
df["column"].replace(np.nan, mean)

Data Formatting: bringing data into a common standard of expression

df["column"] = n / df["column"] # convert data by a formula
# rename columns
df.rename(columns={"column":"another_name"}, inplace = True)

Correcting data types

# identify data type
df.dtypes()
# convert data type
df.astype()
df["column"] = df["column"].astype("int")

Data Normalization: convert to the similar value range / similar intrinsic influence on analytical model
3 ways:
- Simple Feature scaling: xnew = xold / xmax
- Min-Max: xnew = (xold - xmin) / (xmax - xmin)
- Z-score (standard score): xnew = (xold - u) / o (u: mean, o: SD)

df["col"] = df["col"] / df["col"].max()
df["col"] = (df["col"] - df["col"].min()) / (df["col"].max() - df["col"].min())
df["col"] = (df["col"] - df["col"].mean()) / (df["col"].std())

Binning:
- Binning: Grouping of values into "bins"
- Converts numeric into categorical variables
- Group a set of numerical values into a set of "bins"

bins = np.linspace(min(df["col"]), max(df["col"]), 4)
group_names = ["Low", "Medium", "High"]
df["col-binned"] = pd.cut(df["col"], bins, labels = group_names, include_lowest = True)

Turning categorical variables into quantitative variables in Python
Categorical -> Numeric:
- Add dummy variables for each unique category
- Assign 0 or 1 in each category
One-hot encoding

pd.get_dummies(df['fuel'])

3. Exploratory Data Analysis (EDA)

You can download my .ipynb here: 

Descriptive Statistics
GroupBy
ANOVA
Correlation
Correlation - Statistics

Descriptive Statistics:
- Describe basic features of data
- Giving short summaries about the sample and measures of the data

df.describe()
data_counts = df["column"].value_counts()
data_counts.rename(columns={'old_name':'new_name',inplace = True)
data_counts.index.name = 'old_name'

Box Plot

Scatter Plot:
- Each observation represented as a point
- Scatter plot shows the relationship between two variables
- Predictor / independent variables on x-axis
- Target / dependent variables on y-axis

GroupBy:
- Can be applied on categorical variables
- Group data into categories

df_test = df['column1', 'column2', 'column3']
df_grp = df_test.groupby(['column1', 'column2'], as_index=False).mean()
df_grp

GroupBy Pivot()

df_pivot = df_grp.pivot(index = 'column1', columns = 'column2')

Heatmap

plt.pcolor(df_pivot,cmap='RdBBu')
plt.colorbar()
plt.show()

Correlation:
- Positive Linear Relationship
- Negative Linear Relationship
- Strong and Weak correlation

sns.regplot(x="x-name", y="y-name", data = df)
plt.ylim(0,)

Pearson Correlation:
Correlation coefficient:

- Close to +1: Large Positive relationship
- Close to -1: Large Negative relationship
- Close to 0: No relationship
P-value
<0.001: Strong certainly in the result
<0.05: Moderate certainly in the result
<0.1: Weak certainly in the result
>0.1: No certainly in the result
Strong Correlation:
- Correlation coefficient close to 1 or -1
- P-value less than 0.001

Pearson_coef, p_value = stats.personr[['column1'],df['column2']]

Analysis of Variance (ANOVA)
Statistical comparison of groups
Finding correlation between different groups of a categorical variable
ANOVA:
- F-test score: variation between sample group means divided by variation within sample group
- p-value: confidence degree

F-test:
- Small F: poor correlation between groups
- Large F: strong correlation between groups

# ANOVA between "Honda" and "Subaru"
df_anova = df[["make","price"]]
grouped_anova = df_anova.groupby(["make"])
anova_results_1 = stats.f_oneway(grouped_anova.get_group("honda")["price"], grouped_anova.get_group("subaru")["price"])

4. Model Development

You can download my .ipynb here: 

Simple and Multiple Linear Regression
Model Evaluation using Visualization
Polynomial Regression and Pipelines
R-squared and MSE for In-Sample Evaluation
Prediction and Decision Making

Model:
- Independent variables
- Dependent variables
- Relevant data

Simple Linear Regression
- The predictor (independent): x
- The target (dependent): y
- The intercept: b0
- The slope: b1
- y = b0 + b1x
- (b0, b1): Fit

X = df[['x-axis-columns']]
Y = df[['y-axis-columns']]
lm.fit(X,Y)
Yhat = lm.predict(X)

Multiple Linear Regression (MLR)
_ One continuous target (Y) variable
- Two or more predictor (X) variables
- y = b0 + b1x1 + b2x2 + b3x3 + b4x4
- b0: intercept
- bn: coefficient or parameter of xn

Z = df[['x-axis-columns'],['y-axis-columns'],['z-axis-columns']]
lm.fit(Z,df['column'])
Yhat = lm.predict(X)

Estimated Linear Model:
- Find the intercept (b0)
- Find the coefficients (b1, b2, b3, b4)

Model Evaluation using Visualization:
Regression Plot:
- The relationship between two variables
- The strength of correlation
- The direction of the relationship (P or N)
- The scatterplot
- The fitted linear regression line

import seaborn as sns
sns.regplot(x="x-axis-column", y = "y-axis-column", data = df)
plt.ylim(0,)

Residual Plot
- Spread of the residuals: randomly spread out around x-axis


- Not randomly spread out around x-axis
- Nonlinear model may be more appropriate

- Not randomly spread out around x-axis
- Variance appears to change with x-axis

import seaborn as sns
sns.residplot(df['x-axis-column'], df['y-axis'])

Distribution Plots
- The fitted values that result from the model
- The actual values

import seaborn as sns
ax1 = sns.distplot(df['column'], hist = False, color="r", label="Actual Value")
sns.distplot(Yhat, hist=False, color="b", label="Fitted Values", ax=ax1)

Polynomial Regression
- Quadratic - 2nd order: y = b0 + b1x1 + b2(x1)^2
- Cubic - 3rd order: y = b0 + b1x1 + b2(x1)^2 + b3(x1)^3

f = np.polyfit(x,y,3)
p = np.polydl(f)

One Dimension

from sklearn.preprocessing import PolynomialFeatures
pr = PolynomialFeatures(degree=2)
x_polly = pr.fit_transform(x['column1','column2'],include_bias = False)

Pre-processing

from sklearn.preprocessing import StandardScaler
SCALE = StandardScaler()
SCALE.fit(x_data[['column1'],['column2']])
x_scale = SCALE.transform(x_data[['column1'],['column2']])

Pipelines

from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import LinearRegression
from sklearn.preprocessing import StandardScaler

Pipeline Constructor

Pipe.train(X['column1'],['column2'],['column3'],['column4'])
yhat = Pipe.predict(X['column1'],['column2'],['column3'],['column4'])

Measures for In-Sample Evaluation
Numerically determine how good the model fits on dataset
Two important measures to determine the fit of a model:
- Mean Squared Error (MSE)
- R-squared (R^2)

Mean Squared Error (MSE)

from sklearn.metrics import mean_squared_error
mean_squared_error(df['column'], Y_predict_simple_fit)

R-squared / R^2
- The coefficient of Determination or R squared (R^2)
- Determine how close the data is to the fitted regression line
- R^2: the percentage of variation of the target variable (Y) that is explained by the linear model

The values of MSE are between 0 and 1

X = df[['x-axis-column]]
Y = df['y-axis-column]
lm.fit(X,Y)
lm.score

Prediction and Decision Making
Determining a Good Model Fit:

- Do the predicted values make sense
- Visualization
- Numerical measures for evaluation
- Comparing Models

# Train the model
lm.fit(df['x-axis-column'],df['y-axis-column']]
lm.predict(x-value)
lm.coef_
import numpy as np
new_input = np.arange(1,101,1).reshape(-1,1)
# Predict new values
yhat = lm.predict(new_input)

Numerical measures of Evaluation
Compare MLR and SLR:
- MSE for MLR model will be smaller than MSE for SLR model
- Polynomial regression will also have smaller MSE
- A similar inverse relationship holds for R^2

5. Model Evaluation and Refinement

You can download my .ipynb here: 

Model Evaluation:
In-Sample evaluation: how well our model will fit the data used to train it
Out-of-sample evaluation or test set
Split data set into:
- training set: 70% - build and train the model
- testing set: 30% - assess the performance of a predictive model

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size = 0.3, random_state = 0)

x_data: features or independent variables
y_data: dataset target
x_train, y_train: parts of available data as training set
x_test, y_test: parts of available data as testing set
test_size = 30/100 = 0.3
random_state: number generator used for random sampling

Generalization Performance:
Generalization error: measure of how well data does at predicting previously unseen data


この記事が気に入ったらサポートをしてみませんか?