Data Analysis with Python (quick note)

2019年8月20日 19:32

1. Importing Datasets

You can download my .ipynb here:

Formats for dataset: .csv, .json, .xls
Pandas Library:
- Read datasets into a data frame

# import pandas as pd
import pandas as pd

Read data:

path = "path of the data file (it can be a link or a file in the computer)"
df = pd.read_csv(other_path, header=None)

df.head(n)
df.tail(n)
# check the top n rows / bottom n rows of the dataframe

Add header:

# create headers list
headers = ["abc","efg","xyz"]

# replace headers
df.columns = headers

Drop missing values:

df.dropna(subset=["name of column want to drop"], axis = 0)

Save Dataset:

df.to_csv("file_name.csv", index = False)

Data Types:
- object
- float
- int
- bool
- datetime64

# print the type of each column
df.dtypes

Describe:
- Get a statistical summary of each column (count, mean, SD)

df.describe() # excluding NaN (Not a Number)
df.describe(include = "all") # including NaN

# apply describe for selected columns
df[["column1","column2","column3"]].describe()

Info:
- Provide a concise summary of DataFrame

df.info

2. Data Wrangling

You can download my .ipynb here:

Data Pre-processing: converting/mapping data from "raw" form into another format
Data Cleaning / Data Wrangling
Learning Objectives:
- Identify and handle missing values
- Data Formatting
- Data Normalization (centering/scaling)
- Data Binning
- Turning Categorical values to numeric variables

Missing values: "?", "N/A", 0 or a blank cell
How to deal with missing data?
- Drop the missing values: variable / data entry
- Replace the missing values with: an average, frequency, based on other functions
- Leave it as missing data

Drop:

df.dropna()
# axis = 0: drops the entire row
# axis = 1: drops the entire column
df.dropna(subnet = ["column want to drop"],axis = 0, inplace = True)

Replace:

df.replace(missing_value, new_value)
df["column"].replace(np.nan, mean)

Data Formatting: bringing data into a common standard of expression

df["column"] = n / df["column"] # convert data by a formula
# rename columns
df.rename(columns={"column":"another_name"}, inplace = True)

Correcting data types

# identify data type
df.dtypes()
# convert data type
df.astype()
df["column"] = df["column"].astype("int")

Data Normalization: convert to the similar value range / similar intrinsic influence on analytical model
3 ways:
- Simple Feature scaling: xnew = xold / xmax
- Min-Max: xnew = (xold - xmin) / (xmax - xmin)
- Z-score (standard score): xnew = (xold - u) / o (u: mean, o: SD)

df["col"] = df["col"] / df["col"].max()
df["col"] = (df["col"] - df["col"].min()) / (df["col"].max() - df["col"].min())
df["col"] = (df["col"] - df["col"].mean()) / (df["col"].std())

Binning:
- Binning: Grouping of values into "bins"
- Converts numeric into categorical variables
- Group a set of numerical values into a set of "bins"

bins = np.linspace(min(df["col"]), max(df["col"]), 4)
group_names = ["Low", "Medium", "High"]
df["col-binned"] = pd.cut(df["col"], bins, labels = group_names, include_lowest = True)

Turning categorical variables into quantitative variables in Python
Categorical -> Numeric:
- Add dummy variables for each unique category
- Assign 0 or 1 in each category
One-hot encoding

pd.get_dummies(df['fuel'])

3. Exploratory Data Analysis (EDA)

You can download my .ipynb here:

Descriptive Statistics
GroupBy
ANOVA
Correlation
Correlation - Statistics

Descriptive Statistics:
- Describe basic features of data
- Giving short summaries about the sample and measures of the data

df.describe()
data_counts = df["column"].value_counts()
data_counts.rename(columns={'old_name':'new_name',inplace = True)
data_counts.index.name = 'old_name'

Box Plot

Scatter Plot:
- Each observation represented as a point
- Scatter plot shows the relationship between two variables
- Predictor / independent variables on x-axis
- Target / dependent variables on y-axis

GroupBy:
- Can be applied on categorical variables
- Group data into categories

df_test = df['column1', 'column2', 'column3']
df_grp = df_test.groupby(['column1', 'column2'], as_index=False).mean()
df_grp

GroupBy Pivot()

df_pivot = df_grp.pivot(index = 'column1', columns = 'column2')

Heatmap

plt.pcolor(df_pivot,cmap='RdBBu')
plt.colorbar()
plt.show()

Correlation:
- Positive Linear Relationship
- Negative Linear Relationship
- Strong and Weak correlation

sns.regplot(x="x-name", y="y-name", data = df)
plt.ylim(0,)

Pearson Correlation:
Correlation coefficient:
- Close to +1: Large Positive relationship
- Close to -1: Large Negative relationship
- Close to 0: No relationship
P-value
<0.001: Strong certainly in the result
<0.05: Moderate certainly in the result
<0.1: Weak certainly in the result
>0.1: No certainly in the result
Strong Correlation:
- Correlation coefficient close to 1 or -1
- P-value less than 0.001

Pearson_coef, p_value = stats.personr[['column1'],df['column2']]

Analysis of Variance (ANOVA)
Statistical comparison of groups
Finding correlation between different groups of a categorical variable
ANOVA:
- F-test score: variation between sample group means divided by variation within sample group
- p-value: confidence degree

F-test:
- Small F: poor correlation between groups
- Large F: strong correlation between groups

# ANOVA between "Honda" and "Subaru"
df_anova = df[["make","price"]]
grouped_anova = df_anova.groupby(["make"])
anova_results_1 = stats.f_oneway(grouped_anova.get_group("honda")["price"], grouped_anova.get_group("subaru")["price"])

4. Model Development

You can download my .ipynb here:

Simple and Multiple Linear Regression
Model Evaluation using Visualization
Polynomial Regression and Pipelines
R-squared and MSE for In-Sample Evaluation
Prediction and Decision Making

Model:
- Independent variables
- Dependent variables
- Relevant data

Simple Linear Regression
- The predictor (independent): x
- The target (dependent): y
- The intercept: b0
- The slope: b1
- y = b0 + b1x
- (b0, b1): Fit

X = df[['x-axis-columns']]
Y = df[['y-axis-columns']]
lm.fit(X,Y)
Yhat = lm.predict(X)

Multiple Linear Regression (MLR)
_ One continuous target (Y) variable
- Two or more predictor (X) variables
- y = b0 + b1x1 + b2x2 + b3x3 + b4x4
- b0: intercept
- bn: coefficient or parameter of xn

Z = df[['x-axis-columns'],['y-axis-columns'],['z-axis-columns']]
lm.fit(Z,df['column'])
Yhat = lm.predict(X)

Estimated Linear Model:
- Find the intercept (b0)
- Find the coefficients (b1, b2, b3, b4)

Model Evaluation using Visualization:
Regression Plot:
- The relationship between two variables
- The strength of correlation
- The direction of the relationship (P or N)
- The scatterplot
- The fitted linear regression line

import seaborn as sns
sns.regplot(x="x-axis-column", y = "y-axis-column", data = df)
plt.ylim(0,)

Residual Plot
- Spread of the residuals: randomly spread out around x-axis

- Not randomly spread out around x-axis
- Nonlinear model may be more appropriate

- Not randomly spread out around x-axis
- Variance appears to change with x-axis

import seaborn as sns
sns.residplot(df['x-axis-column'], df['y-axis'])

Distribution Plots
- The fitted values that result from the model
- The actual values

import seaborn as sns
ax1 = sns.distplot(df['column'], hist = False, color="r", label="Actual Value")
sns.distplot(Yhat, hist=False, color="b", label="Fitted Values", ax=ax1)

Polynomial Regression
- Quadratic - 2nd order: y = b0 + b1x1 + b2(x1)^2
- Cubic - 3rd order: y = b0 + b1x1 + b2(x1)^2 + b3(x1)^3

f = np.polyfit(x,y,3)
p = np.polydl(f)

One Dimension

from sklearn.preprocessing import PolynomialFeatures
pr = PolynomialFeatures(degree=2)
x_polly = pr.fit_transform(x['column1','column2'],include_bias = False)

Pre-processing

from sklearn.preprocessing import StandardScaler
SCALE = StandardScaler()
SCALE.fit(x_data[['column1'],['column2']])
x_scale = SCALE.transform(x_data[['column1'],['column2']])

Pipelines

from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import LinearRegression
from sklearn.preprocessing import StandardScaler

Pipeline Constructor

Pipe.train(X['column1'],['column2'],['column3'],['column4'])
yhat = Pipe.predict(X['column1'],['column2'],['column3'],['column4'])

Measures for In-Sample Evaluation
Numerically determine how good the model fits on dataset
Two important measures to determine the fit of a model:
- Mean Squared Error (MSE)
- R-squared (R^2)

Mean Squared Error (MSE)

from sklearn.metrics import mean_squared_error
mean_squared_error(df['column'], Y_predict_simple_fit)

R-squared / R^2
- The coefficient of Determination or R squared (R^2)
- Determine how close the data is to the fitted regression line
- R^2: the percentage of variation of the target variable (Y) that is explained by the linear model

The values of MSE are between 0 and 1

X = df[['x-axis-column]]
Y = df['y-axis-column]
lm.fit(X,Y)
lm.score

Prediction and Decision Making
Determining a Good Model Fit:
- Do the predicted values make sense
- Visualization
- Numerical measures for evaluation
- Comparing Models

# Train the model
lm.fit(df['x-axis-column'],df['y-axis-column']]
lm.predict(x-value)
lm.coef_

import numpy as np
new_input = np.arange(1,101,1).reshape(-1,1)
# Predict new values
yhat = lm.predict(new_input)

Numerical measures of Evaluation
Compare MLR and SLR:
- MSE for MLR model will be smaller than MSE for SLR model
- Polynomial regression will also have smaller MSE
- A similar inverse relationship holds for R^2

5. Model Evaluation and Refinement

You can download my .ipynb here:

Model Evaluation:
In-Sample evaluation: how well our model will fit the data used to train it
Out-of-sample evaluation or test set
Split data set into:
- training set: 70% - build and train the model
- testing set: 30% - assess the performance of a predictive model

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size = 0.3, random_state = 0)

x_data: features or independent variables
y_data: dataset target
x_train, y_train: parts of available data as training set
x_test, y_test: parts of available data as testing set
test_size = 30/100 = 0.3
random_state: number generator used for random sampling

Generalization Performance:
Generalization error: measure of how well data does at predicting previously unseen data

Data Analysis with Python (quick note)

1. Importing Datasets

2. Data Wrangling

3. Exploratory Data Analysis (EDA)

4. Model Development

5. Model Evaluation and Refinement

いいなと思ったら応援しよう！