Trying Data Science(4) : Cramming Pandas - 5 miscellaneous skills
Hello, everyone!
Today I'm going to cover Pandas following the last posting: 5 miscellaneous skills.
First of all, let's import numpy and pandas
import numpy as np
import pandas as pd
1. How to deal with missing data
In
df = pd.DataFrame({'A':[1,2,np.nan],
'B':[5,np.nan,np.nan],
'C':[1,2,3]})
df
Out
A B C
0 1.0 5.0 1
1 2.0 NaN 2
2 NaN NaN 3
You can see the missing data 'NaN'. '
There are three main ways to handle it.
(1) Remove rows
In
df.dropna()
Out
A B C
0 1.0 5.0 1
(2) Remove columns
In
df.dropna(axis=1))
Out
C
0 1
1 2
2 3
(3) Fill missing values with different ones
In
df['A'].fillna(value=df['A'].mean())
Out
0 1.0
1 2.0
2 1.5
Name: A, dtype: float64
2. Grouping
The groupby method allows to group rows of data together.
You can save the 'data' object that is grouped as a new variable
In
import pandas as pd
# Create dataframe
data = {'Company':['GOOG','GOOG','MSFT','MSFT'],
'Person':['Sam','Charlie','Amy','Vanessa'],
'Sales':[200,120,340,124]}
df = pd.DataFrame(data)
df
by_comp = df.groupby("Company")
by_comp.mean()
Out
Sales
Company
FB 296.5
GOOG 160.0
MSFT 232.0
Here is a useful function which name is 'describe()'.
It shows general statistical information.
In
by_comp.describe()
Out
Sales
Company
FB count 2.000000
mean 296.500000
std 75.660426
min 243.000000
25% 269.750000
50% 296.500000
75% 323.250000
max 350.000000
GOOG count 2.000000
mean 160.000000
std 56.568542
min 120.000000
25% 140.000000
50% 160.000000
75% 180.000000
max 200.000000
MSFT count 2.000000
mean 232.000000
std 152.735065
min 124.000000
25% 178.000000
50% 232.000000
75% 286.000000
max 340.000000
3. Combining DataFrames
Let's set three data frames.
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])
df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
'B': ['B4', 'B5', 'B6', 'B7'],
'C': ['C4', 'C5', 'C6', 'C7'],
'D': ['D4', 'D5', 'D6', 'D7']},
index=[4, 5, 6, 7])
df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
'B': ['B8', 'B9', 'B10', 'B11'],
'C': ['C8', 'C9', 'C10', 'C11'],
'D': ['D8', 'D9', 'D10', 'D11']},
index=[8, 9, 10, 11])
1) Concatenating
Concatenation basically glues DataFrames together.
In
# Note 'axis=1'. It means the column direction.
pd.concat([df1,df2,df3],axis=1)
Out
A B C D A B C D A B C D
0 A0 B0 C0 D0 NaN NaN NaN NaN NaN NaN NaN NaN
1 A1 B1 C1 D1 NaN NaN NaN NaN NaN NaN NaN NaN
2 A2 B2 C2 D2 NaN NaN NaN NaN NaN NaN NaN NaN
3 A3 B3 C3 D3 NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN A4 B4 C4 D4 NaN NaN NaN NaN
5 NaN NaN NaN NaN A5 B5 C5 D5 NaN NaN NaN NaN
6 NaN NaN NaN NaN A6 B6 C6 D6 NaN NaN NaN NaN
7 NaN NaN NaN NaN A7 B7 C7 D7 NaN NaN NaN NaN
8 NaN NaN NaN NaN NaN NaN NaN NaN A8 B8 C8 D8
9 NaN NaN NaN NaN NaN NaN NaN NaN A9 B9 C9 D9
10 NaN NaN NaN NaN NaN NaN NaN NaN A10 B10 C10 D10
11 NaN NaN NaN NaN NaN NaN NaN NaN A11 B11 C11 D11
2) Megring
The merge function allows you to merge DataFrames using a similar logic as merging SQL tables together.
In
left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']})
pd.merge(left,right,how='inner',on='key')
Out
A B key C D
0 A0 B0 K0 C0 D0
1 A1 B1 K1 C1 D1
2 A2 B2 K2 C2 D2
3 A3 B3 K3 C3 D3
You can see 'key' is a condition merging two data frames.
Anyway, what is a difference between 'inner' and 'outer'?
The 'outer' join is that it merges all rows that do not match the common key though. This allows the 'outer' join to perform a more comprehensive merge without loss of data.
In
pd.merge(left, right, how='outer', on=['key1', 'key2'])
Out
A B key1 key2 C D
0 A0 B0 K0 K0 C0 D0
1 A1 B1 K0 K1 NaN NaN #With missing vlaues, you can get merged results.
2 A2 B2 K1 K0 C1 D1
3 A2 B2 K1 K0 C2 D2
4 A3 B3 K2 K1 NaN NaN #With missing vlaues, you can get merged results.
5 NaN NaN K2 K0 C3 D3 #With missing vlaues, you can get merged results.
3) Joining
Joining is a convenient method for combining the columns of two differently indexed DataFrames into a single result DataFrame.
In
left = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2']},
index=['K0', 'K1', 'K2'])
right = pd.DataFrame({'C': ['C0', 'C2', 'C3'],
'D': ['D0', 'D2', 'D3']},
index=['K0', 'K2', 'K3'])
left.join(right)
Out
A B C D
K0 A0 B0 C0 D0
K1 A1 B1 NaN NaN
K2 A2 B2 C2 D2
4. Operations
1) head()
In
df = pd.DataFrame({'col1':[1,2,3,4],'col2':[444,555,666,444],'col3':['abc','def','ghi','xyz']})
df.head()
Out
col1 col2 col3
0 1 444 abc
1 2 555 def
2 3 666 ghi
3 4 444 xyz
2) value_counts() : unique + count
In
df.['col2].value_counts()
Out
444 2
555 1
666 1
Name: col2, dtype: int64
3) Applying Functions
The 'apply()' function applies a given function to each element of a series or data frame and returns the result. Typically, you can pass a lambda or user-defined function as a factor in the apply() function.
In
df['col1'].apply(lambda x: x*2)
Out
0 2
1 4
2 6
3 8
Name: col1, dtype: int64
5. Data Input and Output
Pandas can read a variety of file types using its ’pd.read_’ methods.
1) Excel/CSV
Pandas can read and write excel/csv files.
BUT! formulas or images can not be read in excel file.
Input
In
pd.read_excel('Excel_Sample.xlsx',sheetname='Sheet1')
Out
a b c
0 0 1
1 4 5
2 8 9
3 12 13
Output
df.to_excel('Excel_Sample.xlsx',sheet_name='Sheet1')
2) Html
Pandas read_html function will read tables off of a webpage.
It returns a list of DataFrame objects.
I have tired it in The Federal Deposit Insurance Corporation webpage.
df = pd.read_html('http://www.fdic.gov/bank/individual/failed/banklist.html')
Conclusion
In summary, Pandas provides powerful tools for manipulating and analyzing data. By processing missing data, grouping data, combining DataFrame, and understanding how to read data from different file types. Pandas can be used to efficiently work and analyze data sets.
エンジニアファーストの会社 株式会社CRE-CO
ソンさん
この記事が気に入ったらサポートをしてみませんか?