Trying Data Science(4) : Cramming Pandas - 5 miscellaneous skills

2023年7月18日 16:55

Hello, everyone!

Today I'm going to cover Pandas following the last posting: 5 miscellaneous skills.

First of all, let's import numpy and pandas

import numpy as np
import pandas as pd

1. How to deal with missing data


df = pd.DataFrame({'A':[1,2,np.nan],
                  'B':[5,np.nan,np.nan],
                  'C':[1,2,3]})
df

Out

	A	B	C
0	1.0	5.0	1
1	2.0	NaN	2
2	NaN	NaN	3

You can see the missing data 'NaN'. '

There are three main ways to handle it.

(1) Remove rows
In

df.dropna()

Out

	A	B	C
0	1.0	5.0	1

(2) Remove columns
In

df.dropna(axis=1))

Out

(3) Fill missing values with different ones
In

df['A'].fillna(value=df['A'].mean())

Out

0    1.0
1    2.0
2    1.5
Name: A, dtype: float64

2. Grouping
The groupby method allows to group rows of data together.
You can save the 'data' object that is grouped as a new variable

In

import pandas as pd
# Create dataframe
data = {'Company':['GOOG','GOOG','MSFT','MSFT'],
       'Person':['Sam','Charlie','Amy','Vanessa'],
       'Sales':[200,120,340,124]}
df = pd.DataFrame(data)
df

by_comp = df.groupby("Company")
by_comp.mean()

Out

	    Sales
Company	
FB	    296.5
GOOG	160.0
MSFT	232.0

Here is a useful function which name is 'describe()'.
It shows general statistical information.

In

by_comp.describe()

Out

	        Sales
Company		
FB	 count	2.000000
	 mean	296.500000
	 std	75.660426
	 min	243.000000
	 25%	269.750000
	 50%	296.500000
	 75%	323.250000
	 max	350.000000
GOOG count	2.000000
	 mean	160.000000
	 std	56.568542
	 min	120.000000
	 25%	140.000000
	 50%	160.000000
	 75%	180.000000
	 max	200.000000
MSFT count	2.000000
	 mean	232.000000
	 std	152.735065
	 min	124.000000
	 25%	178.000000
	 50%	232.000000
	 75%	286.000000
	 max	340.000000

3. Combining DataFrames

Let's set three data frames.


df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                        'B': ['B0', 'B1', 'B2', 'B3'],
                        'C': ['C0', 'C1', 'C2', 'C3'],
                        'D': ['D0', 'D1', 'D2', 'D3']},
                        index=[0, 1, 2, 3])
df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                        'B': ['B4', 'B5', 'B6', 'B7'],
                        'C': ['C4', 'C5', 'C6', 'C7'],
                        'D': ['D4', 'D5', 'D6', 'D7']},
                         index=[4, 5, 6, 7]) 
df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
                        'B': ['B8', 'B9', 'B10', 'B11'],
                        'C': ['C8', 'C9', 'C10', 'C11'],
                        'D': ['D8', 'D9', 'D10', 'D11']},
                        index=[8, 9, 10, 11])

1) Concatenating
Concatenation basically glues DataFrames together.
In

# Note 'axis=1'. It means the column direction.
pd.concat([df1,df2,df3],axis=1)

Out

    A	B	C	D	A	B	C	D	A	B	C	D
0	A0	B0	C0	D0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	A1	B1	C1	D1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	A2	B2	C2	D2	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	A3	B3	C3	D3	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	NaN	NaN	NaN	NaN	A4	B4	C4	D4	NaN	NaN	NaN	NaN
5	NaN	NaN	NaN	NaN	A5	B5	C5	D5	NaN	NaN	NaN	NaN
6	NaN	NaN	NaN	NaN	A6	B6	C6	D6	NaN	NaN	NaN	NaN
7	NaN	NaN	NaN	NaN	A7	B7	C7	D7	NaN	NaN	NaN	NaN
8	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	A8	B8	C8	D8
9	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	A9	B9	C9	D9
10	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	A10	B10	C10	D10
11	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	A11	B11	C11	D11

2) Megring
The merge function allows you to merge DataFrames using a similar logic as merging SQL tables together.

In

left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                     'A': ['A0', 'A1', 'A2', 'A3'],
                     'B': ['B0', 'B1', 'B2', 'B3']})
   
right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
                          'C': ['C0', 'C1', 'C2', 'C3'],
                          'D': ['D0', 'D1', 'D2', 'D3']})    

pd.merge(left,right,how='inner',on='key')

Out

	A	B	key	C	D
0	A0	B0	K0	C0	D0
1	A1	B1	K1	C1	D1
2	A2	B2	K2	C2	D2
3	A3	B3	K3	C3	D3

You can see 'key' is a condition merging two data frames.

Anyway, what is a difference between 'inner' and 'outer'?
The 'outer' join is that it merges all rows that do not match the common key though. This allows the 'outer' join to perform a more comprehensive merge without loss of data.

pd.merge(left, right, how='outer', on=['key1', 'key2'])

Out

	A	B key1 key2	C	D
0	A0	B0	K0	K0	C0	D0
1	A1	B1	K0	K1	NaN	NaN     #With missing vlaues, you can get merged results.
2	A2	B2	K1	K0	C1	D1
3	A2	B2	K1	K0	C2	D2
4	A3	B3	K2	K1	NaN	NaN     #With missing vlaues, you can get merged results.
5	NaN	NaN	K2	K0	C3	D3      #With missing vlaues, you can get merged results.

3) Joining
Joining is a convenient method for combining the columns of two differently indexed DataFrames into a single result DataFrame.

left = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
                     'B': ['B0', 'B1', 'B2']},
                      index=['K0', 'K1', 'K2']) 

right = pd.DataFrame({'C': ['C0', 'C2', 'C3'],
                    'D': ['D0', 'D2', 'D3']},
                      index=['K0', 'K2', 'K3'])
left.join(right)

Out

	A	B	C	D
K0	A0	B0	C0	D0
K1	A1	B1	NaN	NaN
K2	A2	B2	C2	D2

4. Operations
1) head()
In

df = pd.DataFrame({'col1':[1,2,3,4],'col2':[444,555,666,444],'col3':['abc','def','ghi','xyz']})
df.head()

Out


col1	col2	col3
0	1	444	abc
1	2	555	def
2	3	666	ghi
3	4	444	xyz

2) value_counts() : unique + count
In

df.['col2].value_counts()

Out

444    2
555    1
666    1
Name: col2, dtype: int64

3) Applying Functions

The 'apply()' function applies a given function to each element of a series or data frame and returns the result. Typically, you can pass a lambda or user-defined function as a factor in the apply() function.

In

df['col1'].apply(lambda x: x*2)

Out

0    2
1    4
2    6
3    8
Name: col1, dtype: int64

5. Data Input and Output
Pandas can read a variety of file types using its ’pd.read_’ methods.

1) Excel/CSV
Pandas can read and write excel/csv files.
BUT! formulas or images can not be read in excel file.

Input
In

pd.read_excel('Excel_Sample.xlsx',sheetname='Sheet1')

Out

Output

df.to_excel('Excel_Sample.xlsx',sheet_name='Sheet1')

2) Html
Pandas read_html function will read tables off of a webpage.
It returns a list of DataFrame objects.
I have tired it in The Federal Deposit Insurance Corporation webpage.

df = pd.read_html('http://www.fdic.gov/bank/individual/failed/banklist.html')

Conclusion
In summary, Pandas provides powerful tools for manipulating and analyzing data. By processing missing data, grouping data, combining DataFrame, and understanding how to read data from different file types. Pandas can be used to efficiently work and analyze data sets.

エンジニアファーストの会社 株式会社CRE-CO
ソンさん

この記事が気に入ったらサポートをしてみませんか？