Pandas 정리

사용법

Series: 1차원 자료구조

배열/리스트와 같은 일련의 시퀀스 데이타를 정리 별도의 인덱스 레이블을 지정하지 않으면 자동적으로 0부터 시작되는 정수 인덱스 사용

import pandas as pd
data = [1,3,5,7,9]
s = pd.Series(data)
s
0    1
1    3
2    5
3    7
4    9
dtype: int64

Data Frame

행과 열이 있는 데이블 데이터(Tabular Data) 처리 열을 Dict의 Key 로, 행을 Dict의 Value로 한 dictionary 데이타를 pd.DataFrame()을 사용해 자료구조화 함

import pandas as pd
data ={'year': [ 2016, 2017, 2018], 
       'GDP rate': [2.8, 3.1, 3.0], 
       'GDP': ['1.637M','1.73M', '1.83M']
      }

df = pd.DataFrame(data)
df
year GDP rate GDP
0 2016 2.8 1.637M
1 2017 3.1 1.73M
2 2018 3.0 1.83M

Panel

** Deprecated. and removed from Pandas

3차원 자료 구조: Axis0(items), Axis1(major_axis), Axis2(minor_axis)등 3개의 축을 가지고 있다.

Axis0은 그 한 요소가 2차원의 DataFrame에 해당, Axis1은 DataFram의 행(row)에 해당되고 Axis2는 Dataframe의 열(Column)에 해당된다.

데이타 액세스

인덱싱과 속성을 사용해 접근 : i.e., df[‘year’], df[df[‘year]>2016 등

df['year']
0    2016
1    2017
2    2018
Name: year, dtype: int64
df[df['year']>2016]
year GDP rate GDP
1 2017 3.1 1.73M
2 2018 3.0 1.83M
df.head()
year GDP rate GDP
0 2016 2.8 1.637M
1 2017 3.1 1.73M
2 2018 3.0 1.83M
df.describe()
year GDP rate
count 3.0 3.000000
mean 2017.0 2.966667
std 1.0 0.152753
min 2016.0 2.800000
25% 2016.5 2.900000
50% 2017.0 3.000000
75% 2017.5 3.050000
max 2018.0 3.100000
df.sum()
year                    6051
GDP rate                 8.9
GDP         1.637M1.73M1.83M
dtype: object
df.mean()
year        2017.000000
GDP rate       2.966667
dtype: float64
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   year      3 non-null      int64  
 1   GDP rate  3 non-null      float64
 2   GDP       3 non-null      object 
dtypes: float64(1), int64(1), object(1)
memory usage: 200.0+ bytes

외부데이터 읽고 쓰기

pandas는 CSV 파일, 텍스트 파일, 엑셀 파일, SQL 데이타베이스, HDF5 포맷 등 다양한 외부 리소스에 데이타를 읽고 쓸 수 있는 기능을 제공

import pandas as pd
df = pd.read_csv('/Users/catherine/Desktop/grade.csv')
df
id Korean English Math
0 1 80 85 75
1 2 90 100 95
2 3 75 70 65
%matplotlib inline
import pandas as pd
from matplotlib import pyplot as plt

df = pd.read_csv('/Users/catherine/Desktop/grade.csv')
plt.bar(df.id, df['English'])
plt.show()

png

df
id Korean English Math
0 1 80 85 75
1 2 90 100 95
2 3 75 70 65
df.iloc[0:, 1:5]
Korean English Math
0 80 85 75
1 90 100 95
2 75 70 65
df.iloc[0:, 1:5].plot.bar()
<AxesSubplot:>

png

df.plot.bar()
<AxesSubplot:>

png

import seaborn as sns
%matplotlib inline
df.corr()
id Korean English Math
id 1.000000 -0.327327 -0.500000 -0.327327
Korean -0.327327 1.000000 0.981981 1.000000
English -0.500000 0.981981 1.000000 0.981981
Math -0.327327 1.000000 0.981981 1.000000
a= df.iloc[0:, 1:5].corr()
a
Korean English Math
Korean 1.000000 0.981981 1.000000
English 0.981981 1.000000 0.981981
Math 1.000000 0.981981 1.000000
sns.heatmap(a, cmap = 'coolwarm', annot = True)
<AxesSubplot:>

png

Python Pandas Interview Questions

Define the Python Pandas?

Pandas is defined as an open-source library that provides high-performance data manipulation in Python. Pandas stands for Panel Data meaning econometrics from multidimensional data.

how can you calculate the standard deviation from the Series?

import pandas as pd
import numpy as np
df = pd.Series(np.random.randint(0,7, size =10))
df
df.std() # std() is defined as a function for calculating the standard deviation of the given set of numbers, Dataframe, Column, and rows

Define DataFrame in Pandas?

A DataFrame is a widely used data structure pf pandas and works with a two-dimentional array with labeled axes(rows and columns). DataFrame is defined as a standard way to store data and has two different indexes, i.e., row index and column index. It consist of the following properties

What are the significant features of the pandas Library?

Define the different ways a DataFrame can be created in Pandas?

# Define by list
import pandas as pd

a =['Python', 'Pandas']
info = pd.DataFrame(a)
info
# Define by dict

import pandas as pd
info = {'ID': [101, 102, 103], 
       'Department': ['B.Sc', 'B.Tech', 'M.Tech']
       }

info = pd.DataFrame(info)
info

Explain categorical data in Pandas

A categorical data is defined as a Pandas data tye that corresponds to a categorical variable in statistics. A categorical variable is generallly used to take a limited and usually fixed number of possible values.

Example

How will you create a series from dict in Pandas?

import pandas as pd
import numpy as np

info = {'X': 0., 'Y':1., 'Z':2. }
a = pd.Series(info)
a

how can we create a copy of the series in Pandas?

how will you create an empty DataFrame in Pandas?

import pandas as pd
info = pd.DataFrame()
info

how will you add a column to a pandas DataFrame?

We can add new column to an existing DataFrame. See the below code

import pandas as pd
info = {'one': pd.Series([1,2,3,4,5], index = ['a', 'b', 'c', 'd', 'e']), 
       'two': pd.Series([1,2,3,4,5,6], index = ['a', 'b', 'c', 'd', 'e', 'f'])}

info = pd.DataFrame(info)
info
info['three']= pd.Series([20, 40, 60], index = ['a', 'b', 'c'])
info
info['four'] = info['one'] + info['three']
info

how to add an index, row or column to a Pandas DataFrame?

Adding an index to a DataFrame

Adding rows to a DataFrame: we can use .loc, iloc to insert the rows in the DataFrame

Remember

import pandas as pd
import numpy as np

data = pd.DataFrame({
    'age' :     [ 10, 22, 13, 21, 12, 11, 17],
    'section' : [ 'A', 'B', 'C', 'B', 'B', 'A', 'A'],
    'city' :    [ 'Gurgaon', 'Delhi', 'Mumbai', 'Delhi', 'Mumbai', 'Delhi', 'Mumbai'],
    'gender' :  [ 'M', 'F', 'F', 'M', 'M', 'M', 'F'],
    'favourite_color' : [ 'red', np.NAN, 'yellow', np.NAN, 'black', 'green', 'red']
})

data
data.loc[data.age >=15]
data.loc[(data.age>=12) & (data.gender == 'M')]
data.loc[1:3]

how to select a subset of a DataFrame?

data.iloc[0:3, 3:5] # iloc [행시작:행끝, 칼럼 시작: 칼럼 끝]

https://www.javatpoint.com/python-pandas-interview-questions

import pandas as pd
df1 =pd.DataFrame({
    'name':['James', 'Jeff'],
    'Rank': [3, 2]})

df2 =pd.DataFrame({
    'name':['James', 'Jeff'],
    'Rank': [1, 3]})

a= pd.merge(df1, df2, on = 'name')
a
name Rank_x Rank_y
0 James 3 1
1 Jeff 2 3
pd.DataFrame(a)
name Rank