Pandas 정리

사용법

Series: 1차원 자료구조

배열/리스트와 같은 일련의 시퀀스 데이타를 정리 별도의 인덱스 레이블을 지정하지 않으면 자동적으로 0부터 시작되는 정수 인덱스 사용

import pandas as pd
data = [1,3,5,7,9]
s = pd.Series(data)

  1
  3
  5
  7
  9
dtype: int64

Data Frame

행과 열이 있는 데이블 데이터(Tabular Data) 처리 열을 Dict의 Key 로, 행을 Dict의 Value로 한 dictionary 데이타를 pd.DataFrame()을 사용해 자료구조화 함

import pandas as pd
data ={'year': [ 2016, 2017, 2018], 
       'GDP rate': [2.8, 3.1, 3.0], 
       'GDP': ['1.637M','1.73M', '1.83M']
      }

df = pd.DataFrame(data)

df

	year	GDP rate	GDP
0	2016	2.8	1.637M
1	2017	3.1	1.73M
2	2018	3.0	1.83M

Panel

** Deprecated. and removed from Pandas

3차원 자료 구조: Axis0(items), Axis1(major_axis), Axis2(minor_axis)등 3개의 축을 가지고 있다.

Axis0은 그 한 요소가 2차원의 DataFrame에 해당, Axis1은 DataFram의 행(row)에 해당되고 Axis2는 Dataframe의 열(Column)에 해당된다.

데이타 액세스

인덱싱과 속성을 사용해 접근 : i.e., df[‘year’], df[df[‘year]>2016 등

df['year']

  2016
  2017
  2018
Name: year, dtype: int64

df[df['year']>2016]

	year	GDP rate	GDP
1	2017	3.1	1.73M
2	2018	3.0	1.83M

df.head()

	year	GDP rate	GDP
0	2016	2.8	1.637M
1	2017	3.1	1.73M
2	2018	3.0	1.83M

df.describe()

	year	GDP rate
count	3.0	3.000000
mean	2017.0	2.966667
std	1.0	0.152753
min	2016.0	2.800000
25%	2016.5	2.900000
50%	2017.0	3.000000
75%	2017.5	3.050000
max	2018.0	3.100000

df.sum()

year                    6051
GDP rate                 8.9
GDP         1.637M1.73M1.83M
dtype: object

df.mean()

year        2017.000000
GDP rate       2.966667
dtype: float64

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   year      3 non-null      int64  
 1   GDP rate  3 non-null      float64
 2   GDP       3 non-null      object 
dtypes: float64(1), int64(1), object(1)
memory usage: 200.0+ bytes

외부데이터 읽고 쓰기

pandas는 CSV 파일, 텍스트 파일, 엑셀 파일, SQL 데이타베이스, HDF5 포맷 등 다양한 외부 리소스에 데이타를 읽고 쓸 수 있는 기능을 제공

import pandas as pd
df = pd.read_csv('/Users/catherine/Desktop/grade.csv')

df

	id	Korean	English	Math
0	1	80	85	75
1	2	90	100	95
2	3	75	70	65

%matplotlib inline
import pandas as pd
from matplotlib import pyplot as plt

df = pd.read_csv('/Users/catherine/Desktop/grade.csv')
plt.bar(df.id, df['English'])
plt.show()

png

df

	id	Korean	English	Math
0	1	80	85	75
1	2	90	100	95
2	3	75	70	65

df.iloc[0:, 1:5]

	Korean	English	Math
0	80	85	75
1	90	100	95
2	75	70	65

df.iloc[0:, 1:5].plot.bar()

<AxesSubplot:>

png

df.plot.bar()

<AxesSubplot:>

png

import seaborn as sns
%matplotlib inline
df.corr()

	id	Korean	English	Math
id	1.000000	-0.327327	-0.500000	-0.327327
Korean	-0.327327	1.000000	0.981981	1.000000
English	-0.500000	0.981981	1.000000	0.981981
Math	-0.327327	1.000000	0.981981	1.000000

a= df.iloc[0:, 1:5].corr()
a

	Korean	English	Math
Korean	1.000000	0.981981	1.000000
English	0.981981	1.000000	0.981981
Math	1.000000	0.981981	1.000000

sns.heatmap(a, cmap = 'coolwarm', annot = True)

<AxesSubplot:>

png

Python Pandas Interview Questions

Define the Python Pandas?

Pandas is defined as an open-source library that provides high-performance data manipulation in Python. Pandas stands for Panel Data meaning econometrics from multidimensional data.

how can you calculate the standard deviation from the Series?

import pandas as pd
import numpy as np
df = pd.Series(np.random.randint(0,7, size =10))

df

df.std() # std() is defined as a function for calculating the standard deviation of the given set of numbers, Dataframe, Column, and rows

Define DataFrame in Pandas?

A DataFrame is a widely used data structure pf pandas and works with a two-dimentional array with labeled axes(rows and columns). DataFrame is defined as a standard way to store data and has two different indexes, i.e., row index and column index. It consist of the following properties

The columns can be neterogeous types like int and bool
it can be seen as a dictionary of Series structure where both the rows and columns are indexed. It is denoted as “columns” in the case of columns and “index” in case of row.

What are the significant features of the pandas Library?

Memory efficient
Data Alignment
Reshaping
Merge and join
Time Series

Define the different ways a DataFrame can be created in Pandas?

list
dict of ndarrays

# Define by list
import pandas as pd

a =['Python', 'Pandas']
info = pd.DataFrame(a)
info

# Define by dict

import pandas as pd
info = {'ID': [101, 102, 103], 
       'Department': ['B.Sc', 'B.Tech', 'M.Tech']
       }

info = pd.DataFrame(info)
info

Explain categorical data in Pandas

A categorical data is defined as a Pandas data tye that corresponds to a categorical variable in statistics. A categorical variable is generallly used to take a limited and usually fixed number of possible values.

Example

Gender, country affiliation, blood type, social class, observation time, or rating via likert scales

How will you create a series from dict in Pandas?

import pandas as pd
import numpy as np

info = {'X': 0., 'Y':1., 'Z':2. }
a = pd.Series(info)
a

how can we create a copy of the series in Pandas?

pandas.Series.copy
Series.copy(deep = True)
- If we set deep=True, the data will be copied, and the actual python objects will not be copied recursively, only the reference to the object will be copied.

how will you create an empty DataFrame in Pandas?

import pandas as pd
info = pd.DataFrame()
info

how will you add a column to a pandas DataFrame?

We can add new column to an existing DataFrame. See the below code

import pandas as pd
info = {'one': pd.Series([1,2,3,4,5], index = ['a', 'b', 'c', 'd', 'e']), 
       'two': pd.Series([1,2,3,4,5,6], index = ['a', 'b', 'c', 'd', 'e', 'f'])}

info = pd.DataFrame(info)

info

info['three']= pd.Series([20, 40, 60], index = ['a', 'b', 'c'])
info

info['four'] = info['one'] + info['three']

info

how to add an index, row or column to a Pandas DataFrame?

Adding an index to a DataFrame

Pandas allow adding the inoyts to te index argument if you create a DataFrame will make sure that you have the desired index
By default,

Adding rows to a DataFrame: we can use .loc, iloc to insert the rows in the DataFrame

The loc works for the labels of the index
- loc[4] => values of DataFrame that have an index labeled 4
iloc works for the positions in the index.
- iloc[4] => the values of DataFrame that are present at index ‘4’

Remember

Select specific rows and/or columns using loc when using the row and column names
Select specific rows and/or columns using iloc when using the positions in the table
You can assign new values to a selection based on loc/iloc.

import pandas as pd
import numpy as np

data = pd.DataFrame({
    'age' :     [ 10, 22, 13, 21, 12, 11, 17],
    'section' : [ 'A', 'B', 'C', 'B', 'B', 'A', 'A'],
    'city' :    [ 'Gurgaon', 'Delhi', 'Mumbai', 'Delhi', 'Mumbai', 'Delhi', 'Mumbai'],
    'gender' :  [ 'M', 'F', 'F', 'M', 'M', 'M', 'F'],
    'favourite_color' : [ 'red', np.NAN, 'yellow', np.NAN, 'black', 'green', 'red']
})

data

data.loc[data.age >=15]

data.loc[(data.age>=12) & (data.gender == 'M')]

data.loc[1:3]

how to select a subset of a DataFrame?

data.iloc[0:3, 3:5] # iloc [행시작:행끝, 칼럼 시작: 칼럼 끝]

https://www.javatpoint.com/python-pandas-interview-questions

import pandas as pd
df1 =pd.DataFrame({
    'name':['James', 'Jeff'],
    'Rank': [3, 2]})

df2 =pd.DataFrame({
    'name':['James', 'Jeff'],
    'Rank': [1, 3]})

a= pd.merge(df1, df2, on = 'name')
a

	name	Rank_x	Rank_y
0	James	3	1
1	Jeff	2	3

pd.DataFrame(a)

	name	Rank