데이터 사이언스 생초보 아줌마 Pandas 기초중의 기초 정리입니다
Pandas 정리
사용법
Series: 1차원 자료구조
배열/리스트와 같은 일련의 시퀀스 데이타를 정리 별도의 인덱스 레이블을 지정하지 않으면 자동적으로 0부터 시작되는 정수 인덱스 사용
import pandas as pd
data = [1,3,5,7,9]
s = pd.Series(data)
s
0 1
1 3
2 5
3 7
4 9
dtype: int64
Data Frame
행과 열이 있는 데이블 데이터(Tabular Data) 처리 열을 Dict의 Key 로, 행을 Dict의 Value로 한 dictionary 데이타를 pd.DataFrame()을 사용해 자료구조화 함
import pandas as pd
data ={'year': [ 2016, 2017, 2018],
'GDP rate': [2.8, 3.1, 3.0],
'GDP': ['1.637M','1.73M', '1.83M']
}
df = pd.DataFrame(data)
df
| year | GDP rate | GDP | |
|---|---|---|---|
| 0 | 2016 | 2.8 | 1.637M |
| 1 | 2017 | 3.1 | 1.73M |
| 2 | 2018 | 3.0 | 1.83M |
Panel
** Deprecated. and removed from Pandas
3차원 자료 구조: Axis0(items), Axis1(major_axis), Axis2(minor_axis)등 3개의 축을 가지고 있다.
Axis0은 그 한 요소가 2차원의 DataFrame에 해당, Axis1은 DataFram의 행(row)에 해당되고 Axis2는 Dataframe의 열(Column)에 해당된다.
데이타 액세스
인덱싱과 속성을 사용해 접근 : i.e., df[‘year’], df[df[‘year]>2016 등
df['year']
0 2016
1 2017
2 2018
Name: year, dtype: int64
df[df['year']>2016]
| year | GDP rate | GDP | |
|---|---|---|---|
| 1 | 2017 | 3.1 | 1.73M |
| 2 | 2018 | 3.0 | 1.83M |
df.head()
| year | GDP rate | GDP | |
|---|---|---|---|
| 0 | 2016 | 2.8 | 1.637M |
| 1 | 2017 | 3.1 | 1.73M |
| 2 | 2018 | 3.0 | 1.83M |
df.describe()
| year | GDP rate | |
|---|---|---|
| count | 3.0 | 3.000000 |
| mean | 2017.0 | 2.966667 |
| std | 1.0 | 0.152753 |
| min | 2016.0 | 2.800000 |
| 25% | 2016.5 | 2.900000 |
| 50% | 2017.0 | 3.000000 |
| 75% | 2017.5 | 3.050000 |
| max | 2018.0 | 3.100000 |
df.sum()
year 6051
GDP rate 8.9
GDP 1.637M1.73M1.83M
dtype: object
df.mean()
year 2017.000000
GDP rate 2.966667
dtype: float64
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 year 3 non-null int64
1 GDP rate 3 non-null float64
2 GDP 3 non-null object
dtypes: float64(1), int64(1), object(1)
memory usage: 200.0+ bytes
외부데이터 읽고 쓰기
pandas는 CSV 파일, 텍스트 파일, 엑셀 파일, SQL 데이타베이스, HDF5 포맷 등 다양한 외부 리소스에 데이타를 읽고 쓸 수 있는 기능을 제공
import pandas as pd
df = pd.read_csv('/Users/catherine/Desktop/grade.csv')
df
| id | Korean | English | Math | |
|---|---|---|---|---|
| 0 | 1 | 80 | 85 | 75 |
| 1 | 2 | 90 | 100 | 95 |
| 2 | 3 | 75 | 70 | 65 |
%matplotlib inline
import pandas as pd
from matplotlib import pyplot as plt
df = pd.read_csv('/Users/catherine/Desktop/grade.csv')
plt.bar(df.id, df['English'])
plt.show()

df
| id | Korean | English | Math | |
|---|---|---|---|---|
| 0 | 1 | 80 | 85 | 75 |
| 1 | 2 | 90 | 100 | 95 |
| 2 | 3 | 75 | 70 | 65 |
df.iloc[0:, 1:5]
| Korean | English | Math | |
|---|---|---|---|
| 0 | 80 | 85 | 75 |
| 1 | 90 | 100 | 95 |
| 2 | 75 | 70 | 65 |
df.iloc[0:, 1:5].plot.bar()
<AxesSubplot:>

df.plot.bar()
<AxesSubplot:>

import seaborn as sns
%matplotlib inline
df.corr()
| id | Korean | English | Math | |
|---|---|---|---|---|
| id | 1.000000 | -0.327327 | -0.500000 | -0.327327 |
| Korean | -0.327327 | 1.000000 | 0.981981 | 1.000000 |
| English | -0.500000 | 0.981981 | 1.000000 | 0.981981 |
| Math | -0.327327 | 1.000000 | 0.981981 | 1.000000 |
a= df.iloc[0:, 1:5].corr()
a
| Korean | English | Math | |
|---|---|---|---|
| Korean | 1.000000 | 0.981981 | 1.000000 |
| English | 0.981981 | 1.000000 | 0.981981 |
| Math | 1.000000 | 0.981981 | 1.000000 |
sns.heatmap(a, cmap = 'coolwarm', annot = True)
<AxesSubplot:>

Python Pandas Interview Questions
Define the Python Pandas?
Pandas is defined as an open-source library that provides high-performance data manipulation in Python. Pandas stands for Panel Data meaning econometrics from multidimensional data.
how can you calculate the standard deviation from the Series?
import pandas as pd
import numpy as np
df = pd.Series(np.random.randint(0,7, size =10))
df
df.std() # std() is defined as a function for calculating the standard deviation of the given set of numbers, Dataframe, Column, and rows
Define DataFrame in Pandas?
A DataFrame is a widely used data structure pf pandas and works with a two-dimentional array with labeled axes(rows and columns). DataFrame is defined as a standard way to store data and has two different indexes, i.e., row index and column index. It consist of the following properties
- The columns can be neterogeous types like int and bool
- it can be seen as a dictionary of Series structure where both the rows and columns are indexed. It is denoted as “columns” in the case of columns and “index” in case of row.
What are the significant features of the pandas Library?
- Memory efficient
- Data Alignment
- Reshaping
- Merge and join
- Time Series
Define the different ways a DataFrame can be created in Pandas?
- list
- dict of ndarrays
# Define by list
import pandas as pd
a =['Python', 'Pandas']
info = pd.DataFrame(a)
info
# Define by dict
import pandas as pd
info = {'ID': [101, 102, 103],
'Department': ['B.Sc', 'B.Tech', 'M.Tech']
}
info = pd.DataFrame(info)
info
Explain categorical data in Pandas
A categorical data is defined as a Pandas data tye that corresponds to a categorical variable in statistics. A categorical variable is generallly used to take a limited and usually fixed number of possible values.
Example
- Gender, country affiliation, blood type, social class, observation time, or rating via likert scales
How will you create a series from dict in Pandas?
import pandas as pd
import numpy as np
info = {'X': 0., 'Y':1., 'Z':2. }
a = pd.Series(info)
a
how can we create a copy of the series in Pandas?
- pandas.Series.copy
- Series.copy(deep = True)
- If we set deep=True, the data will be copied, and the actual python objects will not be copied recursively, only the reference to the object will be copied.
how will you create an empty DataFrame in Pandas?
import pandas as pd
info = pd.DataFrame()
info
how will you add a column to a pandas DataFrame?
We can add new column to an existing DataFrame. See the below code
import pandas as pd
info = {'one': pd.Series([1,2,3,4,5], index = ['a', 'b', 'c', 'd', 'e']),
'two': pd.Series([1,2,3,4,5,6], index = ['a', 'b', 'c', 'd', 'e', 'f'])}
info = pd.DataFrame(info)
info
info['three']= pd.Series([20, 40, 60], index = ['a', 'b', 'c'])
info
info['four'] = info['one'] + info['three']
info
how to add an index, row or column to a Pandas DataFrame?
Adding an index to a DataFrame
- Pandas allow adding the inoyts to te index argument if you create a DataFrame will make sure that you have the desired index
- By default,
Adding rows to a DataFrame: we can use .loc, iloc to insert the rows in the DataFrame
- The loc works for the labels of the index
- loc[4] => values of DataFrame that have an index labeled 4
- iloc works for the positions in the index.
- iloc[4] => the values of DataFrame that are present at index ‘4’
Remember
- Select specific rows and/or columns using loc when using the row and column names
- Select specific rows and/or columns using iloc when using the positions in the table
- You can assign new values to a selection based on loc/iloc.
import pandas as pd
import numpy as np
data = pd.DataFrame({
'age' : [ 10, 22, 13, 21, 12, 11, 17],
'section' : [ 'A', 'B', 'C', 'B', 'B', 'A', 'A'],
'city' : [ 'Gurgaon', 'Delhi', 'Mumbai', 'Delhi', 'Mumbai', 'Delhi', 'Mumbai'],
'gender' : [ 'M', 'F', 'F', 'M', 'M', 'M', 'F'],
'favourite_color' : [ 'red', np.NAN, 'yellow', np.NAN, 'black', 'green', 'red']
})
data
data.loc[data.age >=15]
data.loc[(data.age>=12) & (data.gender == 'M')]
data.loc[1:3]
how to select a subset of a DataFrame?
data.iloc[0:3, 3:5] # iloc [행시작:행끝, 칼럼 시작: 칼럼 끝]
https://www.javatpoint.com/python-pandas-interview-questions
import pandas as pd
df1 =pd.DataFrame({
'name':['James', 'Jeff'],
'Rank': [3, 2]})
df2 =pd.DataFrame({
'name':['James', 'Jeff'],
'Rank': [1, 3]})
a= pd.merge(df1, df2, on = 'name')
a
| name | Rank_x | Rank_y | |
|---|---|---|---|
| 0 | James | 3 | 1 |
| 1 | Jeff | 2 | 3 |
pd.DataFrame(a)
| name | Rank |
|---|