배열/리스트와 같은 일련의 시퀀스 데이타를 정리 별도의 인덱스 레이블을 지정하지 않으면 자동적으로 0부터 시작되는 정수 인덱스 사용
import pandas as pd
data = [1,3,5,7,9]
s = pd.Series(data)
s
0 1
1 3
2 5
3 7
4 9
dtype: int64
행과 열이 있는 데이블 데이터(Tabular Data) 처리 열을 Dict의 Key 로, 행을 Dict의 Value로 한 dictionary 데이타를 pd.DataFrame()을 사용해 자료구조화 함
import pandas as pd
data ={'year': [ 2016, 2017, 2018],
'GDP rate': [2.8, 3.1, 3.0],
'GDP': ['1.637M','1.73M', '1.83M']
}
df = pd.DataFrame(data)
df
| year | GDP rate | GDP | |
|---|---|---|---|
| 0 | 2016 | 2.8 | 1.637M |
| 1 | 2017 | 3.1 | 1.73M |
| 2 | 2018 | 3.0 | 1.83M |
** Deprecated. and removed from Pandas
3차원 자료 구조: Axis0(items), Axis1(major_axis), Axis2(minor_axis)등 3개의 축을 가지고 있다.
Axis0은 그 한 요소가 2차원의 DataFrame에 해당, Axis1은 DataFram의 행(row)에 해당되고 Axis2는 Dataframe의 열(Column)에 해당된다.
인덱싱과 속성을 사용해 접근 : i.e., df[‘year’], df[df[‘year]>2016 등
df['year']
0 2016
1 2017
2 2018
Name: year, dtype: int64
df[df['year']>2016]
| year | GDP rate | GDP | |
|---|---|---|---|
| 1 | 2017 | 3.1 | 1.73M |
| 2 | 2018 | 3.0 | 1.83M |
df.head()
| year | GDP rate | GDP | |
|---|---|---|---|
| 0 | 2016 | 2.8 | 1.637M |
| 1 | 2017 | 3.1 | 1.73M |
| 2 | 2018 | 3.0 | 1.83M |
df.describe()
| year | GDP rate | |
|---|---|---|
| count | 3.0 | 3.000000 |
| mean | 2017.0 | 2.966667 |
| std | 1.0 | 0.152753 |
| min | 2016.0 | 2.800000 |
| 25% | 2016.5 | 2.900000 |
| 50% | 2017.0 | 3.000000 |
| 75% | 2017.5 | 3.050000 |
| max | 2018.0 | 3.100000 |
df.sum()
year 6051
GDP rate 8.9
GDP 1.637M1.73M1.83M
dtype: object
df.mean()
year 2017.000000
GDP rate 2.966667
dtype: float64
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 year 3 non-null int64
1 GDP rate 3 non-null float64
2 GDP 3 non-null object
dtypes: float64(1), int64(1), object(1)
memory usage: 200.0+ bytes
pandas는 CSV 파일, 텍스트 파일, 엑셀 파일, SQL 데이타베이스, HDF5 포맷 등 다양한 외부 리소스에 데이타를 읽고 쓸 수 있는 기능을 제공
import pandas as pd
df = pd.read_csv('/Users/catherine/Desktop/grade.csv')
df
| id | Korean | English | Math | |
|---|---|---|---|---|
| 0 | 1 | 80 | 85 | 75 |
| 1 | 2 | 90 | 100 | 95 |
| 2 | 3 | 75 | 70 | 65 |
%matplotlib inline
import pandas as pd
from matplotlib import pyplot as plt
df = pd.read_csv('/Users/catherine/Desktop/grade.csv')
plt.bar(df.id, df['English'])
plt.show()

df
| id | Korean | English | Math | |
|---|---|---|---|---|
| 0 | 1 | 80 | 85 | 75 |
| 1 | 2 | 90 | 100 | 95 |
| 2 | 3 | 75 | 70 | 65 |
df.iloc[0:, 1:5]
| Korean | English | Math | |
|---|---|---|---|
| 0 | 80 | 85 | 75 |
| 1 | 90 | 100 | 95 |
| 2 | 75 | 70 | 65 |
df.iloc[0:, 1:5].plot.bar()
<AxesSubplot:>

df.plot.bar()
<AxesSubplot:>

import seaborn as sns
%matplotlib inline
df.corr()
| id | Korean | English | Math | |
|---|---|---|---|---|
| id | 1.000000 | -0.327327 | -0.500000 | -0.327327 |
| Korean | -0.327327 | 1.000000 | 0.981981 | 1.000000 |
| English | -0.500000 | 0.981981 | 1.000000 | 0.981981 |
| Math | -0.327327 | 1.000000 | 0.981981 | 1.000000 |
a= df.iloc[0:, 1:5].corr()
a
| Korean | English | Math | |
|---|---|---|---|
| Korean | 1.000000 | 0.981981 | 1.000000 |
| English | 0.981981 | 1.000000 | 0.981981 |
| Math | 1.000000 | 0.981981 | 1.000000 |
sns.heatmap(a, cmap = 'coolwarm', annot = True)
<AxesSubplot:>

Pandas is defined as an open-source library that provides high-performance data manipulation in Python. Pandas stands for Panel Data meaning econometrics from multidimensional data.
import pandas as pd
import numpy as np
df = pd.Series(np.random.randint(0,7, size =10))
df
df.std() # std() is defined as a function for calculating the standard deviation of the given set of numbers, Dataframe, Column, and rows
A DataFrame is a widely used data structure pf pandas and works with a two-dimentional array with labeled axes(rows and columns). DataFrame is defined as a standard way to store data and has two different indexes, i.e., row index and column index. It consist of the following properties
# Define by list
import pandas as pd
a =['Python', 'Pandas']
info = pd.DataFrame(a)
info
# Define by dict
import pandas as pd
info = {'ID': [101, 102, 103],
'Department': ['B.Sc', 'B.Tech', 'M.Tech']
}
info = pd.DataFrame(info)
info
A categorical data is defined as a Pandas data tye that corresponds to a categorical variable in statistics. A categorical variable is generallly used to take a limited and usually fixed number of possible values.
Example
import pandas as pd
import numpy as np
info = {'X': 0., 'Y':1., 'Z':2. }
a = pd.Series(info)
a
import pandas as pd
info = pd.DataFrame()
info
We can add new column to an existing DataFrame. See the below code
import pandas as pd
info = {'one': pd.Series([1,2,3,4,5], index = ['a', 'b', 'c', 'd', 'e']),
'two': pd.Series([1,2,3,4,5,6], index = ['a', 'b', 'c', 'd', 'e', 'f'])}
info = pd.DataFrame(info)
info
info['three']= pd.Series([20, 40, 60], index = ['a', 'b', 'c'])
info
info['four'] = info['one'] + info['three']
info
Adding an index to a DataFrame
Adding rows to a DataFrame: we can use .loc, iloc to insert the rows in the DataFrame
import pandas as pd
import numpy as np
data = pd.DataFrame({
'age' : [ 10, 22, 13, 21, 12, 11, 17],
'section' : [ 'A', 'B', 'C', 'B', 'B', 'A', 'A'],
'city' : [ 'Gurgaon', 'Delhi', 'Mumbai', 'Delhi', 'Mumbai', 'Delhi', 'Mumbai'],
'gender' : [ 'M', 'F', 'F', 'M', 'M', 'M', 'F'],
'favourite_color' : [ 'red', np.NAN, 'yellow', np.NAN, 'black', 'green', 'red']
})
data
data.loc[data.age >=15]
data.loc[(data.age>=12) & (data.gender == 'M')]
data.loc[1:3]
data.iloc[0:3, 3:5] # iloc [행시작:행끝, 칼럼 시작: 칼럼 끝]
https://www.javatpoint.com/python-pandas-interview-questions
import pandas as pd
df1 =pd.DataFrame({
'name':['James', 'Jeff'],
'Rank': [3, 2]})
df2 =pd.DataFrame({
'name':['James', 'Jeff'],
'Rank': [1, 3]})
a= pd.merge(df1, df2, on = 'name')
a
| name | Rank_x | Rank_y | |
|---|---|---|---|
| 0 | James | 3 | 1 |
| 1 | Jeff | 2 | 3 |
pd.DataFrame(a)
| name | Rank |
|---|