Artificial Intelligence in Plain English

New AI, ML and Data Science articles every day. Follow to join our 3.5M+ monthly readers.

Follow publication

A Comprehensive Guide To Using Pandas For Data Science

--

What is Pandas?

How to install Pandas?

pip install pandas

conda install pandas

What is the series datatype of Pandas?

>>> import numpy as np
>>> import pandas as pd
>>> headings = ['a', 'b', 'c']
>>> list1 = [11,22,33]
>>> arr1 = np.array(list1)
>>> dict1 = {'a':11, 'b':22, 'c':33}
>>> headings
['a', 'b', 'c']
>>> list1
[11, 22, 33]
>>> arr1
array([11, 22, 33])
>>> dict1
{'a': 11, 'b': 22, 'c': 33}
>>> pd.Series(data = list1)
0 11
1 22
2 33
dtype: int64
>>> pd.Series(arr1)
0 11
1 22
2 33
dtype: int32
>>> pd.Series(dict1)
a 11
b 22
c 33
dtype: int64
>>> pd.Series(headings)
0 a
1 b
2 c
dtype: object
>>> pd.Series(data = list1, index = headings)
a 11
b 22
c 33
dtype: int64
>>> series1 = pd.Series(dict1)
>>> series1
a 11
b 22
c 33
dtype: int64
>>> series1['a']
11
>>> dict2 = {'a':1, 'b':2, 'd':3}
>>> series2 = pd.Series(dict2)
>>> series2
a 1
b 2
d 3
dtype: int64
>>> series1
a 11
b 22
c 33
dtype: int64
>>> series1 + series2
a 12.0
b 24.0
c NaN
d NaN
dtype: float64

What are DataFrames in Pandas?

>>> from numpy.random import randn
>>> dataframe = pd.DataFrame(randn(3,2), ['a','b','c'], ['y', 'z'])
>>> dataframe
>>> type(dataframe)
pandas.core.frame.DataFrame
>>> dataframe['z']
a -0.492404
b -0.585436
c -0.892137
Name: z, dtype: float64
>>> type(dataframe['z'])
pandas.core.series.Series
>>> dataframe.z
a -0.492404
b -0.585436
c -0.892137
Name: z, dtype: float64
>>> dataframe[['y','z']]
>>> dataframe['new_col'] = dataframe['y'] - dataframe['z']
>>> dataframe
>>> dataframe.drop('new_col', axis = 1)
>>> dataframe
>>> dataframe.drop('new_col', axis = 1, inplace = True)
>>> dataframe
>>> dataframe.drop('c')
>>> dataframe.loc['c']
y -0.557171
z -0.892137
Name: c, dtype: float64
>>> type(dataframe.loc['c'])
pandas.core.series.Series
>>> dataframe.iloc[2]
y -0.557171
z -0.892137
Name: c, dtype: float64
>>> dataframe.loc['a','z']
-0.4924042937068482
>>> dataframe.loc[['b','c'],['y','z']]
>>> bool_df = dataframe > 0
>>> bool_df
>>> dataframe[bool_df]
>>> dataframe[dataframe<0]
>>> dataframe['y'] > 0
a True
b True
c False
Name: y, dtype: bool
>>> dataframe[dataframe['y']>0]
>>> dataframe[dataframe['z']<0]['y']
a 0.079911
b 0.561538
c -0.557171
Name: y, dtype: float64
>>> dataframe[(dataframe['y']>0) & (dataframe['z']<0)]
>>> dataframe
>>> dataframe.reset_index()
>>> new_index = 'aa bb cc'.split()
>>> new_index
['aa', 'bb', 'cc']
>>> dataframe['new_col'] = new_index
>>> dataframe
>>> dataframe.set_index('new_col')
>>> list1 = ['a', 'a', 'a', 'b', 'b', 'b']
>>> list2 = [11,22,33,11,22,33]
>>> index_level = list(zip(list1,list2))
>>> index_level
[('a', 11), ('a', 22), ('a', 33), ('b', 11), ('b', 22), ('b', 33)]
>>> index_level = pd.MultiIndex.from_tuples(index_level)
>>> index_level
MultiIndex([('a', 11),
('a', 22),
('a', 33),
('b', 11),
('b', 22),
('b', 33)],
)
>>> dataframe2 = pd.DataFrame(randn(6,2), index_level, ['Y','Z'])
>>> dataframe2
>>> dataframe2.loc['a']
>>> dataframe2.loc['a'].loc[22]
Y 0.391026
Z -0.522579
Name: 22, dtype: float64
>>> dataframe2.index.names
FrozenList([None, None])
>>> dataframe2.index.names = ['outside', 'inside']
dataframe2
>>> dataframe2.xs('a')
>>> dataframe2.xs(22, level = 'inside')

How to deal with missing data using Pandas?

>>> dataframe3 = {'x':[11,22,np.nan], 'y':[33, np.nan,np.nan], 'z':[44,55,66]}
>>> df = pd.DataFrame(dataframe3)
df
>>> df.dropna()
>>> df.dropna(axis=1)
>>> df.dropna(thresh=2)
>>> df.fillna(value='A values if filled')
>>> df['x'].fillna(value=df['x'].mean())
0 11.0
1 22.0
2 16.5
Name: x, dtype: float64

How to group data using Pandas?

>>> data = {'Department':['COMP', 'COMP', 'IT', 'IT', 'ELEC', 'ELEC'],'Name':['Jay', 'Mini', 'Sam', 'Jack', 'Kev', 'Russ'],'Scores':[111,222,333,444,555,666]}>>> dataframe4 = pd.DataFrame(data)
>>> dataframe4
>>> dataframe4.groupby('Department'
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001A41A043AC8>
>>> Dept = dataframe4.groupby('Department')
>>> Dept.mean()
>>> Dept.sum()
>>> Dept.std()
>>> Dept.std().loc['IT']
Scores 78.488853
Name: IT, dtype: float64
>>> Dept.count()
>>> Dept.max()
>>> Dept.describe()

How to combine dataframes using Pandas?

>>> df1 = pd.DataFrame({'A':['a0','a1','a2','a3'], 'B':['b0','b1','b2','b3'],'C':['c0','c1','c2','c3'], 'D':['d0','d1','d2','d3']},index=[0,1,2,3])
>>> df1
>>> df2 = pd.DataFrame({'A':['a4','a5','a6','a7'], 'B':['b4','b5','b6','b7'],'C':['c4','c5','c6','c7'], 'D':['d4','d5','d6','d7']},index=[4,5,6,7])
>>> df2
>>> pd.concat([df1,df2])
>>> pd.concat([df1,df2], axis=1)
>>> df4 = pd.DataFrame({'value':['v0','v1','v2','v3'],'a':['a0','a1','a2','a3'],'b':['b0','b1','b2','b3']})
>>> df4
df5 = pd.DataFrame({'value':['v0','v1','v2','v3'],'c':['c0','c1','c2','c3'],'d':['d0','d1','d2','d3']})
>>> df5
>>> pd.merge(df4,df5,how='inner',on='value')
>>> df6 = pd.DataFrame({'a':['a0','a1','a2'],'b':['b0','b1','b2']},index=[0,1,2])
>>> df6
>>> df7 = pd.DataFrame({'c':['c0','c2','c3'],'d':['d0','d2','d3']},
index=[0,2,3])
>>> df7
>>> df6.join(df7)
>>> df7.join(df6)

What different operations can be performed on Pandas dataframe?

>>> df8 = pd.DataFrame({'a':[11,22,33,22],'b':[1,2,3,4],'c':['aa','bb','cc','dd']})
>>> df8
>>> df8['a'].unique()
array([11, 22, 33], dtype=int64)
>>> df8['a'].nunique()
3
>>> df8['a'].value_counts()
22 2
11 1
33 1
Name: a, dtype: int64
>>> def func(x):
return x+x
>>> df8['b'].apply(func)
0 2
1 4
2 6
3 8
Name: b, dtype: int64
>>> df8['c'].apply(len)
0 2
1 2
2 2
3 2
Name: c, dtype: int64
>>> df8['a'].apply(lambda x: x*x)
0 121
1 484
2 1089
3 484
Name: a, dtype: int64
>>> df8.columns
Index(['a', 'b', 'c'], dtype='object')
>>> df8.index
RangeIndex(start=0, stop=4, step=1)
>>> df8.sort_values('a')
>>> df8.isnull()
>>> df8.isnull().sum()
a 0
b 0
c 0
dtype: int64
>>> df9 = pd.DataFrame({'a':['jam','jam','jam','milk','milk','milk'],'b':['red','red','white','white','red','red'],'c':['yes','no','yes','no','yes','no'],'d':[11,33,22,55,44,11]})
>>> df9
>>> df9.pivot_table(values='d', index=['a','b'], columns='c')

How to read and write data for different file formats using Pandas?

pip/conda install sqlalchemy

>>> df10 = pd.read_csv('data.csv')
>>> df10
>>> df10.to_csv('data_output')
>>> pd.read_csv('data_output')
>>> df10.to_csv('data_output',index=False)
>>> pd.read_csv('data_output')
>>> df11 = pd.read_excel('dataset.xlsx')
>>> df11
>>> df11.to_excel('dataset_output.xlsx',sheet_name='dataset sheet')
>>> data = pd.read_html('https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers')>>> type(data)
list
>>> data[0]
>>> from sqlalchemy import create_engine>>> sql_engine = create_engine('sqlite:///:memory:')
>>> df10.to_sql('sql_data',sql_engine)
>>> sql_df = pd.read_sql('sql_data',sql_engine)
>>> sql_df

For more detailed information on Pandas, check the official documentation here.

Refer to the notebook for code here.

Books to refer to:

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

No responses yet

Write a response