While the numpy library provides strong numerical operations for multidimensional arrays, pandas focuses on data analysis. It is a collection of powerful tools for importing, rearranging, analyzing, and exporting tabular data sets.
pandas documentation: "pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis/manipulation tool available in any language. It is already well on its way toward this goal.“
Pandas uses the core functionality of numpy to handle its prominent data structure called the DataFrame
(an concept which originated in R). Pandas provides a wide range of functions for data analysis like importing and exporting, viewing, selection, indexing, handling missing, statistical analyses, merging, grouping, reshaping, handling time series of data and much more. It also enables plotting via the matplotlib library.
It is common to use the abbreviation pd
for pandas.
import numpy as np
import pandas as pd
The pandas.DataFrame is the central data structure of pandas. Data coming from for example Excel or other SQL(-like) structures can be easily converted to a pandas DataFrame. First, data is created manually to discuss the basic features.
# Construct DataFrame from lists or arrays
names = ['Jon','Tim','Tom','Jan','Sven']
ages = np.random.randint(20,30,5)
children = np.random.randint(0,3,5)
# Combine all lists
data = np.array([names,ages,children]).transpose()
part_data = np.array([ages,children]).transpose()
df = pd.DataFrame(data=data)
df
The DataFrame has a table-like structure, and is in many ways similar to a relational database table. In the example above, columns and rows are created with a default integer index and the given column names, respectively. The labels of the columns and rows (index
) can be included during constructing or set afterwards.
df.columns = ['names','ages','children']
df
# Set index from column name
df.set_index('names')
# Construct DataFrame with column name and index
df = pd.DataFrame(data=part_data, index=names, columns=['ages','children'])
df
# Construct DataFram from dictionary
data = {
'ages': ages,
'children': children,
'cousins': np.random.randint(0,3,5)
}
df = pd.DataFrame(data=data, index=names)
df
There are several ways to select single rows or single columns by the index or its column name, respectively. Selection rows via use of []
or loc[]
are the easiest ways.
# Selection of a single row
df[1:2]
df.loc['Tim']
# Use additional [] to return a new DataFrame
df.loc[['Tim']]
# Selection of a range of rows
df[1:3]
df['Tim':'Tom']
For selection of columns there are two ways. When having a 'space' in a column name, only the second one is usable.
df.ages # not possible for column names with spaces
df[['ages']]
df[['ages','cousins']]
Combination for selecting rows and columns.
df.loc['Tom':'Jan', ['ages', 'cousins']]
As for lists and arrays in numpy boolean indexing is possible, too.
df[df['ages'] > 25]
Many methods for data IO are included in pandas, and standard file type are supported. A full list of them can be found here. Some supported file types are:
read_csv()
, to_csv()
)read_json()
, to_json()
)read_excel()
, to_excel()
)read_sql()
, to_sql()
)read_picle()
, to_pickle()
)The example shows the basic principle of importing and exporting data from/to files. In the case of text-based file types (e.g. csv) one important parameter to set is the separator (sep
or delimiter
). This can probably interfere with the decimal separator and should be checked. In the case of an Excel file sheet name has to be set with sheet_name
.
# Export csv and excel files
df.to_csv("data.csv", sep=',')
df.to_excel("data.xls", sheet_name="data")
# Import with setting the first column (index_col=0) as index or the sheet_name, respectively
df_csv_import = pd.read_csv("data.csv", delimiter=',', index_col=0)
df_excel_import = pd.read_excel("data.xls", sheet_name="data")
df_csv_import
df_excel_import
To get used to more standard functionality of pandas we are now going to work with an example data set. The data set is of roughly 5000 different white wine samples (Wine Quality Data Set). The different columns are chemical and physical attributes. In addition a quality score is given.
WINE_COLOR = 'white' #red
df = pd.read_csv(f'../.assets/data/winequality/{WINE_COLOR}.csv.zip', sep=';')
# Data set size: ~5000 (~1600) different white (red) wines with 12 different attributes
df.shape
# Showing column names
df.columns
# It is a default integer index
df.index
# show first 5 rows
df.head(5)
# show last 5 rows
df.tail(5)
# Show a sample of 4 rows
df.sample(4)
# `.describe()` yields a short statistical summary
df.describe()
# sort data set by column `pH`
df.sort_values(by='pH', ascending=False).head()
# select a set of columns and look at first rows
df[['fixed acidity', 'citric acid', 'quality']].head()
# show unique values in a column
df['quality'].unique()
# group by values in `quality` and count entries
df.groupby('quality')['quality'].count()
# filter the data by values in a column
df[df['alcohol'] > 10.0].head()
# method calls can be chained to apply several operations in a row
df[df['alcohol'] > 10.0].sort_values(by='alcohol', ascending=False).head()
# Try exploring the data set with pandas operations
As mentioned in the beginning basic plotting functionality comes already with pandas by using the matplotlib library internally. For example histograms, scatter plots or bar plots can be directly applied on the DataFrame object using the pandas.plot
module. With additional parameters e.g. labels, the size, colors and more can be set.
%matplotlib inline
# This will return a matplotlib object
df['pH'].plot.hist();
# Scatter plot of two columns
df.plot.scatter('alcohol', 'residual sugar', s=5, alpha=0.1);
# Bar plot of the counted quality entries
df.groupby('quality')['quality'].count().plot.bar();
This notebook is licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Copyright © 2018 Point 8 GmbH