Data Handling with Pandas¶

While the numpy library provides strong numerical operations for multidimensional arrays, pandas focuses on data analysis. It is a collection of powerful tools for importing, rearranging, analyzing, and exporting tabular data sets.

pandas documentation: "pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis/manipulation tool available in any language. It is already well on its way toward this goal.“

Pandas uses the core functionality of numpy to handle its prominent data structure called the DataFrame (an concept which originated in R). Pandas provides a wide range of functions for data analysis like importing and exporting, viewing, selection, indexing, handling missing, statistical analyses, merging, grouping, reshaping, handling time series of data and much more. It also enables plotting via the matplotlib library.

Import of Pandas¶

It is common to use the abbreviation pd for pandas.

import numpy as np
import pandas as pd

pandas.DataFrame¶

The pandas.DataFrame is the central data structure of pandas. Data coming from for example Excel or other SQL(-like) structures can be easily converted to a pandas DataFrame. First, data is created manually to discuss the basic features.

# Construct DataFrame from lists or arrays
names = ['Jon','Tim','Tom','Jan','Sven']
ages = np.random.randint(20,30,5)
children = np.random.randint(0,3,5)

# Combine all lists
data = np.array([names,ages,children]).transpose()
part_data = np.array([ages,children]).transpose()

df = pd.DataFrame(data=data)

df

The DataFrame has a table-like structure, and is in many ways similar to a relational database table. In the example above, columns and rows are created with a default integer index and the given column names, respectively. The labels of the columns and rows (index) can be included during constructing or set afterwards.

df.columns = ['names','ages','children']
df

# Set index from column name
df.set_index('names')

# Construct DataFrame with column name and index
df = pd.DataFrame(data=part_data, index=names, columns=['ages','children'])
df

# Construct DataFram from dictionary
data = {
    'ages': ages,
    'children': children,
    'cousins': np.random.randint(0,3,5)
}

df = pd.DataFrame(data=data, index=names)
df

Selection of rows and columns¶

There are several ways to select single rows or single columns by the index or its column name, respectively. Selection rows via use of [] or loc[] are the easiest ways.

 # Selection of a single row
df[1:2]

df.loc['Tim']

ages        28
children     0
cousins      1
Name: Tim, dtype: int64

# Use additional [] to return a new DataFrame
df.loc[['Tim']]

# Selection of a range of rows
df[1:3]

df['Tim':'Tom']

For selection of columns there are two ways. When having a 'space' in a column name, only the second one is usable.

df.ages # not possible for column names with spaces

Jon     29
Tim     28
Tom     24
Jan     21
Sven    20
Name: ages, dtype: int64

df[['ages']]

df[['ages','cousins']]

Combination for selecting rows and columns.

df.loc['Tom':'Jan', ['ages', 'cousins']]

As for lists and arrays in numpy boolean indexing is possible, too.

df[df['ages'] > 25]

Import and Output data to and from pandas¶

Many methods for data IO are included in pandas, and standard file type are supported. A full list of them can be found here. Some supported file types are:

CSV (read_csv(), to_csv())
JSON (read_json(), to_json())
MS Excel (read_excel(), to_excel())
SQL (read_sql(), to_sql())
Python Pickel Format (read_picle(), to_pickle())
...

The example shows the basic principle of importing and exporting data from/to files. In the case of text-based file types (e.g. csv) one important parameter to set is the separator (sep or delimiter). This can probably interfere with the decimal separator and should be checked. In the case of an Excel file sheet name has to be set with sheet_name.

# Export csv and excel files
df.to_csv("data.csv", sep=',')
df.to_excel("data.xls", sheet_name="data")

# Import with setting the first column (index_col=0) as index or the sheet_name, respectively
df_csv_import = pd.read_csv("data.csv", delimiter=',', index_col=0)
df_excel_import = pd.read_excel("data.xls", sheet_name="data")

df_csv_import

df_excel_import

Test data set: Wine Quality¶

To get used to more standard functionality of pandas we are now going to work with an example data set. The data set is of roughly 5000 different white wine samples (Wine Quality Data Set). The different columns are chemical and physical attributes. In addition a quality score is given.

WINE_COLOR = 'white' #red
df = pd.read_csv(f'../.assets/data/winequality/{WINE_COLOR}.csv.zip', sep=';')

# Data set size: ~5000 (~1600) different white (red) wines with 12 different attributes
df.shape

(4898, 12)

# Showing column names
df.columns

Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality'],
      dtype='object')

# It is a default integer index
df.index

RangeIndex(start=0, stop=4898, step=1)

# show first 5 rows
df.head(5)

# show last 5 rows
df.tail(5)

# Show a sample of 4 rows
df.sample(4)

# `.describe()` yields a short statistical summary
df.describe()

# sort data set by column `pH`
df.sort_values(by='pH', ascending=False).head()

# select a set of columns and look at first rows
df[['fixed acidity', 'citric acid', 'quality']].head()

# show unique values in a column
df['quality'].unique()

array([6, 5, 7, 8, 4, 3, 9])

# group by values in `quality` and count entries
df.groupby('quality')['quality'].count()

quality
3      20
4     163
5    1457
6    2198
7     880
8     175
9       5
Name: quality, dtype: int64

# filter the data by values in a column
df[df['alcohol'] > 10.0].head()

# method calls can be chained to apply several operations in a row 
df[df['alcohol'] > 10.0].sort_values(by='alcohol', ascending=False).head()

# Try exploring the data set with pandas operations

Basic plotting with Pandas¶

As mentioned in the beginning basic plotting functionality comes already with pandas by using the matplotlib library internally. For example histograms, scatter plots or bar plots can be directly applied on the DataFrame object using the pandas.plot module. With additional parameters e.g. labels, the size, colors and more can be set.

%matplotlib inline

# This will return a matplotlib object
df['pH'].plot.hist();

 # Scatter plot of two columns
df.plot.scatter('alcohol', 'residual sugar', s=5, alpha=0.1);

# Bar plot of the counted quality entries
df.groupby('quality')['quality'].count().plot.bar();

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
0	7.0	0.27	0.36	20.7	0.045	45.0	170.0	1.0010	3.00	0.45	8.8	6
1	6.3	0.30	0.34	1.6	0.049	14.0	132.0	0.9940	3.30	0.49	9.5	6
2	8.1	0.28	0.40	6.9	0.050	30.0	97.0	0.9951	3.26	0.44	10.1	6
3	7.2	0.23	0.32	8.5	0.058	47.0	186.0	0.9956	3.19	0.40	9.9	6
4	7.2	0.23	0.32	8.5	0.058	47.0	186.0	0.9956	3.19	0.40	9.9	6

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
4893	6.2	0.21	0.29	1.6	0.039	24.0	92.0	0.99114	3.27	0.50	11.2	6
4894	6.6	0.32	0.36	8.0	0.047	57.0	168.0	0.99490	3.15	0.46	9.6	5
4895	6.5	0.24	0.19	1.2	0.041	30.0	111.0	0.99254	2.99	0.46	9.4	6
4896	5.5	0.29	0.30	1.1	0.022	20.0	110.0	0.98869	3.34	0.38	12.8	7
4897	6.0	0.21	0.38	0.8	0.020	22.0	98.0	0.98941	3.26	0.32	11.8	6

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
3553	5.7	0.22	0.22	16.65	0.044	39.0	110.0	0.99855	3.24	0.48	9.0	6
181	7.1	0.34	0.15	1.20	0.053	61.0	183.0	0.99360	3.09	0.43	9.2	5
2303	6.9	0.44	0.18	11.80	0.051	26.0	126.0	0.99750	3.23	0.48	9.1	5
4256	6.0	0.33	0.26	5.10	0.051	16.0	119.0	0.99416	3.15	0.41	9.2	5

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
count	4898.000000	4898.000000	4898.000000	4898.000000	4898.000000	4898.000000	4898.000000	4898.000000	4898.000000	4898.000000	4898.000000	4898.000000
mean	6.854788	0.278241	0.334192	6.391415	0.045772	35.308085	138.360657	0.994027	3.188267	0.489847	10.514267	5.877909
std	0.843868	0.100795	0.121020	5.072058	0.021848	17.007137	42.498065	0.002991	0.151001	0.114126	1.230621	0.885639
min	3.800000	0.080000	0.000000	0.600000	0.009000	2.000000	9.000000	0.987110	2.720000	0.220000	8.000000	3.000000
25%	6.300000	0.210000	0.270000	1.700000	0.036000	23.000000	108.000000	0.991723	3.090000	0.410000	9.500000	5.000000
50%	6.800000	0.260000	0.320000	5.200000	0.043000	34.000000	134.000000	0.993740	3.180000	0.470000	10.400000	6.000000
75%	7.300000	0.320000	0.390000	9.900000	0.050000	46.000000	167.000000	0.996100	3.280000	0.550000	11.400000	6.000000
max	14.200000	1.100000	1.660000	65.800000	0.346000	289.000000	440.000000	1.038980	3.820000	1.080000	14.200000	9.000000

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
1250	5.3	0.260	0.23	5.15	0.034	48.0	160.0	0.99520	3.82	0.51	10.5	7
1255	6.4	0.220	0.34	1.80	0.057	29.0	104.0	0.99590	3.81	0.57	10.3	6
2771	6.3	0.200	0.24	1.70	0.052	36.0	135.0	0.99374	3.80	0.66	10.8	6
2036	5.7	0.270	0.32	1.20	0.046	20.0	155.0	0.99340	3.80	0.41	10.2	6
2321	4.6	0.445	0.00	1.40	0.053	11.0	178.0	0.99426	3.79	0.55	10.2	5

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
2	8.1	0.28	0.40	6.90	0.050	30.0	97.0	0.9951	3.26	0.44	10.1	6
5	8.1	0.28	0.40	6.90	0.050	30.0	97.0	0.9951	3.26	0.44	10.1	6
9	8.1	0.22	0.43	1.50	0.044	28.0	129.0	0.9938	3.22	0.45	11.0	6
10	8.1	0.27	0.41	1.45	0.033	11.0	63.0	0.9908	2.99	0.56	12.0	5
12	7.9	0.18	0.37	1.20	0.040	16.0	75.0	0.9920	3.18	0.63	10.8	5

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
3918	6.4	0.35	0.28	1.6	0.037	31.0	113.0	0.98779	3.12	0.40	14.20	7
4503	5.8	0.61	0.01	8.4	0.041	31.0	104.0	0.99090	3.26	0.72	14.05	7
1099	5.8	0.29	0.21	2.6	0.025	12.0	120.0	0.98940	3.39	0.79	14.00	7
3150	5.8	0.39	0.47	7.5	0.027	12.0	88.0	0.99070	3.38	0.45	14.00	6
2945	4.9	0.33	0.31	1.2	0.016	39.0	150.0	0.98713	3.33	0.59	14.00	8