Pandas

Overview

Teaching: 30 min
Exercises: 15 min
Questions
  • What is Pandas and why should I use it?

Objectives
  • Creating a pandas DataFrame

  • Indexing DataFrames

  • Listing column names of a DataFramePrinting elements of a DataFrame by indexing

  • Reading in data from a CSV into a DataFrame

  • Analyzing a DataFrame using built in functions ( df.sum() , df.cumsum() , df.std() )

Pandas is an open source library which provides data structures and data analysis tools for python. You will notice many similarities between Pandas and NumPy because Pandas is built on top of NumPy using and expanding on NumPy’s functionality. At the core of Pandas is the pandas dataframe. Dataframes are similar to multi-dimensional arrays but they include labels and indices which can be of many types including time, simplifying the use of time series data.

Creating DataFrames


The Pandas DataFrame class is used to manage labeled and indexed data much like one would in SQL.

To import the Pandas library do the following:

import pandas as pd

Pandas DataFrame is created with the form:

pd.DataFrame(Data,Labels,Index)

Create and print a Pandas DataFrame with:

df = pd.DataFrame([10,20,30,40],columns=['numbers'],index=['a','b','c','d'])
print(df)
   numbers
a       10
b       20
c       30
d       40

The main concepts to take away from a Pandas DataFrame are:

We can print out the labels and index of the DataFrame:

print(df.index)
print(df.columns)
Index(['a', 'b', 'c', 'd'], dtype='object')
Index(['numbers'], dtype='object')

We can also initialize a Pandas DataFrame by calling the underlying Numpy function ‘ones’ and wrapping it with a Pandas DataFrame:

df1 = pd.DataFrame(pd.np.ones((3,3)))
print(df1)
     0    1    2
0  1.0  1.0  1.0
1  1.0  1.0  1.0
2  1.0  1.0  1.0

We passed the ones function a tuple of (3,3) which will create a 3 by 3 matrix initialized with the values of 1 and then wrap the Numpy array in a Pandas DataFrame. We can also use the zeros function to initialize the DataFrame with values of zero:

df0 = pd.DataFrame(pd.np.zeros((3,3)))
print(df0)
     0    1    2
0  0.0  0.0  0.0
1  0.0  0.0  0.0
2  0.0  0.0  0.0

We can append data to the end of our dataframe with append:

df2 = pd.DataFrame([50,60,70,80],columns=['numbers'],index=['e','f','g','h'])
df = df.append(df2)
print(df)
   numbers
a       10
b       20
c       30
d       40
e       50
f       60
g       70
h       80

We can add a column to our dataframe by assigning values to it’s label:

df['10x'] = df.numbers*10
print(df)
   numbers  10x
a       10  100
b       20  200
c       30  300
d       40  400
e       50  500
f       60  600
g       70  700
h       80  800

Pandas has more ways to concatenate data than we can cover in this workshop. For more information on concatenating data please visit Pandas merging documentation

Indexing DataFrames


Pandas has two primary types of indexing:

.loc: is label and index based:

print(df.loc['a'])
print('----------------')
print(df.loc['a','10x'])
print('----------------')
print(df.loc['a':'d','10x'])
numbers     10
10x        100
Name: a, dtype: int64 
----------------
100 
----------------
a    100
b    200
c    300
d    400
Name: 10x, dtype: int64 

.iloc: is integer position based (from 0 to length-1)

print(df.iloc[0])
print('----------------')
print(df.iloc[0,1])
print('----------------')
print(df.iloc[0:3,1])
numbers     10
10x        100
Name: a, dtype: int64 
----------------
100 
----------------
a    100
b    200
c    300
Name: 10x, dtype: int64 

Importing and Exporting data from CSV files


We can write our dataframe out to a CSV file with:

df.to_csv('numbers.csv')
!cat numbers.csv
,numbers,10x
a,10,100
b,20,200
c,30,300
d,40,400
e,50,500
f,60,600
g,70,700
h,80,800

We can read in CSV data into a DataFrame (AZO.csv):

df = pd.read_csv('AZO.csv')

Note: If an index column is not specified then pandas will create an index column for you starting with 0.

Now that we have data inside of a DataFrame we can look at the DataFrame with a few commands:

print('head\n',df.head())
print('tail\n',df.tail())
head
          Date        Open        High         Low       Close   Adj Close  \
0  2017-12-05  747.000000  763.299988  703.599976  712.760010  712.760010   
1  2017-12-06  705.630005  711.760010  696.940002  698.650024  698.650024   
2  2017-12-07  701.460022  704.090027  693.750000  702.340027  702.340027   
3  2017-12-08  702.789978  723.429993  701.289978  721.890015  721.890015   
4  2017-12-11  719.590027  720.799988  705.780029  708.609985  708.609985   

    Volume  
0  1225500  
1   492900  
2   417300  
3   546900  
4   501900  
tail
            Date        Open        High         Low       Close   Adj Close  \
247  2018-11-28  835.380005  840.650024  826.599976  833.700012  833.700012   
248  2018-11-29  835.530029  836.739990  824.020020  825.830017  825.830017   
249  2018-11-30  827.090027  827.090027  804.570007  809.070007  809.070007   
250  2018-12-03  814.049988  829.710022  807.059998  824.460022  824.460022   
251  2018-12-04  867.099976  894.369995  854.500000  880.070007  880.070007   

     Volume  
247  349900  
248  279200  
249  602900  
250  535700  
251  935200 

Working with Dataframes


There is a large variety of functions that pandas provides for Dataframes. View Pandas API reference for documentation. In this workshop we will review a few of them to give you an idea of their potential:

DataFrame.sum() Sum the values on the requested axis (default is columns):

print(df.sum())
Date         2017-12-052017-12-062017-12-072017-12-082017-1...
Open                                                    180132
High                                                    182133
Low                                                     177975
Close                                                   180038
Adj Close                                               180038
Volume                                                92264300
dtype: object

DataFrame.cumsum() Return the cumulative sum of the requested axis:

print(df['Close'].cumsum().head())
0     712.760010
1    1411.410034
2    2113.750061
3    2835.640076
4    3544.250061
Name: Close, dtype: float64

DataFrame.std() Return the standard deviation of the requested axis:

print(df.std())
Open             59.753627
High             60.613944
Low              59.501632
Close            60.590288
Adj Close        60.590288
Volume       208536.419569
dtype: float64

Economic example using Pandas

Let’s repeat the example from NumPy using Pandas, this time using the file AZO.csv which includes time data

Load the file using pd.read_csv and plot just the adjusted close values:

Note: There are a few extra commands included here to format the plot and make the dates visible. Don’t worry if you don’t understand them now, we will cover more about matplotlib later

import matplotlib.pyplot as plt

# simple function to format plot to make the dates legible
def plot_dates(df):
    fig, ax = plt.subplots()
    plt.plot(df)
    ax.set_xticks(df.index[0::30])
    plt.xticks(rotation=45)
    plt.show()

# read the data in and plot it
data = pd.read_csv('AZO.csv', index_col=0)
plot_dates(data['Adj Close'])

Next we’ll calculate the log returns and plot them, this time using a pandas dataframe function called pct_change(). We could use the same NumPy method for calculating the log returns, but using the pandas function we keep the indices for the data (dates)

import numpy as np

log_returns = np.log(1 + data['Adj Close'].pct_change())
plot_dates(log_returns)

Next calculate the mean (u), variance (var), Standard deviation (std) and drift (drift) the same way we did with NumPy

u = log_returns.mean()
var = log_returns.var()
std = log_returns.std()
drift = u-(0.5*var)

Say we want to run 5 simulations predicting the next 100 days of returns. First generate a 2 dimensional array of random values of size (100,5), just like we did with NumPy:

rand_vals = np.random.rand(100, 5)
print(rand_vals.shape)
(100, 5)

Next with some help from SciPy’s statistical library let’s calculate the percent point function using our random values:

from scipy.stats import norm
ppf = norm.ppf(rand_vals)

Calculate the daily returns using the results of our previous step and plot them:

returns = np.exp(drift + std*ppf)
plt.plot(returns)
plt.show()

Next with some help from Pandas we will determine what the next 100 business days are and merge set them as the indices for our daily returns

from pandas.tseries.holiday import USFederalHolidayCalendar
from pandas.tseries.offsets import CustomBusinessDay

us_bd = CustomBusinessDay(calendar=USFederalHolidayCalendar())
prediction_dates = pd.date_range(start=data.index[-1], periods=100, freq=us_bd)
returns_dates = pd.DataFrame(returns,index=prediction_dates)
plot_dates(returns_dates)

Convert this back to daily stock values by creating an empty dataframe the same size as our returns with our prediction dates as the indices and assign the first row (starting values) to be the same as the last row in our initial data set. Then loop through our data calculating the daily values and plot it

predictions = pd.DataFrame(np.zeros(returns_dates.shape),index=prediction_dates)
predictions.iloc[0] = data['Adj Close'].iloc[-1]
for i in range(1,predictions.shape[0]):
    predictions.iloc[i] = predictions.iloc[i-1]*returns_dates.iloc[i]
predictions.head()
plot_dates(predictions)

Challenge

Using the predictions from above pick a date and print the predicted stock values for all 5 predictions for that date

Solution

print(predictions.loc['2018-12-26'])
0    897.879181
1    914.058514
2    859.522983
3    923.701527
4    817.625782
Name: 2018-12-26 00:00:00, dtype: float64

Key Points

  • What are Dataframes

  • Using dataframes