2.2 Overview of the PyData Stack#

This notebook is designed to give you a brief overview of the PyData stack: a collection of Python packages that are used for data science. Here I introduce the most important packages in the PyData stack: NumPy, SciPy, Pandas, Matplotlib, StatsModels, and scikit-learn. I’ll also show you some examples of Seaborn and iPyWidgets.

I’ll also demonstrate some features of Jupyter Notebooks in VS Code. Although Jupyter Notebooks are not always the right medium for your code, it will also demonstrate some of the features of Jupyter and Jupyter Notebooks that make them useful for data exploration and visualization.

IMPORTANT NOTE: This is just meant to be a whirlwind overview of the important Python packages for data science. It’s just an overview. We’ll dig into the details over the next few class sessions. The point of this notebook is just to showcase what CAN be done and to make sure that you can download the code for the course and run it in VS Code. We’ll learn how to write the code below over the next few weeks and throughout the next quarter. After getting a look at what’s possible in this notebook, we’ll start from the very basics in the next notebook.

You can try running this code yourself. If you don’t have VS Code installed, please do so ASAP. In a pinch, you can run this in Google Colaboratory: https://colab.research.google.com/ You can open this notebook directly in Google Colaboratory by clicking here: Open in Google Colaboratory.

This notebook will start by setting up the environment and then will demonstrate some examples from NumPy, SciPy, Pandas, Matplotlib, and StatsModels (with some bonus examples from Seaborn and iPyWidgets.) This collection of packages represent the foundation of what is called the PyData stack (ecosystem):

0. Set up Environment#

  • Let’s first make sure that everyone was able to download the code from GitHub. Please raise your hand if you are not able to open the project folder in VS Code.

  • Now, before we start, we need to set up our environment. We need to install the packages that we need (if they aren’t installed already) and then we need to load the packages into Python.

## Install required packages, but do automatically only if running within Google Colaboratory
try:
  import google.colab
  IN_COLAB = True
except:
  IN_COLAB = False

if IN_COLAB:
  !pip install plotly==5.9.0
else:
  print("Be sure to install the required packages manually if not in Colab")
Be sure to install the required packages manually if not in Colab
import numpy as np
import scipy as sp
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

import plotly.express as px

sns.set()

1. NumPy#

NumPy a library designed to add support “for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.”

A good beginner tutorial can be found on the official NumPy website here: https://numpy.org/doc/stable/user/absolute_beginners.html

The basic functionality of NumPy is the efficient management of arrays, with syntax as follows:

a = np.array([1, 2, 3])
a
array([1, 2, 3])
A = np.array([[1, 2], [3, 4]])
A
array([[1, 2],
       [3, 4]])
B = np.array([
    [1, 0], 
    [0, 1]])
B
array([[1, 0],
       [0, 1]])
A @ B
array([[1, 2],
       [3, 4]])
b = np.array((1,2))
x = np.linalg.solve(A, b)
x
array([0. , 0.5])
A @ x
array([1., 2.])
eigval, eigvec = np.linalg.eig(A)
eigval
array([-0.37228132,  5.37228132])
eigvec
array([[-0.82456484, -0.41597356],
       [ 0.56576746, -0.90937671]])

2. SciPy#

SciPy is a library “used for scientific computing and technical computing. SciPy contains modules for optimization, linear algebra, integration, interpolation, special functions, FFT, signal and image processing, ODE solvers and other tasks common in science and engineering.”

For example, consider calculating the following integral. $\( \int_0^1 a x^2 + b \, d x \)$

DISCUSS: How did I make typeset the above equation in this Jupyter notebook? What about the images above?

  • How did I create the section headers?

  • How can I export this notebook into a PDF report? What about an HTML report?

from scipy.integrate import quad
a = 2
b = 1
def integrand(x):
    return a*x**2 + b
I, err = quad(integrand, 0, 1)
I
1.6666666666666667

3. Matplotlib#

Matplotlib is the most plotting library for Python. Even other plotting libraries build off of Matplotlib as a foundation. Even if you use other plotting libraries, it is important to understand the basics of Matplotlib.

As an example, consider the function $\( y = x_0 \exp(-x_2 \cdot t) + x_1 \exp(-x_3 \cdot t) \)$

x = np.array([1, 1, 1, 0])
yfunc = lambda t: x[0] * np.exp(-x[2]* t) + x[1] * np.exp(-x[3] * t)
t_grid = np.linspace(0, 2, 100)
y = yfunc(t_grid)
plt.plot(t_grid, y)
[<matplotlib.lines.Line2D at 0x16e4d2510>]
../_images/1bfbe8aa688675c4c1ad817897d456f2cccbb384290b530a652ab10608140412.png

Now, let’s consider some fancier examples. We’ll even use some Seaborn code.

# https://seaborn.pydata.org/examples/errorband_lineplots.html
# Load an example dataset with long-form data
sns.set_theme(style="darkgrid")
fmri = sns.load_dataset("fmri")

# Plot the responses for different events and regions
sns.lineplot(x="timepoint", y="signal",
             hue="region", style="event",
             data=fmri)
<Axes: xlabel='timepoint', ylabel='signal'>
../_images/addcd30e0c04da27e266b21fcb3207edc2ed5955eb7851eaa24135d9eb7e0ec2.png
# https://seaborn.pydata.org/examples/layered_bivariate_plot.html
# sns.set_theme(style="dark")

# Simulate data from a bivariate Gaussian
n = 10000
mean = [0, 0]
cov = [(2, .4), (.4, .2)]
rng = np.random.RandomState(0)
x, y = rng.multivariate_normal(mean, cov, n).T

# Draw a combo histogram and scatterplot with density contours
f, ax = plt.subplots(figsize=(6, 6))
sns.scatterplot(x=x, y=y, s=5, color=".15")
sns.histplot(x=x, y=y, bins=50, pthresh=.1, cmap="mako")
sns.kdeplot(x=x, y=y, levels=5, color="w", linewidths=1)
<Axes: >
../_images/a07454527e7a5b4db33e1eed6f67d6e0320da4c2446a9e037c0d7edd334847fa.png
# https://seaborn.pydata.org/examples/timeseries_facets.html
flights = sns.load_dataset("flights")

# Plot each year's time series in its own facet
g = sns.relplot(
    data=flights,
    x="month", y="passengers", col="year", hue="year",
    kind="line", palette="crest", linewidth=4, zorder=5,
    col_wrap=3, height=2, aspect=1.5, legend=False,
)

# Iterate over each subplot to customize further
for year, ax in g.axes_dict.items():

    # Add the title as an annotation within the plot
    ax.text(.8, .85, year, transform=ax.transAxes, fontweight="bold")

    # Plot every year's time series in the background
    sns.lineplot(
        data=flights, x="month", y="passengers", units="year",
        estimator=None, color=".7", linewidth=1, ax=ax,
    )

# Reduce the frequency of the x axis ticks
ax.set_xticks(ax.get_xticks()[::2])

# Tweak the supporting aspects of the plot
g.set_titles("")
g.set_axis_labels("", "Passengers")
g.tight_layout()
<seaborn.axisgrid.FacetGrid at 0x3320fbd10>
../_images/c153cbba1885ab2be5fc642139cf8e6cbf808aa8a2396e813c737e339c4b5ec3.png
# Install with:
# conda install -c plotly plotly=5.9.0
# or
# pip install plotly==5.9.0

df = px.data.stocks()
fig = px.line(df, x='date', y="GOOG")
fig.show()

4. Pandas#

From Wikipedia, “pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. It is free software released under the three-clause BSD license. The name is derived from the term “panel data”, an econometrics term for data sets that include observations over multiple time periods for the same individuals. Its name is a play on the phrase “Python data analysis” itself. Wes McKinney started building what would become pandas at AQR Capital while he was a researcher there from 2007 to 2010.”

population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)
population
California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64
area_dict = {
    'California': 423967, 
    'Texas': 695662, 
    'New York': 141297,
    'Florida': 170312, 
    'Illinois': 149995}
area = pd.Series(area_dict)

states = pd.DataFrame({'population': population,
                       'area': area})
states
population area
California 38332521 423967
Texas 26448193 695662
New York 19651127 141297
Florida 19552860 170312
Illinois 12882135 149995
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user'
users = pd.read_table(url, sep='|', index_col='user_id')
users.head()
age gender occupation zip_code
user_id
1 24 M technician 85711
2 53 F other 94043
3 23 M writer 32067
4 24 M technician 43537
5 33 F other 15213

5. StatsModels#

statsmodels is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration.”

import statsmodels.api as sm
import statsmodels.formula.api as smf
dat = sm.datasets.get_rdataset("Guerry", "HistData").data
dat.head()
dept Region Department Crime_pers Crime_prop Literacy Donations Infants Suicides MainCity ... Crime_parents Infanticide Donation_clergy Lottery Desertion Instruction Prostitutes Distance Area Pop1831
0 1 E Ain 28870 15890 37 5098 33120 35039 2:Med ... 71 60 69 41 55 46 13 218.372 5762 346.03
1 2 N Aisne 26226 5521 51 8901 14572 12831 2:Med ... 4 82 36 38 82 24 327 65.945 7369 513.00
2 3 C Allier 26747 7925 13 10973 17044 114121 2:Med ... 46 42 76 66 16 85 34 161.927 7340 298.26
3 4 E Basses-Alpes 12935 7289 46 2733 23018 14238 1:Sm ... 70 12 37 80 32 29 2 351.399 6925 155.90
4 5 E Hautes-Alpes 17488 8174 69 6962 23076 16171 1:Sm ... 22 23 64 79 35 7 1 320.280 5549 129.10

5 rows × 23 columns

results = smf.ols('Lottery ~ Literacy + np.log(Pop1831)', data=dat).fit()
results.summary()
OLS Regression Results
Dep. Variable: Lottery R-squared: 0.348
Model: OLS Adj. R-squared: 0.333
Method: Least Squares F-statistic: 22.20
Date: Sat, 10 Aug 2024 Prob (F-statistic): 1.90e-08
Time: 08:58:33 Log-Likelihood: -379.82
No. Observations: 86 AIC: 765.6
Df Residuals: 83 BIC: 773.0
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 246.4341 35.233 6.995 0.000 176.358 316.510
Literacy -0.4889 0.128 -3.832 0.000 -0.743 -0.235
np.log(Pop1831) -31.3114 5.977 -5.239 0.000 -43.199 -19.424
Omnibus: 3.713 Durbin-Watson: 2.019
Prob(Omnibus): 0.156 Jarque-Bera (JB): 3.394
Skew: -0.487 Prob(JB): 0.183
Kurtosis: 3.003 Cond. No. 702.


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
nobs = 100
X = np.random.random((nobs, 2))
X = sm.add_constant(X)
beta = [1, .1, .5]
e = np.random.random(nobs)
y = np.dot(X, beta) + e

# Fit regression model
results = sm.OLS(y, X).fit()

# Inspect the results
results.summary()
OLS Regression Results
Dep. Variable: y R-squared: 0.160
Model: OLS Adj. R-squared: 0.142
Method: Least Squares F-statistic: 9.204
Date: Sat, 10 Aug 2024 Prob (F-statistic): 0.000219
Time: 08:58:33 Log-Likelihood: -12.900
No. Observations: 100 AIC: 31.80
Df Residuals: 97 BIC: 39.62
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 1.5476 0.077 20.217 0.000 1.396 1.700
x1 0.1473 0.099 1.483 0.141 -0.050 0.344
x2 0.3775 0.097 3.897 0.000 0.185 0.570
Omnibus: 11.454 Durbin-Watson: 2.126
Prob(Omnibus): 0.003 Jarque-Bera (JB): 3.802
Skew: -0.027 Prob(JB): 0.149
Kurtosis: 2.046 Cond. No. 5.41


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

6. scikit-learn#

ScitKit Learn is a machine learning library for Python. It is built on NumPy, SciPy, and Matplotlib. It is designed to be easy to use and to work well with the other packages in the PyData stack. Although there are many other machine learning libraries for Python, scikit-learn is the most popular and is the best place to start if you are new to machine learning.

In the example below, the authors start by creating a simple dataset with two features. They fit a PCA estimator to display the two principal components of this dataset, i.e. the two directions that explain the most variance in the data.

From here: https://scikit-learn.org/stable/auto_examples/cross_decomposition/plot_pcr_vs_pls.html

# Example from here:
# https://scikit-learn.org/stable/auto_examples/cross_decomposition/plot_pcr_vs_pls.html
import matplotlib.pyplot as plt
import numpy as np

from sklearn.decomposition import PCA

rng = np.random.RandomState(0)
n_samples = 500
cov = [[3, 3], [3, 4]]
X = rng.multivariate_normal(mean=[0, 0], cov=cov, size=n_samples)
pca = PCA(n_components=2).fit(X)


plt.scatter(X[:, 0], X[:, 1], alpha=0.3, label="samples")
for i, (comp, var) in enumerate(zip(pca.components_, pca.explained_variance_)):
    comp = comp * var  # scale component by its variance explanation power
    plt.plot(
        [0, comp[0]],
        [0, comp[1]],
        label=f"Component {i}",
        linewidth=5,
        color=f"C{i + 2}",
    )
plt.gca().set(
    aspect="equal",
    title="2-dimensional dataset with principal components",
    xlabel="first feature",
    ylabel="second feature",
)
plt.legend()
plt.show()
../_images/5d1075dbe7613a8de410d02173333003364e17e630edd08242d33b29d8d817ad.png

7. IPyWidgets#

7.1 Lorenz Attractor: Lorenz System of Differential Equations#

from ipywidgets import interact, interactive
from IPython.display import clear_output, display, HTML

import numpy as np
from scipy import integrate

from matplotlib import pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from matplotlib.colors import cnames
from matplotlib import animation
def solve_lorenz(N=10, angle=0.0, max_time=4.0, sigma=10.0, beta=8./3, rho=28.0):

    fig = plt.figure()
    ax = fig.add_axes([0, 0, 1, 1], projection='3d')
    ax.axis('off')

    # prepare the axes limits
    ax.set_xlim((-25, 25))
    ax.set_ylim((-35, 35))
    ax.set_zlim((5, 55))
    
    def lorenz_deriv(tup, t0, sigma=sigma, beta=beta, rho=rho):
        x, y, z = tup
        """Compute the time-derivative of a Lorentz system."""
        return [sigma * (y - x), x * (rho - z) - y, x * y - beta * z]

    # Choose random starting points, uniformly distributed from -15 to 15
    np.random.seed(1)
    x0 = -15 + 30 * np.random.random((N, 3))

    # Solve for the trajectories
    t = np.linspace(0, max_time, int(250*max_time))
    x_t = np.asarray([integrate.odeint(lorenz_deriv, x0i, t)
                      for x0i in x0])
    
    # choose a different color for each trajectory
    colors = plt.cm.jet(np.linspace(0, 1, N))

    for i in range(N):
        x, y, z = x_t[i,:,:].T
        lines = ax.plot(x, y, z, '-', c=colors[i])
        plt.setp(lines, linewidth=2)

    ax.view_init(30, angle)
    plt.show()

    return t, x_t
t, x_t = solve_lorenz(angle=0, N=10)
../_images/b709a5b68dac1a2b2187ced44a377f9001e34c52f19989032280ade2224c06ef.png
w = interactive(solve_lorenz, angle=(0.,360.), N=(0,50), sigma=(0.0,50.0), rho=(0.0,50.0))
## Uncomment below for interactive viewing
# display(w)

7.2 Mean-Variance Portfolio Analysis#

Suppose that you have two different assets that you could invest in, asset 0 or asset 1. Let \(R_0\) and \(R_1\) be random variables representing the gross return on asset 0 and asset 1, respectively. Let the return of each have means \(\mu_0\) and \(\mu_1\) and standard deviations \(\sigma_0\) and \(\sigma_1\). Let \(w\) be portfolio weight allocated to asset 1. Then, the return on the portfolio is

\[ R = (1-w) R_0 + w R_1. \]
\[ E[R_0] = \mu_0 \]
\[ E[R_1] = \mu_1 \]
(1)#\[\begin{align} std(R_0) &= \sigma_0 \\ std(R_1) &= \sigma_1 \end{align}\]

Now, what about the mean and standard devation of \(R\)?

\[ E[R] = (1-w) \mu_0 + w \mu_1. \]
\[ % std(R) = \, ? %\sqrt{(1-w)^2 \sigma_0^2 + w^2 \sigma_1^2 + 2 w (1-w) \rho \sigma_0 \sigma_1} std(R) = \sqrt{(1-w)^2 \sigma_0^2 + w^2 \sigma_1^2 + 2 w (1-w) \rho \sigma_0 \sigma_1} \]
from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets
import numpy as np
from matplotlib import pyplot as plt

def plot_portfolio(weight_1=.5, mu0=2.0, mu1=8.0, sd0=5.0, sd1=10.0, corr=0.0, Rf=1.0):
    
    #calculate portfolio properties
    expected_return = lambda a: a * mu1 + (1-a) * mu0
    add_sd = lambda a: np.sqrt((1-a)**2 * sd0**2 + a**2 * sd1**2 
                            + 2*a*(1-a)*corr*sd0 * sd1)
    
    #calculate frontier
    weight_1_range = np.linspace(-2.0, 2.0, 100)
    y = expected_return(weight_1_range)
    x = add_sd(weight_1_range)
    
    #Plot frontier and points
    XRIGHT = 15
    fig, ax = plt.subplots()
    ax.plot(x, y)
    ax.plot([sd0, sd1], [mu0, mu1], 'o')
    ax.axis([0, XRIGHT, -2, 10])
    ax.plot([add_sd(weight_1)], [expected_return(weight_1)], 'o')
    plt.xlabel('standard deviation')
    plt.ylabel('expected return')
    
    if corr < 1:
        #calculate risk-free rate frontier
        one_vec = np.array([[1,1]]).T
        Sigma = np.array([[sd0**2, sd0 * sd1 * corr],
                          [sd0 * sd1 * corr, sd1**2]])
        mu_vec = np.array([[mu0, mu1]]).T
        #A = np.linalg.inv(Sigma) @ (mu_vec - one_vec * Rf)
        A = np.linalg.solve(Sigma, (mu_vec - one_vec * Rf))
        weights_vec = A * (one_vec.T @ A)**(-1)
        Rt = mu_vec.T @ weights_vec
        sd_t = np.sqrt(weights_vec.T @ Sigma @ weights_vec)
        slope = (Rt - Rf) / (sd_t - 0)    
        ax.plot([0, sd_t.item(), (sd_t + XRIGHT).item()], [Rf, Rt.item(), (Rt + XRIGHT * slope).item() ])
plot_portfolio(weight_1=.5, mu0=4.0, mu1=8.0, sd0=5.0, sd1=10.0, corr=0.0, Rf=1.0)
../_images/7c940e36e6a4c4b5006e4ca5ca3cab9e82b07b84b4334c8ad93652e2a158c0fe.png
## Uncomment below for interactive viewing
# interact(plot_portfolio, weight_1=(-1.0, 2.0, .001), corr=(-1.0, 1.0, .01));

DISCUSS: The example above demonstrate what is meant by a set of efficient portfolios? What do you think is meant by an efficient portfolio?

For more information about calculating the tangency portfolio, see here: https://bookdown.org/compfinezbook/introcompfinr/Efficient-portfolios-of.html