Gaussian Process Regression: A non-parametric regression approach

See this page for derivations

  • One of the biggest advantages of Gaussian Process Regression, apart from simplicity, is that even before making a new observation y at any point x, we can predict its variance based on observations done so far.
    • What that means is we can avoid exploring regions in input space X, where variance is low, as we won't get much new information there.
      • Variance in regions of input space where some observations are already made is lower than points in regions where no observations are done.
    • This has serious advantages in various areas, like:
      (See this link)
      • Finance: Avoid investing in stocks which we know are low gain and very predictable. Explore stocks where risk is higher upto a certain limit, as those yield higher returns.
      • Model Selection in Machine Learning systems: We can explore Design Space where we think we can get much more return (high variance regions). Exploration in those regions may not yield better results from what we already know, but any explored points in those highly uncertain regions will reveal much information about surrounding points.
In [1]:
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import Latex

Define Kernel

$\begin{eqnarray} K(x,x') = \lambda^{-1}*e^{\frac{-1}{2*lsep^2}.||x-x'||^2}\\ \end{eqnarray}$

In [2]:
# Define Kernel - Using Radial Basis Function
def Kernel(x,x_p, lam=0.1, lsep=0.1):
    """
    lam: lamda prior on distribution of model, used to combine phi(x) linearly
    lsep: length of separation between points
    """
    return lam**(-1.) * np.exp((-1./(2*lsep**2))*(np.sum(x**2, 1).reshape(-1,1) \
                                                    + np.sum(x_p**2, 1).reshape(1,-1) \
                                                    - 2*x.dot(x_p.T)))

Generate a random function

$f = N(0, K(x_{train}, x_{train}))$

  • And store it to see if we can learn it later using Gaussian Process Regression.
In [3]:
# - Generate Train data
n=100
rng=(-10,10)
x_train = np.linspace(rng[0],rng[1],n).reshape(-1,1) # x train points
K=Kernel(x_train, x_train, lsep=1, lam=1)
f = np.random.multivariate_normal(np.zeros(n), K) # generate a random function

plt.imshow(K)
plt.title('ColorMap of Covariance matrix K')
plt.colorbar(orientation='vertical')
plt.show()

plt.plot(x_train, f)
plt.title('saved an arbitrary f to see if we can re-learn it')
plt.show()

Generate Noisy Observations

$y_{train} = N(f, \sigma_{noise}^2*I)$

In [4]:
sig_noise = 0.5
y_train = np.random.multivariate_normal(f, sig_noise*np.eye(n))
y_train = y_train.reshape(-1,1)

plt.scatter(x_train, y_train, c='g', label=r'$y_{train}$')
#plt.plot(x_train, f, c='black', label='f')
plt.title(r'Generated $y_{train}$ using prior f with $\sigma_{noise}$='+str(sig_noise))
plt.legend(loc='lower right')
plt.show()

Gaussian Process Regresion: Learning

We will learn in steps using M observations at a time. (Basically these observation will be sampled from our bag of training data we saved earlier)

$\begin{bmatrix} y_M\\ y_D \end{bmatrix}$$ $$ \sim\ N(0,\ \sigma_{noise}^2.I\ +\ \begin{bmatrix} K(X_{M},X_{M}) & K(X_{D},X_{M})^T\\ K(X_{D},X_{M}) & K(X_{D},X_{D}) \end{bmatrix})$

Now, the Expected value of $y_M=E(y_M)$ (the new M predictions) based on distribution shown above (from data D seen so far):
$\mu_M = K(X_{M},X_{D}) * K(X_{D},X_{D})^{-1} * y_D$

And the variance of these predictions:
$ \Sigma_M = \sigma_{noise}^2*I + K(X_{M},X_{M}) - K(X_{M},X_{D}) * K(X_{D},X_{D})^{-1} * K(X_{M},X_{D})^T$

Basically $y_M\ |\ X_{D}, y_{D} \sim N(\mu_{M}, \Sigma_M)$

We will implement following algorithm:

  • Prepare the observations (randomly permutate train data)
  • Get initial observations (sample M examples from training data)
  • loop through observations
    • predict new y (get mean and co-variance of predictions)
      • Would be better if we observe areas where there is high variance in predictions.
      • This will give bigger bang for buck (See this link)
    • get error of E(predicted new y) VS actually observated new y
    • Add newly observed y to observation so far
In [6]:
M = 5 # number of observations at a time
D = n # total number of observations

# Prepare the observations (randomly permutate train data)
ind_shuffled = np.random.permutation(x_train.shape[0])
x_obs, y_obs = x_train[ind_shuffled], y_train[ind_shuffled]

# Get initial observations
i, Mini=0, 5
x_D, y_D = x_obs[i:i+Mini], y_obs[i:i+Mini]
i += Mini

errorTrace_RMSE = []
# - loop through observations
while i+M <= D:
    print 'for observations {0}:{1}'.format(i, i+M)
#  - predict new y (get mean and variance of predictions)
    # Add some new points to previously observed points. 
    # We will see how well we do on previously observed points 
    #   and how much variance we have on not-yet-observed points.
    x_M = x_obs[0:i+M]
    
    K_DD = Kernel(x_D, x_D)
    K_MD = Kernel(x_M, x_D)
    K_MM = Kernel(x_M, x_M)
    
    # - Mean
    mu_M = (K_MD.dot(np.linalg.inv(K_DD))).dot(y_D)
    
    # - Co-Variance
    Sig_M = sig_noise*np.eye(i+M) + K_MM - (K_MD.dot(np.linalg.inv(K_DD))).dot(K_MD.T)
    
    # - get variances (along diagonal of co-variance matrix)
    sd = np.sqrt(np.diagonal(Sig_M)).reshape(-1,1) 
    point_of_maxVariance = np.argmax(sd, axis=0)
    
    #  - Actually observed y and find error between E(y_M) - y_Mobs
    y_Mobs = y_obs[0:i+M]
    err_RMSE = np.sqrt(np.mean((mu_M-y_Mobs)**2))
    errorTrace_RMSE.append(err_RMSE)
    
    # - plots
    plt.plot(x_train, f, c='black', label='f: Truth')
    plt.scatter(x_D, y_D, c='green', marker=(5, 2), label=r'$y_{M}$:noisy observations')
    
    ind_sorted = np.argsort(x_M, axis=0).flatten()
    plt.plot(x_M[ind_sorted], mu_M[ind_sorted], 'r--', label=r'$\mu_{M}$:predicted mean')
    plt.scatter(x_M[i:i+M], mu_M[i:i+M], c='blue', marker='o', label='Expected values of new predictions')

    UL = mu_M + sd
    LL = mu_M - sd
    plt.fill_between(x_M[ind_sorted].flatten(), UL[ind_sorted].flatten(), LL[ind_sorted].flatten(), facecolor='gray', alpha=0.5)
    
    plt.vlines(x_M[point_of_maxVariance], plt.ylim()[0], plt.ylim()[1], colors='blue',linestyles='--')
    plt.legend(bbox_to_anchor=(1.04,1), loc="upper left")
    plt.show()
    
    #  - Add newly observed y to observation so far
    x_D, y_D = np.vstack((x_D, x_M[i:i+M])), np.vstack((y_D, y_Mobs[i:i+M]))
    
    i += M

plt.plot(errorTrace_RMSE)
plt.title("Error trace")
plt.grid()
plt.show()
for observations 5:10
for observations 10:15
for observations 15:20
for observations 20:25
for observations 25:30
for observations 30:35
for observations 35:40
for observations 40:45
for observations 45:50
for observations 50:55
for observations 55:60
for observations 60:65
for observations 65:70
for observations 70:75
for observations 75:80
for observations 80:85
for observations 85:90
for observations 90:95
for observations 95:100

Conclusions

  • It can be noted that variance is lesser at points already observed but higher at yet-to-be-observed points.
  • Variance is high at new predictions.
    • Vertical dashed blue line shows x-point were variance is highest among new predictions.
    • We can greedily explore areas, where variance is high.
  • Error between $E(y_{predictions})$ and $y_{observations}$ reduces with more observations. But won't go to zero due to noisy observations.
In [ ]: