Facundo Joel Allia Fernandez · Follow
14 min read · Apr 28, 2023
--
I am writing this article to share the study that I carried out last year in the final postgraduate project in quantitative finance “Classification of shares through k-means clustering”. To see the full paper visit SSRN.
This work presents an approach to using the K-means algorithm for stock classification, with the aim of helping investors to diversify their investment portfolios. It´s divided into 4 parts:
- Introducing the k-means algorithm
- Clustering of stocks by return and volatility
- Clustering of shares by price-earnings ratio and dividend rate
- 3-dimensional analysis using the k-means++ algorithm by return, volatility, and PER.
The term k-means was first used by MacQueen in 1967, although the idea dates back to Steinhaus in 1957. K-means is an unsupervised classification (clustering) algorithm that groups objects into k groups based on their characteristics.
Clustering is done by minimizing the sum of distances between each object and the centroid of its group or cluster. Quadratic distance is often used. The algorithm consists of three steps:
- Initialization: once the number of groups, k, has been chosen, k centroids are established in the data space, for example, choosing them randomly.
- Assign objects to centroids: each data object is assigned to its nearest centroid.
- Centroid update: the position of the centroid of each group is updated, taking as the new centroid the position of the average of the objects belonging to said group.
Steps 2 and 3 are repeated until the centroids do not move, or move below a threshold distance at each step.
We analyze the S&P 500 index to cluster stocks based on return and volatility. This index comprises 500 large-cap US companies from various sectors, traded on NYSE or Nasdaq. Due to its representation of the US’s largest publicly traded firms, it serves as a suitable dataset for algorithmic k-means clustering.
#Import the libraries that we are going to need to carry out the analysis:
import numpy as np
import pandas as pd
import pandas_datareader as dr
import yfinance as yffrom pylab import plot,show
from matplotlib import pyplot as plt
import plotly.express as px
from numpy.random import rand
from scipy.cluster.vq import kmeans,vq
from math import sqrt
from sklearn.cluster import KMeans
from sklearn import preprocessing
Load Data
We calculate the annual average return and volatility for each company by obtaining their adjusted closing prices during 01/02/2020–12/02/2022 and inserting them into a dataframe, which is then annualized (assuming 252 market days per year).
# Define the url
sp500_url = 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'# Read in the url and scrape ticker data
data_table = pd.read_html(sp500_url)
tickers = data_table[0]['Symbol'].values.tolist()
tickers = [s.replace('\n', '') for s in tickers]
tickers = [s.replace('.', '-') for s in tickers]
tickers = [s.replace(' ', '') for s in tickers]
# Download prices
prices_list = []
for ticker in tickers:
try:
prices = dr.DataReader(ticker,'yahoo','01/01/2020')['Adj Close']
prices = pd.DataFrame(prices)
prices.columns = [ticker]
prices_list.append(prices)
except:
pass
prices_df = pd.concat(prices_list,axis=1)
prices_df.sort_index(inplace=True)
# Create an empity dataframe
returns = pd.DataFrame()
# Define the column Returns
returns['Returns'] = prices_df.pct_change().mean() * 252
# Define the column Volatility
returns['Volatility'] = prices_df.pct_change().std() * sqrt(252)
Determine the optimal number of clusters
The Elbow curve method is a technique used to determine the optimal number of clusters for K-means clustering. The method works by plotting the sum of squared errors (SSE) for different values of k (number of clusters). The optimal number of clusters is the value of k at which the SSE starts to decrease at a slower rate. The optimal number of clusters is determined by finding the elbow or the point at which the SSE reaches its minimum value. In this case, the optimal number of clusters is 4.
# Format the data as a numpy array to feed into the K-Means algorithm
data = np.asarray([np.asarray(returns['Returns']),np.asarray(returns['Volatility'])]).T
X = data
distorsions = []
for k in range(2, 20):
k_means = KMeans(n_clusters=k)
k_means.fit(X)
distorsions.append(k_means.inertia_)
fig = plt.figure(figsize=(15, 5))plt.plot(range(2, 20), distorsions)
plt.grid(True)
plt.title('Elbow curve')
K-means clustering
Once the optimum number of clusters has been defined, we proceed to create them. In the first instance, the centroids are defined using the sklearn library. For the creation of 4 groups of actions, the K-means algorithm iteratively assigns data points to the groups based on their similarity of characteristics, or “features”, in this case, Average Annualized Return and Average Annualized Volatility.
# Computing K-Means with K = 4 (4 clusters)
centroids,_ = kmeans(data,4)# Assign each sample to a cluster
idx,_ = vq(data,centroids)
# Create a dataframe with the tickers and the clusters that's belong to
details = [(name,cluster) for name, cluster in zip(returns.index,idx)]
details_df = pd.DataFrame(details)
# Rename columns
details_df.columns = ['Ticker','Cluster']
# Create another dataframe with the tickers and data from each stock
clusters_df = returns.reset_index()
# Bring the clusters information from the dataframe 'details_df'
clusters_df['Cluster'] = details_df['Cluster']
# Rename columns
clusters_df.columns = ['Ticker', 'Returns', 'Volatility', 'Cluster']
The algorithm initially randomly assigns the data points to the clusters and then calculates the centroid of each cluster, which is the mean of all the data points within the cluster. Then, it compares the data points to the centroid and reassigns them to groups accordingly. This process is repeated until the centroid of each cluster remains relatively stable, at which point the algorithm stops and each cluster is assigned a label. The end result is a set of 4 groups, each containing stocks that have similar returns and volatility.
# Plot the clusters created using Plotly
fig = px.scatter(clusters_df, x="Returns", y="Volatility", color="Cluster", hover_data=["Ticker"])
fig.update(layout_coloraxis_showscale=False)
fig.show()
Outlier treatment
When creating the clusters, four outliers or outliers are detected in a scatter plot. Outliers are data points that are significantly different from the rest of the data points in the data set. Often, they can lead to inaccurate results when using an algorithm, since they don’t fit the same pattern as the other data points. Therefore, it is important to segregate and remove outliers to improve the accuracy of the model.
Outlier removal can help the algorithm focus on the most representative data points and reduce the effect of outliers on the results. This can help increase the accuracy of the model and ensure that the data points are grouped correctly. The removed tickers are:
- MRNA
- ENPH
- TSLA
- CEG
# Identify and remove the outliers stocks
returns.drop('MRNA',inplace=True)
returns.drop('ENPH',inplace=True)
returns.drop('TSLA',inplace=True)
returns.drop('CEG',inplace=True)# Recreate data to feed into the algorithm
data = np.asarray([np.asarray(returns['Returns']),np.asarray(returns['Volatility'])]).T
Once the outliers have been eliminated, we repeat the steps performed for clustering using the K-means algorithm to obtain more accurate clusters.
# Computing K-Means with K = 4 (4 clusters)
centroids,_ = kmeans(data,4)# Assign each sample to a cluster
idx,_ = vq(data,centroids)
# Create a dataframe with the tickers and the clusters that's belong to
details = [(name,cluster) for name, cluster in zip(returns.index,idx)]
details_df = pd.DataFrame(details)
# Rename columns
details_df.columns = ['Ticker','Cluster']
# Create another dataframe with the tickers and data from each stock
clusters_df = returns.reset_index()
# Bring the clusters information from the dataframe 'details_df'
clusters_df['Cluster'] = details_df['Cluster']
# Rename columns
clusters_df.columns = ['Ticker', 'Returns', 'Volatility', 'Cluster']
# Plot the clusters created using Plotly
fig = px.scatter(clusters_df, x="Returns", y="Volatility", color="Cluster", hover_data=["Ticker"])
fig.update(layout_coloraxis_showscale=False)
fig.show()
The graph shows 4 clusters that were generated using a K-means algorithm with 2 variables: average annualized return and average annualized volatility. These variables are used to measure the risk and return of a stock. The 4 clusters represent 4 groups of actions with different levels of risk and return in the period under study.
Clustering is useful for identifying peer groups among stocks, thus allowing differentiation between stocks with different levels of risk and return. This is useful for investors looking to diversify their investment portfolios, as it allows them to identify groups of stocks with different levels of risk and return.
Investors could use the 4 clusters to select a mix of stocks with different levels of risk and return based on their investment objectives. This will help them diversify their portfolio and reduce the risk of their investment since they will be investing in a variety of assets with different levels of risk.
According to Zhao Gao:
In machine learning, the development of some algorithms is already quite mature, […] such as the k-Means algorithm. Said algorithm can be applied to investments in variable income and achieve very good results. For example, segregating two types of companies in the market. On the one hand, mature companies or “value stocks” generally have low P/E ratios and high dividend rates. The second category, “growth” companies are companies with broad development prospects, but also uncertainties in the future, generally have high P/E ratios and low dividend rates. If you can accurately distinguish between blue-chip stocks and high-growth stocks in the market, you can provide a good benchmark for investors. (Zhao Gao 2020).
Following this conceptual line, it is possible to apply a clustering similar to the one carried out previously, exchanging the variables Average Annualized Return and Average Annualized Volatility for PER (Price-Earnings Ratio) and Dividend Rate (Dividend Yield). In this way, we could differentiate between “value” companies and “growth” companies.
Load Data
The trailing price-to-earnings (P/E) ratio is a relative valuation multiple that is based on the last 12 months of actual earnings. It is calculated by taking the current share price and dividing it by the earnings per share (EPS) for the last 12 months. While the dividend rate, or dividend rate, is the amount of cash that a company returns to its shareholders annually as a percentage of the company’s market value.
# Download trailingPE and dividendRate
trailingPE_list = []
dividendRate_list = []for t in tickers:
tick = yf.Ticker(t)
ticker_info = tick.info
try:
trailingPE = ticker_info['trailingPE']
trailingPE_list.append(trailingPE)
except:
trailingPE_list.append('na')
try:
dividendRate = ticker_info['dividendRate']
dividendRate_list.append(dividendRate)
except:
dividendRate_list.append('na')
# Create a datafrane to contain the data
sp_features_df = pd.DataFrame()
# Add the ticker, trailingPE and dividendRate data
sp_features_df['Ticker'] = tickers
sp_features_df['trailingPE'] = trailingPE_list
sp_features_df['dividendRate'] = dividendRate_list
# Shares with 'na' as dividend rate has no dividend so we have to assign 0 as dividend rate in this cases
sp_features_df["dividendRate"] = sp_features_df["dividendRate"].fillna(0)
# filter shares with 'na' as trailingPE
df_mask = sp_features_df['trailingPE'] != 'na'
sp_features_df = sp_features_df[df_mask]
# Convert trailingPE numbers to float type
sp_features_df['trailingPE'] = sp_features_df['trailingPE'].astype(float)
# Removes the rows that contains NULL values
sp_features_df=sp_features_df.dropna()
Determine the optimal number of clusters
Once the Price-Earnings Ratio and Dividend Rate data have been obtained, we can reapply the Elbow method to determine the optimal number of clusters
# Format the data as a numpy array to feed into the K-Means algorithm
data = np.asarray([np.asarray(sp_features_df['trailingPE']),np.asarray(sp_features_df['dividendRate'])]).T
X = data
distorsions = []
for k in range(2, 20):
k_means = KMeans(n_clusters=k)
k_means.fit(X)
distorsions.append(k_means.inertia_)
fig = plt.figure(figsize=(15, 5))plt.plot(range(2, 20), distorsions)
plt.grid(True)
plt.title('Elbow curve')
The optimal number of clusters is 3.
K-means clustering
Once the optimum number of clusters has been defined, we proceed to create them. In the first instance, the centroids are defined using the sklearn library. For the creation of groups of actions, the K-means algorithm iteratively assigns data points to the groups based on their similarity of characteristics, or “features”, in this case, Price-Earnings Ratio and Dividend Rate.
# Computing K-Means with K = 3 (3 clusters)
centroids,_ = kmeans(data,3)# Assign each sample to a cluster
idx,_ = vq(data,centroids)
# Create the clusters from the numpy array 'data'
cluster_1 = data[idx==0,0],data[idx==0,1]
cluster_2 = data[idx==1,0],data[idx==1,1]
cluster_3 = data[idx==2,0],data[idx==2,1]
# Create a dataframe with the tickers and the clusters that's belong to
details = [(name,cluster) for name, cluster in zip(sp_features_df.index,idx)]
details_df = pd.DataFrame(details)
# Rename columns
details_df.columns = ['Ticker','Cluster']
# Create another dataframe with the tickers and data from each stock
clusters_df = sp_features_df
# Bring the clusters information from the dataframe 'details_df'
clusters_df['Cluster'] = details_df['Cluster']
# Rename columns
clusters_df.columns = ['Ticker', 'trailingPE', 'dividendRate', 'marketCap', 'Cluster']
# Plot the clusters created using Plotly
fig = px.scatter(clusters_df, x="dividendRate", y="trailingPE", color="Cluster", hover_data=["Ticker"])
fig.update(layout_coloraxis_showscale=False)
fig.show()
When making a first approximation clustering by trailing price-to-earnings (P/E) and dividend rate, the presence of outliers and excessive dispersion among the observations is evident, so we proceed to filter the actions and normalize the data to eliminate these distortions.
Outlier treatment
We see as a result a scattered and unclear clustering. Therefore, eliminating outliers and normalizing the data will be necessary to achieve more accurate clusters.
First, we apply a filter to include only stocks with price-to-earnings less than 200 and dividend rate less than 5.
df_mask = (sp_features_df['trailingPE'] < 200) & (sp_features_df['dividendRate'] < 5)
sp_features_df = sp_features_df[df_mask]
Then, we apply MaxAbsScaler. MaxAbsScaler
scales each feature by its maximum absolute value. This estimator scales and translates each feature individually such that the maximal absolute value of each feature in the training set will be 1. It does not shift/center the data and thus does not destroy any sparsity.
# Import the MaxAbsScaler class
max_abs_scaler = preprocessing.MaxAbsScaler()# Extract the 'trailingPE' column and reshape it to a column vector
trailingPE_array = np.array(sp_features_df['trailingPE'].values).reshape(-1,1)
# Extract the 'dividendRate' column and reshape it to a column vector
dividendRate_array = np.array(sp_features_df['dividendRate'].values).reshape(-1,1)
# Extract the 'marketCap' column and reshape it to a column vector
marketCap_array = np.array(sp_features_df['marketCap'].values).reshape(-1,1)
# Apply the MaxAbsScaler and store the normalized values in new columns
sp_features_df['trailingPE_norm'] = max_abs_scaler.fit_transform(trailingPE_array)
sp_features_df['dividendRate_norm'] = max_abs_scaler.fit_transform(dividendRate_array)
sp_features_df['marketCap_norm'] = max_abs_scaler.fit_transform(marketCap_array)
Once MaxAbsScaler
is applied, we perform the Elbow method again with the normalized variables as input:
# Format the data as a numpy array to feed into the K-Means algorithm
data = np.asarray([np.asarray(sp_features_df['trailingPE_norm']),np.asarray(sp_features_df['dividendRate_norm'])]).T
X = data
distorsions = []
for k in range(2, 20):
k_means = KMeans(n_clusters=k)
k_means.fit(X)
distorsions.append(k_means.inertia_)
fig = plt.figure(figsize=(15, 5))plt.plot(range(2, 20), distorsions)
plt.grid(True)
plt.title('Elbow curve')
Once the pertinent modifications have been made, it is possible to obtain 4 clusters generated by the K-means algorithm according to the trailing price-to-earnings (P/E) and dividend rate of each share.
# Computing K-Means with K = 4 (4 clusters)
centroids,_ = kmeans(data,4)# Assign each sample to a cluster
idx,_ = vq(data,centroids)
# Create a dataframe with the tickers and the clusters that's belong to
details = [(name,cluster) for name, cluster in zip(sp_features_df.index,idx)]
details_df = pd.DataFrame(details)
clusters_df = pd.DataFrame()
clusters_df['Ticker'] = sp_features_df['Ticker']
clusters_df['trailingPE_norm'] = sp_features_df['trailingPE_norm']
clusters_df['dividendRate_norm'] = sp_features_df['dividendRate_norm']
clusters_df['marketCap_norm'] = sp_features_df['marketCap_norm']
clusters_df['Cluster'] = details_df[1].values
# Plot the clusters created using Plotly
fig = px.scatter(clusters_df, x="dividendRate_norm", y="trailingPE_norm", color="Cluster", hover_data=["Ticker"])
fig.update(layout_coloraxis_showscale=False)
fig.show()
It is possible to graphically verify that the algorithm assigned a greater weight to the dividend rate variable when creating the clusters. In this way, 4 sets of actions are distinguished: C1 with Null or very low, C2 with low, C3 with a medium-high and C4 with a high dividend rate.
We can extend the analysis of S&P500 stocks by applying k-means++ binning. This algorithm ensures a more intelligent initialization of the centroids and improves the quality of the clustering. Other than initialization, the rest of the algorithm is the same as the standard K-means algorithm. That is, K-means++ is the standard K-means algorithm together with more intelligent initialization of the centroids.
It is possible to take into account 3 variables for clustering. In K-means clustering, points are grouped into clusters based on the distance between the points. This means that to extend a clustering from two dimensions to three, you must add a third dimension to the data and calculate the distance between points in that dimension as well.
Clusters are now defined as data sets that have similar distances in all three dimensions, instead of just two. This allows a better separation of the clusters and a better fit for the distribution of the data.
Load Data
As in the first application, the Average Annualized Return and Average Annualized Volatility data are selected, but now the Price to Book variable is also added for the 3-dimensional analysis:
# Download priceToBook and marketCap
priceToBook_list = []
marketCap_list = []
tickers_clean = tickersfor t in tickers:
try:
tick = yf.Ticker(t)
ticker_info = tick.info
priceToBook = ticker_info['priceToBook']
marketCap = ticker_info['marketCap']
priceToBook_list.append(priceToBook)
marketCap_list.append(marketCap)
except:
tickers_clean = tickers.remove(ticker)
print('The stock ticker {} is not on database'.format(ticker))
# Create a datafrane to contain the data
priceToBook_df = pd.DataFrame()
# Add the ticker, priceToBook and marketCap data
priceToBook_df['Ticker'] = tickers_clean
priceToBook_df['priceToBook'] = priceToBook_list
priceToBook_df['marketCap'] = marketCap_list
# Merge dataframes
clusters3d_df = pd.merge(clusters_df, priceToBook_df)
# Removes the rows that contains NULL values
clusters3d_df.dropna(inplace=True)
# Drop the column with the old clusterization
clusters3d_df.drop(['Cluster'], axis=1, inplace=True)
# Order columns
clusters3d_df = clusters3d_df[['Ticker', 'marketCap', 'Returns', 'Volatility', 'priceToBook']]
Determine the optimal number of clusters
Once the Average Annualized Return, Average Annualized Volatility and Price to Book data have been obtained, we can reapply the Elbow method to determine the optimal number of clusters
The optimal number of clusters is 3.
K-means clustering
Once the optimum number of clusters has been defined, we proceed to create them. In the first instance, the centroids are defined using the sklearn library. For the creation of groups of actions, the K-means algorithm iteratively assigns data points to the groups based on their similarity of characteristics, or “features”, in this case, Average Annualized Return, Average Annualized Volatility, and Price to Book.
# Create clusters
k_means_optimum = KMeans(n_clusters = 3, init = 'k-means++', random_state=42)
y = k_means_optimum.fit_predict(X)# Plot 3D graph with plotly
clusters3d_df['cluster'] = y
fig = px.scatter_3d(clusters3d_df, x='Returns', y='Volatility', z='priceToBook',
color='cluster', hover_data=["Ticker"])
fig.show()
Outlier treatment
Again, we note the presence of outliers. We remove them individually and repeat the steps to cluster them again
# Identify and remove the outliers stocks
clusters3d_df.drop(clusters3d_df[(clusters3d_df['Ticker'] == 'HD')].index, inplace=True)
clusters3d_df.drop(clusters3d_df[(clusters3d_df['Ticker'] == 'CL')].index, inplace=True)# Recreate data to feed into the algorithm
data3d = np.asarray([np.asarray(clusters3d_df['Returns']), np.asarray(clusters3d_df['Volatility']), np.asarray(clusters3d_df['priceToBook'])]).T
X = data3d
#elbow method
distorsions = []
for i in range(1,20):
k_means = KMeans(n_clusters=i,init='k-means++', random_state=42)
k_means.fit(X)
distorsions.append(k_means.inertia_)
#plot elbow curve
fig = plt.figure(figsize=(15, 5))
plt.plot(np.arange(1,20),distorsions)
plt.xlabel('Clusters')
plt.ylabel('SSE')
plt.title('Elbow curve')
plt.grid(True)
plt.show()
Once the pertinent modifications have been made, it is possible to obtain 3 clusters generated by the K-means++ algorithm according to the t Average Annualized Return, Average Annualized Volatility, and Price to Book of each share.
# Initialize KMeans model with 3 clusters
k_means_optimum = KMeans(n_clusters = 3, init = 'k-means++', random_state=42)# Cluster data using KMeans and store labels in 'y'
y = k_means_optimum.fit_predict(X)
# Add 'cluster' column to dataframe with cluster labels from 'y'
clusters3d_df['cluster'] = y
# Create 3D scatter plot with plotly express, color-coded by cluster label and with 'Ticker' tooltip
fig = px.scatter_3d(clusters3d_df, x='priceToBook', y='Returns', z='Volatility', color='cluster', hover_data=["Ticker"])
# Display plot
fig.show()
Finally, through clustering by the K-means++ algorithm, 3 sets of actions grouped by the 3 variables under study (Annualized Average Return, Annualized Average Volatility, and Price to Book) are obtained. The presence of shares with high Price to Book indices is striking. This information can be useful when creating an investment portfolio.