Stock classification using k-means clustering (2024)

Facundo Joel Allia Fernandez

·

Follow

14 min read

·

Apr 28, 2023

--

Stock classification using k-means clustering (2)

I am writing this article to share the study that I carried out last year in the final postgraduate project in quantitative finance “Classification of shares through k-means clustering”. To see the full paper visit SSRN.

This work presents an approach to using the K-means algorithm for stock classification, with the aim of helping investors to diversify their investment portfolios. It´s divided into 4 parts:

  1. Introducing the k-means algorithm
  2. Clustering of stocks by return and volatility
  3. Clustering of shares by price-earnings ratio and dividend rate
  4. 3-dimensional analysis using the k-means++ algorithm by return, volatility, and PER.

The term k-means was first used by MacQueen in 1967, although the idea dates back to Steinhaus in 1957. K-means is an unsupervised classification (clustering) algorithm that groups objects into k groups based on their characteristics.

Clustering is done by minimizing the sum of distances between each object and the centroid of its group or cluster. Quadratic distance is often used. The algorithm consists of three steps:

  1. Initialization: once the number of groups, k, has been chosen, k centroids are established in the data space, for example, choosing them randomly.
  2. Assign objects to centroids: each data object is assigned to its nearest centroid.
  3. Centroid update: the position of the centroid of each group is updated, taking as the new centroid the position of the average of the objects belonging to said group.

Steps 2 and 3 are repeated until the centroids do not move, or move below a threshold distance at each step.

We analyze the S&P 500 index to cluster stocks based on return and volatility. This index comprises 500 large-cap US companies from various sectors, traded on NYSE or Nasdaq. Due to its representation of the US’s largest publicly traded firms, it serves as a suitable dataset for algorithmic k-means clustering.

#Import the libraries that we are going to need to carry out the analysis:
import numpy as np
import pandas as pd
import pandas_datareader as dr
import yfinance as yf

from pylab import plot,show
from matplotlib import pyplot as plt
import plotly.express as px

from numpy.random import rand
from scipy.cluster.vq import kmeans,vq
from math import sqrt
from sklearn.cluster import KMeans
from sklearn import preprocessing

Load Data

We calculate the annual average return and volatility for each company by obtaining their adjusted closing prices during 01/02/2020–12/02/2022 and inserting them into a dataframe, which is then annualized (assuming 252 market days per year).

# Define the url
sp500_url = 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'

# Read in the url and scrape ticker data
data_table = pd.read_html(sp500_url)
tickers = data_table[0]['Symbol'].values.tolist()
tickers = [s.replace('\n', '') for s in tickers]
tickers = [s.replace('.', '-') for s in tickers]
tickers = [s.replace(' ', '') for s in tickers]

# Download prices
prices_list = []
for ticker in tickers:
try:
prices = dr.DataReader(ticker,'yahoo','01/01/2020')['Adj Close']
prices = pd.DataFrame(prices)
prices.columns = [ticker]
prices_list.append(prices)
except:
pass
prices_df = pd.concat(prices_list,axis=1)
prices_df.sort_index(inplace=True)

# Create an empity dataframe
returns = pd.DataFrame()

# Define the column Returns
returns['Returns'] = prices_df.pct_change().mean() * 252

# Define the column Volatility
returns['Volatility'] = prices_df.pct_change().std() * sqrt(252)

Determine the optimal number of clusters

The Elbow curve method is a technique used to determine the optimal number of clusters for K-means clustering. The method works by plotting the sum of squared errors (SSE) for different values of k (number of clusters). The optimal number of clusters is the value of k at which the SSE starts to decrease at a slower rate. The optimal number of clusters is determined by finding the elbow or the point at which the SSE reaches its minimum value. In this case, the optimal number of clusters is 4.

# Format the data as a numpy array to feed into the K-Means algorithm
data = np.asarray([np.asarray(returns['Returns']),np.asarray(returns['Volatility'])]).T
X = data
distorsions = []
for k in range(2, 20):
k_means = KMeans(n_clusters=k)
k_means.fit(X)
distorsions.append(k_means.inertia_)
fig = plt.figure(figsize=(15, 5))

plt.plot(range(2, 20), distorsions)
plt.grid(True)
plt.title('Elbow curve')

Stock classification using k-means clustering (3)

K-means clustering

Once the optimum number of clusters has been defined, we proceed to create them. In the first instance, the centroids are defined using the sklearn library. For the creation of 4 groups of actions, the K-means algorithm iteratively assigns data points to the groups based on their similarity of characteristics, or “features”, in this case, Average Annualized Return and Average Annualized Volatility.

# Computing K-Means with K = 4 (4 clusters)
centroids,_ = kmeans(data,4)

# Assign each sample to a cluster
idx,_ = vq(data,centroids)

# Create a dataframe with the tickers and the clusters that's belong to
details = [(name,cluster) for name, cluster in zip(returns.index,idx)]
details_df = pd.DataFrame(details)

# Rename columns
details_df.columns = ['Ticker','Cluster']

# Create another dataframe with the tickers and data from each stock
clusters_df = returns.reset_index()

# Bring the clusters information from the dataframe 'details_df'
clusters_df['Cluster'] = details_df['Cluster']

# Rename columns
clusters_df.columns = ['Ticker', 'Returns', 'Volatility', 'Cluster']

The algorithm initially randomly assigns the data points to the clusters and then calculates the centroid of each cluster, which is the mean of all the data points within the cluster. Then, it compares the data points to the centroid and reassigns them to groups accordingly. This process is repeated until the centroid of each cluster remains relatively stable, at which point the algorithm stops and each cluster is assigned a label. The end result is a set of 4 groups, each containing stocks that have similar returns and volatility.

# Plot the clusters created using Plotly
fig = px.scatter(clusters_df, x="Returns", y="Volatility", color="Cluster", hover_data=["Ticker"])
fig.update(layout_coloraxis_showscale=False)
fig.show()
Stock classification using k-means clustering (4)

Outlier treatment

When creating the clusters, four outliers or outliers are detected in a scatter plot. Outliers are data points that are significantly different from the rest of the data points in the data set. Often, they can lead to inaccurate results when using an algorithm, since they don’t fit the same pattern as the other data points. Therefore, it is important to segregate and remove outliers to improve the accuracy of the model.

Outlier removal can help the algorithm focus on the most representative data points and reduce the effect of outliers on the results. This can help increase the accuracy of the model and ensure that the data points are grouped correctly. The removed tickers are:

  • MRNA
  • ENPH
  • TSLA
  • CEG
# Identify and remove the outliers stocks
returns.drop('MRNA',inplace=True)
returns.drop('ENPH',inplace=True)
returns.drop('TSLA',inplace=True)
returns.drop('CEG',inplace=True)

# Recreate data to feed into the algorithm
data = np.asarray([np.asarray(returns['Returns']),np.asarray(returns['Volatility'])]).T

Once the outliers have been eliminated, we repeat the steps performed for clustering using the K-means algorithm to obtain more accurate clusters.

# Computing K-Means with K = 4 (4 clusters)
centroids,_ = kmeans(data,4)

# Assign each sample to a cluster
idx,_ = vq(data,centroids)

# Create a dataframe with the tickers and the clusters that's belong to
details = [(name,cluster) for name, cluster in zip(returns.index,idx)]
details_df = pd.DataFrame(details)

# Rename columns
details_df.columns = ['Ticker','Cluster']

# Create another dataframe with the tickers and data from each stock
clusters_df = returns.reset_index()

# Bring the clusters information from the dataframe 'details_df'
clusters_df['Cluster'] = details_df['Cluster']

# Rename columns
clusters_df.columns = ['Ticker', 'Returns', 'Volatility', 'Cluster']

# Plot the clusters created using Plotly
fig = px.scatter(clusters_df, x="Returns", y="Volatility", color="Cluster", hover_data=["Ticker"])
fig.update(layout_coloraxis_showscale=False)
fig.show()

Stock classification using k-means clustering (5)

The graph shows 4 clusters that were generated using a K-means algorithm with 2 variables: average annualized return and average annualized volatility. These variables are used to measure the risk and return of a stock. The 4 clusters represent 4 groups of actions with different levels of risk and return in the period under study.

Clustering is useful for identifying peer groups among stocks, thus allowing differentiation between stocks with different levels of risk and return. This is useful for investors looking to diversify their investment portfolios, as it allows them to identify groups of stocks with different levels of risk and return.

Stock classification using k-means clustering (6)

Investors could use the 4 clusters to select a mix of stocks with different levels of risk and return based on their investment objectives. This will help them diversify their portfolio and reduce the risk of their investment since they will be investing in a variety of assets with different levels of risk.

According to Zhao Gao:

In machine learning, the development of some algorithms is already quite mature, […] such as the k-Means algorithm. Said algorithm can be applied to investments in variable income and achieve very good results. For example, segregating two types of companies in the market. On the one hand, mature companies or “value stocks” generally have low P/E ratios and high dividend rates. The second category, “growth” companies are companies with broad development prospects, but also uncertainties in the future, generally have high P/E ratios and low dividend rates. If you can accurately distinguish between blue-chip stocks and high-growth stocks in the market, you can provide a good benchmark for investors. (Zhao Gao 2020).

Following this conceptual line, it is possible to apply a clustering similar to the one carried out previously, exchanging the variables Average Annualized Return and Average Annualized Volatility for PER (Price-Earnings Ratio) and Dividend Rate (Dividend Yield). In this way, we could differentiate between “value” companies and “growth” companies.

Load Data

The trailing price-to-earnings (P/E) ratio is a relative valuation multiple that is based on the last 12 months of actual earnings. It is calculated by taking the current share price and dividing it by the earnings per share (EPS) for the last 12 months. While the dividend rate, or dividend rate, is the amount of cash that a company returns to its shareholders annually as a percentage of the company’s market value.

# Download trailingPE and dividendRate 
trailingPE_list = []
dividendRate_list = []

for t in tickers:

tick = yf.Ticker(t)
ticker_info = tick.info

try:
trailingPE = ticker_info['trailingPE']
trailingPE_list.append(trailingPE)
except:
trailingPE_list.append('na')

try:
dividendRate = ticker_info['dividendRate']
dividendRate_list.append(dividendRate)
except:
dividendRate_list.append('na')

# Create a datafrane to contain the data
sp_features_df = pd.DataFrame()

# Add the ticker, trailingPE and dividendRate data
sp_features_df['Ticker'] = tickers
sp_features_df['trailingPE'] = trailingPE_list
sp_features_df['dividendRate'] = dividendRate_list

# Shares with 'na' as dividend rate has no dividend so we have to assign 0 as dividend rate in this cases
sp_features_df["dividendRate"] = sp_features_df["dividendRate"].fillna(0)

# filter shares with 'na' as trailingPE
df_mask = sp_features_df['trailingPE'] != 'na'
sp_features_df = sp_features_df[df_mask]

# Convert trailingPE numbers to float type
sp_features_df['trailingPE'] = sp_features_df['trailingPE'].astype(float)

# Removes the rows that contains NULL values
sp_features_df=sp_features_df.dropna()

Determine the optimal number of clusters

Once the Price-Earnings Ratio and Dividend Rate data have been obtained, we can reapply the Elbow method to determine the optimal number of clusters

# Format the data as a numpy array to feed into the K-Means algorithm
data = np.asarray([np.asarray(sp_features_df['trailingPE']),np.asarray(sp_features_df['dividendRate'])]).T
X = data
distorsions = []
for k in range(2, 20):
k_means = KMeans(n_clusters=k)
k_means.fit(X)
distorsions.append(k_means.inertia_)
fig = plt.figure(figsize=(15, 5))

plt.plot(range(2, 20), distorsions)
plt.grid(True)
plt.title('Elbow curve')

Stock classification using k-means clustering (7)

The optimal number of clusters is 3.

K-means clustering

Once the optimum number of clusters has been defined, we proceed to create them. In the first instance, the centroids are defined using the sklearn library. For the creation of groups of actions, the K-means algorithm iteratively assigns data points to the groups based on their similarity of characteristics, or “features”, in this case, Price-Earnings Ratio and Dividend Rate.

# Computing K-Means with K = 3 (3 clusters)
centroids,_ = kmeans(data,3)

# Assign each sample to a cluster
idx,_ = vq(data,centroids)

# Create the clusters from the numpy array 'data'
cluster_1 = data[idx==0,0],data[idx==0,1]
cluster_2 = data[idx==1,0],data[idx==1,1]
cluster_3 = data[idx==2,0],data[idx==2,1]

# Create a dataframe with the tickers and the clusters that's belong to
details = [(name,cluster) for name, cluster in zip(sp_features_df.index,idx)]
details_df = pd.DataFrame(details)

# Rename columns
details_df.columns = ['Ticker','Cluster']

# Create another dataframe with the tickers and data from each stock
clusters_df = sp_features_df

# Bring the clusters information from the dataframe 'details_df'
clusters_df['Cluster'] = details_df['Cluster']

# Rename columns
clusters_df.columns = ['Ticker', 'trailingPE', 'dividendRate', 'marketCap', 'Cluster']

# Plot the clusters created using Plotly
fig = px.scatter(clusters_df, x="dividendRate", y="trailingPE", color="Cluster", hover_data=["Ticker"])
fig.update(layout_coloraxis_showscale=False)
fig.show()

Stock classification using k-means clustering (8)

When making a first approximation clustering by trailing price-to-earnings (P/E) and dividend rate, the presence of outliers and excessive dispersion among the observations is evident, so we proceed to filter the actions and normalize the data to eliminate these distortions.

Outlier treatment

We see as a result a scattered and unclear clustering. Therefore, eliminating outliers and normalizing the data will be necessary to achieve more accurate clusters.

First, we apply a filter to include only stocks with price-to-earnings less than 200 and dividend rate less than 5.

df_mask = (sp_features_df['trailingPE'] < 200) & (sp_features_df['dividendRate'] < 5)
sp_features_df = sp_features_df[df_mask]

Then, we apply MaxAbsScaler. MaxAbsScaler scales each feature by its maximum absolute value. This estimator scales and translates each feature individually such that the maximal absolute value of each feature in the training set will be 1. It does not shift/center the data and thus does not destroy any sparsity.

# Import the MaxAbsScaler class 
max_abs_scaler = preprocessing.MaxAbsScaler()

# Extract the 'trailingPE' column and reshape it to a column vector
trailingPE_array = np.array(sp_features_df['trailingPE'].values).reshape(-1,1)

# Extract the 'dividendRate' column and reshape it to a column vector
dividendRate_array = np.array(sp_features_df['dividendRate'].values).reshape(-1,1)

# Extract the 'marketCap' column and reshape it to a column vector
marketCap_array = np.array(sp_features_df['marketCap'].values).reshape(-1,1)

# Apply the MaxAbsScaler and store the normalized values in new columns
sp_features_df['trailingPE_norm'] = max_abs_scaler.fit_transform(trailingPE_array)
sp_features_df['dividendRate_norm'] = max_abs_scaler.fit_transform(dividendRate_array)
sp_features_df['marketCap_norm'] = max_abs_scaler.fit_transform(marketCap_array)

Once MaxAbsScaler is applied, we perform the Elbow method again with the normalized variables as input:

# Format the data as a numpy array to feed into the K-Means algorithm
data = np.asarray([np.asarray(sp_features_df['trailingPE_norm']),np.asarray(sp_features_df['dividendRate_norm'])]).T
X = data
distorsions = []
for k in range(2, 20):
k_means = KMeans(n_clusters=k)
k_means.fit(X)
distorsions.append(k_means.inertia_)
fig = plt.figure(figsize=(15, 5))

plt.plot(range(2, 20), distorsions)
plt.grid(True)
plt.title('Elbow curve')

Stock classification using k-means clustering (9)

Once the pertinent modifications have been made, it is possible to obtain 4 clusters generated by the K-means algorithm according to the trailing price-to-earnings (P/E) and dividend rate of each share.

# Computing K-Means with K = 4 (4 clusters)
centroids,_ = kmeans(data,4)

# Assign each sample to a cluster
idx,_ = vq(data,centroids)

# Create a dataframe with the tickers and the clusters that's belong to
details = [(name,cluster) for name, cluster in zip(sp_features_df.index,idx)]
details_df = pd.DataFrame(details)

clusters_df = pd.DataFrame()
clusters_df['Ticker'] = sp_features_df['Ticker']
clusters_df['trailingPE_norm'] = sp_features_df['trailingPE_norm']
clusters_df['dividendRate_norm'] = sp_features_df['dividendRate_norm']
clusters_df['marketCap_norm'] = sp_features_df['marketCap_norm']
clusters_df['Cluster'] = details_df[1].values

# Plot the clusters created using Plotly
fig = px.scatter(clusters_df, x="dividendRate_norm", y="trailingPE_norm", color="Cluster", hover_data=["Ticker"])
fig.update(layout_coloraxis_showscale=False)
fig.show()

Stock classification using k-means clustering (10)

It is possible to graphically verify that the algorithm assigned a greater weight to the dividend rate variable when creating the clusters. In this way, 4 sets of actions are distinguished: C1 with Null or very low, C2 with low, C3 with a medium-high and C4 with a high dividend rate.

Stock classification using k-means clustering (11)

We can extend the analysis of S&P500 stocks by applying k-means++ binning. This algorithm ensures a more intelligent initialization of the centroids and improves the quality of the clustering. Other than initialization, the rest of the algorithm is the same as the standard K-means algorithm. That is, K-means++ is the standard K-means algorithm together with more intelligent initialization of the centroids.

It is possible to take into account 3 variables for clustering. In K-means clustering, points are grouped into clusters based on the distance between the points. This means that to extend a clustering from two dimensions to three, you must add a third dimension to the data and calculate the distance between points in that dimension as well.

Clusters are now defined as data sets that have similar distances in all three dimensions, instead of just two. This allows a better separation of the clusters and a better fit for the distribution of the data.

Load Data

As in the first application, the Average Annualized Return and Average Annualized Volatility data are selected, but now the Price to Book variable is also added for the 3-dimensional analysis:

# Download priceToBook and marketCap
priceToBook_list = []
marketCap_list = []
tickers_clean = tickers

for t in tickers:

try:

tick = yf.Ticker(t)
ticker_info = tick.info

priceToBook = ticker_info['priceToBook']
marketCap = ticker_info['marketCap']

priceToBook_list.append(priceToBook)
marketCap_list.append(marketCap)

except:

tickers_clean = tickers.remove(ticker)
print('The stock ticker {} is not on database'.format(ticker))

# Create a datafrane to contain the data
priceToBook_df = pd.DataFrame()

# Add the ticker, priceToBook and marketCap data
priceToBook_df['Ticker'] = tickers_clean
priceToBook_df['priceToBook'] = priceToBook_list
priceToBook_df['marketCap'] = marketCap_list

# Merge dataframes
clusters3d_df = pd.merge(clusters_df, priceToBook_df)

# Removes the rows that contains NULL values
clusters3d_df.dropna(inplace=True)

# Drop the column with the old clusterization
clusters3d_df.drop(['Cluster'], axis=1, inplace=True)

# Order columns
clusters3d_df = clusters3d_df[['Ticker', 'marketCap', 'Returns', 'Volatility', 'priceToBook']]

Determine the optimal number of clusters

Once the Average Annualized Return, Average Annualized Volatility and Price to Book data have been obtained, we can reapply the Elbow method to determine the optimal number of clusters

Stock classification using k-means clustering (12)

The optimal number of clusters is 3.

K-means clustering

Once the optimum number of clusters has been defined, we proceed to create them. In the first instance, the centroids are defined using the sklearn library. For the creation of groups of actions, the K-means algorithm iteratively assigns data points to the groups based on their similarity of characteristics, or “features”, in this case, Average Annualized Return, Average Annualized Volatility, and Price to Book.

# Create clusters
k_means_optimum = KMeans(n_clusters = 3, init = 'k-means++', random_state=42)
y = k_means_optimum.fit_predict(X)

# Plot 3D graph with plotly
clusters3d_df['cluster'] = y

fig = px.scatter_3d(clusters3d_df, x='Returns', y='Volatility', z='priceToBook',
color='cluster', hover_data=["Ticker"])
fig.show()

Stock classification using k-means clustering (13)

Outlier treatment

Again, we note the presence of outliers. We remove them individually and repeat the steps to cluster them again

# Identify and remove the outliers stocks
clusters3d_df.drop(clusters3d_df[(clusters3d_df['Ticker'] == 'HD')].index, inplace=True)
clusters3d_df.drop(clusters3d_df[(clusters3d_df['Ticker'] == 'CL')].index, inplace=True)

# Recreate data to feed into the algorithm
data3d = np.asarray([np.asarray(clusters3d_df['Returns']), np.asarray(clusters3d_df['Volatility']), np.asarray(clusters3d_df['priceToBook'])]).T
X = data3d

#elbow method
distorsions = []
for i in range(1,20):
k_means = KMeans(n_clusters=i,init='k-means++', random_state=42)
k_means.fit(X)
distorsions.append(k_means.inertia_)

#plot elbow curve
fig = plt.figure(figsize=(15, 5))
plt.plot(np.arange(1,20),distorsions)
plt.xlabel('Clusters')
plt.ylabel('SSE')
plt.title('Elbow curve')
plt.grid(True)

plt.show()

Stock classification using k-means clustering (14)

Once the pertinent modifications have been made, it is possible to obtain 3 clusters generated by the K-means++ algorithm according to the t Average Annualized Return, Average Annualized Volatility, and Price to Book of each share.

# Initialize KMeans model with 3 clusters
k_means_optimum = KMeans(n_clusters = 3, init = 'k-means++', random_state=42)

# Cluster data using KMeans and store labels in 'y'
y = k_means_optimum.fit_predict(X)

# Add 'cluster' column to dataframe with cluster labels from 'y'
clusters3d_df['cluster'] = y

# Create 3D scatter plot with plotly express, color-coded by cluster label and with 'Ticker' tooltip
fig = px.scatter_3d(clusters3d_df, x='priceToBook', y='Returns', z='Volatility', color='cluster', hover_data=["Ticker"])

# Display plot
fig.show()

Stock classification using k-means clustering (15)

Finally, through clustering by the K-means++ algorithm, 3 sets of actions grouped by the 3 variables under study (Annualized Average Return, Annualized Average Volatility, and Price to Book) are obtained. The presence of shares with high Price to Book indices is striking. This information can be useful when creating an investment portfolio.

Stock classification using k-means clustering (2024)

FAQs

Can you use K-means clustering for classification? ›

K-means alone is not designed for classification, but we can adapt it for the purpose of supervised classification. If we use k-means to classify data, there are two schemes. One method used is to separate the data according to class labels and apply k-means to every class separately.

What is K clustering in stock data? ›

K-Means Clustering, a partitioning method, emerges from the realm of unsupervised machine learning. It serves as an essential tool to divide datasets into 'k' different clusters based on feature similarity. Each cluster shares characteristics that differentiate it from others.

What is clustering of stock market data? ›

Clusters close in distance, meaning a high correlation in returns, often share some similar risk factors. Thus, a down day in one cluster could translate to an equally weak performance in another cluster. For this reason, investors should find and cluster stocks with a large distance between them.

How to use K-means clustering for segmentation? ›

  1. Color-Based Segmentation Using K-Means Clustering.
  2. Step 1: Read Image.
  3. Step 2: Classify Colors in RBG Color Space Using K-Means Clustering.
  4. Step 3: Convert Image from RGB Color Space to L*a*b* Color Space.
  5. Step 4: Classify Colors in a*b* Space Using K-Means Clustering.
  6. Step 5: Create Images that Segment H&E Image by Color.

When should you not use k-means clustering? ›

k-means has trouble clustering data where clusters are of varying sizes and density. To cluster such data, you need to generalize k-means as described in the Advantages section. Clustering outliers. Centroids can be dragged by outliers, or outliers might get their own cluster instead of being ignored.

In what situations is k-means classification not a suitable method? ›

K-means clustering is not well-suited for data sets with uneven cluster sizes or non-linearly separable data, as it may be unable to identify the underlying structure of the data in these cases.

What is k-means clustering good for? ›

K-means clustering, a part of the unsupervised learning family in AI, is used to group similar data points together in a process known as clustering. Clustering helps us understand our data in a unique way – by grouping things together into – you guessed it – clusters.

What is k-means clustering in time series? ›

The k-means clustering algorithm can be applied to time series with dynamic time warping with the following modifications. Dynamic Time Warping (DTW) is used to collect time series of similar shapes. Cluster centroids, or barycenters, are computed with respect to DTW.

What is K-means vs mean shift clustering? ›

The K-means algorithm has a good performance when the number of clusters K is between 10 and 15, whereas the mean shift algorithm has good performance when the bandwidth h is between 0.03 and 0.06.

What is K-means algorithm in finance? ›

K-means clustering is a key method for banks in segmenting their customers more effectively. By analyzing transaction histories, types of accounts, and customer demographics, banks can group customers into distinct segments. This approach allows for highly personalized marketing strategies.

What are the 4 types of cluster analysis? ›

Below is a short discussion of four common approaches, focusing on centroid-based clustering using k-means.
  • Centroid-based Clustering.
  • Density-based Clustering.
  • Distribution-based Clustering.
  • Hierarchical Clustering.
Jul 18, 2022

What are the 4 types of cluster analysis used in data analytics? ›

The major types of cluster analysis are Centroid Based/ Partition Clustering, Hierarchical Based Clustering, Distribution Based Clustering, Density-Based Clustering, and Fuzzy Based Clustering.

How do you apply k-means clustering on a dataset? ›

The recipe for k -means is quite straightforward.
  1. Decide how many clusters you want, i.e. choose k.
  2. Randomly assign a centroid to each of the k clusters.
  3. Calculate the distance of all observation to each of the k centroids.
  4. Assign observations to the closest centroid.
Apr 1, 2022

How to interpret k-means clustering results? ›

Interpreting the meaning of k-means clusters boils down to characterizing the clusters. A Parallel Coordinates Plot allows us to see how individual data points sit across all variables. By looking at how the values for each variable compare across clusters, we can get a sense of what each cluster represents.

What is an example of K clustering? ›

Use K means clustering to generate groups comprised of observations with similar characteristics. For example, if you have customer data, you might want to create sets of similar customers and then target each group with different types of marketing.

Can we use clustering for classification? ›

TakeAway. Clustering apart from being an unsupervised machine learning can also be used to create clusters as features to improve classification models. On their own they aren't enough for classification as the results show. But when used as features they improve model accuracy.

What is the difference between K clustering and classification? ›

What Is the Difference Between Classification and Clustering? Classification sorts data into specific categories using a labeled dataset. Clustering is partitioning an unlabeled dataset into groups of similar objects.

Can we do classification after clustering? ›

After k-means Clustering algorithm converges, it can be used for classification, with few labeled exemplars/training data. It is a very common approach when the number of training instances(data) with labels are very limited due to high cost of labeling.

Can clustering help classification? ›

Clustering can help with classification when there are not enough labeled data. With a hrlp of clustering, more data can be labeled very quickly. Also, meaningful clusters can be very good predictors for classification tasks.

Top Articles
Latest Posts
Article information

Author: Barbera Armstrong

Last Updated:

Views: 6601

Rating: 4.9 / 5 (59 voted)

Reviews: 90% of readers found this page helpful

Author information

Name: Barbera Armstrong

Birthday: 1992-09-12

Address: Suite 993 99852 Daugherty Causeway, Ritchiehaven, VT 49630

Phone: +5026838435397

Job: National Engineer

Hobby: Listening to music, Board games, Photography, Ice skating, LARPing, Kite flying, Rugby

Introduction: My name is Barbera Armstrong, I am a lovely, delightful, cooperative, funny, enchanting, vivacious, tender person who loves writing and wants to share my knowledge and understanding with you.