Distributional Clustering

These are lecture notes from my Computer Science course.

We’re going to look at:

  • Distributional Similarity
  • Clustering (Hierarchical and flat)

Distributional Hypothesis states that words that occur in the same contexts tend to have similar meanings. Two words are similar to the extent that they share similar contexts. But what is the context of a word w.

Context Representation

You have grammatical relations:

  • Subject/verb (relation of train to leave, depart)
  • Object/verb

Proximity Representation:

  • Instead of using grammatic relations.
  • Define a window size +/- tokens around the target word.

I think you can then count the occurrences of a word appearing within a window size around another word and then put that into a table. Then you can calculate the conditional probability of a noun to appear as an object of a verb. There are a few different measures you can use which are the same as things we have used before.

Distributional Clustering

Clustering is the process of grouping a set of objects into subsets or clusters based on…something. We are going to look at flat clustering (k-means) and hierarchical clustering (bottom-up and top down).

We want to cluster the words we have in rows on feature weights, based on similarities between the object/verb conditional probabilities. For example house and apartment would be clustered together because words like buy, rent, and sell appear in a window around them.

Hierarchical Agllomerative Clustering (HAC)

Bottom-up algorithm: Treat each object as a singleton cluster then successively merge pairs of clusters until all the clusters have been merged into the ROOT (the top).

TO do this you transform the input matrix into an adjacency matrix, where the rows and columns are both nouns. Each cell then contains the similarity between the noun in the row and the noun in the column.

To calculate this ‘similarity’ value you exploit the vectors of nouns. For example Cosine, Jaccard, PMI, and many others. We’ll look at Cosine and…something else.

Some seriously long equations here that are probably quite simple.

Once you’ve calculated these similarities, you then iterate through a process of finding pairs with highest similarity and merging the pairs into a new cluster and updating the similarity matrix. Keep doing this until there is just one cluster.

To update the similarity matrix, remove the columns and rows of each clustered noun, add new rows/columns for the cluster, and recalculate the similarity. Pretty obvious really. Though recalculating the similarity isn’t quite so obvious; how do you deal with clusters in that calculation? There are three different methods: Single-linkage, Complete-linkage, and Average-linkage.

  • Average-linkage is obvious, take the mean of the similarity values of the individual verbs in turn.
  • Single-linkage: use the maximum similarities values from the pairs. Results in elongated clusters.
  • Complete-Linkage is the opposite of single-linkage. Results in compact clusters.

Then there was a graph showing the impact of these different strategies on the final results.

Flat Clustering

Given a set of objects and their corresponding distributional vectors, produce a flat grouping of objects into K clusters. Can use ‘K-means’, which is similar to HAC. K is the number of desirably clusters.

I think it starts with the same input, the same similarity values.

You then have a process of:

  • Start with random clustering
  • Centroid calculation
  • Distances calculation
  • Grouping
  • Check termination
  • Repeat (minus the random clustering bit)

What the hell is a centroid? It’s a vector which represents each cluster and it is calculated by taking the average vector of the vectors in the cluster. There’s a slide (‘Calculating Centroids’) which makes this easy to understand.

What about calculating distance? You use the Euclidean distance. Apparently.

Each noun is then moved to its less distance cluster. Then keep doing it. Eventually you terminate when an iteration produces no changes (i.e. it converges).

K-means has a couple of limitations:

  • The initial random clustering has a big impact on the clustering.
  • K is manually defined (the number of clusters).

Exam Stuff

  • Probably different segmentation to last year.
  • Compulsory question is stuff that doesn’t come up in the other questions. Last year it was stuff to do with context-free grammars, morphology.

If you can do the practicals you can do the exam.

Probably worthwhile revising:

  • Collocations: t-test, PMI, null hypothesis, chi-square test. Did some of this in a practical so maybe look at my Python code (I think it was correct).
  • Stuff from guest lecturer. Though apparently things may be a bit different this year?
This entry was posted in lecture, nlp. Bookmark the permalink.