What is lexical semantics? It is the study of the meaning of words, and how these combine to form the meaning of longer contexts (paragraphs, etc.).
Words and concepts are a many-to-many relation. This results in lexical ambiguity; one phrase has many attributable meanings. Word sense disambiguation is working out the sense of a word in a context (e.g. sentence). In order to do this we need a source of all these different senses of words.
- WordNets; more expressive than dictionaries and thesauri. Usually called “Large Lexical Databases”. http://wordnet.princeton.edu/
Words and phrases are grouped into sets of synonyms, called ‘Synsets’ in WordNet terminology.
This doesn’t sound much better than a thesaurus. But it has something extra; synsets are interlinked by means of lexical relations, for example taxonomic relations e.g. dog is an animal. There are lots of types of relations:
- Hypernymy/Hyponymy; taxonomic relations e.g. car is a hypernymy of automotive.
- Meronymy; where something is a part of something else, e.g. leg is a meronym of body.
- Holonymy is the inverse of Meronymy.
- Antonymy; the opposite of synonymy. rise is an antonym of fall.
- Entailment; logical consequence of another action, for example snoring entails sleeping or buying entails paying.
Problems with WordNet
- It has very fine-grained senses; it can be very vague.
- Lack of explicit relations between topically related concepts. For example cigarette is not related to ashtray.
- Misses concepts, especially domain-specific, e.g. names of products.
The Vector Space Model
This is from information retrieval (searching information within documents). We want to look at Document Retrieval (e.g. a search engine).
The VSM is an algebraic representation of each document or a query. Each document (or query) is represented as an n-dimensional vector of features. What are the features? The words in the document (with common words removed).
This might cause problems as you might get documents that look identical.
You can improve the binary VSM (e.g. where everything is just a 0 or a 1) with ‘Feature Weighting’. For example you can weight each value by counting the number of times the word is present in the document. However this isn’t too great, so instead you should weight documents according to their relevancy to the document and their discriminating ability (‘the’ doesn’t have much of this).
More complicated; you can have Term Frequency and Inverse Document Frequency, and combine these to get TF-IDF:
- Term Frequency is the number of occurrences of a word in the document divided by the total number of words in the document.
- Inverse Document Frequency measures the discriminating ability of a word, it is equal to the number of all documents / number of documents that contain that word, and then taking the logarithm.
This makes sense, the IDF bit gets rid of the effectiveness of a word like ‘the’.
You can also use the other statistical measures we’ve covered before such as Chi-squared and Log-likelihood.
Very straightforward. Make sure that the features for document representation are used for query representation.
Problems with VSM
- Words have many senses as we have discussed above, and the VSM does not discriminate between the different senses of a word.
- Keyword based.
- Introduces noise in similarity calculations.
- Does not take into account synonyms.
You could use WordNet relations to expand vectors to include synonyms, hypernyms, and hyponyms. This increases the coverage so you can answer other queries.