Assignments‎ > ‎

HW3: Precision/Recall and Spelling correction

Problem 1 (25 points)

In the Twitter data set about health care reform that the class annotated for the previous assignment, there were 119 tweets annotated as neutral, 152 as positive, and 298 as negative.

I create a subjectivity classifier that identifies tweets as objective (neutral) or subjective (either positive or negative). It labels 151 tweets as objective and the rest as subjective.  Of the 151, only 93 were actually objective.

Part A [7 points]

Fill in the following table based on the data given above

 Classifier: Objective  
 Classifier: Subjective
 Annotation: Objective
 Annotation: Subjective

Part B [18 points]

Now, answer the following questions about the classifier's ability to identify objective tweets.

(a) [1 pt] How many false negatives did the classifier produce as a detector of objective tweets?

(b) [4 pts] What is the precision of the classifier as a detector of objective tweets?

(c) [4 pts] What is the recall of the classifier as a detector of objective tweets?

Next, answer the following questions about the classifier's ability to identify subjective tweets.

(d) [1 pt] How many false positives did the classifier produce as a detector of subjective tweets?

(e) [4 pts] What is the precision of the classifier as a detector of subjective tweets?

(f) [4 pts] What is the recall of the classifier as a detector of subjective tweets?

Problem 2 (20 points)

The goal of a subjectivity classifier is to identify tweets that should be labeled as positive or negative by a polarity classifier. The polarity classifier is what will be used to understand the overall positivity or negativity toward health care reform on new tweets, and it will be used to present positive and negative examples to policy makers. Recall that probabilistic classifiers provide not only a label, but also a confidence value (the probability), and that probability can be used as a threshold. For example, rather than saying that the best candidate label is:

You can instead choose the label based on a threshold:

where θ is a threshold value between 0 and 1. If θ = 0.5, then the decision is the same as the argmax above, but if θ =0.8, then we are saying that we will only consider tweets subjective if they have a probability of being subjective greater than .8, and so on. This allows us to use the confidence value to select more or fewer tweets as subjective. (At the extremes: if θ = 0, everything is subjective, and if θ = 1, nothing is.) This is important because when a classifier is more confident, it is typically more correct. So, if we have a higher threshold, it means that precision is higher because each decision is more likely to be right. However, if we have too high a threshold, we might fail to label many instances, so this can damage recall.

This question asks to you to consider whether precision or recall is more important for identifying subjective tweets for different contexts. Another way of thinking about this question is this: is it more important (a) to catch all of the subjective tweets at the risk of including many objective ones and thus have better recall (by using values for θ less than 0.5), or (b) to select fewer tweets that the subjectivity classifier is more confident about as subjective at the risk of not providing many subjective tweets to the polarity classifier and thus have better precision (by using values for θ greater than 0.5)

With that in mind, and also considering that there may be 100,000 or more tweets per day that are relevant to your topic of interest, say whether you think precision or recall for subjective tweets is more important for the following purposes:

  • A single person monitors the pulse of the people (on Twitter) by tracking the positive/negative tweet ratio over time.
  • A group of ten researchers must find interesting comments (in specific tweets) to demonstrate specific sentiments as examples in a policy decision.
  • An automated system will use the positive/negative ratio toward many different companies in order to decide whether to buy or sell shares of those companies' stocks.
Write 4-5 sentences for each (a short paragraph). Think about the effort that goes into various aspects of these tasks, such as how many tweets one person can look at in an hour and what the cost of mistakes about sentiment---including missing important subjective tweets or assigning labels positive and negative to many tweets that are actually objective (in which case they are incorrect by definition).

Problem 3 (25 points)

Calculate the minimum edit distance from the string gudmint a to each of the following words:

  • giant
  • judgment

Use only insertions, deletions, and substitutions. The costs you should use are as follows:

  • insertions: cost 1
  • deletions: cost 1
  • substitutions
    • no change (e.g. a→a, t→t) : cost 0
    • vowel → vowel: cost 0.5
    • consonant → constant: cost 0.5
    • vowel → consonant: cost 1
    • consonant → vowel: cost 1

For each of the pairs, write down:

  • the minimum edit distance, and
  • the acyclic graph which you use to calculate the minimum edit distance, with annotated costs for each node.
You do not need to include the string representation for each node in the graph (though feel free to include it if it is helpful to you).

Don't try to cram both graphs onto a single page – use a full page for each pair and give yourself plenty of space to write.

Problem 4 (30 points)

You are building a spelling corrector for English and have already built an excellent error detection component. You are given the sentence “John's good gudmint helps him make great decisions.” and the detector spots ‘gudmint’ as a typo. You also have a candidate generator that gives you two candidates as possible corrections: giant, and judgment. Now you need to rank these candidates by using a model trained from data (given below).

You are given a corpus of text which has the following counts:
  • 20,358 word tokens
  • 282 tokens of giant
  • 60 tokens of judgment
  • 31 tokens of garment

Also assume that we know the following from an error model:

  • P(gudmint | giant) = .011
  • P(gudmint | judgment) = .042
  • P(gudmint | garment) = .062

(a) (15 points) Rank the candidates from most likely to least likely, assuming that we calculate P(candidate) using a unigram language model based on the above training material? Show your work, and the values for each candidate.

(b) (15 points) The unigram model doesn't use the context of the sentence to choose the best candidate. Let's make it more sensitive to context by using a bigram language model, which for the above sentence means computing P(candidate|good). In our training material, we observe the following:

  • good occurs 815 times
  • giant occurs after good 1 time
  • judgment occurs after good 14 times
  • garment occurs after good 3 times

Using the bigram probabilities for each candidate being preceded by good, which is the best candidate? Show your work and the values for each candidate.