## Due: February 23, 2011## Problem 1 (25 points)Two paragraphs are given below, one for each of two authors, John Beadle and John Haldane, from texts about traveling in North America. Your job is to determine who wrote the questioned paragraph based on properties of the known paragraph, providing explicit justification for your choices, similarly to what we did in class and which you can see in the first few pages of the Forensic Linguistics slides.
Provide three (a) [15 pts] Provide three quantified measures (be sure to provide quantities for all three paragraphs). Here is a suggested format for writing the values. Don't forget that you also need to explain how it supports your analysis.
(b) [10 pts] Provide three non-quantified measures (state explicitly how each one relates to each paragraph, and how it supports your analysis). ## Problem 2 (25 points)
In the course slides the relative frequency of the words Here are the documents and their measurements:
You can probably spot the clusters pretty easily just by inspecting the values, but for this problem you need to compute the centroids of each cluster using K-means. You are given the following two initial centroids:
Note that in the slides, we used two of the document points as initial centroids; here, they are different points, so there will be non-zero distances from them to all documents. ## Part A (5 points)For every document, compute the distance between it and the two centroids c1 and c2. In the slides you saw how to compute the squared distance in two dimensions. For three it is not different: you just need to include the z dimensions in your calculation:
distance(d Here's a table to help organize your values:
## Part B (5 points)Based on the distances, write down the group memberships for each centroid. Then, compute the new centroids based on the group memberships. ## Part C (10 points)Do this for two more iterations (the centroids won't change after that). At each iteration, give the distances, the centroids, and the group memberships. ## Part D (5 points)You are given a new document, d7, with the measurements (3.3, 13.2, 25.0) and are told it is written by Doyle. Calculate the distance between d7 and the final centroids you computed for Part C. (Show your work.) Based on this result, which of the first six documents were likely to have been written by Doyle and and which by Austen? ## Problem 3 (25 points)Services like Twitter allow short, real-time commentary about whatever users feel like talking about, and this type of commentary is creating data of great interest for sentiment analysis. Often, there is interest in automatically determining whether a given tweet is positive, negative or neutral toward a specific topic, person, company, etc. For this problem, you will be working with tweets about health care reform that were collected at the time the US Congress was debating and voting on new health care legislation.
Part A [10 pts]Annotate 20 tweets on the spreadsheet. Follow the directions carefully!Step 1. Open the spreadsheet. You should see something like this: There are 1500 tweets, with one tweet per row of the spreadsheet. Each tweet has the following basic attributes: **tweet id**: the unique identification number for the tweet**user id**: the unique identification number for the user who created the tweet**username**: the user's twitter username**content**: the text of the tweet
Your job is to fill in the rest of the annotated attributes for 20 tweets, similarly to what the instructor (Jason Baldridge) did for the first thirty tweets in the list. These attributes are:**sentiment**: the sentiment label for the tweet**target**: what subject is the sentiment being expressed about**annotator**id: the identifier for the annotator**comment**: a place to write down any comments you have about the tweet or the annotation (see the examples above)**dispute**: if someone sees an annotation that they disagree with, they can note it in this column
What to do for each of these attributes will be described in detail below. Step 2. Create an annotator id.Your annotator id can be the first letter of your first name followed by your last name (for example, the id for "Jason Baldridge" is "jbaldridge"). You can also use something else (e.g. wolverine, kingtut), provided it isn't the same as any other ids already being used. We will use this to find the tweets you annotated when considering your answers to the other parts of this question. In your submission for this part of the problem, write down what your annotator id is. Step 3. Claim 20 tweets by putting your annotator id next to them.Before actually annotating any of the tweets, you must "claim" 20 of them by putting your annotator id in the "annotator id" column for the tweets. This will make sure that noone else who is editing the document at the same time will accidentally write over one of your annotations. Step 4. Annotate each of the 20 tweets.For each tweet, read the text and decide what sentiment it expresses and what that sentiment is being expressed about. The possible values for sentiment are:**positive**: the tweet expresses positive polarity toward the target (e.g. "I love health care reform!!!)**negative**: the tweet expresses negative polarity toward the target (e.g. "I hate health care reform!!!)**neutral**: the tweet is objective and does not express polarity toward the target (e.g. "Congress is debating health care reform.")**unsure**: you simply can't figure out whether it is subjective or objective, or whether it is positive or negative (e.g., it looks sarcastic, but you aren't sure, maybe something like "Health reform is just great.")**irrelevant**: the tweet isn't about health care reform (this should be rare, but does happen some times: e.g. "The Cubanization of Venezuela-Castro works to keep Chavez in power #hcr")
Sometimes it might be hard to make an exact determination, so it is fine to use the label "unsure" if you can't decide. The possible values for target are:**obama**: Barack Obama**hcr**: heath care reform**gop**: the Republican party (Grand Old Party)**dems**: the Democratic party**teaparty**: the Tea Party**conservatives**: individuals who subscribe to American conservatism**liberals**: individuals who subscribe to American liberalism**other**: the target is not adequately described by any of the above (use this if the tweet was irrelevant)
There are some tweets that clearly have multiple targets. In these cases, just pick the one you think is more important, and if they seem equally important, pick the first one mentioned. Choose the appropriate values for sentiment and target from the drop down menus available with each one. If it is appropriate, you are welcome to add comments in the comments column for the tweet. A few tips and comments: - Find out what hashtags like #hcr mean on tagdef.com.
- There may be others editing at the same time. Make sure not to edit a row that someone has already claimed.
- Make sure you only edit the sentiment, target, annotator id, and comment cells. Don't rearrange the rows or columns. If you accidentally do this, please get in touch with the instructor and TA right away so we can restore the spreadsheet.
- As an indicator of how long it should take you, Jason annotated 30 tweets in a little under 10 minutes.
- You are welcome to annotate more than 20 tweets if you would like to do so!
- If you have any trouble with the spreadsheet, don't hesitate to get in touch with the instructor and the TA.
- If you have any questions about the annotation, the labels, or are generally unsure about what to do, again, don't hesitate to contact the instructor and the TA!
Part B [2 pts] Calculate the probability of positive out of the positive and negative tweets in your 20 tweets. Part C [3 pts] Calculate the probability of positive for tweets about health care reform (target=hcr) from your 20 tweets plus the 30 tweets annotated by jbaldridge. As in part B, calculate the probability based only on positive and negative tweets (about hcr). Part D [6 pts] Name the three most difficult tweets you had to annotate and describe what made them difficult. Try to identify things like polarity flipping, use of discourse markers, sarcasm, etc. that were discussed in the slides. Part E [4 pts] Look at tweets that were annotated by others (this can include jbaldridge) and see if you disagree with the annotation of either the sentiment or the target of any of these (including any "unsure" annotations). If you do disagree, make a note in the dispute cell for the tweet on the spreadsheet. In your homework, describe the tweet and how you would annotate it differently and what information you think is necessary to use. For example, is it necessary to know whether a hashtag corresponds to a particular position, like #handsoff (against health care reform) and #p2 (for health care reform), or to read the content of a page that the tweeter linked to?If you agree with all the annotations, you can instead write down any general problems you think there are with annotating tweets in this way. For example, are there any important target values missing, or are many tweets just too short for you to make a determination? ## Problem 4 (25 points)
The band
Using Bayes rule, you will compute the probability that this tweet is positive based on the fact that it contains **Dataset 1**: You collect 1000 tweets about*The Flaming Lips*and label each of them as positive or negative, finding that 700 of them are positive. The word*best*is in 120 positive tweets and 10 negative ones.__Use this to compute the probability of__*best*given*positive*and and the probability of*best*given*negative.***Dataset 2**: You ask 100 people at*The Flaming Lips*concert whether they would rate the show positively or negatively, and 90 of them respond positively.__Use this to compute the probability of__.*positive*and the probability of*negative*
In answering the questions below, (a) [2 pts] What is the probability of (b) [3 pts] What is the probability of (c) [12 pts] What is the probability of (d) [8 pts] You realize it might be a good idea to use both (a) the positive and negative tweets and (b) the response of the positive and negative concert-goers to get a better estimate. What is the probability of |

Assignments >