Due: February 23, 2011Problem 1 (25 points)Two paragraphs are given below, one for each of two authors, John Beadle and John Haldane, from texts about traveling in North America. Your job is to determine who wrote the questioned paragraph based on properties of the known paragraph, providing explicit justification for your choices, similarly to what we did in class and which you can see in the first few pages of the Forensic Linguistics slides.
Provide three quantified measures of authorship style (e.g., average sentence length) and three non-quantified observations of similarity (e.g., content or particular expressions). For each measure, you should give its value for each text and you should explicitly state why it supports your determination of who wrote the questioned paragraph. You will be graded on the evidence and reasoning you use, not on whether you get the right author. (a) [15 pts] Provide three quantified measures (be sure to provide quantities for all three paragraphs). Here is a suggested format for writing the values. Don't forget that you also need to explain how it supports your analysis.
(b) [10 pts] Provide three non-quantified measures (state explicitly how each one relates to each paragraph, and how it supports your analysis). Problem 2 (25 points)In the course slides the relative frequency of the words I and the were calculated for five texts of three authors: Austen, Doyle, and Krugman. These two dimensions were used to calculate the centroids for the three authors using the K-means algorithm. There are many other values that could be used for clustering with K-means; for this problem, you'll work with the relative frequency of we, he, and a. In particular, you are given the measurements for six texts, some of them by Austen and some by Doyle, and your job is to cluster them with k-means. Here are the documents and their measurements:
You can probably spot the clusters pretty easily just by inspecting the values, but for this problem you need to compute the centroids of each cluster using K-means. You are given the following two initial centroids:
Note that in the slides, we used two of the document points as initial centroids; here, they are different points, so there will be non-zero distances from them to all documents. Part A (5 points)For every document, compute the distance between it and the two centroids c1 and c2. In the slides you saw how to compute the squared distance in two dimensions. For three it is not different: you just need to include the z dimensions in your calculation: distance(di,dj) = (xi - xj)2 + (yi - yj)2 + (zi - zj)2 Here's a table to help organize your values:
Part B (5 points)Based on the distances, write down the group memberships for each centroid. Then, compute the new centroids based on the group memberships. Part C (10 points)Do this for two more iterations (the centroids won't change after that). At each iteration, give the distances, the centroids, and the group memberships. Part D (5 points)You are given a new document, d7, with the measurements (3.3, 13.2, 25.0) and are told it is written by Doyle. Calculate the distance between d7 and the final centroids you computed for Part C. (Show your work.) Based on this result, which of the first six documents were likely to have been written by Doyle and and which by Austen? Problem 3 (25 points)Services like Twitter allow short, real-time commentary about whatever users feel like talking about, and this type of commentary is creating data of great interest for sentiment analysis. Often, there is interest in automatically determining whether a given tweet is positive, negative or neutral toward a specific topic, person, company, etc. For this problem, you will be working with tweets about health care reform that were collected at the time the US Congress was debating and voting on new health care legislation. Note: to do this problem, you must go to the Google spreadsheet "HCR Tweets - Language and Computers". This was shared to you as a link in an email; if for some reason you cannot find it, contact the instructor or the TA. Part A [10 pts] Annotate 20 tweets on the spreadsheet. Follow the directions carefully! Step 1. Open the spreadsheet. You should see something like this: There are 1500 tweets, with one tweet per row of the spreadsheet. Each tweet has the following basic attributes:
Your job is to fill in the rest of the annotated attributes for 20 tweets, similarly to what the instructor (Jason Baldridge) did for the first thirty tweets in the list. These attributes are:
What to do for each of these attributes will be described in detail below. Step 2. Create an annotator id. Your annotator id can be the first letter of your first name followed by your last name (for example, the id for "Jason Baldridge" is "jbaldridge"). You can also use something else (e.g. wolverine, kingtut), provided it isn't the same as any other ids already being used. We will use this to find the tweets you annotated when considering your answers to the other parts of this question. In your submission for this part of the problem, write down what your annotator id is. Step 3. Claim 20 tweets by putting your annotator id next to them. Before actually annotating any of the tweets, you must "claim" 20 of them by putting your annotator id in the "annotator id" column for the tweets. This will make sure that noone else who is editing the document at the same time will accidentally write over one of your annotations. Step 4. Annotate each of the 20 tweets. For each tweet, read the text and decide what sentiment it expresses and what that sentiment is being expressed about. The possible values for sentiment are:
Sometimes it might be hard to make an exact determination, so it is fine to use the label "unsure" if you can't decide. The possible values for target are:
There are some tweets that clearly have multiple targets. In these cases, just pick the one you think is more important, and if they seem equally important, pick the first one mentioned. Choose the appropriate values for sentiment and target from the drop down menus available with each one. If it is appropriate, you are welcome to add comments in the comments column for the tweet. A few tips and comments:
Part B [2 pts] Calculate the probability of positive out of the positive and negative tweets in your 20 tweets. Part C [3 pts] Calculate the probability of positive for tweets about health care reform (target=hcr) from your 20 tweets plus the 30 tweets annotated by jbaldridge. As in part B, calculate the probability based only on positive and negative tweets (about hcr). Part D [6 pts] Name the three most difficult tweets you had to annotate and describe what made them difficult. Try to identify things like polarity flipping, use of discourse markers, sarcasm, etc. that were discussed in the slides. Part E [4 pts] Look at tweets that were annotated by others (this can include jbaldridge) and see if you disagree with the annotation of either the sentiment or the target of any of these (including any "unsure" annotations). If you do disagree, make a note in the dispute cell for the tweet on the spreadsheet. In your homework, describe the tweet and how you would annotate it differently and what information you think is necessary to use. For example, is it necessary to know whether a hashtag corresponds to a particular position, like #handsoff (against health care reform) and #p2 (for health care reform), or to read the content of a page that the tweeter linked to? If you agree with all the annotations, you can instead write down any general problems you think there are with annotating tweets in this way. For example, are there any important target values missing, or are many tweets just too short for you to make a determination? Problem 4 (25 points)
The band The Flaming Lips has asked you to build a system that monitors Twitter during the Austin City Limits festival and classify tweets as positive or negative about the band. Here is an example:
Using Bayes rule, you will compute the probability that this tweet is positive based on the fact that it contains best and based on the following training data:
In answering the questions below, be sure to use dataset 1 and dataset 2 appropriately for computing the probabilities, as indicated above. (a) [2 pts] What is the probability of negative? (Show your work). (b) [3 pts] What is the probability of best given negative? (Show your work) (c) [12 pts] What is the probability of positive given best? (Show your work) (d) [8 pts] You realize it might be a good idea to use both (a) the positive and negative tweets and (b) the response of the positive and negative concert-goers to get a better estimate. What is the probability of negative given best if you use the tweets in addition to the data from the concert-goers to compute the prior probabilities of positive and negative? (Show your work) |
Assignments >