Assignments‎ > ‎

HW2: Forensic linguistics and document classification

Due: February 23, 2011

Problem 1 (25 points)

Two paragraphs are given below, one for each of two authors, John Beadle and John Haldane, from texts about traveling in North America. Your job is to determine who wrote the questioned paragraph based on properties of the known paragraph, providing explicit justification for your choices, similarly to what we did in class and which you can see in the first few pages of the Forensic Linguistics slides.

Text 1, Author: John Beadle

From Corsicana the train on the Texas Central Railroad carried me early straight south, leaving the valley of the Trinity and bearing across the high country to the Brazos. Not one acre in ten of this region is under fence. All the rest is common pasture, though most of it belongs to private owners, and is for sale at two to six dollars per acre. The region is high and gently undulating, about one-fifth in timber, the rest fertile prairie. My next stopping place was Houston, which I thought, at first view, the most beautiful place in Texas. There had been a twenty-four-hours' rain, and at 9 A. M the sun shone out clear; the orange groves, magnolias, and shade trees looked their richest green and Houston presented to the newly arrived Northerner a most enchanting appearance. That city, the original capital of Texas, is at the head of Buffalo Bayou, a long projection of Galveston Bay.

Text 2, Author: John Haldane

What about Chicago? My dear good friend, what can I say about that ninth wonder of the world which people everywhere do not know already, and yet, something fresh may be squeezed out of it at a few points. I found that it had a population of fully 2,000,000, and was so immense in area, that once in it, you required some time to get out of it, even by rail, or by swift tram cars, two of which ran me one evening nearly twenty miles into the suburbs at a speed of thirty miles an hour when possible. The country round about is flat and uninteresting, but nevertheless, Chicago is a wonderful city throughout, and of unique interest amongst the great cities of the globe. The streets were as crowded as those of London, and the shops quite as handsome, but I could not admire such an array of gigantic commercial buildings, hundreds of feet in height, so much alike, so slabby in form, and flat roofed in most cases.

Text 3, Questioned Paragraph

One week sufficed to conclude my business in Oregon, but before leaving a few general notes are in order. Portland is on the west bank of the Willamette, twelve miles above its mouth and near the head of tide-water. But the Columbia often rises so as to cause backwater, giving the Willamette a variation of thirty-two feet. Ocean steamers load at the wharf, and the place has direct water communication with all the ports of the world, the chief exports being wheat, lumber, beef and salmon. All the older portion of the city is very beautifully improved; elegant residences abound, with many evidences of taste and wealth. The location is picturesque. The Cascade Range is only occasionally visible, but Mount Hood rears its snowy summit sixty miles eastward, and looks as if it were just out of town. Mount Saint Helens is sometimes in good view, though eighty miles to the north-east. All the hills around the city are covered with heavy timber, and in town every street is double lined with shade trees.

Provide three quantified measures of authorship style (e.g., average sentence length) and three non-quantified observations of similarity (e.g., content or particular expressions). For each measure, you should give its value for each text and you should explicitly state why it supports your determination of who wrote the questioned paragraph. You will be graded on the evidence and reasoning you use, not on whether you get the right author.

(a) [15 pts] Provide three quantified measures (be sure to provide quantities for all three paragraphs). Here is a suggested format for writing the values. Don't forget that you also need to explain how it supports your analysis.

Description of MeasureValue for BeadleValue for Haldane    Value for Questioned
    
    
    

(b) [10 pts] Provide three non-quantified measures (state explicitly how each one relates to each paragraph, and how it supports your analysis).



Problem 2 (25 points)

In the course slides the relative frequency of the words I and the were calculated for five texts of three authors: Austen, Doyle, and Krugman. These two dimensions were used to calculate the centroids for the three authors using the K-means algorithm. There are many other values that could be used for clustering with K-means; for this problem, you'll work with the relative frequency of we, he, and a. In particular, you are given the measurements for six texts, some of them by Austen and some by Doyle, and your job is to cluster them with k-means.

Here are the documents and their measurements:

Doc ID x y z
d1 2.010.915.9
d2 4.611.423.0
d3 1.811.519.1
d4 1.89.117.1
d5 2.19.619.3
d6 7.715.422.0

You can probably spot the clusters pretty easily just by inspecting the values, but for this problem you need to compute the centroids of each cluster using K-means.

You are given the following two initial centroids:

Centroid ID x y z
c1 1.99.916.5
c2 2.010.519.1

Note that in the slides, we used two of the document points as initial centroids; here, they are different points, so there will be non-zero distances from them to all documents.

Part A (5 points)

For every document, compute the distance between it and the two centroids c1 and c2.

In the slides you saw how to compute the squared distance in two dimensions. For three it is not different: you just need to include the z dimensions in your calculation:

distance(di,dj) = (xi - xj)2 + (yi - yj)2 + (zi - zj)2

Here's a table to help organize your values:


d1 d2 d3 d4 d5 d6
c1





c2





Part B (5 points)

Based on the distances, write down the group memberships for each centroid. Then, compute the new centroids based on the group memberships.

Part C (10 points)

Do this for two more iterations (the centroids won't change after that). At each iteration, give the distances, the centroids, and the group memberships.

Part D (5 points)

You are given a new document, d7, with the measurements (3.3, 13.2, 25.0) and are told it is written by Doyle. Calculate the distance between d7 and the final centroids you computed for Part C. (Show your work.) Based on this result, which of the first six documents were likely to have been written by Doyle and and which by Austen?


Problem 3 (25 points)

Services like Twitter allow short, real-time commentary about whatever users feel like talking about, and this type of commentary is creating data of great interest for sentiment analysis. Often, there is interest in automatically determining whether a given tweet is positive, negative or neutral toward a specific topic, person, company, etc. For this problem, you will be working with tweets about health care reform that were collected at the time the US Congress was debating and voting on new health care legislation.

Note: to do this problem, you must go to the Google spreadsheet "HCR Tweets - Language and Computers". This was shared to you as a link in an email; if for some reason you cannot find it, contact the instructor or the TA.


Part A [10 pts]

Annotate 20 tweets on the spreadsheet. Follow the directions carefully!

Step 1. Open the spreadsheet

You should see something like this:


There are 1500 tweets, with one tweet per row of the spreadsheet. Each tweet has the following basic attributes:
  • tweet id: the unique identification number for the tweet
  • user id: the unique identification number for the user who created the tweet
  • username: the user's twitter username
  • content: the text of the tweet
Your job is to fill in the rest of the annotated attributes for 20 tweets, similarly to what the instructor (Jason Baldridge) did for the first thirty tweets in the list. These attributes are:
  • sentiment: the sentiment label for the tweet
  • target: what subject is the sentiment being expressed about
  • annotator id: the identifier for the annotator 
  • comment: a place to write down any comments you have about the tweet or the annotation (see the examples above)
  • dispute: if someone sees an annotation that they disagree with, they can note it in this column
What to do for each of these attributes will be described in detail below.

Step 2. Create an annotator id.

Your annotator id can be the first letter of your first name followed by your last name (for example, the id for "Jason Baldridge" is "jbaldridge"). You can also use something else (e.g. wolverine, kingtut), provided it isn't the same as any other ids already being used. We will use this to find the tweets you annotated when considering your answers to the other parts of this question. In your submission for this part of the problem, write down what your annotator id is.
 
Step 3. Claim 20 tweets by putting your annotator id next to them.

Before actually annotating any of the tweets, you must "claim" 20 of them by putting your annotator id in the "annotator id" column for the tweets. This will make sure that noone else who is editing the document at the same time will accidentally write over one of your annotations.

Step 4. Annotate each of the 20 tweets.

For each tweet, read the text and decide what sentiment it expresses and what that sentiment is being expressed about. 

The possible values for sentiment are:
  • positive: the tweet expresses positive polarity toward the target (e.g. "I love health care reform!!!)
  • negative: the tweet expresses negative polarity toward the target (e.g. "I hate health care reform!!!)
  • neutral: the tweet is objective and does not express polarity toward the target (e.g. "Congress is debating health care reform.") 
  • unsure: you simply can't figure out whether it is subjective or objective, or whether it is positive or negative (e.g., it looks sarcastic, but you aren't sure, maybe something like "Health reform is just great.")
  • irrelevant: the tweet isn't about health care reform (this should be rare, but does happen some times: e.g. "The Cubanization of Venezuela-Castro works to keep Chavez in power #hcr")  
Sometimes it might be hard to make an exact determination, so it is fine to use the label "unsure" if you can't decide.

The possible values for target are:
  • obama: Barack Obama
  • hcr: heath care reform
  • gop: the Republican party (Grand Old Party)
  • dems: the Democratic party
  • teaparty: the Tea Party
  • conservatives: individuals who subscribe to American conservatism
  • liberals: individuals who subscribe to American liberalism
  • other: the target is not adequately described by any of the above (use this if the tweet was irrelevant)
There are some tweets that clearly have multiple targets. In these cases, just pick the one you think is more important, and if they seem equally important, pick the first one mentioned.

Choose the appropriate values for sentiment and target from the drop down menus available with each one. If it is appropriate, you are welcome to add comments in the comments column for the tweet.

A few tips and comments:
  • Find out what hashtags like #hcr mean on tagdef.com.
  • There may be others editing at the same time. Make sure not to edit a row that someone has already claimed.
  • Make sure you only edit the sentiment, target, annotator id, and comment cells. Don't rearrange the rows or columns. If you accidentally do this, please get in touch with the instructor and TA right away so we can restore the spreadsheet.
  • As an indicator of how long it should take you, Jason annotated 30 tweets in a little under 10 minutes.
  • You are welcome to annotate more than 20 tweets if you would like to do so!
  • If you have any trouble with the spreadsheet, don't hesitate to get in touch with the instructor and the TA.
  • If you have any questions about the annotation, the labels, or are generally unsure about what to do, again, don't hesitate to contact the instructor and the TA!
Part B [2 pts] 

Calculate the probability of positive out of the positive and negative tweets in your 20 tweets.

Part C [3 pts] 

Calculate the probability of positive for tweets about health care reform (target=hcr) from your 20 tweets plus the 30 tweets annotated by jbaldridge. As in part B, calculate the probability based only on positive and negative tweets (about hcr).

Part D [6 pts] 

Name the three most difficult tweets you had to annotate and describe what made them difficult. Try to identify things like polarity flipping, use of discourse markers, sarcasm, etc. that were discussed in the slides.

Part E [4 pts] 

Look at tweets that were annotated by others (this can include jbaldridge) and see if you disagree with the annotation of either the sentiment or the target of any of these (including any "unsure" annotations). If you do disagree, make a note in the dispute cell for the tweet on the spreadsheet. In your homework, describe the tweet and how you would annotate it differently and what information you think is necessary to use. For example, is it necessary to know whether a hashtag corresponds to a particular position, like #handsoff (against health care reform) and #p2 (for health care reform), or to read the content of a page that the tweeter linked to?

If you agree with all the annotations, you can instead write down any general problems you think there are with annotating tweets in this way. For example, are there any important target values missing, or are many tweets just too short for you to make a determination?

Problem 4 (25 points)

The band The Flaming Lips has asked you to build a system that monitors Twitter during the Austin City Limits festival and classify tweets as positive or negative about the band. Here is an example:

Using Bayes rule, you will compute the probability that this tweet is positive based on the fact that it contains best and based on the following training data:

  • Dataset 1: You collect 1000 tweets about The Flaming Lips and label each of them as positive or negative, finding that 700 of them are positive. The word best is in 120 positive tweets and 10 negative ones. Use this to compute the probability of best given positive and and the probability of best given negative.
  • Dataset 2: You ask 100 people at The Flaming Lips concert whether they would rate the show positively or negatively, and 90 of them respond positively.   Use this to compute the probability of positive and the probability of negative.

In answering the questions below, be sure to use dataset 1 and dataset 2 appropriately for computing the probabilities, as indicated above.

(a) [2 pts] What is the probability of negative?  (Show your work). 

(b) [3 pts] What is the probability of best given negative? (Show your work) 

(c) [12 pts] What is the probability of positive given best? (Show your work)

(d) [8 pts]  You realize it might be a good idea to use both (a) the positive and negative tweets and (b) the response of the positive and negative concert-goers to get a better estimate. What is the probability of negative given best if you use the tweets in addition to the data from the concert-goers to compute the prior probabilities of positive and negative? (Show your work)

Comments