SOLVED CSE 6240 Web Search & Text Mining Homework 1

TO FIND THE NEXT OR DIFFERENT PROJECT CLICK ON THE SEARCH BUTTON ON THE TOP RIGHT MENU AND SEARCH USING THE TITLE CODE OR PROJECT TITLE.

Starting from:

~~$25~~

$15

Homework 1
1. Positional Indexing (25pt)
Shown below is a portion of a positional index in the format: term: doc1: ⟨position1,
position2, . . . ⟩; doc2: ⟨position1, position2, . . . ⟩; etc.
angels: 2: ⟨36,174,252,651⟩; 4: ⟨12,22,102,432⟩; 7: ⟨17⟩;
fools: 2: ⟨1,17,74,222⟩; 4: ⟨8,78,108,458⟩; 7: ⟨3,13,23,193⟩;
fear: 2: ⟨87,704,722,901⟩; 4: ⟨13,43,113,433⟩; 7: ⟨18,328,528⟩;
in: 2: ⟨3,37,76,444,851⟩; 4: ⟨10,20,110,470,500⟩; 7: ⟨5,15,25,195⟩;
rush: 2: ⟨2,66,194,321,702⟩; 4: ⟨9,69,149,429,569⟩; 7: ⟨4,14,404⟩;
to: 2: ⟨47,86,234,999⟩; 4: ⟨14,24,774,944⟩; 7: ⟨199,319,599,709⟩;
tread: 2: ⟨57,94,333⟩; 4: ⟨15,35,155⟩; 7: ⟨20,320⟩;
where: 2: ⟨67,124,393,1001⟩; 4: ⟨11,41,101,421,431⟩; 7: ⟨16,36,736⟩;
Which document(s) if any meet each of the following queries, where each expression
within quotes is a phrase query?
a. “fools rush in”
b. “where angels fear”
c. “fools rush in” AND “angels fear to tread”
2. TF-IDF (25pt)
Consider idf as the most commonly used version: idft = log
N
dft
⎛
⎝
⎜
⎞
⎠
⎟ .
a. Why is the idf of a term always finite?
b. What is the idf of a term that occurs in every document? Compare this with the use of
the stop word lists.
c. Consider the table of tf, df for 3 documents denoted Doc1, Doc2, Doc3 in Doc3 in
Figure 1 and Figure 2. Compute the tf-idf weights for the four terms, where the total
number of documents N = 806,791.
d. Can the tf-idf weight of a term in a document exceed 1?
term Doc1 Doc2 Doc3
car 27 4 24
moto 3 33 0
insurance 0 33 29
rent 14 0 17
Figure 1. Table of tf values for Question 2.
Georgia Institute of Technology CSE6240 - Spring 2015 Homework 1
term df
car 18,165
moto 6,723
insurance 19,241
rent 25,235
Figure 2. Table of df values for Question 2.
3. Evaluation - Example (25pt)
There are two indexing systems, S1 and S2. Both of them make two queries Q1 and Q2
on a document set, and return top 5 results as is shown below.
system, query 1 2 3 4 5
S1, Q1 d3 d5 d8 d10 d11
S1, Q2 d1 d2 d7 d11 d13
S2, Q1 d6 d7 d2 d9 d8
S2, Q2 d1 d2 d4 d11 d14
Figure 3. Table of query results for Question 3.
Suppose the relevant documents of Q1 is {d3, d6, d7, d8}, and Q2 {d1, d4, d11}.
Calculate precision, recall, F Measure (harmonic mean), Average Precision and Mean
Average Precision of S1 and S2 on Q1 and Q2, respectively (no need to consider
interpolation). Which values consider ranks? Which don’t?
4. Evaluation - General (25pt)
Answer the following questions:
a. The balanced F measure (a.k.a. F1) is defined as the harmonic mean of precision and
recall. What is the advantage of using the harmonic mean rather than “averaging” (using
the arithmetic mean)?
b. What are the possible values for interpolated precision at a recall level of 0?
c. Must there always be a break-even point between precision and recall? Either show
there must be or give a counter-example.