[SOUND]
Hello
welcome to the course in
Text Retrieval and Search Engines.
I'm Cheng Xiang Zhai.
I have a nickname Cheng.
I'm a professor of the Department
of Computer Science at
the University of Illinois
at Urbana-Champaign.
this first lecture is a basic
introduction to the course.
A brief introduction to what
we we'll cover in the course.
We're going to first talk about the data
mining specialization since this course is
part of that specialization.
And then we'll cover motivation
objectives of the course.
This will be followed by pre-requisites
and course format and reference books.
And then finally we'll talk
about the course schedule,
which has number of topics to be
covered in the rest of this course.
So the data mining specialization
offered by the University of Illinois
at Urbana-Champaign is really to address
the need for data mining techniques to
handle a lot of big data,
to turn the big data into knowledge.
There are five lecture-based courses,
as you see on the slide.
Plus one capstone,
project course in the end.
I'm teaching two of them which is
this course, Text Retrieval and
Search Engines and this one.
So the two courses that I cover
here are all about the text data.
In contrast, the other courses are
covering more general techniques that can
be applied to all kinds of data.
So Patent Discovery taught by the
Professor Jowi Han and Cluster Analysis
again taught by him about the general data
mining techniques to handle structure.
The end and structure text data.
And data mine, data visualization
covered by professor Jung Hart is about
the general visualization techniques.
Again applicable to all kinds of data.
So the motivation for this course.
In fact also for
the other courses that I'm teaching
is that we have a lot of text data.
And the data is everywhere,
is growing rapidly, so
you must have been
experiencing this growth.
Just think about how much text data
you're dealing with every day.
I listed some data types here, for
example, on the internet we see a lot
of web pages, news articles etcetera.
And then we have block articles,
emails, scientific literature,
tweets, as well speaking,
maybe a lot of tweets are being written,
and a lot of emails are, are being sent.
So, the amount of text data is beyond
our capacity to understand them.
Also, the amount of data makes it possible
to actually analyze the data to discover
interesting knowledge and that's what
we meant by, harnessing big text data.
[MUSIC]

[MUSIC]
Text data is very special.
In contrast to the data captured
by machines such as sensors,
text data is produced by humans.
And they also are meant
to be consumed by humans.
And this has some
interesting consequences.
Because it is produced by humans, it tends
to have a lot of useful knowledge about
people's' preferences,
people's' opinions about everything.
And that makes it possible to mine
text data to discover those
latent prefaces of people,
which could be very useful to build
an intelligent system to help people.
You can think about
scientific literature or
so and it's a way to encode
our knowledge about the world.
So it's very high quality content, yet we
have difficulty digesting all the content.
Now as a result of the fact that
text is consumed by we humans,
we also need intelligent software tools
to help people digest the content, or
otherwise we'd miss
a lot of useful content.
This slide shows that the human really
plays important role in test data mining.
We have to consider human in the loop, and
we have to consider the fact that
the text is generated by human.
So, here are some examples of
useful text information systems.
This is by no means a complete
list of all applications.
I categorize them into
different categories.
But you can probably imagine
other kinds of applications.
So let's take a look at some of them.
Search for example,
we all know search engines is special.
Web search engines, iPad,
all of you are using Google, or Bing, or
another web search engine all the time.
And we also have live research assistants.
And in fact, wherever you have a lot of
text data, you would have a search engine.
So for example, you might have
a search box on your laptop.
All right,
to search content on your computer.
So that's one kind of application systems,
but
we also have filtering systems or
recommended systems.
Those systems can push
information to users.
They can recommend useful
information to users.
So again, use filters, spam filters.
Literature the movie recommenders.
Now not of them are necessary
recommending the information to you.
For example email filter,
spam email filter,
this is actually to filter out
the spams from your inbox, all right.
But in nature these are similar systems in
that they have to make a binary decision
regarding whether to retain
a particular document or discard it.
Another kind of systems
are categorization systems.
So for example, in handling emails,
you might prefer automatic,
sorter that would automatically
sort incoming emails into a proper
folders that you created.
Or we might want to categorize product
reviews into positive or negative.
News agencies might be interested in
categorizing news articles into
all kinds of subject categories.
Those are all categorization systems.
Finally there are also systems
that might do more analysis.
And oh, you can say mine text data.
And these can be text mining systems or
information extraction systems,
and they can be
used to analyze text data in more detail
to discover potentially useful knowledge.
For example companies might
be interested in discovering
major complaints from their customers
based on the email messages that the,
they have received from the customers.
All right, so
having a system to support that would
really help improve their productivity and
the customer relations.
Also in business, intelligence companies
are often interested in analyzing product
reviews to understand the relative
strengths of their own products
in comparison with competitors.
And, and so these are all examples
of these test mining systems.
[INAUDIBLE] we have a lot of data
in particular literature data.
So, there's also great opportunity
of using computer systems
to analyze the data to
automatically read literature, and
to gain knowledge, and
to help biologists make discoveries.
And you can imagine many others.
So the point is that with so
much text data,
we can build very useful systems to
help people in many different ways.
Now how do we build this systems?
Well these actually are the main
technologies that we'll be talking
about in this course and the other course
that I'm teaching for this specialization.
The main techniques for
building these systems and also for
harnessing the text data are text
retrieval and text data mining.
So I use this picture to show
the relation between these two
some of the different techniques.
We started with big text data, right?
But for any applications, we don't
necessarily need to use all the data.
Often we only need the small subset of the
most relevant data, and that's shown here.
So text retrieval is to convert big,
raw text data into that small
subset of most relevant data that are most
useful for a particular application.
And this is usually
done by search engines.
And so
this will be covered in this course.
After we have got a small
amount of relevant data,
we also need to further analyze the data
to help people digest the data, or
to turn the data into
actionable knowledge.
And this step is called text mining,
where we use a number of techniques to
mine the data to get useful knowledge or
pairings.
And the knowledge can then be used
in many different applications.
And this part, text mining, will be
covered in the other course that I'm
teaching called Text Mining and Analytics.
The emphasis of this course
is on basic concepts and
practical techniques in text retrieval.
More specifically we will
cover how search engines work.
How to implement a search engine.
How to evaluate a search engine, so
that you know one search engine is
better than another or
one method is better than another.
How to improve and
optimize a search engine system.
And how to build a recommender system.
We also hope to provide a hands on
experience on multiple aspects.
One is to create a test collection for
evaluating search engines.
This is very important for knowing
which technique actually worked well.
And whether your search engine system
is really good for your application.
The other aspect is to experiment
with search engine algorithms.
In practice, you will have to face
choices of different algorithms.
So it's important to know
how to compare them and
to figure out how they work or
maybe potentially, how to improve them.
And finally, we'll provide a platform for
you to do search engine competition.
Where you can compare your different
ideas to see which idea works better
on some data set.
The prerequisites for
this course are minimum.
Basically we hope you have some basic
concepts of computer science, for
example data structures.
And we hope you will be comfortable
with programming, especially in C++.
because that's the language that we'll use
for some of the programming assignments.
The format is lectures plus quizzes,
as often happens in MOOCs.
And we also will provide
a program assignments for
those of you that have
the resources to do that.
We don't really have any required
readings for this course.
That just means if you follow all
the lecture videos carefully,
and you're suppose to know all the basic
concepts and the basic techniques.
But it's always useful to read more, so
here we provide a list of
some useful reference books.
And this in time order, and
that also includes a book that
and I are co-authoring now, and
we make some draft chapters
available on this website.
And we can find more readings and
reference books on this website.
Finally, and this is the course schedule.
That's just the top of the map for
the rest of the course,
and it shows the topics that we will
cover in the remaining lectures.
This picture also shows basic flow of
information in a text information system.
So starting from the big text data, the
first step is to do some natural language
content analysis, because text data is
in the form of natural language text.
So we need to understand
the text to some extent
in order to do something useful for
the users.
So this is the first
topic that we will cover.
And then on top of that as you
can see there are two boxes here.
Those are two types of systems
that can be used to help people
get access to the most relevant data.
Or in other words, those are the two
kinds of systems that will convert
big text data into small
relevant text data.
Search engines are helping
users to search or
to query the data to get
the most relevant documents out.
Recommender systems are to
recommend information to users,
to push information to users.
So those are two, complementary was of
getting users connected to the most
relevant data at the right time.
So this part is called text access,
and this will be the next topic.
And after we cover that we are going
to cover a number of topics,
all about the search engines.
Now the text access
topic is a brief topic,
a brief coverage of
the two kinds of systems.
In the remaining topics, we'll cover
search engines in much more detail.
That includes text retrieval problem,
text retrieval methods, how to evaluate
these methods, implementation of
the system, and web search applications.
And after these, we're going to
go cover the recommender system.
So this is what you expect
in the rest of this course.
Thanks.
[MUSIC]

[SOUND] This lecture is about
natural language content analysis.
As you see from this picture,
this is really the first step
to process any text data.
Text data are in natural languages.
So, computers have to understand
natural languages to some extent in
order to make use of the data, so
that's the topic of this lecture.
We're going to cover three things.
First, what is natural language
processing, which is a main technique for
processing natural language
to obtain understanding?
The second is the State of the Art in NLP,
which stands for
natural language processing.
Finally, we're going to cover the relation
between natural language processing and
text retrieval.
First, what is NLP?
Well, the best way to
explain it is to think about,
if you see a text in a foreign
language that you can't understand.
Now, what you have to do in
order to understand that text?
This is basically what
computers are facing.
Right?
So, looking at the simple sentence like,
a dog is chasing a boy on the playground.
We don't have any problems
understanding this sentence, but
imagine what the computer would have
to do in order to understand it.
For in general,
it would have to do the following.
First, it would have to know dog is
a noun, chasing's a verb, et cetera.
So, this is a code lexile analysis or
part of speech tagging.
And, we need to pick out the,
the syntaxing categories of those words.
So, that's a first step.
After that, we're going to figure
out the structure of the sentence.
So for example, here it shows that a and
dog would go together
to form a noun phrase.
And, we won't have dog and
is to go first, right.
And, there are some structures
that are not just right.
But, this structure shows what we might
get if we look at the sentence and
try to interpret the sentence.
Some words would go together first, and
then they will go together
with other words.
So here, we show we have noun phrases
as intermediate components and
then verb phrases.
Finally, we have a sentence.
And, you get this structure, we need to
do something called a syntactic analysis,
or parsing.
And, we may have a parser,
a computer program that would
automatically create this structure.
At this point, you would know
the structure of this sentence, but
still you don't know
the meaning of the sentence.
So, we have to go further
through semantic analysis.
In our mind,
we usually can map such a sentence to what
we already know in our knowledge base.
And for example, you might imagine
a dog that looks like that,
there's a boy and
there's some activity here.
But for computer,
will have to use symbols to denote that.
All right.
So, we would use the symbol
d1 to denote a dog.
And, b1 to denote a boy, and then p1
to denote the playground, playground.
Now, there is also a chasing
activity that's happening here, so
we have the relation chasing here,
that connects all these symbols.
So, this is how a computer would obtain
some understanding of this sentence.
Now from this representation, we could
also further infer some other things,
and we might indeed, naturally think
of something else when we read text.
And, this is call inference.
So for example, if you believe
that if someone's being chased and
this person might be scared.
All right.
With this rule,
you can see computers could also
infer that this boy may be scared.
So, this is some extra knowledge
that you would infer based on
some understanding of the text.
You can even go further to understand the,
why the person said this sentence.
So, this has to do with
the use of language.
All right.
This is called pragmatic analysis.
In order to understand the speech
actor of a sentence, all right,
we say something to
basically achieve some goal.
There's some purpose there and
this has to do with the use of language.
In this case, the person who said
the sentence might be reminding
another person to bring back the dog.
That could be one possible intent.
To reach this level of understanding,
we would require all these steps.
And, a computer would have to go
through all these steps in order to
completely understand this sentence.
Yet, we humans have no
trouble with understand that.
We instantly, will get everything,
and there is a reason for that.
That's because we have a large
knowledge base in our brain, and
we use common sense knowledge
to help interpret the sentence.
Computers, unfortunately,
are hard to obtain such understanding.
They don't have such a knowledge base.
They are still incapable of doing
reasoning and uncertainties.
So, that makes natural language
processing difficult for computers.
But, the fundamental reason why the
natural language processing is difficult
for computers is simple because natural
language has not been designed for
computers.
They, they, natural languages
are designed for us to communicate.
There are other languages designed for
computers.
For example, program languages.
Those are harder for us, right.
So, natural languages is designed to
make our communication efficient.
As a result,
we omit a lot of common sense knowledge
because we assume everyone
knows about that.
We also keep a lot of ambiguities
because we assume the receiver, or
the hearer could know how to
discern an ambiguous word,
based on the knowledge or the context.
There's no need to invent a different
word for different meanings.
We could overload the same word with
different meanings without the problem.
Because of these reasons,
this makes every step in natural language
of processing difficult for computers.
Ambiguity's the main difficulty, and
common sense reasoning is often required,
that's also hard.
So, let me give you some
examples of challenges here.
Conceded the word-level ambiguities.
The same word can have different
syntactical categories.
For example,
design can be a noun or a verb.
The word root may have multiple meanings.
So, square root in math sense,
or the root of a plant.
You might be able to
think of other meanings.
There are also syntactical ambiguities.
For example, the main topic of this
lecture, natural language processing,
can actually be interpreted in two ways,
in terms of the structure.
Think for a moment and
see if you can figure that out.
We usually think of this as
processing of natural languages, but
you could also think of this as you say,
language process is natural.
Right.
So, this is example of syntatic ambiguity.
Where we have different
structures that can be
applied to the same sequence of words.
Another example of ambiguous
sentence is the following,
a man saw a boy with a telescope.
Now, in this case, the question is,
who had the telescope?
All right, this is called a prepositional
phrase attachment ambiguity,
or PP attachment ambiguity.
Now, we generally don't have a problem
with these ambiguities because we have
a lot of background knowledge to
help us disintegrate the ambiguity.
Another example of difficulty
is anaphora resolution.
So, think about the sentence like John
persuaded Bill to buy a TV for himself.
The question here is,
does himself refer to John or Bill?
So again, this is something that
you have to use some background or
the context to figure out.
Finally, presupposition
is another problem.
Consider the sentence,
he has quit smoking.
Now this obviously
implies he smoked before.
So, imagine a computer wants to understand
all the subtle differences and meanings.
They would have to use a lot of
knowledge to figure that out.
It also would have to maintain a large
knowl, knowledge base of odd meanings of
words and how they are connected to our
common sense knowledge of the word.
So this is why it's very difficult.
So as a result we are still not perfect.
In fact, far from perfect in understanding
natural languages using computers.
So this slide sort of gives a simplified
view of state of the art technologies.
We can do part of speech
tagging pretty well.
So, I showed minus 7% accuracy here.
Now this number is obviously
based on a certain data set, so
don't take this literally.
All right, this just shows that
we could do it pretty well.
But it's still not perfect.
In terms of parsing,
we can do partial parsing pretty well.
That means we can get noun phrase
structures or verb phrase structure, or
some segment of the sentence understood
correctly in terms of the structure.
And, in some evaluation
results we have seen about 90%
accuracy in terms of partial
parsing of sentences.
Again, I have to say, these numbers
are relative to the data set.
In some other data sets,
the numbers might be lower.
Most of existing work has been
evaluated using news data set.
And so, a lot of these numbers are more or
less biased towards news data.
Think about social media data.
The accuracy likely is lower.
In terms of semantic analysis,
we are far from being able to do
a complete understanding of a sentence.
But we have some techniques
that would allow us to do
partial understanding of the sentence.
So, I could mention some of them.
For example, we have techniques that can
allow us to extract the entities and
relations mentioned in text or articles.
For example, recognizing
the mentions of people, locations,
organizations, et cetera in text.
Right?
So this is called entity extraction.
We may be able to recognize the relations.
For example,
this person visited that per, that place.
Or, this person met that person, or
this company acquired another company.
Such relations can be extracted
by using the current and
natural languaging processing techniques.
They are not perfect, but
they can do well for some entities.
Some entities are harder than others.
We can also do word sentence
disintegration to some extent.
We have to figure out whether this word in
this sentence would have certain meaning,
and in another context,
the computer could figure out
that it has a different meaning.
Again, it's not perfect but
you can do something in that direction.
We can also do sentiment analysis meaning
to figure out whether sentence
is positive or negative.
This is a special use for, for
review analysis for example.
So these examples of semantic analysis.
And they help us to obtain partial
understanding of the sentences.
Right?
It's not
giving us a complete understanding as
I showed before for the sentence, but
it will still help us gain understanding
of the content and these can be useful.
In terms of inference,
we are not yet there,
probably because of the general difficulty
of inference and uncertainties.
This is a general challenge
in artificial intelligence.
That's probably also because we don't have
complete semantic reimplementation for
natural language text.
So this is hard.
Yet in some domains, perhaps in
limited domains when you have a lot of
restrictions on the world of users,
you may be to may be able to perform
inference to some extent, but in general
we cannot really do that reliably.
Speech act analysis is also
far from being done, and
we can only do that analysis for
very special cases.
So, this roughly gives you some
idea about the state of the art.
And let me also talk a little
bit about what we can't do.
And, and so we can't even do
100% part of speech tagging.
This looks like a simple task,
but think about the example here,
the two uses of off may have different
syntactic categories if you try
to make a fine grain distinctions.
It's not that easy to figure
out such differences.
It's also hard to do general
complete the parsing.
And, again this same sentence
that you saw before is example.
This, this ambiguity can be
very hard to disambiguate.
And you can imagine example where you
have to use a lot of knowledge i,
in the context of the sentence or
from the background in order to figure
out the, who actually had the telescope.
So is, i, although sentence looks very
simple, it actually is pretty hard.
And in cases when the sentence is
very long, imagine it has four or
five prepositional phrases, then there
are even more possibilities to figure out.
It's also harder to precise
deep semantic analysis.
So here's example.
In this sentence, John owns a restaurant,
how do we define owns exactly?
The word, own, you know,
is something that we can understand but
it's very hard to precisely describe
the meaning of own for computers.
So as a result we have robust and
general natural language processing
techniques that can process a lot of text
data in a shallow way,
meaning we only do superficial analysis.
For example, part of s,
of speech tagging, or
partial parsing, or recognizing sentiment.
And those are not deep understanding
because we're not really
understanding the exact
meaning of the sentence.
On the other hand, the deep understanding
techniques tend not to scale up well,
meaning that they would fail
on some unrestricted text.
And if you don't restrict
the text domain or
the use of words, then these
techniques tend not to work well.
They may work well, based on machine
learning techniques on the data
that are similar to the training data
that the program has been trained on.
But they generally wouldn't work well on
the data that are very different from
the training data.
So this pretty much summarizes the state
of the art of natural language processing.
Of course, within such a short amount
of time, we can't really give you a,
a complete view of any of it, which is a
big field, and either expect that to have,
to see multiple courses on natural
language processing topic itself.
But, because of it's relevance to the
topic that we talked about it's useful for
you to know the background in case
you haven't been exposed to that.
So, what does that mean for
text retrieval?
Well, in text retrieval we
are dealing with all kinds of text.
It's very hard to restrict
the text to a certain domain.
And we also are often dealing with
a lot of text data, so that means.
The NLP techniques must be general,
robust, and efficient and that
just implies today we can only use fairly
shallow NLP techniques for text retrieval.
In fact,
most search engines today use something
called a bag of words representation.
Now this is probably the simplest
representation you can probably think of.
That is to turn text data
into simply a bag of words.
Meaning we will keep the individual words
but we'll ignore all the orders of words.
And we'll keep duplicated
occurrences of words.
So this is called a bag
of words representation.
When you represent the text in this way,
you ignore a lot about the information,
and that just makes it harder
to understand the exact meaning of
a sentence because we've lost the order.
But yet, this representation tends
to actually work pretty well for
most search tasks.
And this is partly because the search
task is not all that difficult.
If you see matching of some of the query
words in a text document, chances
are that that document is about the topic,
although there are exceptions, right?
So in comparison some other tasks, for
example machine translation, would require
you to understand the language accurately,
otherwise the translation would be wrong.
So in comparison,
search tasks are solved relatively easy
such a representation is often sufficient.
And that's also the representation
that the major search engines today,
like Google or Bing are using.
Of course I put in in parentheses but
not all.
Of course there are many queries that are
not answered well by the current search
engines, and
they do require a representation
that would go beyond bag
of words representation.
That would require more natural
language processing, to be done.
There is another reason why we have
not used the sophisticated NLP
techniques in modern search engines, and
that's because some retrieval techniques
actually naturally solve
the problem of NLP.
So, one example,
is word sense disambiguation.
Think about a word like java.
It could mean coffee or
it could mean program language.
If you look at the word
alone it would be ambiguous.
But when the user uses the water in
the query, usually there are other words.
For example I'm looking for
usage of Java applet.
When I have applet there that
implies Java means program language.
And that context can help us naturally
prefer documents where Java is
referring to program language,
because those documents would
probably match applet as well.
If java occurs in the document
in a way that means coffee,
then you would never match applet,
or with very small probability.
Right.
So this is a case when some retrieval
techniques naturally achieve the goal
of word sense disambiguation.
Another example is some technique called
feedback which we will talk about
later in some of the lectures.
This tech, technique would allow us
to add additional words to the query.
And those additional words could
be related to the query words.
And these words can help match documents
where the original query words
have not occurred.
So this achieves, to some extent,
semantic matching of terms.
So those techniques also helped us
bypass some of the difficulties
in natural language processing.
However, in the long run, we still need
deeper natural language processing
techniques in order to improve the
accuracy of the current search engines.
And it's particularly needed for complex
search tasks, or for question answering.
Google has recently
launched a knowledge graph.
And this is one step toward that goal,
because knowledge graph would contain
entities and their relations.
And this goes beyond the simple
bag of words representation.
And such technique should help us improve
the search engine utility significantly,
although this is a still open topic for
research and exploration.
In sum, in this lecture we'll talk
about what is NLP and we've talked
about the state of the art techniques,
what we can do, what we cannot do.
And finally, we also explained
why bag of words representation
remains the dominant representation used
in modern search engines even though
deeper NLP would be needed for
future search engines.
If you want to know more you can take
a look at some additional readings.
I only cited one here.
And that's a good starting point though.
Thanks.
[MUSIC]

[SOUND] In this lecture,
we're going to talk about text access.
In the previously lecture, we talked
about natural language content analysis.
We explained that the state of the art
natural language processing techniques
are still not good enough to process
a lot of unrestricted text data
in a robust manner.
As a result, bag of words representation
remains very popular in
applications like search engines.
In this lecture we're going to talk
about some high level strategies
to help users get access to the text data.
This is also important step to
convert raw, big text data into small
relevant data that are actually
needed in a specific application.
So the main question we address here is,
how can a text information system
help users get access to
the relevant text data?
We're going to cover two complementary
strategies, push vs pull.
And then we're going to talk about
two ways to implement the pull mode,
querying vs browsing.
So first, push vs pull.
These are two different ways to connect
users with the right information
at the right time.
The difference is which
takes the initiative,
which party it takes in the initiative.
In the pull mode,
the users would take the initiative to
start the information access process.
And in this case, a user typically would
use a search engine to fulfill the goal.
For example,
the user may type in a query, and
then browse results to find
the relevant information.
So this is usually appropriate for
satisfying a user's ad
hoc information need.
An ad hoc information need is
a temporary information need.
For example, you want to buy a product so
you suddenly have a need to read
reviews about related products.
But after you have collected information,
you have purchased your product, you
generally no longer need such information.
So it's a temporary information need.
In such a case, it's very hard for
a system to predict your need, and
it's more appropriate for
the users to take the initiative.
And that's why search engines
are very useful today,
because many people have many ad
hoc information needs all the time.
So as we are speaking Google probably is
processing many queries from this, and
those are all, or
mostly ad hoc information needs.
So this is a pull mode.
In contrast, in the push mode
the system will take the initiative
to push the information to the user or
to recommend the information to the user.
So in this case, this is usually
supported by a recommender system.
Now this would be appropriate if
the user has a stable information need.
For example, you may have a research
interest in some topic, and
that interest tends to stay for
a while, so it's relatively stable.
Your hobby is another example
of a stable information need.
In such a case, the system can interact
with you and can learn your interest, and
then can monitor the information stream.
If it is, the system hasn't seen any
relevant items to your interest,
the system could then take the initiative
to recommend information to you.
So for example, a news filter or
news recommender system could
monitor the news stream and
identify interest in news to you, and
simply push the news articles to you.
This mode of information access
may be also appropriate when
the system has a good knowledge
about the user's need.
And this happens in the search context.
So for example, when you search for
information on the web a search
engine might infer you might be also
interested in some related information.
And they would recommend
the information to you.
So that should remind you for example,
advertisement placed on a search page.
So this is about the, the two high level
strategies or two modes of text access.
Now let's look at the pull
mode in more detail.
In the pull mode, we can further this
in usually two ways to help users,
querying vs browsing.
In querying, a user would just enter
a query, typically a keyword query, and
the search engine system would
return relevant documents to users.
And this works well when
the user knows what exactly key,
are the keywords to be used.
So if you know exactly
what you're looking for
you tend to know the right keywords,
and then query would work very well.
And we do that all the time.
But we also know that
sometimes it doesn't work so
well, when you don't know the right
keywords to use in the query or
you want to browse information
in some topic area.
In this case browsing
would be more useful.
So in this case in the case of browsing
the users would simply navigate
into the relevant information
by following the path that's
supported by the structures
on the documents.
So the system would maintain
some kind of structures, and
then the user could follow
these structures to navigate.
So this strategy works well when the user
wants to explore information space or
the user doesn't know what
are the keywords to use in the query.
Or simply because the user, finds it
inconvenient to type in the query.
So even if a user knows what query to type
in, if the user is using a cell phone
to search for information,
then it's still hard to enter the query.
In such a case again,
browsing tends to be more convenient.
The relationship between browsing and
the query is best understood by
making an analogy to sightseeing.
Imagine if you are touring a city.
Now if you know the exact address
of a attraction, then taking a taxi
there is perhaps the fastest way,
you can go directly to the site.
But if you don't know the exact address,
you may need to walk around, or
you can take a taxi to a nearby place,
and then walk around.
It turns out that we do exactly
the same in the information space.
If you know exactly what you
are looking for, then you can
use the right keywords in your query
to find the information directly.
That's usually the fastest
way to do find information.
But what if you don't know
the exact keywords to use?
Well, your query probably won't work so
well, and you will land on some related
pages, and then you need to also walk
around in the information space.
Meaning by following the links or
by browsing,
you can then finally get
into the relevant page.
If you want to learn about a topic again,
you you will likely do a lot of browsing.
So just like you are looking
around in some area and
you want to see some interesting
attractions in a related
in the same region.
So this analogy also tells us that
today we have very good support for
querying, but we don't really
have good support for browsing.
And this is because in order to browse
effectively, we need a a map to guide us,
just like you need a map of Chicago
to tour the city of Chicago.
You need a topical map to
tour the information space.
So how to construct such a topical
map is in fact a very interesting
research question that
likely will bring us
more interesting browsing experience
on the web or in other applications.
So to summarize this lecture, we have
talked about two high level strategies for
text access, push and pull.
Push tends to be supported
by a recommender system and
pull tends to be supported
by a search engine.
Of course in the sophisticated
intent in the information system,
we should combine the two.
In the pull mode we have further
distinguished querying and browsing.
Again, we generally want to combine
the two ways to help users so
that you can support both querying and
browsing.
If you want to know more about
the relationship between pull and
push, you can read this article.
This gives a excellent discussion of the
relationship between information filtering
and information retrieval.
Here information filtering is similar
to information recommendation,
or the push mode of information access.
[MUSIC]

[SOUND] This lecture is about
the text retrieval problem.
This picture shows our overall plan for
lectures.
In the last lecture, we talked about
the high level strategies for text access.
We talked about push versus pull.
Search engines are the main tools for
supporting the pull mode.
Starting from this lecture,
we're going to talk about the how
search engines work in detail.
So first,
it's about the text retrieval problem.
We're going to talk about
the three things in this lecture.
First, we'll define text retrieval.
Second, we're going to make
a comparison between text retrieval and
the related task, database retrieval.
Finally, we're going to talk about the
document selection versus document ranking
as two strategies for
responding to a user's query.
So what is text retrieval?
It should be a task that's familiar
to most of us because we're using web
search engines all the time.
So text retrieval is basically a task
where the system would respond
to a user's query with relevant lock-ins,
basically through
supported querying as one way to implement
the poor mold of information access.
So the scenario's the following.
You have a collection of text documents.
These documents could be all
the web pages on the web.
Or all the literature articles
in the digital library or
maybe all the text files in your computer.
A user will typically give a query to the
system to express the information need.
And then the system would return
relevant documents to users.
Relevant documents refer to those
documents that are useful to the user who
typed in the query.
Now this task is a often
called information retrieval.
But literally, information retrieval
would broadly include the retrieval of
other non-textual information as well.
For example, audio, video, et cetera.
It's worth noting that text
retrieval is at the core of
information retrieval in the sense
that other medias such as
video can be retrieved by
exploiting the companion text data.
So for example,
can the image search engines actually
match a user's query with
the companion text data of the image?
This problem is also called the,
the search problem,
and the technology is often called
search technology in industry.
If you ever take on course in databases,
it'll be useful to pause
the lecture at this point and
think about the differences between
text retrieval and database retrieval.
Now these two tasks
are similar in many ways.
But there are some important differences.
So, spend a moment to think about
the differences between the two.
Think about the data and information
managed by a search engine versus
those that are man,
managed by a database system.
Think about the difference between
the queries that you typically specify for
a database system versus the queries that
typed in by users on the search engine.
And then finally think about the answers.
What's the difference between the two?
Okay, so
if we think probably the information out
there are managed by the two systems.
We will see that in text retrieval,
the data is unstructured, it's free text.
But in databases, they are structured
data, where there is a clear defined
schema to tell you this column
is the names of people and
that column is ages, et cetera.
In unstructured text,
it's not obvious what are the names
of people mentioned in the text.
Because of this difference,
we can also see that text information
tends to be more ambiguous.
And we'll talk about that in the natural
language processing lecture.
Whereas in databases, the data tend
to have well-defined semantics.
There is also important
difference in the queries, and
this is partly due to the difference
in the information, or data.
So text queries tend to be ambiguous,
whereas in their research,
the queries are particularly well-defined.
Think about the SQL query,
that would clear the specify
what records to be returned.
So it has very well defined semantics.
Queue all queries or naturally ending
queries tend to be incomplete.
Also in that it doesn't really,
fully specify what documents
should be retrieved.
Whereas, in the database search,
the SQL query
can be regarded as a computer
specification for what should be returned.
And because of these differences,
the answers would be also different.
In the case of text retrieval,
we're looking for relevant documents.
In the database search,
we are retrieving records or
matched records with the SQL query,
more precisely.
Now in the case of text retrieval,
what should be the right answers to
a query is not very well specified,
as we just discussed.
So it's unclear what should be
the right answers to a query.
And this has very important consequences,
and
that is text retrieval is
an empirically defined problem.
And so this a problem because
if it's empirically defined,
then we cannot mathematically prove one
method is better than another method.
That also means we must rely
on emperical evaluation
more than users to know
which method works better.
And that's why we have one lecture,
actually more than one lectures
to cover the issue of evaluation.
Because this is a very important topic for
search engines.
Without knowing how to evaluate
an algorithm appropriately,
there's no way to tell whether we
have got the better algorithm or
whether one system is better than another.
So now let's look at
the problem in a formal way.
So this slide shows a formal formulation
of the text retrieval problem.
First, we have our vocabulary set which
is just a set of words in a language.
Now here,
we're considering just one language, but
in reality on the web there might
be multiple natural languages.
We have text that are in
all kinds of languages.
But here for simplicity, we just
assume there is one kind of language.
As the techniques used for
retrieving data from multiple languages,.
Are more or
less similar to the techniques used for
retrieving documents in one language.
Although there is important difference,
the principles and
methods are very similar.
Next we have the query,
which is a sequence of words.
And so here you can see the query
is defined as a sequence of words.
Each q sub i is a word in the vocabulary.
A document is defined in the same way.
So it's also a sequence of words.
And here,
d sub ij is also a word in the vocabulary.
Now typically, the documents
are much longer than queries.
But there are also cases where
the documents may be very short.
So you can think about the,
what might be a example of that case.
I hope you can think of,
of twitter search, all right?
Tweets are very short.
But in general,
documents are longer then the queries.
Now, then we have
a collection of documents.
And this collection can be very large.
So think about the web.
It could, could be very large.
And then the goal of text retrieval is
you'll find the set of relevant documents,
which we denote by R of q,
because it depends on the query.
And this is, in general, a subset of
all the documents in the collection.
Unfortunately, this set of random
documents is generally unknown,
and usually depend in the sense that for
the same query typed
in by different users, the expected
relevant documents may be different.
The query given to us by
the user is only a hint
on which document should be in this set.
And indeed, the user is generally unable
to specify what exactly should be in
the set, especially in the case of a web
search where the collection is so large.
The user doesn't have complete
knowledge about the whole collection.
So, the best a search system can do is
to compute an approximation of
this relevent document set.
So we denote it by R prime of q.
So, formally, we can see the task
is to compute this R prime of q,
an approximation of
the relevant documents.
So how can we do that?
Now, imagine if you are now asked
to write a program to do this.
What would you do?
Now think for a moment.
Right, so these are your input.
With the query, the documents and
then you will have computed
the answers to this query,
which is set of documents that
would be useful to the user.
So how would you solve the problem?
In general there are two
strategies that we can use.
All right, the first strategy
is to do document selection.
And that is, we're going to have
a binary classification function, or
binary classified.
That's a function that
will take a document and
query as input, and
then give a zero or one as output,
to indicate whether this document
is relevant to the query, or not.
So in this case, you can see the document.
The, the relevant document
set is defined as follows.
It basically, all the documents that
have a value of one by this function.
And so in this case,
you can see the system must have decided
if a document is relevant or not.
Basically, that has to say
whether it's one or zero.
And this is called absolute relevance.
Basically, it needs to know exactly
whether it's going to be useful
to the user.
Alternatively, there's another
strategy called document ranking.
Now in this case,
the system is not going to make a call
whether a document is relevant or not.
Rather, the system's going to
use a real value function, f,
here that would simply give us a value.
That would indicate which
document is more likely relevant.
So it's not going to make a call whether
this document is relevant or not,
but rather it would say which
document is more likely relevant.
So this function then can be
used to rank the documents.
And then we're going to let the user
decide where to stop when the user looks
at the documents.
So we have a threshold,
theta, here to determine
what documents should be
in this approximation set.
And we're going to assume that all
the documents that are ranked above this
threshold are in the set.
Because in effect, these are the documents
that we delivered to the user.
And theta is a cutoff
determined by the user.
So here we've got some collaboration from
the user in some sense because we
don't really make a cutoff, and
the user kind of helped
the system make a cutoff.
So in this case, the system only needs
to decide if one document is more likely
relevant than another.
And that is, it only needs for
determined relative relevance as
opposed to absolute relevance.
Now you can probably already
sense that relevant,
relative relevance would be easier
to determine the absolute relevance.
Because in the first case,
we have to say exactly whether
a document is relevant or not, right?
And it turns out that ranking is indeed
generally preferred to document selection.
So let's look this these two
strategies in more detail.
So this pictures shows how it works.
So on the left side,
we see these documents.
And we use the pluses to
indicate the relevant documents.
So we can see the true relevant
documents here consists
this set of true relevant documents
consists of these pluses, these documents.
And with the document selection function,
we can do,
basically classify them into two groups,
relevant documents and non-relevant ones.
Of course, the classifier will not be
perfect, so it will make mistakes.
So here we can see in the approximation of
the relevant documents we have
got some non-relevant documents.
And similarly, there's a relevant document
that that's misclassified as non-relevant.
In the case of document ranking,
we can see the system seems like
simply ranks all the documents in
the descending order of the scores.
And we're going to let the user stop
wherever the user wants to stop.
So if a user wants to examine more
documents, then the user will
go down the list to examine more and
stop at the lower position.
But if the user only wants to
read a few random documents,
the user might stop at the top position.
So in this case,
the user stops at d4, so the effect,
we have delivered these
four documents to our user.
So as I said,
ranking is generally preferred.
And one of the reasons is
because the classifier,
in the case of document selection,
is unlikely accurate.
Why?
Because the only clue is usually
the query.
But the query may not be accurate, in the
sense that it could be overly constrained.
For example, you might expect the relevant
documents to talk about all these
topics you, by using specific vocabulary,
and as a result,
you might match no random documents,
because in the collection,
no others have discussed the topic
using these vocabularies.
All right.
So in this case,
we'll see there is this problem of
no relevant documents to return in
the case of overly constrained query.
On the other hand, if the query is
under constrained, for example,
if the query does not have sufficient
discriminating words you'll
find in relevant documents,
you may actually end up having.
over delivery.
And this is when you thought these
words might be sufficient to
help you find the relevant documents, but
it turns out that they're not sufficient.
And there are many distraction
documents using similar words.
And so this is the case of over delivery.
Unfortunately, it's very hard to find the
right position between these two extremes.
Why?
Because, when the users looking for
the information in general,
the user does not have a good knowledge
about the the information to be found.
And in that case, the user does
not have a good knowledge about
what vocabularies will be used
in those random documents.
So it's very hard for
a user to pre-specify
the right level of of constraints.
Even if the class file is accurate,
we also still want to rank these
relevant documents because they
are generally not equally relevant.
Relevance is often a matter of degree.
So we must prioritize these documents for
user to exam.
And this, note that this
prioritization is very important,
because a user cannot digest
all the contents at once.
The user generally would have to
look at each document sequentially.
And therefore, it would make
sense to feed users with the most
relevant documents, and
that's what ranking is doing.
So for these reasons ranking
is generally preferred.
Now, this preference also has
a theoretical justification, and
this is given by the probability
ranking principle.
In the end of this lecture
there is a reference for this.
This principal says, returning a ranked
list of documents in descending order of
probability, that a document
is relevant to the query,
is the optimal strategy under
the following two assumptions.
First, the utility of a document to a user
Is independent of the utility
of any other document.
Second, a user would be assumed to
browse the results sequentially.
Now it's easy to understand why
these two assumptions are needed,
in order to justify for
the ranking, strategy.
Because, if the documents are independent,
then we can evaluate the utility
of each document that's separate.
And this would allow us to compute
a score for each document independently.
And then we're going to rank these
documents based on those scores.
The second assumption is to say that the
user would indeed follow the rank list.
If the user is not going to follow
the ranked list, is not going to examine
the documents sequentially, then obviously
the ordering would not be optimal.
So under these two assumptions, we can
theoretically justify the ranking strategy
is in fact the best that you could do.
Now I've put one question here.
Do these 2 assumptions hold?
Now I suggest you to pause the lecture for
a moment to think about these.
Now can you think of some
examples that would suggest
these assumptions aren't necessarily true?
Now if you think for
a moment you may realize none of
the assumptions is actually true.
For example in the case of
independence assumption, we might have
identical documents that have similar
content or exactly the same content.
If you look at each of them alone,
each is relevant.
But if the user has already seen
one of them, we assume it's
generally not very useful for the user
to see another similar or duplicate one.
So clearly the utility of
a document is dependent
on other documents that the user has seen.
In some other cases, you might see
a scenario where one document that may not
be useful to the user, but when three
particular documents are put together,
they provide answer to
the user's question.
So this is collective relevance.
And that also suggests that
the value of the document
might depend on other documents.
Sequential browsing generally would make
sense if you have a ranked list there.
But even if you have a run list,
there is evidence showing that
users don't always just go strictly
sequentially through the entire list.
They sometimes would look at
the bottom for example, or skip some.
And if you think about the more
complicated interfaces that would possibly
use like, two dimensional interface
where you can put additional information
on the screen, then sequential browsing
is a very restrictive assumption.
So the point here is that,
none of these assumptions is really true,
but nevertheless,
the probability ranking principle
establishes some solid foundation for
ranking as a primary task for
search engines.
And this has actually been the basis for
a lot of research work in
information retrieval.
And many algorithms have been
designed based on this assumption.
Despite that the assumptions
aren't necessarily true.
And we can, address this problem by
doing post processing of a ranked list.
For example, to remove redundancy.
So to summarize this lecture,
the main points that you can
take away are the following.
First, text retrieval is
an empirically defined problem.
And that means which algorithm is
better must be judged by the users.
Second, document ranking
is generally prefer and
this is, will help users prioritize
examination of search results.
And this is also to bypass the difficulty
in determining absolute relevance,
because we can get some help from users
in determining where to make the cut off.
It's more flexible.
So this further suggests that
the main technical challenge
in designing the search engine is with
designing effective ranking function.
In other words, we need to define
what is the value of this function f
on the query and document pair.
Now how to design such a function is
a main topic in the following lectures.
There are two suggested
additional readings.
The first is the classic paper on
probability ranking principle.
The second, is a must read for anyone
doing research information retrieval.
It's classical IR book,
which has excellent coverage of
the main research results in early days,
up to the time when the book was written.
Chapter six of this book has
an in depth discussion of
the probability of the ranking principal,
and
the probabilistic retrieval models,
in general.
[MUSIC]

[MUSIC]
This lecture is a overview
of text retrieval methods.
In the previous lecture we introduced
you to the problem of text retrieval.
We explained that the main problem is
to design a ranking function to rank
documents for a query.
In this lecture,
we will give a overview of different
ways of designing this ranking function.
So the problem is the following.
We have a query that has
a sequence of words, and
a document that, that's also a sequence of
words, and we hope to define the function
f that can compute a score based
on the query and document.
So the main challenge you here is with
designing a good ranking function that can
rank all the relevant documents,
on top of all the non-relevant ones.
Now clearly this means our
function must be able to measure
the likelihood that a document
d is relevant to a query q.
That also means we have to have
some way to define relevance.
In particular in order to implement
the program to do that we have to have
a computational definition of relevance,
and we achieve this goal by
designing a retrieval model, which
gives us a formalization of relevance.
Now, over many decades,
researchers have designed
many different kinds of retrieval models,
and they fall into different categories.
First, one fair many of the models
are based on the similarity idea.
Basically, we assume that if
a document is more similar to the query
than another document is,
then we would say the first document
is more relevant than the second one.
So in this case,
the ranking function is defined as
the similarity between the query and
the document.
One well known example in this
case is vector space model,
which we will cover more in
detail later in the lecture.
The second kind of models
are called probabilistic models.
In this family of models,
we follow a very different strategy.
While we assume that queries and
documents are all observations
from random variables, and
we assume there is a binary
random variable called R here,
to indicate whether a document
is relevant to a query.
We then define the score of document with
respect to a query as is a probability
that this random variable R is equal to 1,
given a particular document and query.
There are different cases
of such a general idea.
One is classic probabilistic model,
another is language model, yet
another is
divergence-from-randomness model.
In a later lecture,
we will talk more about the, one case,
which is language model.
The third kind of models of this
is probabilistic inference.
So here the idea is to associate
uncertainty to inference rules.
And we can then quantify the probability
that we can show that the query
follows from the document.
Finally, there is also a family of models
that are using axiomatic thinking.
Here the idea is to define
a set of constraints that
we hope a good retrieval
function to satisfy.
So in this case the problem is you seek
a good ranking function that can
satisfy all the desired constraints.
Interestingly, although these different
models are based on different thinking,
in the end the retrieval function
tends to be very similar.
And these functions tend to
also involve similar variables.
So now let's take a look at the, the
common form of a state of that retrieval
model and examine some of the common
ideas used in all these models.
First, these models are all
based on the assumption
of using bag of words for
representing text.
And we explained this in the natural
language processing lecture.
Bag of words representation remains
the main representation used in all
the search engines.
So, with this assumption,
the score of a query like a presidential
campaign news,
with respect to a document d here,
would be based on scores computed at,
based on each individual word.
And that means the score would
depend on the score of each word,
such as presidential, campaign, and news.
Here we can see there are three
different components,
each corresponding to how well the
document matches each of the query words.
Inside of these functions,
we see a number of heuristics views.
So for example, one factor that
affects the function g here is how
many times does the word
presidential occur in the document?
This is called a Term Frequency or TF.
We might also denote as
c of presidential and d.
In general if the word
occurs more frequently in
the document then the value of
this function would be larger.
Another factor is how
long is the document, and
this is so
to use the document length for score.
In general, if a term occurs in a long
document that many times,
it's not as significant as
if it occurred the same number
of times in a short document.
Because in the long document any term
is expected to occur more frequently.
Finally, there is this factor called
a document frequency, and that is we also
want to look at how often presidential
occurs in the entire collection.
And we call this Document Frequency,
or DF, of presidential.
And in some other models we
might also use a probability
to characterize this information.
So here, I show the probability of
presidential in the collection.
So all these are trying to
characterize the popularity of
the term in the collection.
In general,
matching a rare term in the collection
is contributing more to the overall
score then matching a common term.
So this captures some of the main ideas
used in pretty much all the state of
the art retrieval models.
So now, a natural question is
which model works the best?
Now, it turns out that many models work
equally well, so here I listed the four
major models that are generally regarded
as a state of the art retrieval models.
Pivoted length normalization,
BM25, query likelihood, PL2.
When optimized these models tend
to perform similarly and this was,
discussed in detail in this reference
at the end of this lecture.
Among all these,
BM25 is probably the most popular.
It's most likely that this has been used
in virtually all the search engines,
and you will also often see this
method discussed in research papers.
And we'll talk more about this
method later in some other lectures.
So, to summarize, the main points made
in this lecture are, first the design
of a good ranking function pre-requires a
computational definition of relevance, and
we achieve this goal by designing
a proper retrieval model.
Second, many models are equally effective
but we don't have a single winner here.
Researchers are still actively
working on this problem,
trying to find a truly
optimal retrieval model.
Finally, the state of the art
ranking functions tend to rely on
the following ideas.
First, bag of words representation.
Second, TF and
the document frequency of words.
Such information is used when
ranking function to determine
the overall contribution of matching
a word, and document length.
These are often combined in
interesting ways and we'll discuss
how exactly they are combined to rank
documents in the lectures later.
There are two suggested additional
readings if you have time.
The first is a paper where you can
find a detailed discussion and
comparison of multiple
state of the art models.
The second, is a book with a chapter
that gives a broad review of
different retrieval models.
[MUSIC]

[SOUND].
This lecture is about the vector
space retrieval model.
We're going to give
an introduction to its basic idea.
In the last lecture we talked about
the different ways of designing
a retrieval model which would give
us a different the ranking function.
In this lecture, we're going to
talk about the, the specific way of
design the ramping function called
a vector space mutual model.
And we're going to give a brief
introduction to the basic idea.
Vector space model is a special case of
similarity based models
as we discussed before.
Which means,
we assume relevance is roughly
similarity between a document and a query.
Now whether this assumption is true,
is actually a question.
But in order to solve our
search problem we have to
convert the vague notion of
relevance into a more precise
definition that can be implemented
with the programming language.
So in this process we have to
make a number of assumptions.
This is the first assumption
that we make here.
Basically we assume that
if a document is more
similar to a query than another document,
then the first document would be assumed
to be more relevant than the second one.
And this is the basis for
ranking documents in this approach.
Again, it's questionable whether this is
really the best definition for relevance.
As we will see later there
are other ways to model relevance.
The first idea of vector space retrieval
model is actually very easy to understand.
Imagine a high dimensional space, where
each dimension corresponds to a term.
So, here, I show a three
dimensional space with three words,
programming, library, and presidential.
So each term, here, defines one dimension.
Now we can consider vectors in
this three dimensional space.
And we're going to assume
all our documents and
the query will be placed
in this vector space.
So, for example, one document that might
be represented at by this vector, d1.
Now this means this document probably
covers library and presidential.
But it doesn't really
talk about programming.
All right, what does this mean in
terms of presentation of document?
That just means,
we're going to look at our document
from the perspective of this vector.
We're going to ignore everything else.
Basically what we see here is
only the vector of the document.
Of course the document
has other information.
For example,
the orders of words are simply ignored and
that's because we're
assume that the words.
So with this representation
you have already seen, d1,
seems to suggest a topic in
either presidential library.
Now this is different
from another document.
Which might be represented as
a different vector, d2 here.
Now in this case, the document that
covers programming and library, but
it doesn't talk about presidential.
So what does this remind you?
Well, you can probably guess, the topic
is likely about program language and
the library is software library, library.
So this shows that by using this
vector space representation,
we can actually capture the differences
between topics of documents.
Now you can also imagine
there are other vectors.
For example,
d3 is pointing in that direction, that
might be about presidential programming.
And in fact we're going to place all
the documents in this vector space.
And they will be pointing
to all kinds of directions.
And similarly, we're going to place
our query also in this space,
as another vector.
And then we're going to measure the
similarity between the query vector and
every document vector.
So, in this case for example, we can
easily see d2 seems to be the closest of,
to this query factor and
therefore d2 will be ranked above others.
So this was a, basically the main
idea of the, the vector space model.
So to be more pri,
precise, be more precise.
Vector space model is a framework.
In this framework,
we make the following assumptions.
First, we represent a document and
query by a term vector.
So here a term can be any basic concept.
For example, a word or a phrase,
or even enneagram of characters.
Those are a sequence of
characters inside a word.
Each term is assumed to
define one dimension.
Therefore N terms.
In our vocabulary,
we define N-dimensional space.
A query vector would consist
of a number of elements
corresponding to the weights
of different terms.
Each document vector is also similar.
It has a number of elements and
each value of each element
is indicating that weight
of the corresponding term.
Here you can see,
we have seen there are N dimensions.
Therefore, there are N elements,
each corresponding to the weight
on the particular term.
So the relevance in this case would
be assume to be the similarity
between the two vectors,
therefore our range in function is
also defined as the similarity between
the query vector and document vector.
Now, if I ask you to write the program
to the internet this approach
in the search engine.
You would realize that this
was far from clear, right?
We haven't seen a lot of things in detail
therefore it's impossible to actually
write the program to implement this.
That's why I said this is a framework.
And this has to be refined
in order to actually
suggest a particular function,
that you can implement on the computer.
So, what does this framework not serve?
Well, it actually hasn't set many things
that would be required in order
to implement this function.
First, it did not say how we should define
or select the basic concepts exactly.
We clearly assume
the concepts are orthogonal,
otherwise there will be redundancy.
For example, if two synonyms are somehow
distinguished as two different concepts.
Then they would be defined
in two different dimensions.
And then that would clearly
cause a redundancy here.
Or overemphasizing of
matching this concept.
Because it would be as if you
matched the two dimensions
when you actually matched
one semantic concept.
Secondly, it did not say how we
exactly should place documents and
query in this space.
Basically I show you some examples
of query and document vectors.
But where exactly should the vector for
a particular document point to?
[INAUDIBLE] So this is equivalent
to how to define the term weights.
How do you computer use element
values in those vectors?
This is a very important
question because term weight
in the query vector indicates
the importance of term.
So depending on how you assign the weight,
you might prefer some terms
to be matched over others.
Similarly, term weight in
the document is also very meaningful.
It indicates how well the term
characterizes the document.
If you got it wrong, then you clearly
don't represent this document accurately.
Finally, how we define the similarity
measure is also not clear.
So these questions must be addressed
before we can have an operational
function that we can actually
implement using a program language.
So how do we solve these problems
is the main topic of the next lecture.
[MUSIC]

[SOUND].
In this lecture, we're going to talk about
how to instantiate a vector space model,
so that we can get a very
specific ranking function.
So this is the, to continue
the discussion of the vector space model.
Which is one particular approach
to design ranking function.
And we are going to talk about how
we use the general framework of
the the vector space model.
As a guidance to instantiate the framework
to derive a specific ranking function.
And we're going to cover the simplest
instantiation of the framework.
So as we discussed in
the previous lecture.
The vector space model
is really a framework.
It isn't, didn't say.
As we discussed in the previous lecture,
vector space model is really a framework.
It doesn't, say many things.
So for
example here it shows that it did not say
how we should define the dimension.
It also did not say how we place
a documented vector in this space.
It did not say how we place a query
vector in this vector space.
And finally, it did not say how
we should match a similarity
between the query vector and
the document vector.
So, you can imagine,
in order to implement this model.
We have to see specifically,
how we are computing these vectors.
What is exactly xi and what is exactly yi?
This will determine where we
place the document vector.
Where we place a query vector.
And of course, we also need to say exactly
what will be the similarity function.
So if we can provide a definition
of the concepts that would
define the dimensions and
these xi's, or yi's.
And then, the waits of terms for
query and document.
Then we will be able to
place document vectors and
query vector in this well defined space.
And then,
if we also specify similarity function,
then we'll have well
defined ranking function.
So let's see how we can do that.
And think about
the the simpliciter instantiation.
Actually, I would suggest you to
pause the lecture at this point
spend a couple of minute to think about.
Suppose you are asked
to implement this idea.
You've come up with the idea
of vector space model.
But you still haven't figured out
how to compute this vector exactly,
how to define this similarity function.
What would you do?
So think for a couple of minutes and
then, proceed.
So let's think about some simplest ways
of instantiating this vector space model.
First, how do we define a dimension.
Well the obvious choice is we use
each word in our vocabulary
to define a dimension.
And a whole issue that there
are n words in our vocabulary,
therefore there are n dimensions.
Each word defines one dimension.
And this is basically the Bag
of Words Instantiation.
Now let's look at how we
place vectors in this space.
Again here, the simplest of strategy is to
use a bit vector to represent
both a query and a document.
And that means each element xi and
yi would be taking a value
of either zero or one.
When it's one,
it means the corresponding word is
present in the document or in the query.
When it's zero,
it's going to mean that it's absent.
So you can imagine if the user
types in a few word in your query.
Then the query vector,
we only have a few ones, many, many zeros.
The document vector in general
we have more ones of course,
but we also have many zeros.
So it seems the vocabulary
is generally very large.
Many words don't really
occur in a document.
Many words will only occasionally
occur in the document.
A lot of words will be absent
in a particular document.
So, now we have placed the documents and
the query in the vector space.
Let's look at how we
match up the similarity.
So, a commonly used similarity
measure here is Dot Product.
The dot product of two
vectors is simply defined as
the sum of the products of the
corresponding elements of the two vectors.
So here we see that it's
the product of x1 and the y1.
So here.
And then, x2 multiplied by y2.
And then finally xn multiplied by yn.
And then we take a sum here.
So that's the dot product.
Now we can represent this in a more
general way, using a sum here.
So this only one of the many different
ways of matching the similarity.
So now we see that we have defined the,
the dimensions.
We have defined the, the vectors.
And we have also defined
the similarity function.
So now we finally have
the Simplest Vector Space Model.
Which is based on the bit vector
representation, dot product similarity,
and bag of words instantiation.
And the formula looks like this.
So this is our formula.
And that's actually a particular retrieval
function, a ranking function all right?
Now, we can finally implement this
function using a program language and
then rank documents for query.
Now at this point you should
again pause the lecture
to think about how we can
interpret this score.
So we have gone through the process
of modeling the retrieval problem
using a vector space model.
And then, we make assumptions.
About how we place vectors in the vector
space and how we define the similarity.
So in the end we've got a specific
retrieval function shown here.
Now the next step is to think about
what of this individual function
actually makes sense?
I, can we expect this function
to actually perform well?
Where we use it to ramp it up,
for use in query.
So, it's worth thinking about, what is
this value that we are calculating?
So in the end, we've got a number,
but what does this number mean?
Is it meaningful?
So spend a couple minutes
to think about that.
And of course,
the general question here is do you
believe this is a good ranking function?
Would it actually work well?
So again,
think about how to interpret this value.
Is it actually meaningful?
Does it mean something?
So related to how well that
document matches the query.
So in order to assess
whether this simplest
vector space model actually works well,
let's look at the example.
So here I show some sample documents and
a simple query.
The query is news about
the presidential campaign.
And we have five documents here.
They cover different, terms in the query.
And if you look at the,
these documents for a moment.
You may realize that
some documents are probably relevant
in some cases or probably not relevant.
Now if I ask you to rank these documents,
how would you rank them?
This is basically our ideal ranking.
Right.
When humans can examine the documents and
then try to rank them.
Now, so think for a moment and
take a look at this slide.
And perhaps by pausing the lecture.
So I think most of you
would agree that d4,
and d3, are probably better than others.
Because they really cover the query well.
They match news,
presidential, and campaign.
So, it looks like that these two documents
are probably better than the others.
They should be ranked on top.
And the other three, d1, d2, and
d5, are really non-relavant.
So we can also say d4 and
d3 are relevent documents, and d1, d2, and
d5 are non-relevant.
So, now lets see if our vector
space model could do the same or
could do something closer.
So let's first think about how we actually
use this model to score documents.
Right here I show two documents, d1 and
d3, and we have the query also here.
In the vector space model, of course we
want to first compute the vectors for
these documents and the query.
Now I issue with the vocabulary
here as well, so
these are the n dimensions
that we'll be thinking about.
So what do you think is the vector
representation for the query?
Note that we are assuming
that we only use zero and one
to indicate whether a term is absent or
present in the query or in the document.
So these are zero, one bit vectors.
So what do you think is the query vector?
Well the query has four words here.
So for these four words, there would be a
one and for the rest, there will be zeros.
Now what about the documents?
It's the same.
So d1 has two rows, news and about.
So there are two ones here and
the rest are zeros.
Similarly, so
now that we have the two vectors,
let's compute the similarity.
And we're going to use dot product.
So you can see when we use dot product we
just, multiply the corresponding elements.
Right.
So
these two would be, form a,
be forming a product.
And these two will
generate another product.
And these two would generate yet
another product.
And so on and so forth.
Now you can,
you need to see if we do that.
We actually don't have to
care about these zeroes
because if whenever we have a zero,
the product will be zero.
So, when we take a sum
over all these pairs,
then the zero entries will be gone.
As long as you have one zero,
then the product would be zero.
So in the fact, we're just counting
how many pairs of one and one, right?
In this case, we have seen two.
So the result will be two.
So, what does that mean?
Well that means, this number or
the value of this scoring function.
Is simply the count of how many unique
query terms are matched in the document.
Because if a document,
if a term is matched in the document,
then there will be two ones.
If it's not, then there will
be zero on the document side.
Similarly, if the document has a term,.
But the terms not in the query there
will be zero in the query vector.
So those don't count.
So as a result this
scoring function basically
meshes how many unique query
terms are matched in a document.
This is how we interpret this score.
Now we can also take a look at the d3.
In this case,
you can see the result is three.
Because d3 matched the three distinctive
query words, news, presidential, campaign.
Whereas d1 only matched two.
Now in this case, it seems
reasonable to rank d3 on top of d1.
And this simplest vector
space model indeed does that.
So that looks pretty good.
However, if we examine this model in
detail, we likely will find some problems.
So here I'm going to show all
the scores for these five documents.
And you can even verify they are correct.
Because we're basically counting
the number of unique query
terms matched in each document.
Now note that this method
actually makes sense.
Right?
It basically means if a document matches
more unique query terms, then the document
will be assuming to be more relevant.
And that seems to make sense.
The only problem is here, we can note set
there are three documents, d2, d3, and d4.
And they tied with a three, as a score.
So that's a problem, because if you
look at them carefully it seems that
d4 should be right above d3.
Because d3 only mentioned
the presidential once.
But d4 mentioned it much more times.
In case of d3,
presidential could be extended mentioned.
But d4 is clearly above
presidential campaign.
Another problem is that d2 and
d3 also have the same soul.
But, if you look at the,
the three words that are matched.
In the case of d2, it matched the news,
about, and the campaign.
But in the case of d3, it match the news,
presidential, and campaign.
So intuitively, d3 is better.
Because matching presidential is more
important though than matching about.
Even though about and
the presidential are both in the query.
So intuitively,
we would like d3 to be ranked above d2.
But this model, doesn't do that.
So that means this is still not good
enough, we have to solve these problems.
To summarize,
in this lecture we talked about how
to instantiate a vector space model.
We may need to do three things.
One is to define the dimension.
The second is to
decide how to place documents
as vectors in the vector space.
And to also place a query in
the vector space as a vector.
And third is to define
the similarity between two vectors,
particularly the query vector and
the document vector.
We also talked about a very simple way
to instantiate the vector space model.
Indeed, that's probably the simplest
vector space model that we can derive.
In this case,
we use each word to define a dimension.
We use a zero one bit vector to
represent a document or a query.
In this case, we basically only care
about word presence or absence.
We ignore the frequency.
And we use the dot product
as the similarity function.
And with such a, a, in situation.
And we showed that the scoring
function is basically to score
a document based on the number of distinct
query words matched in the document.
We also show that such a single vector
space model still doesn't work well,
and we need to improve it.
And this is the topic that we're
going to cover in the next lecture.
[MUSIC]

[SOUND].
In this lecture, we're going to talk about
how to improve the instant changing of
the Vector Space Model.
This is the continued discussion
of the Vector Space Model.
We're going to focus on how to improve
the instant changing of this model.
In a previous lecture,
you have seen that with simple
situations of the Vector Space Model,
we can come up with
a simple scoring function that
would give us, basically,
a count of how many unique query
terms are matching the document.
We also have seen that this function
has a problem as shown on this slide.
In particular,
if you look at these three documents,
they will all get the same score because
they match the three unique query words.
But intuitively we would like,
d4 to be ranked above d3.
And d2 is really non relevant.
So the problem here is that
this function couldn't capture
the following characteristics.
First, we would like to
give more gratitude to d4
because it matches the presidential
more times than d3.
Second, intuitively matching
presidential should be more important
than matching about, because about is
a very common word that occurs everywhere.
It doesn't really carry that much content.
So, in this lecture,
let's see how we can improve the model
to solve these two problems.
It's worth thinking at this point about
why do we have these four problems.
If we look back at
the assumptions we have made
while substantiating the Vector
Space Model, we will realize that
the problem is really coming
from some of the assumptions.
In particular, it has to do with how we
place the vectors in the vector space.
So then, naturally,
in order to fix these problems,
we have to revisit those assumptions.
Perhaps, you will have
to use different ways to
instantiate the Vector Space Model.
In particular, we have to place
the vectors in a different way.
So, let's see how can we prove this?
Well, our natural thought is in order
to consider multiple times of a term
in a document.
We should consider the term frequency
instead of just the absence or presence.
In order to consider the difference
between a document where a query
term occurred multiple times and the one
where the query term occurred just once.
We have to concede a term frequency,
the count of a term being in the document.
In the simplest model, we only model
the presence and absence of a term.
We ignore the actual number of times
that a term occurs in a document.
So let's add this back.
So we're going to do then represent
a document by a vector with
term frequency as element.
So, that is to say, now,
the elements of both the query vector and
the document vector will not be zero once,
but
instead there will be the counts of
a word in the query or the document.
So this would bring additional
information about the document.
So this can be seen as a more accurate
representation of our documents.
So, now let's see what the formula
would look like if we change
this representation.
So as you see on this slide,
we still use that product, and,
so the formula looks
very similar in the form.
In fact, it looks identical, but
inside of the sum of cos xi and
yi are now different.
They're now the counts of words
i in the query and the document.
Now at this point, I also suggest you
to pause the lecture for moment and
just we'll think about how we have
interpret the score of this new function.
It's doing something very similar
to what the simplest VSM is doing.
But because of the change of the vector,
now the new score has
a different interpretation.
Can you see the difference?
And it has to do with
the consideration of multiple
occurrences of the same
time in the document.
More importantly, we''ll try to know
whether this would fix the problem of
the simplest vector space model.
So, let's look at the this example again.
So suppose, we change the vector
to term frequency vectors.
Now, let's look at these
three documents again.
The query vector is the same because
all these words occurred exactly once
in the query.
So the vector is still 0 1 vector.
And in fact,
d2 is also essential in representing
the same way because none of these
words has been repeated many times.
As a result, the score is also the same,
still three.
The same issue for d3 and
we still have a 3.
But d4 would be different, because now,
presidential occurred twice here.
So the end in the four presidential in
the [INAUDIBLE] would be 2 instead of 1.
As a result, now the score for
d4 is higher.
It's a four now.
So this means, by using term frequency,
we can now rank d4 above d2 and
d3 as we hope to.
So this solve the problem with default.
But, we can also see that d2 and
d3 are still featured in the same way.
They still have identical scores,
so it did not fix the problem here.
So, how can we fix this problem?
We would like, to give more credit for
matching presidential than matching about.
But how can we solve
the problem in a general way?
Is there any way to determine which word
should be treated more importantly and
which word can be, basically ignored.
About is such a word.
And which it does not really
carry that much content,
we can essentially ignore that.
We sometimes call such a word,
a stock word.
Those are generally very frequent and
they occur everywhere,
matching it, doesn't really mean anything.
But computation how can we capture that?
So again, I encourage you to
think a little bit about this.
Can you come up with any
statistical approaches to somehow
distinguish presidential from about.
If you think about it for
a moment, you realize that,
one difference is that a word
like above occurs everywhere.
So if you count the currents of the water
in the whole collection that we
would see that about as much higher for
this than presidential, which it tends
to occur only in some documents.
So this idea suggests
that we could somehow
use the global statistics of terms or
some other formation to try to
down weight the element for
about in the vector representation of d2.
At the same time,
we hope to somehow increase the weight
of presidential in the vector of d3.
If we can do that, then,
we can expect that d2 will get
the overall score to be less than three,
while d3 will get the score about three.
Then, we'll be able to
rank d3 on top of d2.
So how can we do this systematically?
Again, we can rely on some
steps that people count.
And in this case, the particular idea is
called the Inverse Document Frequency.
We have seen document frequency.
As one signal used in,
the moding retrieval functions.
We discussed this in a previous lecture.
So here's the specific way of using it.
Document frequency is the count of
documents that contain a particular term.
Here, we say inverse document frequency
because we actually want to reword a word
that doesn't occur in many documents.
And so, the way to incorporate this
into our vector [INAUDIBLE] is
then to modify the frequency
count by multiplying
it by the idea of the corresponding
word as shown here.
If we didn't do that,
then we can penalize common
words which generally have a low idea of,
and
reward real words,
which we're have a higher IDF.
So most specific [INAUDIBLE] IDF
can be defined as the logarithm
of M plus one divided by k,
where M is the total number of
documents in the collection,k is df or
document frequency.
The total number of documents
containing the word W.
Now, if you plot this
function by varying k,
then you will see the curve
that look like this.
In general, you can see it
would give a higher value for
a low DF word, a rare word.
You can also see the maximum value
of this function is log of M plus 1.
Will be interesting for you to think about
what's minimum value for this function?
This could be interesting exercise.
Now, the specific function
may not be as important as
the heuristic to simply
penalize popular terms.
But it turns out this particular
function form has also worked very well.
Now, whether there is a better
form of function here,
is the open research question.
But, it's also clear that if we use
a linear kernalization like what's
shown here with this line, then, it may
not be as reasonable as the standard IDF.
In particular, you can see
the difference in the standard IDF,
and we,
somehow have a [INAUDIBLE] point here.
After this point, we're going to say these
terms are essentially not very useful.
They can be essentially ignored.
And this makes sense when the term
occurs so frequently, and
let's say a term occurs in more
than 50% of the documents.
Then the term is unlikely very important
and it's, it's basically, a common term.
It's not very important to match this
word, so with the standard IDF, you can
see it's, basically, assumed that they all
have lower weights, there's no difference.
But if you look at the linear
kernelization, at this point there is,
there's some difference.
So intuitively, we want to focus more
on the discrimination of low DF words,
rather than these common words.
Well, of course, which one works better,
still has to be validated
by using the empirically related data set.
And we have to use users to
judge which results of that.
So now let's see how this
can solve problem two.
So now,
let's look at the two documents again.
Now without IDF weighting, before,
we just have [INAUDIBLE] vectors,
but with IDF weighting we
now can adjust the DF weight
by multiplying the, with the IDF value.
For example here, you can see is
the adjustment in particular for
about, there is an adjustment
by using the IDF value of about
which is smaller than the IDF
value of presidential.
So if you look at these,
the IDF will distinguish these two words.
As a result, adjustment here would be
larger, would make this weight larger.
So if we score with these new vectors, and
what would happen is that the, of course,
they share the same weights for news and
the campaign, but the margin of about and
presidential with this grouping may.
So now as a result of IDF weighting,
we will have d3 to be ranked above d2.
Because it matched rail word,
where as d2 matched common word.
So this shows that the idea of
weighting can solve problem two.
So, how effective is this model in
general when we use TF-IDF weighting?
Well, let's look at all these
documents that we have seen before.
These are the new scores
of the new documents.
But how effective is this
new weighting method and
new scoring function, all right?
So now let's see overall how effective
is this new ranking function
with TF-IDF Weighting?
Here, we show all the five documents
that we have seen before, and
these are their scores.
Now, we can see the scores for
the first four
documents here seem to
be quite reasonable.
They are as we expected.
However, we also see a new problem.
Because now d5, here,
which did not have a very high
score with our simplest
vector space model.
Now, after it has a very high score.
In fact, it has the highest score here.
So, this creates a new problem.
This actually a common phenomenon
in designing material functions.
Basically, when you try
to fix one problem,
you tend to introduce other problems.
And that's why it's very tricky how
to design effective ranking function.
And what's what's the best ranking
function is the open research question.
Researchers are still working on that.
But in the next few lecture, we're
going to also talk about some additional
ideas to further improve this model and
try to fix this problem.
So to summarize this lecture,
we've talked about how to
improve this vector space model.
And we've got to improve the [INAUDIBLE]
of the vector space model based on
TF-IDF weighting.
So the improvement, most of it,
is on the placement of the vector.
Where we give higher weight to a term
that occurred many times in the document,
but infrequently in the whole collection.
And we have seen that this improved
model indeed works better than
the simplest vector space model, but
it also still has some problems.
In the next lecture,
we're going to look at the how to
address these additional problems.
[MUSIC]

[SOUND]
In this lecture, we continue
the discussion of Vector Space Model.
In particular, we are going to
talk about the TF transformation.
In the previous lecture,
we have derived a TF-IDF weighting
formula using the vector space model.
And we have shown that this model
actually works pretty well for
these examples as shown on
this slide except for d5,
which has received a very high score.
Indeed, it has received the highest
score among all these documents.
But this document is intuitively
non-relevant, so this is not desirable.
In this lecture, we're going to
talk about how would you use TF
transformation to solve this problem.
Before we discuss the details,
let's take a look at the formula for
this symbol here for
IDF weighting ranking function and
see why this document has
received such a high score.
So this is the formula, and
if you look at the formula carefully,
then you will see it involves a sum
over all the matched query terms.
And inside the sum, each matched
query sum has a particular weight.
And this weight is TF-IDF weighting.
So it has an IDF component
where we see 2 variables.
One is the total number of documents
in the collection, and that is m.
The other is the documentive frequency.
This is the number of documents
that contain this word w.
The other variables in,
involving the formula,
include the count of the query term.
W in the query, and
the count of the word in the document.
If you look at this document again,
now it's not hard to
realize that the reason why it has
received a high score is because
it has a very high count of campaign.
So the count of campaign in this document
is a four, which is much higher than
the other documents, and has contributed
to the high score of this document.
So intriguingly, in order to lower
the score for this document, we need
to somehow restrict the contribution of,
the matching of this term in the document.
And if you think about the matching of
terms in the document carefully you
actually would realize we
probably shouldn't reward
multiple occurrences so generously.
And by that I mean the first occurrence
of a term says a lot about the,
the matching of this term,
because it goes from zero count
to a count of one, and
that increase means a lot.
Once we see a word in the document,
it's very likely that the document
is talking about this word.
If we see an extra occurrence
on top of the first occurrence,
that is to go from one to two,
then we also can say that well, the second
occurrence kind of confirmed that it's
not a accidental mention of the word.
Now, we are more sure that this
document is talking about this word.
But imagine we have seen, let's say,
50 times of the word in the document.
Then, adding one extra occurrence
is not going to test more about
evidence because we are already sure
that this document is about this word.
So if you're thinking
this way it seems that
we should restrict the contributing
of a high account of term.
And that is the idea of TF Transformation.
So this transformation function is
going to turn the raw count of word
into a Term Frequency Weight,
for the word in the document.
So here I show in x-axis, that raw count,
and in y-axis I show
the Term Frequency Weight.
So, in the previous ranking functions
we actually have increasingly,
used some kind of transformation.
So for example in the zero-one bit
vector retentation we actually use
the Suchier transformation
function as shown here.
Basically if the count is
zero then it has zero weight.
Otherwise it would have a weight of one.
It's flat.
Now what about using
Term Count as a TF weight.
Well that's a linear function, right?
So it has just exactly
the same weight as the count.
Now we have just seen that
this is not desirable.
So what we want is something like this.
So for example with a logarithm function,
we can have a sub-linear
transformation that looks like this.
And this will control the influence of
really high weight because it's going to
lower its inference, yet it will
retain the inference of small count.
Or we might want to even bend the curve
more by applying logarithm twice.
Now people have tried all these methods
and they are indeed working better than
the linear form of the transformation,
but so far what works the best
seems to be this special transformation
called a BM25 transformation.
BM stands for best matching.
Now in this transformation,
you can see there's a parameter k here.
And this k controls the upper
bound of this function.
It's easy to see this function has
a upper bound because if you look at
the x divided by x plus k where
k is not an active number,
then the numerator will never be able
to exceed the denominator, right?
So, it's upper bounded by k plus 1.
Now, this is also difference between
this transformation function and
the logarithm transformation.
Which it doesn't have upperbound.
Now furthermore, one interesting property
of this function is that as we vary K,
we can actually simulate different
transformation functions,
including the two extremes
that are shown here.
That is a zero one bit transformation,
and the unit transformation.
So for example, if we set k to zero,
now you can see
the function value would be one.
So we precisely,
recover the zero one bit transformation.
If you set k to a very large number,
on the other hand,
other hand, it's going to look more
like the linear transformation function.
So in this sense,
this transformation is very flexible,
it allows us to control
the shape of the transformation.
It also has a nice property
of the upper bound.
And this upper bound is useful to control
the inference of a particular term.
And so that we can prevent a, a spammer
from just increasing the count of
1 term to spam all queries
that might match this term.
In other words this upper bound
might also ensure that all terms
will be counted when we aggregate the,
the weights, to compute a score.
As I said, this transformation
function has worked well, so far.
So to summarise this lecture,
the main point is that we need to do
some sub linearity of TF Transformation.
And this is needed to capture
the intuition of diminishing return from
high Term Counts.
It's also to avoid a dominance by
one single term over all others.
This BM25 Transformation, Transformation
that we talked about is very interesting.
It's so far one of the best performing
TF Transforming formation formulas.
It has upper bound, and
it's also robust and effective.
Now, if we're plug in this
function into our TF-IDF weighting
vector space model then we would
end up having the following
ranking function,
which has a BM25 TF component.
Now this is already very close to a state
of the art ranking function called a BM25.
And we will discuss how we can further
improve this formula in the next lecture.
[MUSIC]

This lecture is about document length
normalization in the vector space model.
In this lecture we are going to continue
the discussion of the vector space model
in particular we are going to discuss.
The issue of document
length normalization.
So far in the lectures about
the vector space model,
we have used the various
signals from the document to
assess the matching of the document
though with a preorder.
In particular we have
considered the term frequency,
the count of a term in a document.
We have also considered a,
it's global statistics such as
IDF in words document frequency.
But we have not considered
a document length.
So, here I show two example documents.
D4 is much shorter with only 100 words.
D6 on the other hand has 5,000 words.
If you look at the matching of these
query words we see that in D6 there
are more matchings of the query words but
one might reason that D6 may
have matched these query words.
In a scattered manner.
So maybe the topic of d6 is not
really about the topic of the query.
So the discussion of a campaign
at the beginning of the document
may have nothing to do with the mention
of presidential at the end.
In general,
if you think about the long documents,
they would have a higher
chance to match any query.
In fact, if you generate a,
a long document that randomly sampling,
sampling words from
the distribution of words,
then eventually you probably
will match any query.
So in this sense we should
penalize no documents because they
just naturally have better
chances to match any query.
And this is our idea of document answer.
We also need to be careful in avoiding
to overpenalize small documents.
On the one hand,
we want to penalize a long document.
But on the other hand,
we also don't want to over-penalize them.
And the reason is because a document that
may be long because of different reason.
In one case the document may be more
long because it uses more words.
So for example think about
the article of a research paper.
It would use more words than
the corresponding abstract.
So this is the case where we probably
should penalize the matching of
a long document such as, full paper.
When we compare the matching
of words in such
long document with matching of
the words in the short abstract.
Then long papers generally have a higher
chance of matching query words.
Therefore we should penalize them.
However, there is another case
when the document is long and
that is when the document
simply has more content.
Now consider another
case of a long document,
where we simply concatenated a lot
of abstracts of different papers.
In such a case, obviously, we don't
want to penalize such a long document.
Indeed, we probably don't want to penalize
such a document because it's long.
So that's why we need to be careful.
About using the right
degree of penalization.
A method that has been working well
based on recent research is called,
pivot length normalization.
And in this case the idea is to use.
The average document length as a P word,
as a reference point.
That means we will assume that for
the average length documents,
the score is about right.
So, the normalizer would be 1.
But if a document is longer than
the average document length
then there will be some penalization.
Where as if it's shorter than
there's even some reward.
So this is an illustrator
that using this slide.
On the axis,
s axis you can see the length of document.
On the y-axis we show the normalizer,
in the case pivoted length normalization
formula for the normalizer is
is seem to be interpolation of one and
the normalize the document lengths,
controlled by a parameter b here.
So, you can see here,
when we first divide the lengths of the
document by the average document length.
This not only gives us
some sense about the,
how this document is compared with
the average document length, but
also gives us a benefit of not
worrying about the unit of
length, we can measure the length
by words or by characters.
Anyway this normalizer has
an interesting property.
First we see that if we set the parameter
b to 0 then the value would be 1,
so there's no pair,
length normalization at all.
So b in this sense controls the length
normalization, where as if we set
d to a non-zero value, then
the normalizer will look like this, right.
So the value would be higher for
documents that are longer than
the average document length.
Where as the value of the normalizer
will be short- will be smaller for
shorter documents.
So in this sense we see there's
a penalization for long documents.
And there's a reward for short documents.
The degree of penalization
is conjured by b.
Because if we set b to a larger
value then the normalizer.
What looked like this.
There's even more penalization for
long documents and more reward for
the short documents.
By adjusting b which
varies from zero to one
we can control the degree
of length normalization.
So if we're plucking this length
normalization factor into
the vector space model ranking functions
that we have already examined.
Then we will end up heading with formulas,
and
these are in fact the state of
the are vector space models.
Formulas.
So, let's talk an that,
let's take a look at the each of them.
The first one's called a pivoted length
normalization vector space model.
And, a reference in the end has detail
about the derivation of this model.
And, here, we see that it's basically
the TFIDF weighting model that we have
discussed.
The IDF component should be
very familiar now to you.
There is also a query term
frequency component, here.
And, and then in the middle there is.
And normalize the TF.
And in this case,
we see we use the double algorithm,
as we discussed before, and this is to
achieve a sublinear transformation.
But we also put document length
normalizer in the bottom, all right so
this would cause penalty for
a long document, because the larger
the denominator is, the denominator is
then the smaller the shift weight is.
And this is of course controlled
by the parameter b here.
And you can see again, b is set to 0, and
there, there is no length normalization.
Okay.
So this is one of the two most effective.
Not this base model of formulas.
The next one called a BM25,
or Okapi, is, also similar.
In that, it also has a i, df component
here, and a query df component here.
But in the middle, the normalization's
a little bit different.
As we expand there is this or
copied here for transformation here.
And that does, sublinear
transformation with an upper bound.
In this case we have put the length
normalization factor here.
We are adjusting k, but
it achieves a similar factor
because we put a normalizer
in the denominator.
Therefore again, if a document is longer,
then the term weight will be smaller.
So, you can see, after we have gone
through all the instances that we talked
about, and we have,
in the end, reached the,
basically the state of
the art mutual function.
So, so far we have talked
about mainly how to place
the document matter in the matter space.
And this has played an important role
in uh,determining the factors of
the function.
But there are also other dimensions
where we did not really examine detail.
For example can we further
improve the instantiation of
the dimension of the vector space model.
Now we've just assumed
that the back of words.
So each dimension is a word.
But obviously we can see
there are many other choices.
For example, stemmed words, those
are the words that have been transformed
into the same rule form.
So that computation and computing will all
become the same and they can be matched.
We need to stop water removal.
This is removes on very common
words that don't carry any content.
Like the or of,
we use the phrases to define that [SOUND].
We can even use late in the semantica,
an answer sort of find in the sum cluster.
So words that represent
a legend of concept as one.
We can also use smaller units,
like a character in grams.
Those are sequences of n characters for
dimensions.
However, in practice people have found
that the bag-of-words representation
with the phrases is where
the the most effective one.
And it's also efficient so
this is still so
far the most popular dimension
instantiation method and
it's used in all the major search engines.
I should also mention that sometimes
we did to do language specific and
domain specific organization.
And this is actually very important as
we might have variations of the terms.
That might prevent us from
matching them with each other.
Even though they mean the same thing.
And some of them, which is like Chinese,
the results of the.
Segmenting text to obtain word boundaries.
Because it's just
a sequence of characters.
A word might, might correspond to
one character or two characters or
even three characters.
So it's easier in English when we
have a space to separate the words.
But in some other languages we may need
to do some natural language processing
to figure out the,
where are the boundaries for words.
There is also possibility to
improve this in narrative function.
And so
far we have used the about product, but
one can imagine there are other matches.
For example we can match the cosine
of the angle between two vectors, or
we can use Euclidean distance measure.
And these are all possible.
The dot product seems still the best and
one of the reasons is
because it's very general.
In fact, it's sufficiently general.
If you consider the possibilities of
doing weighting in different ways.
So, for example,
cosine measure can be regarded as the dot
product of two normalized vectors.
That means we first normalize each vector,
and then we take the dot product.
That would be equivalent
to the cosine measure.
I just mentioned that the BM25.
Seems to be one of the most
effective formulas.
But there has been also further
development in, improving BM25, although
none of these works have
changed the BM25 fundamentally.
So in one line of work,
people have derived BM25 F.
Here F stands for field, and
this is a little use BM25 for
documents with a structures.
For example you might consider
title field, the abstract, or
body of the reasearch article, or
even anchor text on the web pages.
Those are the text fields that
describe links to other pages.
And these can all be
combined with a appropriate
weight on different fields to help
improve scoring for document.
Use BM25 for such a document.
And the obvious choice is to
apply BM25 for each field, and
then combine the scores.
Basically, the ideal of BM25F,
is to first combine
the frequency counts of tons in all
the fields and then apply BM25.
Now this has advantage of avoiding over
counting the first occurrence of the term.
Remember in the sublinear
transformation of TF,
the first recurrence is very important
then, and contributes a large weight.
And if we do that for all the fields, then
the same term might have gained a, a lot
of advantage in every field, but when we
combine these word frequencies together.
We just do the transformation one time,
and
that time then the extra occurrences will
not be counted as fresh first occurrences.
And this method has been working very
well for scoring structured documents.
The other line of extension is called
a BM25 plus and this line, arresters
have addressed the problem of over
penalization of long documents by BM25.
So to address this problem,
the fix is actually quite simple.
We can simply add a small constant
to the TF normalization formula.
But what's interesting is that we can
analytically prove that by doing such
a small modification,
we will fix the problem of a,
over penalization of long
documents by the original BM25.
So the new formula called
BM25-plus is empirically and
analytically shown to be better than BM25.
So to summarize all what we have
said about the Vector Space Model.
Here are the major takeaway points.
First, in such a model,
we use the similarity notion of relevance,
assuming that the relevance of
a document with respect to a query is
basically proportional to the similarity
between the query and the document.
So, naturally,
that implies that the query and
document must be represented in
the same way, and in this case,
we represent them as vectors in
high dimensional vector space.
Where the dimensions are defined by
words or concepts or terms in general.
And we generally need to use a lot of
heuristics to design a ranking function.
We use some examples which show
the need for several heuristics,
including TF waiting and transformation.
And IDF weighting, and
document length normalization.
These major heuristics are the most
important heuristics to ensure such
a general ranking function to
work well for all kinds of tasks.
And finally BM25 and
Pivoted normalization seem
to be the most effective
formulas out of that Space Model.
Now I have to say that, I've put BM25
in the category of Vector Space Model.
But in fact the BM25 has
been derived using model.
So the reason why I've put it in
the vector space model is first
the ranking function actually has a nice
interpretation in the vector space model.
We can easily see it looks very
much like a vector space model
with a special weighting function.
The second reason is because the original
BM25 has a somewhat different from of IDF.
And that form of IDF actually
doesn't really work so
well as the standard IDF
that you have seen here.
So as a effective original function
BM25 should probably use a heuristic
modification of the IDF to make that
even more like a vector space model.
There are some additional readings.
The first is a paper about
the pivoted length normalization.
It's an excellent example of using
empirical data enhances to suggest a need
for length normalization, and then further
derived a length normalization formula.
The second is the original
paper when the was proposed.
The third paper has
a thorough discussion of and
its extensions, particularly BM-25F.
And finally, the last paper
has a discussion of improving
BM-25 to correct the overpenalization
of long documents.
[MUSIC]

[SOUND] This lecture is about
the implementation of
text retrieval systems.
In this lecture, we will discuss how we
can implement a text retrieval method
to build a search engine.
The main challenge is to
manage a lot of text data and
to enable a query to be answered very
quickly and to respond to many queries.
This is a typical text
retrieval system architecture.
We can see the documents
are first processed by a tokenizer
to get tokenizer units, for example words.
And then these words or
tokens would be processed by
an indexer that would create an index,
which is a data structure for the search
engine to use to quickly answer a query.
And the query will be going
through a similar processing step.
So, the tokenizer will be
apprised to query as well so
that the text can be
processed in the same way.
The same units will be
matched with each other.
And the query's representation
will then be given to the scorer.
Which would use a index to
quickly answer a user's query by
scoring the documents and
then ranking them.
The results will be given to the user.
And then the user can look at the results
and and provide some feedback that can be
expressed judgements about which documents
are good, which documents are bad,
or implicit feedback such as pixels so the
user doesn't have to any, anything extra.
The user will just look at the results and
skip some and
click on some results to view.
So these interaction signals can be used
by the system to improve the ranking
accuracy by assuming that viewed documents
are better than the skipped ones.
So, a search engine system then
can be divided into three parts.
The first part is the indexer, and
the second part is the scorer,
that responds to the user's query.
And the third part is
the feedback mechanism.
Now typically, the indexer is done in
the offline manner so you can pre-process
the correct data and to build the inverter
index which we will introduce in a moment.
And this data structure can then be used
by the online module which is a scorer
to process a user's query dynamically and
quickly generate search results.
The feedback mechanism can be done online
or offline depending on the method.
The implementation of the index and
the, the scorer is fairly standard,
and this is the main topic of this
lecture and the next few lectures.
The feedback mechanism,
on the other hand has variations.
It depends on what method is used.
So that is usually done in
a algorithm-specific way.
Let's first talk about the tokenize.
Tokenization is a normalize lexical
units into the same form so
that semantically similar words
can be matched with each other.
Now in the language of English
stemming is often used and
this what map all the inflectional
forms of words into the same root form.
So for example, computer computation and
computing can all be matched
to the root form compute.
This way, all these different forms of
computing can be matched with each other.
Normally this is a good idea to increase
the coverage of documents that
are matched with this query.
But it's also not always beneficial
because sometimes the subtlest
difference between computer and
computation might still suggest the
difference in the coverage of the content.
But in most cases,
stemming seems to be beneficial.
When we tokenize the text in some other
languages, for example Chinese, we might
face some special challenges in segmenting
the text to find the word boundaries.
Because it's not ob,
obvious where the boundary is as
there's no space separating them.
So, here, of course,
we have to use some language-specific
natural language processing techniques.
Once we do tokenization, then we would
index the text documents, and that it
will convert the documents into some data
structure that can enable fast search.
The basic idea is to precompute
as much as we can, basically.
So the most commonly used index
is called a inverted index.
And this has been used, to,
in many search engines to
support basic search algorithms.
Sometimes other indices, for
example a document index,
might be needed in order to support a,
a feedback.
Like I said, this, this kind of
techniques are not really standard
in that they vary a lot according
to the feedback methods.
To understand why we
are using inverted index.
It will be useful for you to think
about how you would respond to
a single term query quickly.
So if you want to use more time to
think about that, pause the video.
So think about how you can
preprocess the text data so
that you can quickly respond
to a query with just one word.
Well, if you have thought about question,
you might realize that where the best is
to simply create a list of documents
that match every term in the vocabulary.
In this way, you can basically
pre-construct the answers.
So when you see a term,
you can simply just fetch
the ranked list of documents for
that term and return the list to the user.
So that's the fastest way to
respond to single term query.
Now the idea of invert index is
actually basically like that.
We can do, pre-construct such a index.
That would allow us to quickly find the,
all the documents that
match a particular term.
So let's take a look at this example.
We have three documents here, and
these are the documents that you
have seen in some previous lectures.
Suppose we want to create invert index for
these documents, then we will
need to maintain a dictionary.
In the dictionary we'll have one entry for
each term.
And we're going to store some
basic statistics about the term.
For example, the number of
documents that match the term or
the total number of, fre,
total frequency of the term,
which means we would encounter
duplicated occurrences of the term.
And so, for example, news.
This term occurred in
all the three documents.
So the count of documents is three.
And you might also realize we needed
this count of documents or document
frequency for computing some statistics
to be used in the vector space model.
Can you think of that?
So, what waiting heuristic
would need this count?
Well, that's the IDF, right,
inverse document frequency.
So IDF is a property of the term,
and we can compute it right here.
So with the document account here,
it's easy to compute the IDF either at
this time or when we build an index or.
At running time when we see a query.
Now in addition to these
basic statistics we also
saw all the documents that matched news.
And these entries are stored
in a file called a Postings.
So in this case it matched 3 documents and
we store Information about
these 3 documents here.
This is the document id,
document 1, and the frequency is 1.
The TF is 1 for news.
In the second document it's also 1, etc.
So from this list that we can get all
the documents that match the term news.
And we can also know the frequency
of news in these documents.
So, if the query has just one word,
news, and
we can easily look up in this
table to find the entry and
go quickly to the postings to fetch
all the documents that match news.
So, let's take a look at another term.
Now this time let's take a look
at the word presidential.
All right, this word occurred
in only 1 document, document 3.
So, the document frequency is 1, but
it occurred twice in this document.
And so the frequency count is 2, and
the frequency count is used for,
in some other retrieval method
where we might use the frequency
to assess the popularity of a,
a term in the collection.
And similarly, we'll have a pointer
to the postings, right here.
And in this case there is
only one entry here because
the term occurred in just one document.
And that's here.
The document id is 3,
and it occurred twice.
So this is the basic
idea of inverted index.
It's actually pretty simple, right?
With this structure we can easily fetch
all the documents that match a term.
And this will be the basis for
storing documents for our query.
Now sometimes we also want to store
the positions of these terms.
So, in many of these
cases the term occurred
just once in the document so there's only
one position, for example in this case.
But in this case the term occurred
twice so it would store two positions.
Now the position information is
very useful for checking whether
the matching of query terms is actually
within a small window of, let's say,
five words, or ten words,
or whether the matching of,
the two query terms,
is in fact a phrase of two words.
This can all be checked quickly by
using the position information.
So why is inverted index good for
faster search?
Well we just talked about the possibility
of using the two ends
of a single-word query.
And that's very easy.
What about a multiple-term queries?
Well, let's look at the,
some special cases of the Boolean query.
A Boolean query is basically
a Boolean expression, like this.
So I want the relevant document
to match both term A AND term B.
All right, so
that's one conjunctive query.
Or, I want the relevant documents
to match term A OR term B.
That's a disjunctive query.
Now how can we answer such
a query by using inverted index?
Well if you think a, a bit about it,
it would be obvious.
Because we have simply to fetch all
the documents that match term A and
also fetch all the documents
that match term B.
And then just take the intersection
to answer a query like A and B.
Or to take the union to
answer the query A or B.
So this is all very easy to answer.
It's going to be very quick.
Now what about the multi-term
keyword query?
We talked about the vector space model for
example.
And we would match such a query with
a document and generate a score.
And the score is based on
aggregated term weights.
So in this case it's not a Boolean query,
but
the scoring can be actually
done in a similar way.
Basically it's similar to
disjunctive Boolean query.
Basically It's like A OR B.
We take the union of all the, documents
that matched at least one query term,
and then we would
aggregate the term weights.
So this is a, a, a basic idea of
using inverted index for
scoring documents in general.
And we're going to talk about
this in more detail later.
But for now,
let's just look at the question,
why is inverted index, a good idea?
Basically, why is it more efficient than
sequentially just scanning documents?
Right?
This is, the obvious approach.
You can just compute the score for
each document, and
then you can score them,
sorry, you can then sort them.
This is a, a straightforward method.
But this is going to be very slow.
Imagine the web.
It has a lot of documents.
If you do this, then it will take
a long time to answer your query.
So the question now is, why would the in,
the inverted index be much faster?
Well it has to do with
the word distribution in text.
So, here's some common phenomenon
of word distribution in text.
There are some language-in, independent
patterns that seem to be stable.
And these patterns are basically
characterized by the following pattern.
A few words like the common words
like the a, or we, occur very,
very frequently in text.
So they account for
a large percent of occurrences of words.
But most word would occur just rarely.
There are many words that occur just once,
let's say, in a document,
or once in the collection.
And there are many such single terms.
It's also true that the most
frequent words in one corpus
may actually be rare in another.
That means, although the general
phenomenon is applicable or
is observed in many cases,
the exact words that are common
may vary from context to context.
So this phenomena is characterized
by what's called a Zipf's Law.
This law says that the rank
of a word multiplied by,
the frequency of the word
is roughly constant.
So formally if we use F of
w to denote the, frequency,
r of w to denote the rank of a word,
then this is the formula.
It basically says the same thing,
just mathematical term, where C is,
basically a constant, right, so as, so.
And there is also
parameter alpha that might,
be adjusted to better fit
any empirical observations.
So if I plot the word
frequencies in sorted order,
then you can see this more easily.
The x-axis is basically the word rank.
And this is r of w.
And the y-axis is the word frequency,
or F of w.
Now, this curve basically shows
that the product of the two
is roughly the constant.
Now, if you look these words, we can see.
They can be separated into three group2s.
In the middle it's
the immediate frequency words.
These words tend to occur in
quite a few documents, right?
But they're not like those
most frequent words.
And they are also not very rare.
So they tend to be often used in in,
in queries.
And they also tend to have high TFI
diff weights in these intermediate
frequency words.
But if you look at the left
part of the curve.
These are the highest frequency words.
They occur very frequently.
They are usually stopper words,
the, we, of, et cetera.
Those words are very, very frequently.
They are, in fact,
a too frequently to be discriminated.
And they generally are not very
useful for, for retrieval.
So, they are often removed, and
this is called a stop words removal.
So you can use pretty much just the count
of words in the collection to kind
of infer what words might be stop words.
Those are basically
the highest frequency words.
And they also occupy a lot of
space in the invert index.
You can imagine the posting entries for
such a word would be very long.
And then therefore,
if you can remove such words,
you can save a lot of
space in the invert index.
We also show the tail part,
which is, has a lot of rare words.
Those words don't occur very frequently,
and there are many such words.
Those words are actually very useful for
search,
also, if a user happens to be
interested in such a topic.
But because they're rare it's
often true that users are,
aren't the necessary
interest in those words.
But retain them would allow us to
match such a document accurately,
and they generally have very high IDFs.
So what kind of data structures should
we use to to store inverted index?
Well, it has two parts, right?
If you recall we have a dictionary,
and we also have postings.
The dictionary has modest size,
although for
the web, it still wouldn't be very large.
But compared with postings, it's modest.
And we also need to have fast,
random access to the entries
because we want to look up
the query term very quickly.
So, therefore, we prefer to keep such
a dictionary in memory if it's possible.
Or, or, or if the connection is not
very large, and this is visible.
But if the connection is very large,
then it's in general not possible.
If the vocabulary size is very large,
obviously we can't do that.
So, but in general, that's our goal.
So the data structures
that we often use for
storing dictionary would be direct access
data structures, like a hash table or
B-tree if we can't store everything
in memory of the newest disk.
And but to try to build a structure that
would allow it to quickly look up our
entries.
Right.
For postings, they're huge, you can see.
And in general, we don't have to have
direct access to a specific engine.
We generally would just look up a,
a sequence of document IDs and
frequencies for all of the documents
that match a query term.
So we would read those
entries sequentially.
And therefore,
because it's large and we generate,
have store postings on disk,
so they have to stay on disk.
And they would contain information such
as document IDs, term frequencies, or
term positions, et cetera.
Now because they're very large,
compression is often desirable.
Now this is not only
to save disk space and
this is of course,
one benefit of compression.
It's not going to occupy that much space.
But it's also to help improving speed.
Can you see why?
Well, we know that input and
output will cost a lot of time in
comparison with the time taken by CPU.
So CPU is much faster.
But IO takes time.
And so by compressing the inverted index,
the posting files will become smaller.
And the entries that we
have to read into memory
to process a query done,
would would be smaller.
And then so we, we can reduce
the amount of traffic and IO.
And that can save a lot of time.
Of course, we have to then do
more processing of the data
when we uncompress the,
the data in the memory.
But as I said, CPU is fast, so
overall, we can still save time.
So compression here is both
to save disk space and
to speed up the loading
of the inverted index.
[MUSIC]

[SOUND].
This lecture is about
the Inverted Index Construction.
In this lecture, we will continue
the discussion of system implementation.
In particular, we're going to discuss
how to construct the inverted index.
The construction of the inverted index
is actually very easy if the data set is
very small.
It's very easy to construct a dictionary
and then store the postings in a file.
The problem's that when our data
is not able to fit to the memory,
then we have to use some
special method to deal with it.
And unfortunately, in most retrieval a
petitions, the data set would be large and
they generally cannot be,
loaded into the memory at once.
And there are many approaches
to solving that problem, and
sorting-based method, is quite common and
works in four steps as shown here.
First, we collect the the local termID,
document ID, and frequency tuples.
Basically, you overlook kinds of terms
in a small set of documents, and, and
then, once you collect those counts, you
can sort those counts based on terms so
that you build a local,
a partial inverted index.
And these are called, runs.
And then, you write them into
a temporary file on the disk.
And then, you merge in step three with do
pair-wise merging of these runs, and here,
you eventually merge all the runs,
we generate a single inverted index.
So this is an illustration of this method.
On the left, you see some documents.
And on the right, we have, show a term
lexicon and a document ID lexicon.
And these lexicon's are to map a stream
based representations of document IDs or
terms into integer representations.
Or, and, map back from,
integers to the screen representation.
And the reason why we want, are interested
in using integers represent these IDs,
is because,
integers are often easier to handle.
For example,
integers can be used as index for
array and they are also easy to compress.
So this is a, one reason why we,
tend to map these streams
into integers so that so that we don't
have to, carry these streams around.
So how does this approach work?
Well, it's very simple.
We're going to scan these
documents sequentially, and
then pause the documents and
a count the frequencies of terms.
And in this, stage we generally sort
the frequencies by document IDs because we
process each document that sequentially.
So, we first encounter all the terms in,
the first document.
Therefore, the document IDs,
are all once in this stage.
And so, and, this would be
followed by document IDs 2.
And, and they're naturally sort in this
order just because we process the data in
this order.
At some point, the,
we will run out of memory and
that would have to,
to write them into the disk.
But before we do that,
we're going to a sort them, just,
use whatever memory we have,
we can sort them, and
then, this time,
we're going to sort based on term IDs.
Note that here, we're using, this,
the term IDs as a key to sort.
So, all the entries that share the same
term would be grouped together.
In this case,
we can see all the, all the IDs
of documents that match term
one would be grouped together.
And we're going to write this into
the disk as a temporary file.
And that would, allow us to use the memory
to process the next batch of documents,
and we're going to do that for
all the documents.
So we're going to write a lot of
temporary files into the disk.
And then,
the next stage is to do merge sort.
Basically, we're going to,
merge them and the sort them.
Eventually, we will get a single
inverted index where the,
their entries are sorted
based on term IDs.
And on the top,
we can see these are the order entries for
the documents that match term ID 1.
So this is basically how we can do,
the construction of inverted index,
even though that they're or
cannot be, or loaded into the memory.
Now, we mentioned earlier that
because the po, postings are very large,
it's desirable to compress them.
So let's now talk a little bit about
how we compress inverted index.
Well, the idea of compression, in general,
is you leverage skewed
distributions of values.
And we generally have to use variable
lengths in coding instead of the fixed
lengths in coding as we', using,
by defaulting a program language like C++.
And so, how can we leverage the skewed
distributions of values to,
compress these values?
Well, in general, we would use fewer
bits to encode those frequent words
at a cost of using, longer bits from
the code than those, rare values.
So in our case, let's think about how
we can compress the tf, term frequency.
If you can picture what the inverted
index would look like and
you'll see in postings there are a lot of,
term frequencies.
Those are the frequencies of terms,
in all those documents.
Now, we, if you think about it, what
kind of values are most frequent there?
You probably will, be able to guess
that the small numbers tend to occur
far more frequently than large numbers.
Why?
Well, think of about
the distribution of words, and
this is due to Zipf's law and
many words occur just, rarely.
So we see a lot of small numbers,
therefore, we can use fewer bits for
the small, but highly frequent integers,
and at the cost of using more bits for
large integers.
This is a trade-off, of course.
If the values are distributed uniformly
and this won't save us any, spacing.
But because we tend to see many
small values, they're very frequent.
We can save on average
even though sometimes,
when we see a large number we
have to use a lot of bits.
What about the document IDs
that we also saw in postings.
Well, they are not,
distributed in a skewed way, right?
So, how can we deal with that?
Well, it turns out you can
use a trick called the d-gap,
and that, that is to store
the difference of these term IDs.
And we can, imagine if a term
has matched many documents,
then there will be a long
list of document IDs.
So when we take the gap, and when we take
difference between adjacent document IDs,
those gaps will be small.
So we'll again see a lot of small numbers,
whereas,
if a term occurred in only a few
documents, then the gap would be large.
The larger numbers will not be frequent,
so this creates some skewed distribution
that would allow us to,
to compress these values.
This is also possible because in order to
uncover or uncompress these document IDs,
we have to sequentially process the data
because we stored the difference.
And in order to recover the,
the exact document ID,
we have to first recover the previous
document ID, and then, we can add
the difference to the previous document ID
to restore the, the current document ID.
Now, this was possible because we
only needed to have sequential
access to those document IDs.
Once we look up a term we fetch all
the document IDs that match the term,
then we sequentially process them.
So it's very natural that's why this,
trick actually works.
And there are many different methods for
encoding.
So binary code is a common used code in,
in just any program.
Language that we use basically
a fixed length in coding.
Unary code and gamma code, and
delta code are all possible in this and
there are many other possible in this.
So let's look at some
of them in more detail.
Binary code is really
equal-length in coding.
And that's a property for
the randomly distributed values.
The unary coding is is a variable and
it's important [INAUDIBLE].
In this case, integer that is,
I've missed one or
we encode that as x minus 1,
1 bit followed by 0.
So for example, 3 would be encoded
as two 1s followed by a 0,
whereas 5 would be encoded as
four 1s followed by 0, et cetera.
So now, now you can imagine how
many bits do we have to use for
a large number like 100.
So, how many bits do I have to use for
exactly for a number like 100?
Well, exactly, we have to use 100 bits,
but so, it's the same number of
bits as the value of this number.
So, this is very inefficient.
If you were likely to
see some large numbers,
imagine if you occasionally see a number
like 1000, you have to use 1000 bits.
So, this only works where if you
are absolutely sure that there would be no
large numbers.
Mostly very frequent,
they're often using very small numbers.
Now, how do you decode this code?
Since these are variables
lengths in coding methods, and
you can't just count how many bits and
then just stop.
Right?
You can say eight bits or 32 bits,
then you, you will start another code.
There are variable lengths, so,
you have to rely on some mechanism.
In this case for unary, you can see
it's very easy to see the boundary.
Now you can easily see 0 would
signal the end of encoding.
So you just count how many 1s you
have seen, and then you hit the 0.
You know you have finished one number,
you start another number.
Now which is to start at unary code is to
aggressive in rewarding small numbers.
And if you occasionally can see a very
big number, it will be a disaster.
So what about some other
less aggressive method?
Well, gamma coding is one of them.
And in this method, we can do,
use unary coding for
a transformed form of the value.
So it's 1 plus the flow of log of x.
So the magnitude of this value is
much lower than the original, x.
So that's why we have four
using urinary code for that so,
and so we, first we have the urinary
code for coding this log of s.
And this will be followed by
a uniform code or binary code, and
this is basically the same uniform
code and binary code are the same.
And we're going to use this code to code
the remaining part of the value of x.
And this is basically, precisely,
x minus 1, 2 to the flow of log of x.
So the unary code or basically code
with a flow of log of x, well,
I added one there, and here.
But the remaining part will,
we using uniform
code to actually code
the difference between the x and
and this, 2 to the log of x.
And, and it's easy to to show that for
this this value, there's difference.
We only need to use up to,
this many bits and
in flow of log of x bits.
And this is easy to understand,
if the difference is too large then we
would have a higher flow of log of x.
So, here are some examples.
For example, 3 is encoded as 101.
The first two digits are the unary code.
Right.
So, this is for the value 2.
Right.
10 encodes 2 in unary coding.
And so, that means log of x,
the flow of log of x is 1,
because we will actually use unary code
to encode 1 plus the flow of log of x.
Since this is 2, then we know that
the floor of log of x is actually 1.
So but,
3 is still larger than 2 to the 1, so
the difference is 1, and
that 1 is encoded here at the end.
So that's why we have 101 for 3.
Now, similarly 5 is encoded
as 110 followed by 01.
And in this case,
the unary code encodes 3.
So, this is the unary code for 110 and
so the floor of log of x is 2.
And that means, we will compute
the difference between 5 and
the 2 to the 2, and that's 1, and
so we now have again 1 at the end.
But this time, we're going to
use two bits because with this
level of flow of log of x,
we could have more numbers, 5, 6, 7.
They would all share the same prefix here,
110.
So, in order to differentiate them,
we have to use two bits,
in the end to differentiate them.
So you can imagine 6 would be, 10 here
in the end instead of 01, after 110.
It's also true that the form
of a gamma code is always,
the first odd number of bits,
and in the center, there was a 0.
That's the end of the unary code.
And before that, or to, on the left
side of this 0, there will be all 1s.
And on the right side of this 0,
it's binary coding or uniform coding.
So how can you decode such a code?
Well, you again first do unary coding,
right?
Once you hit 0,
you know you have got the unary code.
And this also will tell you how many
bits you have to read further to
decode the uniform code.
So this is how you can
decode a gamma code.
There is also delta code, but
that's basically same as gamma code,
except that you replace the unary
prefix with the gamma code.
So that's even less
conservative than gamma code,
in terms of avoiding the small integers.
So that means it's okay if you
occasionally see a large number.
It's, it's, you know,
it's okay with delta code.
It's also fine with gamma code.
It's really a big loss for unary code,
and they are all operating,
of course, at different degrees of
favoring short favoring small integers.
And that also means they would
appropriate for sorting distribution.
But none of them is perfect for
all distributions.
And which method works,
the best would have to depend on
the actual distribution in your data set.
For inverted index, compression,
people have found that gamma
coding seems to work well.
So how to uncompress inverted index?
We just, talked about this.
Firstly, you decode those encode integers.
And we just, I think discussed how we
decode unary coding and gamma coding.
So I won't repeat.
What about the document IDs that
might be compressed using d-gap?
Well, we're going to do
sequential decoding.
So suppose the encoded idealist is x1,
x2, x3 et cetera.
We first decode x1 to obtain
the first document ID, ID1.
Then, we will decode x2,
which is actually the difference between
the second ID and the first one.
So we have to add the decoded value
of x2 to ID1 to recover the value
of the,
the ID at this secondary position, right.
So this is where you can see the advantage
of, converting document IDs into integers.
And that allows us to do this
kind of compression, and
we just repeat until we
decode all the documents.
Every time we use the document
ID in the previous position
to help recover the document
ID in the next position.
[MUSIC]

[SOUND].
This lecture is about how to do fast
research by using inverted index.
In this lecture,
we are going to continue the discussion
of the system implementation.
In particular, we're going to talk about,
to how to support a faster
search by using inverted index.
So, let's think about what a general
scoring function might look like.
Now, of curse the vector space
model is a special case of this.
But we can imagine many other
retrieval functions of the same form.
So, the form of this
function is as follows.
We see this scoring
function of document d, and
query q is defined as first, a function
of f a that's adjustment in the function.
That what consider two
factors that are shown
here at the end, f sub d of d,
and f sub q of q.
These are adjustment factors
of a document and query, so
they're at the level of document,
and query.
So, and
then inside of this function we also see
there's a another function called edge.
So, this is the main part of
the scoring function,
and these as I just said
of the scoring factors at the level
of the whole document, and the query.
For example, document and
this aggregate function would
then combine all these.
Now, inside this h function,
there are functions that would compute
the weights of the contribution
of a matched query term t i.
So, this this g, the function g gives us
the weight of a matched query
term t i in document d.
And this h function with that
aggregate all these weights, so
it were, for example, take a sum, but
it of all the matched query in that terms.
But it can also be a product, or
could be another way of aggregate them.
And then finally, this adjustment
function would then consider
the document level, or query level
factors through further adjuster score,
for example, document lens [INAUDIBLE].
So, this general form would cover
many state of original functions.
Let's look at how we can score such
score documents with such
a function using inverted index.
So here's the general algorithm
that works as follows.
First these these Query level and
document level factors can be
pre-computed in the indexing term.
Of course, for the query,
we have to compute it as a query term.
But for document, for example,
document can be pre-computed.
And then we maintain a score accumulator
for each document d to compute the h.
And h is aggregation function
of all the matching query terms.
So how do we do that?
Well, for each query term,
we going to do fetch inverted list,
from the inverted index.
This will give us all the documents
that match this query term,
and that includes d1,
f1, and so, d and fn.
So each pair is document id and
the frequency of the term in the document.
Then for each entry d sub j and f sub j,
a particular match of the term in
this particular document d sub j,
we're going to computer the function g.
That would give us something like
a t of i, ef weights of this term.
So, we're computing the weight
contribution of matching this query term
in this document.
And then we're going to update the score
accumulator for this document.
And this would allow us to
add this to our accumulator,
that would incrementally
compute function h.
So this is basically a general
way to allow sort of computer
all functions of this form,
by using inverted index.
Note that we don't have to
attach any document that that
didn't match any query term,
but this is why it's fast.
We only need to process the documents that
tap, that match at least one query term.
In the end, then we're going to
adjust the score to compute a,
this function f of a and then we can sort.
So let's take a look at
the specific example.
In this, case let's assume the scoring
function's a very simple one.
It just takes us sum of tf, the rule of
tf, the count of, of term in the document.
Now this simple equation with the help
showing the algorithm clearly.
It's very easy to extend the,
the computation to include other weights
like the transformation of TF or
document or IDF weighting.
So let's take a look at specific example
with the query's information security,
and shows some entries of
the inverted index on the right side.
Information occurring before documents and
the frequencies is also there,
security is coding three documents.
So, let's see how the algorithm works,
all right?
So, first we iterate all the query terms,
and we fetch the first query then.
What is that?
That's information.
Right?
So, and imagine we have all these score
accumulators to score, score the,
score the scores for these documents.
We can imagine there will be allocated,
but
then they will only be
allocated as needed.
So before we do any weighting of terms
we don't even need a score accumulators.
But conceptual we have these score
accumulators eventually allocated, right?
So let's fetch the,
the entries from the inverted list for
information first, that's the first one.
So these score accumulators obviously
would be initialized as zeros.
So the first entry is d1 and 3,
3 is occurrences of
information in this document.
Since our scoring function assume that the
score is just a sum of these raw counts.
We just need to add a 3 to the score
accumulator to account for
the increase of score, due to matching
this term information, a document d1.
And now we go to the next entry.
That's d2 and 4 and then we'll add
a 4 to the score accumulator of d2.
Of course, at this point we will allocate
the score accumulator as needed.
And so, at this point, we have located
d1 and d2, and the next one is d3.
And we add 1, or we locate another score
coming in the spot d3 and add 1 to it.
And finally,
the d4 gets a 5 because the information
the term information occurred ti
in five times in this document.
Okay, so this completes the processing
of all the entries in the,
inverted index for information.
It's processed all the contributions
of matching information in this
four documents.
So now our arrows will go to the next
query term, that's security.
So, we're going to factor all
the inverted index entries for security.
So in this case, there were three entries.
And we're going to go
through each of them.
The first is d2 and 3.
And that means security occurred
three times in d2, and what do we do?
Well, we do exactly the same as
what we did for information.
So this time we're going
to do change the score,
accumulating d2 sees
it's already allocate.
And what we do is we'll add 3 to
the existing value which is a 4,
so we now get the 7 for d2.
D2 sc, score is increased because of the
match both information and the security.
Go to the next step entry, that's d4 and
1, so we've updated the score for
d4,and again we add 1 to d4,
so d4 goes from 5 to 6.
Finally we process d5 and 3.
SInce we have not yet
equated a score accumulator d4 to d5,
at this point, we allocate one,
45 and we're going to add 3 to it.
So, those scores on the last row
are the final scores for these documents.
If our scoring function is just a,
a simple sum of tf values.
Now what if we actually would like to,
to do lands normalization.
Well we can do the normalization
at this point for each document.
So to summarize this,
all right so you can see we first
processed the information determine
query term information, and
we process all the entries in
the inverted index for this term.
Then we process the security,
all right, let's think about
the what should be the order of processing
here when we consider query terms?
It might make difference,
especially if we don't want to keep
to keep all the score accumulators.
Let's say we only want to keep
the most promising score accumulators.
What do you think it would be
a good order to go through?
Would you go would you process
a common term first or
would you process a rare term first?
The answer is we should go through we
should process the rare term first.
A rare term will match fewer documents and
then the score confusion will be higher,
because the IDF value will be higher and,
and
then it allows us to attach
the most diplomacy documents first.
So it helps pruning some non
promising ones, if we don't need so
many documents to be returned to the user.
And so those are heuristics for
further improving the accuracy.
Here can also see how we can
incorporate the idea of weighting.
All right.
So they can [INAUDIBLE] when we
incorporated a one way process each
query term.
When we fetch in word index we
can fetch the document frequency,
and then we can compute the IDF.
Or maybe perhapsIDF value has already been
pre-computed when we index the document.
At that time we already computed the IDF
value that we can just fetch it.
So all these can be down at this time.
So that will mean one will process
all the entries for information these
these weights would be adjusted by the
same IDF, which is IDF for information.
So this is the basic idea of using
inverted index for faster search, and
works well for all kinds of formulas that
are of the general form and this generally
cov, the general form covers actually most
state of the art retrieval functions.
So there are some tricks to further
improve the efficiency ,some general mac
tech, techniques include caching.
This is just a to store some
results of popular query's, so
that next time when you see the same query
you simply return the stored results.
Similarly, you can also score the missed
of inverted index in the memory for
popular term.
And if the query comes
popular you will assume
it will fetch the inverted index for
the same term again.
So keeping that in the memory would help.
And these are general techniques for
improving efficiency.
We can also only keep the most promising
accumulators because a user generally
doesn't want to examine so many documents.
We only want to return high quality
subset of documents that likely ranked
on the top, in,in for that purpose
we can then prune the accumulators.
We don't have to store
all the accumulators.
At some point we just keep
the highest value accumulators.
Another technique is to do parallel
processing, and that's needed for
really processing such a large data set,
like the web data set.
And to scale up to the Web-scale
we need to special
to have the special techniques
to do parallel processing and
to distribute the storage of
files on multiple machines.
So here as a, here is a list of
some text retrieval toolkits.
It's, it's not a complete list.
You can find the more information
at this URL on the bottom.
Here I listed four here,
lucene is one of the most popular toolkit
that can support a lot of applications.
And it has very nice support for
applications.
You can use it to build
a search engine very quickly,
the downside is that it's not
that easy to extend it, and
the algorithms incremented there
are not the most advanced algorithms.
Lemur or Indri is another toolkit that
that does not have such a nice
support application as Lucene.
But it has many advanced
search algorithms.
And it's also easy to extend.
Terrier is yet another toolkit
that also has good support for
quotation capability and
some advanced algorithms.
So that's maybe in between Lemur,
or Lucene or
maybe rather combining the strands of
both, so that's also useful toolkit.
MeTA is the toolkit that we'll use for
the programming assignment,
and this is a new toolkit
that has a combination
of both text retrieval algorithms and
text mining algorithms.
And so, toolkit models are implement, they
are, there are a number of text analysis
algorithms, implemented in the toolkit,
as well as basic research algorithms.
So, to summarize all the discussion
about the system implementation,
here are the major take away points.
Inverted index is the primary data
structure for supporting a search engine.
That's the key to enable faster
response to a user's query.
And the basic idea is process that,
pre-process the data as much as we can,
and we want to do compression
when appropriate.
So that we can save disk space and
can speed up IO and
processing of the inverted
index in general.
We'll talk about how we will construct
the inverted index when the data
can fit into the memory.
And then we talk about faster search using
inverted index, basically to exploit
the inverted index to accumulate scores
for documents matching a query term.
And we exploit Zipf's law
avoid touching many documents
that don't match any query term.
And this algorithm can, can support
a wide range of ranking algorithms.
So these basic techniques have mm,
have great potential for further scanning
output using distribution to withstand
parallel processing and the caching.
Here are two additional readings that
you can take a look at if you have time,
and are interested in
learning more about this.
The first one is a classic textbook on the
scare the efficiency of inverted index and
the compression techniques,
and how to in general,
build a efficient search engine in
terms of the space overhead and speed.
The second one is a newer textbook that
has a nice discussion of implementing and
evaluating search engines.
[MUSIC]

[SOUND] This lecture is about
evaluation of text retrieval systems.
In the previous lectures, we have talked
about a number of text retrieval methods.
Different kinds of ranking functions.
But how do we know which
one works the best?
In order to answer this question,
we have to compare them,
and that means we'll have to
evaluate these retrieval methods.
So this is the main topic of this lecture.
First, let's think about why
do we have to do evaluation?
I already gave one reason.
And that is,
we have to use evaluation to figure out
which retrieval method works better.
Now this is very important for
advancing our knowledge.
Otherwise we wouldn't know whether
a new idea works better than old idea.
In the beginning of this
course we talked about the,
the problem of text retrieval we
compare it with database retrieval.
There, we mentioned that text retrieval
is imperative to find the problem.
So, evaluation must rely on users,
which system works better,
that would have to be judged by our users.
So this becomes very challenging problem.
Because how can we get users involved in,
in matters, and
how can we draw a fair
comparison of different methods.
So just go back to the reasons for
evaluation.
I listed two reasons here.
The second reason is basically what I just
said but there is also another reason,
which is to assess the actual
utility of a test regional system.
Now imagine you're building
your own applications.
Would be interested in knowing how well
your search engine works for your users.
So in this case measures must
reflect the utility to the actual
users in the the real application.
And typically, this has been
done by using user studies and
using the real search engine.
In the second case or for
the second reason, the measures
actually all need to be correlated
with the utility to actual users.
Thus they don't have to accurately
reflect the, the exact utility to users.
So the measure only needs to be good
enough to tell which method works better.
And this is usually done
through test collection.
And this is the main idea that we'll
be talking about in this course.
This has been very important for
comparing different algorithms and
for improving search
engines systems in general.
So next we will talk
about what to measure.
There are many aspects of a search engine
we can measure, we can evaluate and
here I list the three major aspects.
One is effectiveness or accuracy,
how accurate are the search results?
In this case we're measuring a system's
capability of ranking relevant documents
on top of non relevant ones.
The second is efficiency.
How quickly can a user get the results?
How much computing resources
are needed to answer a query?
So in this case we need to measure
the space and time overhead of the system.
The third aspect is usability.
Basically the question is how useful
is the system for real user tasks?
Here, obviously, interfaces and
many other things are also important and
we typically would have
to do user studies.
Now, in this course, we're going to talk
more, mostly about the effectiveness and
accuracy measures because,
the efficiency and
usability dimensions are, not really
unique to search engines, and so,
they are, needed for
evaluating any other software systems.
And there is also good coverage of
such materials in other courses.
But how to evaluate a search engine
is quite, you know accuracy is
something you need to text retrieval, and
we're going to talk a lot about this.
The main idea that people have proposed
before using a attitude, evaluate
a text retrieval algorithm, is called
the Cranfield Evaluation Methodology.
This one actually was developed long
time ago, developed in the 1960s.
It's a methodology for laboratory test
of system components, it's actually
a methodology that has been very useful,
not just for search engine evaluation.
But also for evaluating virtually
all kinds of empirical tasks.
And, for example in processing or
in other fields where the problem
is empirically defined we typically would
need to use to use such a methodology.
And today was the big data challenge with
the use of machine learning every where.
We general, this methodology has been very
popular, but it was first developed for
search engine application in the 1960s.
So the basic idea of this approach is
it'll build a reusable test collections
and define measures.
Once such a test collection is
build it can be used again and
again to test the different algorithms.
And we're going to define measures
that would allow you to quantify
performance of a system or
an, an algorithm.
So how exactly would this work?
Well, we're going to do,
have assembled collection of documents and
this is just similar to real document
collection in your search application.
We can also have a sample
set of queries or topics.
This is to simulate the user's queries.
Then we'll have to have
relevance judgments.
These are judgments of which documents
should be returned for which queries.
Ideally, they have to made by
users who formulated the queries
because those are the people that know
exactly what documents would be used for.
And then finally we have to have measures
to quantify how well a system's result
matches the ideal ranked list.
That would be constructed and
based on users' relevant judgements.
So this methodology is very useful for
starting retrieval
algorithms because the test can actually,
can be reused many times.
And it will also provide a fair
comparison for all the methods.
We have the same criteria,
same data set to use and
to compare different algorithms.
This allows us to compare a new
algorithm with an old algorithm,
that was the method of many years ago.
By using the same standard.
So this is the illustration
of how this works, so
as I said,
we need a queries that are shown here.
We have Q1, Q2, et cetera.
We also need a documents, and
that's called the document collection,
and on the right side,
you see we need relevance judgment.
These are basically the binary judgments
of documents with respect to a query.
So, for example D1 is judged
as being relevant to Q1,
D2 is judged as being relevant as well.
And D3 is judged as non relevant
in the two, Q1, et cetera.
These would be created by users.
Once we have these, and
we basically have a test, correction, and
then, if you have two systems,
you want to, compare them.
Then you can just run each
system on these queries and
documents and
each system will then return results.
Let's say if the query is Q1 and
then we would have the results here,
here I show R sub A as
results from system A.
So, this is remember we talked about
task of computing approximation of the,
relevant document setter.
So A is,
the system A's approximation here, and
also B is system B's approximation
of relevant documents.
Now let's take a look at these results.
So which is better?
Now imagine for
a user which one would you like?
All right lets take
a look at both results.
And there are some differences and
there are some documents that
are return to both systems.
But if you look at the results
you will feel that well,
maybe an A is better in the sense that
we don't have many number in documents.
And among the three documents returned
the two of them are relevant, so
that's good, it's precise.
On the other hand can also
say maybe B is better because
we've got more relevant documents,
we've got three instead of two.
So which one is better and
how do we quantify this?
Well obviously, this question
highly depends on a user's task.
And, it depends on users as well.
You might be able to imagine, for
some users may be system made is better.
If the user is not interested in
getting all the relevant documents,
right, in this case this is
the user doesn't have to read.
User would see most relevant documents.
On the other hand on one count,
imagine user might need to have
as many relevant documents as possible,
for example, taking a literature survey.
You might be in the second category, and
then you might find
that system B's better.
So in either case, we'll have to also
define measures that would quantify them.
And we might need to define
multiple measures because
users have different perspectives
of looking at results.
[MUSIC]

[SOUND] This lecture is about the,
the basic measures for
evaluation of text original systems.
In this lecture,
we're going to discuss how we design basic
measures [SOUND] to quantitatively,
compare two original [SOUND] systems.
This is a slide that you have
seen earlier in the lecture,
where we talk about the grand
evaluation methodology.
We can have a test collection that
consists of queries, documents and
relevance judgements.
We can then run two systems on these da,
data sets to,
quantitatively evaluate your performance.
And we raised to the question about,
[SOUND] which settles results is better
is System A better or System B better?
[SOUND] So let's now talk about how to
actually quantify their performance.
Suppose we have a total of,
of 10 random documents in
the current folder for this query.
Now, the relevance judgements
shown on the right,
did not include all the ten obviously.
And we have only seen three
rendered documents there but
we can imagine there are other random
documents in judging for this query.
So now, intuitively we thought that
System A is better because
it did not have much noise.
And in particular we have seen,
amount of three results,
two of them are relevant but
in System B we
have five results and
only three of them are relevant.
So intuitively,
it looks like System A is more accurate.
And this can be captured by
a matching order precision.
Where we simply compute to what extent
all the retrieval results are relevant.
If you have 100% precision that would mean
all the retrieval documents are relevant.
So, in this case the system A has
a Precision of two out of three.
System B as three over five.
And this shows that System A is
better by Precision.
But we also talked about
System B might be preferred by
some other users hold like to retrieve
as many relevant documents as possible.
So, in that case we have to compare
the number of relevant
documents that retrieve.
And there is an other
measure called a Recall.
This measures the completeness of
coverage of relevant documents
in your retriever result.
So, we just assume that there are ten
relevant documents in the collection.
And here we've got two of them in
System A, so the recall is two out of ten.
Where as system B has got a three,
so it's a three out of ten.
Now ,we can see by recall
System B is better and these two
measures turned out to be the very basic
measures for evaluating search engine.
And they are very important because
they are also widely used in many other
testing variation problems.
For example, if you look at the
applications of machine learning you tend
to see precision recall numbers being
reported for all kinds of tasks.
Okay, so now, let's define these
two measures more precisely and
these measures are to evaluate
a set of retrieval documents.
So that means we are considering
that approximation
of a set of relevant documents.
We can distinguish it four cases,
depending on the situation of a document.
A document that can be retrieved or
not retrieved, right?
Because we're talking
about the set of result.
The document can be also relevant or
not relevant, depending on whether
the user thinks this is a useful document.
So, we can now have counts of documents
in each of the four categories.
We can have a to represent the number
of documents that are retrieved and
relevant, b for documents that
are not retrieved but relevant, etc.
Now, with this table,
then we have defined precision.
As the, ratio of, the relevant
retriever documents A to the total
number of retriever documents.
So this is just you know,
a divided by the sum of a and c.
The sum of this column.
Signal recall is defined by
dividing a by the sum of a and b.
So that's, again, to divide a by the sum
of the rule, instead of the column.
All right, so we going to see
precision and recall is all focused on
looking at the a, that's the number
of retrieval relevant documents, but
we're going to use different denominators.
Okay, so what would be an ideal result?
Well, you can able to see in ideal
case we have precision and recall, all
to be 1.0 that means we have got 1% of
all the random documents in our results.
And all the results that
we return are relevant.
[INAUDIBLE] There's no single
not relevant document returned.
The reality however, high recall tends
to be associated with low precision And
you can imagine why that is the case.
As you go down the distant to try to get
as many relevant actions as possible.
You tend to in time a lot of non relevant
documents, so the precision goes down.
Look at this set, can also be defined
by a cutoff in a ranked list.
That's why, although these two measures
are defined for a set of retrieved
documents, they are actually very
useful for evaluating a ranked list.
They are the fundamental measures in
tension retrieval and many other tasks.
We often are interested in to
the precision up to ten documents for
web search.
This means we look at the,
how many documents among the top
results are actually relevant.
Now, this is a very meaningful measure,
because it tells us how many relevant
documents a user can expect to see.
On the first page of search results,
where they typically show ten results.
So, precision and recall are,
the basic measures and
we need to use them to further
evaluate a search engine but
they are the building blocks really.
We just to say that there tends to be
a trade off between precision and recall.
So, naturally it would be interesting
to [SOUND] combine them and
here's one measure that's often used,
called f measure.
And it's harmonic mean of precision and
recall, it's defined on this slide.
So you can see it first computed,
inverse of R and P here and
then it would be
interpreted to by using a co,
coefficients.
Depending on the parameter Beta and
after some transformation we can
easily see it would be of this form.
And in many cases it's just
a combination of precision and recall.
And, and Beta is a parameter
that's often set to one.
It can control the emphasis
on precision or recall.
When we set,
beta to one we end up by having a special
case of F measure, often called F1.
This is a popular measure, that is often
used as a combined precision and recall.
And the formula looks very
simple it's just this, here.
Now it's easy to see that if you have,
a larger precision or
larger recall than F
measure would be high.
But what's interesting is that,
the trade off between precision and
recall, is captured in
an interesting way in F1.
So, in order to understand that, we,
can first look at the natural question.
Why not just the,
combining them using a simple
arithmetic mean as a [INAUDIBLE] here.
That would be likely the most
natural way of combining them.
So, what do you think?
If you want to think more,
you can pause the media.
So why is this not as good as F1?
Or what's the problem with this?
Now, if you think about
the arithmetic mean,
you can see that this is the sum of,
of multiple terms.
In this case,
this is the sum of precision and recall.
In the case of the sum, the total value
tends to be dominated by the large values.
That means if you have a very high P or
a very high R,
then you really don't care about the,
whether the other varies is low.
So, the whole sum would be high.
Now, this is not the desirable because
one can easily have a perfect recall.
We can have a perfect recall is it?
Can you imagine how?
It's probably very easy to imagine that
we simply retrieve all
the document in the collection,
then we have a perfect recall and
this will give us 0.5 as the average.
But search results are clearly
not very useful for users,
even though the, the average using
this formula would be relatively high.
Now, in contrast, you can see F1 will
reward a case where precision and
recall are roughly but similar.
So, it would paralyze a case
where you have extremely high
matter for one of them.
So, this means F1 encodes
a different trade off between that.
Now this example shows actually,
a very important methodology here.
When we try to solve a problem,
you might naturally think of one solution.
Let's say, in this case,
it's this arithmetic mean.
But it's important that not
to settle on this solution.
It's important to think whether you
have other ways to combine them.
And once you think about
the multiple variance.
It's important to analyze
their difference and
then think about which
one makes more sense.
In this case,
if you think more carefully you will feel
that if one problem makes more sense.
Then the simple arithmetic mean.
Although in other cases,
there may be, different results.
But in this case, the arithmetic mean,
seems not reasonable.
But if you don't pay attention
to these subtle differences,
you might just, take an easy way to
combine them and then go ahead with it.
And here later you'll find that, hm,
the measure doesn't seem to work well.
Right so, at this methodology
is actually very important in
general in solving problem and
try to think about the best solution.
Try to understand that the problem,
very well and then know why
you needed this measure, and why you
need to combine precision and recall.
And then use that to guide you in
finding a good way to solve the problem.
To summarize, we talk about precision,
which addresses the question,
are the retrieval results all relevant?
We'll also talk about the recall,
which addresses the question,
have all the relevant
documents been retrieved?
These two are the two basic measures
in testing retrieval in variation.
They are are used for, for
many other tasks as well.
We'll talk about F measure as a way
to combine precision and recall.
We also talked about the trade
off between precision and recall.
And this turns out to depend
on the users search tasks and
we'll discuss this point
more in the later lecture.
[MUSIC]

[MUSIC]
This lecture is about,
how we can evaluate a ranked list?
In this lecture, we will continue
the discussion of evaluation.
In particular,
we are going to look at, how we can
evaluate a ranked list of results.
In the previous lecture,
we talked about, precision-recall.
These are the two basic measures for,
quantitatively measuring
the performance of a search result.
But, as we talked about, ranking, before,
we framed that the text of retrieval
problem, as a ranking problem.
So, we also need to evaluate the,
the quality of a ranked list.
How can we use precision-recall
to evaluate, a ranked list?
Well, naturally, we have to look after the
precision-recall at different, cut-offs.
Because in the end, the approximation
of relevant documents, set,
given by a ranked list, is determined
by where the user stops browsing.
Right?
If we assume the user, securely browses,
the list of results, the user would,
stop at some point, and
that point would determine the set.
And then,
that's the most important, cut-off,
that we have to consider,
when we compute the precision-recall.
Without knowing where
exactly user would stop,
then we have to consider, all
the positions where the user could stop.
So, let's look at these positions.
Look at this slide, and
then, let's look at the,
what if the user stops at the,
the first document?
What's the precision-recall at this point?
What do you think?
Well, it's easy to see, that this document
is So, the precision is one out of one.
We have, got one document,
and that's relevent.
What about the recall?
Well, note that, we're assuming that,
there are ten relevant documents, for
this query in the collection,
so, it's one out of ten.
What if the user stops
at the second position?
Top two.
Well, the precision is the same,
100%, two out of two.
And, the record is two out of ten.
What if the user stops
at the third position?
Well, this is interesting,
because in this case, we have not got any,
additional relevant document,
so, the record does not change.
But the precision is lower,
because we've got number [INAUDIBLE] so,
what's exactly the precision?
Well, it's two out of three, right?
And, recall is the same, two out of ten.
So, when would see another point,
where the recall would be different?
Now, if you look down the list,
well, it won't happen until,
we have, seeing another relevant document.
In this case D5, at that point, the,
the recall is increased through
three out of ten, and,
the precision is three out of five.
So, you can see, if we keep doing this,
we can also get to D8.
And then, we will have
a precision of four out of eight,
because there are eight documents,
and four of them are relevant.
And, the recall is a four out of ten.
Now, when can we get,
a recall of five out of ten?
Well, in this list, we don't have it,
so, we have to go down on the list.
We don't know, where it is?
But, as convenience, we often assume that,
the precision is zero,
at all the, the othe,
the precision are zero at
all the other levels of recall,
that are beyond the search results.
So, of course,
this is a pessimistic assumption,
the actual position would be higher,
but we make, make this assumption,
in order to, have an easy way to,
compute another measure called Average
Precision, that we will discuss later.
Now, I should also say, now, here you see,
we make these assumptions that
are clearly not, accurate.
But, this is okay, for
the purpose of comparing to, text methods.
And, this is for the relative comparison,
so, it's okay, if the actual measure,
or actual, actual number deviates
a little bit, from the true number.
As long as the deviation,
is not biased toward any particular
retrieval method, we are okay.
We can still,
accurately tell which method works better.
And, this is important point,
to keep in mind.
When you compare different algorithms,
the key's to avoid any
bias toward each method.
And, as long as, you can avoid that.
It's okay, for you to do transformation
of these measures anyway, so,
you can preserve the order.
Okay, so, we'll just talk about,
we can get a lot of precision-recall
numbers at different positions.
So, now, you can imagine,
we can plot a curve.
And, this just shows on the,
x-axis, we show the recalls.
And, on the y-axis, we show the precision.
So, the precision line was marked as .1,
.2, .3, and, 1.0.
Right?
So,
this is, the different, levels of recall.
And,, the y-axis also has,
different amounts, that's for precision.
So, we plot the, these, precision-recall
numbers, that we have got,
as points on this picture.
Now, we can further, and
link these points to form a curve.
As you'll see,
we assumed all the other, precision
as the high-level recalls, be zero.
And, that's why, they are down here,
so, they are all zero.
And this, the actual curve probably will
be something like this, but, as we just
discussed, it, it doesn't matter that
much, for comparing two methods.
because this would be,
underestimated, for all the method.
Okay, so, now that we,
have this precision-recall curve,
how can we compare ranked to back list?
All right, so, that means,
we have to compare two PR curves.
And here, we show, two cases.
Where system A is showing red,
system B is showing blue, there's crosses.
All right, so, which one is better?
I hope you can see,
where system A is clearly better.
Why?
Because, for the same level of recall,
see same level of recall here,
and you can see,
the precision point by system A is better,
system B.
So, there's no question.
In here, you can imagine, what does the
code look like, for ideal search system?
Well, it has to have perfect,
precision at all the recall points, so,
it has to be this line.
That would be the ideal system.
In general, the higher the curve is,
the better, right?
The problem is that,
we might see a case like this.
This actually happens often.
Like, the two curves cross each other.
Now, in this case, which one is better?
What do you think?
Now, this is a real problem,
that you actually, might have face.
Suppose, you build a search engine,
and you have a old algorithm,
that's shown here in blue, or system B.
And, you have come up with a new idea.
And, you test it.
And, the results are shown in red,
curve A.
Now, your question is, is your new
method better than the old method?
Or more, practically,
do you have to replace the algorithm that
you're already using, your, in your search
engine, with another, new algorithm?
So, should we use system,
method A, to replace method B?
This is going to be a real decision,
that you to have to make.
If you make the replacement, the search
engine would behave like system A here,
whereas, if you don't do that,
it will be like a system B.
So, what do you do?
Now, if you want to spend more time
to think about this, pause the video.
And, it's actually very
useful to think about that.
As I said, it's a real decision that you
have to make, if you are building your own
search engine, or if you're working, for
a company that, cares about the search.
Now, if you have thought about this for
a moment, you might realize that,
well, in this case, it's hard to say.
Now, some users might like a system A,
some users might like, like system B.
So, what's the difference here?
Well, the difference is just that,
you know,
in the, low level of recall,
in this region, system B is better.
There's a higher precision.
But in high recall region,
system A is better.
Now, so, that also means,
it depends on whether the user
cares about the high recall, or
low recall, but high precision.
You can imagine, if someone is just going
to check out, what's happening today, and
want to find out something
relevant in the news.
Well, which one is better?
What do you think?
In this case, clearly, system B is better,
because the user is unlikely
examining a lot of results.
The user doesn't care about high recall.
On the other hand,
if you think about a case,
where a user is doing you are,
starting a problem.
You want to find, whether your idea ha,
has been started before.
In that case, you emphasize high recall.
So, you want to see,
as many relevant documents as possible.
Therefore, you might, favor, system A.
So, that means, which one is better?
That actually depends on users,
and more precisely, users task.
So, this means, you may not necessarily
be able to come up with one number,
that would accurately
depict the performance.
You have to look at the overall picture.
Yet, as I said, when you have
a practical decision to make,
whether you replace ours with another,
then you may have to actually come up with
a single number, to quantify each, method.
Or, when we compare many different
methods in research, ideally, we have
one number to compare, them with, so, that
we can easily make a lot of comparisons.
So, for all these reasons, it is desirable
to have one, single number to match it up.
So, how do we do that?
And, that,
needs a number to summarize the range.
So, here again it's
the precision-recall curve, right?
And, one way to summarize
this whole ranked, list, for
this whole curve,
is look at the area underneath the curve.
Right?
So, this is one way to measure that.
There are other ways to measure that,
but, it just turns out that,,
this particular way of matching
it has been very, popular, and
has been used, since a long time ago for
text And, this is,
basically, in this way, and
it's called the average precision.
Basically, we're going to take a, a look
at the, every different, recall point.
And then, look out for the precision.
So, we know, you know,
this is one precision.
And, this is another,
with, different recall.
Now, this, we don't count to this one,
because the recall level is the same,
and we're going to, look at the,
this number, and that's precision at
a different recall level et cetera.
So, we have all these, you know, added up.
These are the precisions
at the different points,
corresponding to retrieving the first
relevant document, the second, and
then, the third, that follows, et cetera.
Now, we missed the many relevant
documents, so, in all of those cases,
we just, assume,
that they have zero precisions.
And then, finally, we take the average.
So, we divide it by ten, and
which is the total number of relevant
documents in the collection.
Note that here,
we're not dividing this sum by four.
Which is a number retrieved
relevant documents.
Now, imagine, if I divide by four,
what would happen?
Now, think about this, for a moment.
It's a common mistake that people,
sometimes, overlook.
Right, so, if we, we divide this by four,
it's actually not very good.
In fact, that you are favoring a system,
that would retrieve very few random
documents, as in that case,
the denominator would be very small.
So, this would be, not a good matching.
So, note that this denomina,
denominator is ten,
the total number of relevant documents.
And, this will basically ,compute
the area, and the needs occur.
And, this is the standard method,
used for evaluating a ranked list.
Note that, it actually combines
recall and, precision.
But first, you know, we have
precision numbers here, but secondly,
we also consider recall, because if missed
many, there would be many zeros here.
All right, so,
it combines precision and recall.
And furthermore, you can see this
measure is sensitive to a small change
of a position of a relevant document.
Let's say, if I move this relevant
document up a little bit, now,
it would increase this means,
this average precision.
Whereas, if I move any relevant document,
down, let's say, I move this relevant
document down, then it would decrease,
uh,the average precision.
So, this is a very good,
because it's a very sensitive to
the ranking of every relevant document.
It can tell, small differences
between two ranked lists.
And, that is what we want,
sometimes one algorithm only works
slightly better than another.
And, we want to see this difference.
In contrast, if we look at
the precision at the ten documents.
If we look at this, this whole set, well,
what, what's the precision,
what do you think?
Well, it's easy to see,
that's a four out of ten, right?
So, that precision is very meaningful,
because it tells us, what user would see?
So, that's pretty useful, right?
So, it's a meaningful measure,
from a users perspective.
But, if we use this measure to
compare systems, it wouldn't be good,
because it wouldn't be sensitive to where
these four relevant documents are ranked.
If I move them around the precision
at ten, still, the same.
Right.
So,
this is not a good measure for
comparing different algorithms.
In contrast, the average precision
is a much better measure.
It can tell the difference of, different,
a difference in ranked list in,
subtle ways.
[MUSIC]

[SOUND]
So average precision is computer for
just one.
one query.
But we generally experiment with many
different queries and this is to
avoid the variance across queries.
Depending on the queries you use you
might make different conclusions.
Right, so
it's better then using more queries.
If you use more queries then,
you will also have to
take the average of the average
precision over all these queries.
So how can we do that?
Well, you can naturally.
Think of just doing arithmetic mean as we
always tend to, to think in, in this way.
So, this would give us what's called
a "Mean Average Position", or MAP.
In this case,
we take arithmetic mean of all the average
precisions over several queries or topics.
But as I just mentioned in
another lecture, is this good?
We call that.
We talked about the different ways
of combining precision and recall.
And we conclude that the arithmetic
mean is not as good as the MAP measure.
But here it's the same.
We can also think about the alternative
ways of aggregating the numbers.
Don't just automatically assume that,
though.
Let's just also take the arithmetic
mean of the average position over
these queries.
Let's think about what's
the best way of aggregating them.
If you think about the different ways,
naturally you will,
probably be able to think about
another way, which is geometric mean.
And we call this kind of average a gMAP.
This is another way.
So now, once you think about
the two different ways.
Of doing the same thing.
The natural question to ask is,
which one is better?
So.
So, do you use MAP or gMAP?
Again, that's important question.
Imagine you are again
testing a new algorithm in,
by comparing the ways your old
algorithms made the search engine.
Now you tested multiple topics.
Now you've got the average precision for
these topics.
Now you are thinking of looking
at the overall performance.
You have to take the average.
But which, which strategy would you use?
Now first, you should also think about the
question, well did it make a difference?
Can you think of scenarios where using
one of them would make a difference?
That is they would give different
rankings of those methods.
And that also means depending on
the way you average or detect the.
Average of these average positions.
You will get different conclusions.
This makes the question
becoming even more important.
Right?
So, which one would you use?
Well again, if you look at
the difference between these.
Different ways of aggregating
the average position.
You'll realize in arithmetic mean,
the sum is dominating by large values.
So what does large value here mean?
It means the query is relatively easy.
You can have a high pres,
average position.
Whereas gMAP tends to be
affected more by low values.
And those are the queries that
don't have good performance.
The average precision is low.
So if you think about the,
improving the search engine for
those difficult queries,
then gMAP would be preferred, right?
On the other hand, if you just want to.
Have improved a lot.
Over all the kinds of queries or
particular popular queries that might be
easy and you want to make the perfect and
maybe MAP would be then preferred.
So again, the answer depends on
your users, your users tasks and
their pref, their preferences.
So the point that here is to think
about the multiple ways to solve
the same problem, and then compare them,
and think carefully about the differences.
And which one makes more sense.
Often, when one of them might
make sense in one situation and
another might make more sense
in a different situation.
So it's important to pick out under
what situations one is preferred.
As a special case of the mean average
position, we can also think about
the case where there was precisely
one rank in the document.
And this happens often, for example,
in what's called a known item search.
Where you know a target page, let's
say you have to find Amazon, homepage.
You have one relevant document there,
and you hope to find it.
That's call a "known item search".
In that case,
there's precisely one relevant document.
Or in another application,
like a question and answering,
maybe there's only one answer.
Are there.
So if you rank the answers,
then your goal is to rank that one
particular answer on top, right?
So in this case, you can easily
verify the average position,
will basically boil down
to reciprocal rank.
That is, 1 over r where r is the rank
position of that single relevant document.
So if that document is ranked
on the very top or is 1, and
then it's 1 for reciprocal rank.
If it's ranked at the,
the second, then it's 1 over 2.
Et cetera.
And then we can also take a, a average
of all these average precision or
reciprocal rank over a set of topics, and
that would give us something
called a mean reciprocal rank.
It's a very popular measure.
For no item search or, you know,
an problem where you have
just one relevant item.
Now again here, you can see this
r actually is meaningful here.
And this r is basically
indicating how much effort
a user would have to make in order
to find that relevant document.
If it's ranked on the top it's low effort
that you have to make, or little effort.
But if it's ranked at 100
then you actually have to,
read presumably 100 documents
in order to find it.
So, in this sense r is also a meaningful
measure and the reciprocal rank will
take the reciprocal of r,
instead of using r directly.
So my natural question here
is why not simply using r?
I imagine if you were to design
a ratio to, measure the performance
of a random system,
when there is only one relevant item.
You might have thought about
using r directly as the measure.
After all,
that measures the user's effort, right?
But, think about if you take a average
of this over a large number of topics.
Again it would make a difference.
Right, for one single topic, using r or
using 1 over r wouldn't
make any difference.
It's the same.
Larger r with corresponds
to a small 1 over r, right?
But the difference would only show when,
show up when you have many topics.
So again, think about the average of Mean
Reciprocal Rank versus average of just r.
What's the difference?
Do you see any difference?
And would, would this difference
change the oath of systems.
In our conclusion.
And this, it turns out that,
there is actually a big difference, and
if you think about it, if you want to
think about it and then, yourself,
then pause the video.
Basically, the difference is,
if you take some of our directory, then.
Again it will be dominated
by large values of r.
So what are those values?
Those are basically large values that
indicate that lower ranked results.
That means the relevant items
rank very low down on the list.
And the sum that's also the average
that would then be dominated by.
Where those relevant documents
are ranked in, in ,in,
in the lower portion of the ranked.
But from a users perspective we care
more about the highly ranked documents.
So by taking this transformation
by using reciprocal rank.
Here we emphasize more on
the difference on the top.
You know, think about
the difference between 1 and the 2,
it would make a big difference, in 1 over
r, but think about the 100, and 1, and
where and when won't make much
difference if you use this.
But if you use this there will
be a big difference in 100 and
let's say 1,000, right.
So this is not the desirable.
On the other hand, a 1 and
2 won't make much difference.
So this is yet another case where there
may be multiple choices of doing the same
thing and then you need to figure
out which one makes more sense.
So to summarize,
we showed that the precision-recall curve.
Can characterize the overall
accuracy of a ranked list.
And we emphasized that the actual
utility of a ranked list depends
on how many top ranked results
a user would actually examine.
Some users will examine more.
Than others.
An average person uses a standard measure
for comparing two ranking methods.
It combines precision and recall and
it's sensitive to the rank
of every random document.
[MUSIC]

[SOUND] This lecture is about how to
evaluate the text retrieval system when
we have multiple levels of judgments.
In this lecture we will continue
the discussion of evaluation.
We're going to look at how to
evaluate the text retrieval system.
And we have multiple level of judgements.
So, so far we have talked
about binding judgements,
that means a documents is judged
as being relevant or not-relevant.
But earlier we will also talk about,
relevance as a matter of degree.
So we often can distinguish it
very higher relevant options,
those are very useful options, from you
know, lower rated relevant options.
They are okay, they are useful perhaps.
And further from non-relevant documents.
Those are not useful.
Right?
So imagine you can have ratings for
these pages.
Then you would have much
more levels of ratings.
For example, here I show an example
of three levels, three were relevant.
Sorry, three were very relevant.
Two for marginally relevant and
one for non-relevant.
Now how do we evaluate such a new system
using these judgements of use of the map
doesn't work, average of precision
doesn't work, precision and
record doesn't work because
they rely on vinyl judgement.
So let's look at the sum top regular
results when using these judgments.
Right?
Imagine the user would be mostly
care about the top ten results here.
Right.
And we mark the the rating levels or
relevance levels for
these documents as shown here.
Three, two, one, one, three, et cetera.
And we call these gain.
And the reason why we call it a gain,
is because the measure that
we are infusing is called, NTCG,
normalizer discount of accumulative gain.
So this gain basically can mesh your,
how much gain of random
information a user can obtain by
looking at each document, alright.
So looking after the first document
the user can gain three points.
Looking at the non-relevant document
the user would only gain one point.
Right.
Looking at the multi-level relevant or
marginally relevant document the user
would get two points et cetera.
So this gain usually matches the utility
of a document from a user's perspective.
Of course if we assume the user
stops at the ten documents, and
we're looking at the cutoff at ten we can
look after the total gain of the user.
And what's that,
well that's simply the sum of these and
we call it the cumulative gain.
So if we use a stops at
the positua that's just a three.
If the user looks at another
document that's a 3 plus 2.
If the user looks at the more documents.
Then the cumulative gain is more.
Of course, this is at the cost of
spending more time to examine the list.
So cumulative gain gives
us some idea about
how much total gain the user would have
if the user examines all these documents.
Now, in NDCG, we also have another letter
here, D, discounted cumulative gain.
So why do we want to do discounting?
Well, if you look at this cumulative gain,
there is one deficiency which is
it did not consider the rank
position of these these documents.
So, for example looking at the,
this sum here
and we only know there is only
one highly relevant document,
one marginally relevant document,
two non-relevant documents.
We don't really care
where they are ranked.
Ideally, we want these two
to be ranked on the top.
Which is the case here.
But how can we capture that intuition?
Well we have to say, well this 3 here
is not as good as this 3 on the top.
And that means the contribution of,
the game from different positions,
has to be weight by their position.
And this is the idea of discounting,
basically.
So, we're going to say, well, the first
one, doesn't it need to be discounted
because the user can be assume that
you always see this document, but
the second one,
this one will be discounted a little bit,
because there's a small possibility
that the user wouldn't notice it.
So, we divide this gain by the weight,
based on the position.
So, log of two, two is the rank
position of this document and,
when we go to the third position, we,
discount even more because the numbers
is log of three, and so on and so forth.
So when we take a such a sum then a lowly
ranked document would not contribute
contribute that much as
a highly ranked document.
So that means if you, for example,
switch the position of this and let's say
this position and this one, and then
you would get more discount if you put
for example, very relevant document
here as opposed to two here.
Imagine if you put the three here,
then it would have to be discounted.
So it's not as good as if you
would put the three here.
So this is the idea of discounting.
Okay, so n, now at this point that we have
got this discounted cumulative gain for
measuring the utility of this ranked
list with multiple levels of judgments.
So are we happy with this?
Well we can use this rank systems.
Now we still need to do
a little bit more in order to
make this measure comfortable
across different topics.
And this is the last step.
And by the way,
here we just show that DCG at the ten.
Alright.
So this is the total sum of DCG
over all these ten documents.
So the last step is called N,
normalization.
And if we do that then
we get normalized DCG.
So how do we do that?
Well, the idea here is
within the Normalized DCG
by the Ideal DCG at the same cutoff.
What is the Ideal DCG?
Well this is a DCG of ideal ranking.
So imagine if we have nine
documents in the whole collection
rated a three here and that means in
total we have nine documents rated three.
Then, our ideal ranked the Lister
would have put all these nine
documents on the very top.
So all these would have to be three and
then this would be followed by a two here,
because that's the best we could do
after we have run out of threes.
But all these positions would be threes.
Right?
So this would be our ideal ranked list.
And then we can compute the DCG for
this ideal rank list.
So this would be given by this
formula you see here, and so
this idea DCG would be used
as the normalizer DCG.
Like so here, and this IdealDCG
would be used as a normalizer.
So you can imagine now normalization
essentially is to compare the actual DCG
with the best decision you can
possibly get for this topic.
Now why do we want to do this?
Well by doing this we'll map the DCG
values in to a range of zero through one,
so the best value, or the highest
value for every query would be one.
That's when you're relevance
is in fact the idealist.
But otherwise in general
you will be lower than one.
Now what if we don't do that?
Well, you can see this transformation or
this numberization,
doesn't really affect the relative
comparison of systems for
just one topic, because this ideal
DCG is the same for all the systems.
So the ranking of systems based on
only DCG would be exactly the same.
As if you rank them based
on the normalized decision.
The difference however is when
we have multiple topics because
if we don't do normalization, different
topics will have different scales of DCG.
For a topic like this one we have
nine highly relevant documents.
The DCG can get really high.
But imagine that in another case there
are only two very relevant documents.
In total in the whole collection.
Then the highest DCG that
any system could achieve for
such a topic would not be very high.
So again we face the problem of different
scales of DCG values and when we
take an average we don't want the average
to be dominated by those high values.
Those are again easy quires.
So by doing the normalization we have all,
avoid the problem.
Making all the purists
contribute equal to the average.
So this is the idea of NDCG.
It's used for measuring relevance based
on much more level relevance judgments.
So more in the more general way,
this is basically a measure
that can be applied through
any ranked task with much more level of,
of judgments.
And the scale of
the judgments can be multiple
can be more than binary, not only more
than binary, they can be multiple levels,
like one's or five, or
even more depending on your application.
And the main idea of this
measure just to summarize,
is to measure the total utility
of the top k documents.
So you always choose a cutoff, and
then you measure the total utility.
And it would discount the contribution
from a lowly ranked document,
and finally, it would do normalization
to ensure comparability across queries
[MUSIC]

[NOISE].
>> This lecture is about some practical
issues that you would have to address in
evaluation of text retrieval systems.
In this lecture we will continue
the discussion of evaluation.
We will cover some practical
issues that you will have to solve
in actual evaluation of
text retrieval systems.
So, in order to create a test collection,
we have to create a set of queries,
a set of documents and
a set of relevance judgments.
It turns out that each is
actually challenging to create.
So first, the documents and
queries must be representative.
They must rep, represent the real queries
and real documents that the users handle.
And we also have to use many queries and
many documents in order to
avoid biased conclusions.
For the matching of relevant
documents with the queries,
we also need to ensure that there exists a
lot of relevant documents for each query.
If a query has only one that is a relevant
document in the collection then, you know,
it's not very informative to
compare different methods
using such a query because there is not
much room for us to see difference.
So, ideally there should be more
relevant documents in the collection.
But yet the queries also should represent
real queries that we care about.
In terms of relevance judgements, the
challenge is to ensure complete judgements
of all the documents for all the queries,
yet, minimizing human and fault.
Because we have to use the human
labor to label these documents.
It's very labor intensive.
And as a result, it's impossible to
actually label all of the documents for
all the queries, especially considering
a joint, data set like the web.
So, this is actually a major challenge.
It's a very difficult challenge.
For measures, it's also challenging
because what we want with measures is that
with accuracy reflected
the perceived utility of users.
We have to consider carefully
what the users care about and
then design measures to measure that.
If we, your measure is not
measuring the right thing,
then your conclusion would,
would be misled.
So it's very important.
So we're going to talk
about a couple issues here.
One is the statistical significance test,
and
this also is the reason why we
have to use a lot of queries, and
the question here is how sure can you
be that I observed the difference?
It doesn't simply result from
the particular queries you choose.
So here are some sample results of
average precision for System A and
System B in two different experiments.
And you can see in the bottom,
we have mean average position, all right?
So the mean,
if you look at the mean average position
the mean average positions are exactly
the same in both experiments.
All right, so you can see this is 0.2,
this is 0.4 for
system B and
again here its also 0.2 and 0.4.
So they are identical.
Yet if you look at the, these exact
average positions for different queries,
if you look at these numbers in detail,
you will realize that in one case
you would feel that you can trust
the conclusion here given by the average.
In another case, in the other case,
you will feel that, well, I'm not sure.
So, why don't you take a look at
all these numbers for a moment.
Pause the video.
So, if you at the average,
the main average position,
we can easily say that, well,
System B is better, right?
So it's, after all, it's 0.4 and
then this is twice as much as 0.2.
So that's a better performance.
But if you look at these two experiments
and look at the detailed results,
you will see that we'll be more
confident to say that in the case one.
In experiment one.
In this case because these numbers seem to
be consistently better than for system B.
Where as in, experiment two,
we're not sure.
because, looking at some results,
like this, after system A is better.
And this is another case
where system A is better.
But yet, if we look at on the average,
System B is better.
So what do you think?
You know, how reliable is our conclusion
if we only look at the average?
Now in this case, intuitively, we feel
it's better than one, it's more reliable.
But how can we quantitatively
answer this question?
And this is why we need to do
statistical significance test.
So the idea of a statistical significance
test is basically to assess the vary,
variance across these different queries.
If there's a, a big variance that means
that the results could fluctuate
a lot according to different queries.
Then we should believe that
unless you have used a lot of
queries the results might change
if we use another set of queries.
Right?
So, this is then not so
if you have seen high variance
then it's not very reliable.
So let's look at these results
again in the second case.
So here we show two,
different ways to compare them.
One is a Sign Test.
And we'll, we'll just look at the sign.
If System B is better than System A,
then we have a plus sign.
When System A is better
we have a minus sign etc.
Using this case if you see this,
well, there are seven cases.
We actually have four cases
where System B is better.
But three cases System A is better.
You know intuitively,
this is almost like random results.
Right, so if you just take a random
sample of to, to flip seven coins,
and if you use plus to denote the head and
then minus to denote the tail, and
that could easily be the results of just
randomly flipping, these seven coins.
So, the fact that the, the average
is larger doesn't tell us anything.
You know, we can't reliably concur that.
And this can be quantitative
in the measure by, a p value.
And that basically, means,
the probability that this result is
in fact from random fluctuation.
In this case, probability is one.
It means it surely is
a random fluctuation.
Now in Wilcoxon, test,
it's a non parametrical test,
and we would be not only
looking at the signs
we'll be also looking at
the magnitude of the difference.
But, we, we, we can draw a similar
conclusion where you say well it's
very likely to be from random.
So to illustrate this let's
think about such a distribution.
And this is called a normal distribution.
We assume that the mean is zero here.
Let's say, well, we started with
the assumption that there's no difference
between the two systems.
But we assume that because of random
fluctuations depending on the queries
we might observe a difference, so
the actual difference might be
on the left side here or
on the right side here, right?
And, and
this curve kind of shows the probability
that we would actually observe values
that are deviating from zero here.
Now, so if we, look at this picture then
we see that if a difference
is observed here,
then the chance is very
high that this is in fact,
a random observation, right.
We can define region of you know, likely
observation because of random fluctuation.
And this is 95% of all outcomes.
And in this interval
then the observed values
may still be from random fluctuation.
But if you observe a value in this
region or a difference on this side,
then the difference is unlikely
from random fluctuation.
Right, so there is a very small
probability that you will observe
such a difference just because
of random fluctuation.
So in that case, we can then conclude
the difference must be real.
So System B is indeed better.
So, this is the idea of
the statistical significance test.
The takeaway message here is that
you have used many queries to avoid
jumping into a conclusion as in this
case to say System B is better.
There are many different ways of doing
this statistical significance test.
So now, let's talk about the other
problem of making judgements and
as we said earlier,
it's very hard to judge all the documents
completely unless it is a small data set.
So the question is, if we can't
afford judging all the documents
in the collection,
which subset should we judge?
And the solution here is pooling.
And this is a strategy that has been used
in many cases to solve this problem.
So the idea of pulling is the following.
We would first choose a diverse
set of ranking methods,
these are types of retrieval systems.
And we hope these methods
can help us nominate
likely relevance in the documents.
So the goal is to pick out
the relevant documents..
It means we are to make judgements
on relevant documents because those
are the most useful documents
from the users perspective.
So, that way we would have each
to return top-K documents.
And the K can vary from systems, right.
But the point is to ask them to suggest
the most likely relevant documents.
And then we simply combine
all these top-K sets to form
a pool of documents for
human assessors to judge.
So, imagine you have many systems.
Each will return K documents, you know,
take the top-K documents, and
we form the unit.
Now, of course there are many
documents that are duplicated,
because many systems might have
retrieved the same random documents.
So there will be some duplicate documents.
And there are,
there are also unique documents that are
only returned by one system, so the idea
of having diverse set of result ranking
methods is to ensure the pool is broad.
And can include as many possible
random documents as possible.
And then the users with
the human assessors would make
complete the judgements on this data set,
this pool.
And the other unjudged documents are
usually just a assumed to be non-relevant.
Now if the pool is large enough,
this assumption is okay.
But the, if the pool is not very large,
this actually has to be reconsidered,
and we might use other
strategies to deal with them and
there are indeed other
methods to handle such cases.
And such a strategy is generally okay for
comparing systems that
contribute to the pool.
That means if you participate in
contributing to the pool then it's
unlikely that it will penalize
your system because the top
ranked documents have all been judged.
However, this is problematic for
even evaluating a new system that may
not have contributed to the pool.
In this case, you know, a new system
might be penalized because it might have
nominated some relevant documents
that have not been judged.
So those documents might be
assumed to be non-relevant.
And that, that's unfair.
So to summarize the whole part
of text retrieval evaluation,
it's extremely important because the
problem is an empirically defined problem.
If we don't rely on users, there's no way
to tell whether one method works better.
If we have inappropriate
experiment design,
we might misguide our research or
applications.
And we might just draw wrong conclusions.
And we have seen this in
some of our discussion.
So, make sure to get it right for
your research or application.
The main methodology is Cranfield
evaluation methodology and
this is near the main paradigm used in
all kinds of empirical evaluation tasks,
not just a search engine variation.
Map and nDCG are the two main measures
that should definitely know about and
they are appropriate for
comparing ranking algorithms.
You will see them often
in research papers.
Perceiving up to ten documents is easier
to interpret from users perspective.
So, that's also often useful.
What's not covered is some other
evaluation strategy like A-B test
where the system would mix two of
the results of two methods randomly.
And then will show the mix
of results to users.
Of course, the users don't see
which result is from which method.
The users would judge those results or
click on those those documents in
in a search engine application.
In this case, then, the search engine can
keep track of the clicked documents, and
see if one method has contributed
more to the clicked documents.
If the user tends to click on
one the results from one method,
then it's just that method may,
may be better.
So this is what leverages a real users
of a search engine to do evaluation.
It's called A-B Test, and
it's a strategy that's often used by
the modern search engines,
the commercial search engines.
Another way to evaluate IR or
text retrieval is user studies,
and we haven't covered that.
I've put some references here that you can
look at if you want to
know more about that.
So there are three
additional readings here,
these are three mini
books about evaluation.
And they are all excellent in covering a
broad review of information retrieval and
evaluation.
And this covered some of
the things that we discussed.
But they also have a lot
of others to offer.
[MUSIC]

[SOUND] This lecture is about
the probabilistic retrieval model.
In this lecture, we're going to continue
the discussion of text retrieval methods.
We're going to look at another kind of
very different way to design ranking
functions, then the Vector Space Model
that we discussed before.
In probabilistic models we define
the ranking function based
on the probability that this
document is random to this query.
In other words, we are, we introduced
a binary random variable here.
This is the variable R here.
And we also assume that the query and
the documents are all observations
from random variables.
Note that in the vector space model,
we assume they are vectors.
But here we assumed we assumed they are
the data observed from random variables.
And so the problem, model retrieval
becomes to estimate
the probability of relevance.
In this category of models,
there are different variants.
The classic probabilistic model has
led to the BM25 retrieval function,
which we discussed in
the vector space model,
because it's form is actually
similar to a vector space model.
In this lecture,
we're going to discuss another subclass in
this big class called a language
modeling approaches to retrieval.
In particular, we're going to discuss
the Query Likelihood retrieval model,
which is one of the most effective
models in probabilistic models.
There is also another line called
a divergence-from-randomness model,
which has latitude the PL2 function.
It's also one of the most effective
state of the art attribute functions.
In query likelihood, our assumption
is that this probability readiness
can be approximated by the probability
of query given a document and readiness.
So, intuitively, this probability just
captures the following probability.
And that is if a user likes document d,
how likely would
the user enter query q in
order to retrieve document d.
So we'll assume that the user likes d,
because we have a relevance value here.
And the we ask the question about
how likely we will see this
particular query from this user?
So this is the basic idea.
Now to understand this idea,
let's take a look at the general idea or
the basic idea of probabilistic
retrieval models.
So here, I listed some imagined
relevance status values or
relevance judgments of queries and
documents.
For example, in this slide,
it shows that query one
is a query that the user typed in and
d1 is a document the user has seen and
one means the user thinks
d1 is relevant to to q1.
So this R here can be also approximated
by the clicks little data that the search
engine can collect it by watching how
you interact with the search results.
So, in this case, let's say,
the user clicked on this document, so
there's a one here.
Similarly, the user clicked on d2 also,
so there's a one here.
In other words,
d2 is assumed to relevant at two, q1.
On the other hand, d3 is non relevant,
there's a zero here.
And d4 is non-relevant and then d 5 is
again relevant and so on and so forth.
And this part of maybe,
they are collected from a different user.
Right.
So this user typed in q1 and
then found that d1 is actually not useful,
so d1 is actually non-relevant.
In contrast here we see it's relevant and,
or this could be the same query typing
by the same user at different times,
but d2 is also relevant, et cetera.
And then here, we can see more
data that about other queries.
Now we can imagine,
we have a lot of search data.
Now we can ask the question,
how can we then estimated
the probability of relevance?
Right.
So
how can we compute this
probability of relevance?
Well, intuitively,
that just means if we look at the,
all the entries where we see this
particular d and this particular q,
how likely will we see
a one on the third column?
Basically, that just means
we can correct the counts.
We can first count how many
times where we see q and
d as a pair in this table and
then count how many times
we actually have also seen
one in the third column and
then we just compute the ratio.
So let's take a look at
some specific examples.
Suppose we are trying to computed this
probability for d1, d2 and d3 for q1.
What is the estimated probability?
Now think about that.
You can pause the video if needed.
Try to take a look at the table and
try to give your estimate
of the probability.
Have you seen that if we are interested
in q1 and d1, we've been looking at the,
these two pairs and in both cases or
actually in one of the cases,
the user has said that this is one,
this is relevant.
So R is equal to 1 in only
one of the two cases.
In the other case, this is zero.
So that's one out of two.
What about the d1 and the d2?
Well, they're are here,
you want d2, d1, d2.
In both cases,
in this case R is equal to 1.
So, it's two out of two and
so and so forth.
So you can see with this approach,
we captured it score these documents for
the query.
Right?
We now have a score for d1,
d2 and d3 for this query.
We can simply ranked them based
on these probabilities and so
that's the basic idea of
probabilistic retrieval model.
And you can see, it makes a lot of sense.
In this case, it's going to rank
d2 above all the other documents.
Because in all the cases, when you
have seen q1 and d2, R is equal to 1.
The user clicked on this document.
So this also showed showed that
with a lot of click through data,
a search engine can learn a lot from
the data to improve the search engine.
This is a simple example that shows that
with even a small number of entries here,
we can already estimate
some probabilities.
These probabilities would give us some
sense about which document might be more
read or more useful to a user for
typing this query.
Now, of course, the problem is that
we don't observe all the queries and
all of the documents and
all the relevance values.
Right?
There will be a lot of unseen documents.
In general, we can only collect data from
the document's that we have shown to
the users.
There are even more unseen queries,
because you cannot predict what
queries will be typed in by users.
So, obviously, this approach won't work
if we apply it to unseen queries or
unseen documents.
Nevertheless, this shows the basic idea
of the probabilistic retrieval model and
it makes sense intuitively.
So what do we do in such a case when we
have a lot of unseen documents and, and
unseen queries?
Well, the solutions that we have
to approximate in some way.
Right.
So, in this particular case called
the Query LIkelihood Retrieval Model,
we just approximate this
by another conditional probability,
p q | d, R is equal to 1.
So, in the condition part, we assume
that the user likes the document,
because we have seen that the user
clicked on this document.
And this part,
shows that we're interested in how likely
the user would actually enter this query.
How likely we will see this
query in the same row.
So note that here, we have made
an interesting assumption here.
Basically, we, we're going to assume
that whether the user types in this
query has something to do with
whether user likes the document.
In other words, we actually
make the foreign assumption and
that is a user formula to query based
on an imaginary relevant document.
Well, if you just look at this
as a conditional probability,
it's not obvious we
are making this assumption.
So what I really meant
is that to use this new
conditional probability to help us score
then this new condition of probability.
We have to somehow be able
to estimate this conditional
probability without
relying on this big table.
Otherwise, it would be having
similar problems as before.
And by making this assumption, we have
some way to bypass this big table and
try to just mortar how to
use a formula to the query.
Okay.
So this is how you can simplify the,
the general model so that we can
give either specific function later.
So let's look at how this model works for
our example.
And basically,
what we are going to do in this case
is to ask the following question.
Which of these documents is most likely
the imaginary relevant document in
the user's mind when the user
formulates this query?
And so we ask this question and
we quantify the probability and this
probability is a conditional probability
of observing this query if a particular
document is in fact the imaginary
relevant document in the user's mind.
Here you can see we compute all these
query likelihood probabilities,
the likelihood of queries
given each document.
Once we have these values,
we can then rank these documents
based on these values.
So to summarize, the general idea of
modern relevance in the probability
risk model is to assume that we introduce
a binary random variable, R here.
And then let the scoring function be
defined based on this conditional
probability.
We also talked about a proximate in this
[SOUND] by using the query likelihood.
And in this case,
we have a ranking function that's
basically based on a probability
of a query given the document.
And this probability should be
interpreted as the probability
that a user who likes document
d would pose query q.
Now the question, of course is how do
we compute this additional probability?
At this in general has to do with how
to compute the probability of text,
because q is a text.
And this has to do with a model
called a Language Model.
And this kind of models
are proposed to model text.
So most specifically, we will be
very interested in the following
conditional probability as I show you,
you this here.
If the user like this document, how
likely the user would approve this query?
And in the next lecture, we're going
to give introduction to Language Model,
so that we can see how we can model text
with a probability risk model in general.
[MUSIC]

[SOUND] This lecture is about the feedback
in the language modeling approach.
In this lecture we will continue the
discussion of feedback in text retrieval.
In particular we're going to talk about
the feedback in language modeling
approaches.
So we derive the query likelihood ranking
function by making various assumptions.
As a basic retrieval function, that
formula, or those formulas worked well.
But if we think about the feedback
information, it's a little bit awkward to
use query likelihood to
perform feedback because
a lot of times the feedback information is
additional information about the query.
But we assume the query is
generated by assembling words
from a language model in
the query likelihood method.
It's kind of unnatural to sample,
words that, form feedback documents.
As a result, then research is proposed,
a way to generalize query
likelihood function.
It's called a Kullback-Leibler
divergence retrieval model.
And this model is actually,
going to make the query likelihood,
our retrieval function much
closer to vector space model.
Yet this, form of the language model can
be, regarded as a generalization of query
likelihood in the sense that if it can
cover query likelihood as a special case.
And in this case the feedback
can be achieved through
simply query model estimation or updating.
This is very similar to Rocchio
which updates the query vector.
So let's see what the, is the scale
of divergence, which we will model.
So, on the top, what you see is query
likelihood retrieval function,
all right, this one.
And then KL-divergence or
also called cross entropy retrieval
model is basically to
generalize the frequency part,
here, into a layered model.
So basically it's the difference,
given by the probabilistic model here
to characterize what the user's looking
for versus the kind of query words there.
And this difference allows us to plotting
various different ways to estimate this.
So this can be estimated in many different
ways including using feedback information.
Now this is called a KL-divergence because
this can be interpreted as measuring
the KL-divergence of two distributions.
One is the query model
denoted by this distribution.
One is the talking,
the language model here.
And [INAUDIBLE] though is a [INAUDIBLE]
language model, of course.
And we are not going to talk
about the detail of that, and
you'll find the things in references.
It's also called cross entropy,
because, in, in fact,
we can ignore some terms in the
KL-divergence function and we will end up
having actually cross entropy, and that,
both are terms in information theory.
But, anyway for
our purposes here you can just see
the two formulas look almost identical,
except that here we have a probability of
a word given by a query language model.
This, and here,
the sum is over all the words
that are in the document,
and also with the non-zero probability for
the query model.
So it's kind of, again, a generalization
of sum over all the matching query words.
Now you can also, easy to see,
we can recover the query likelihood,
which we will find here by as simple
as setting this query model to
the relative frequency of
a word in the query, right?
This is very to easy see
once you practice this.
And to here, you can eliminate this
query lens, that's a constant,
and then you get exactly like that.
So you can see the equivalence.
And that's also why this KL-divergence
model can be regarded as a generalization
of query likelihood because we can cover
query likelihood as a special case,
but it would also allow it
to do much more than that.
So this is how we use the KL-divergence
model to then do feedback.
The picture shows that we first
estimate a document language model,
then we estimate a query
language model and
we compute the KL-divergence,
this is often denoted by a D here.
But this basically means,
this was exactly like in vector space
model because we compute the vector for
the document in the computer and
not the vector for the query,
and then we compute the distance.
Only that these vectors
are of special forms,
they have probability distributions.
And then we get the results, and
we can find some feedback documents.
Let's assume they are more selective
sorry, mostly positive documents.
Although we could also consider
both kinds of documents.
So what we could do is, like in Rocchio,
we can compute another language model
called feedback language model here.
Again, this is going to be another vector
just like a computing centroid vector in
Rocchio.
And then this model can be
combined with the original
query model using a linear interpolation.
And this would then give us an updated
model, just like again in Rocchio.
Right, so here, we can see the parameter
of our controlling amount of feedback if
it's set to 0,
then it says here there's no feedback.
After set to 1, we've got full feedback,
we can ignore the original query.
And this is generally not desirable,
right.
So this unless you are absolutely sure you
have seen a lot of relevant documents and
the query terms are not important.
So of course the main question here
is how do you compute this theta F?
This is the big question here.
And once you can do that,
the rest is easy.
So here we'll talk about
one of the approaches.
And there are many approaches of course.
This approach is based on generative model
and I'm going to show you how it works.
This is a user generative mixture model.
So this picture shows that
the we have this model here,
the feedback model that
we want to estimate.
And we the basis is the feedback options.
Let's say we are observing
the positive documents.
These are the collected documents by
users, or random documents judged by
users, or simply top ranked documents
that we assumed to be random.
Now imagine how we can
compute a centroid for
these documents by using language model.
One approach is simply to assume
these documents are generated from
this language model as we did before.
What we could do is do it,
just normalize the word frequency here.
And then we,
we'll get this word distribution.
Now the question is whether this
distribution is good for feedback.
Well you can imagine well the top
rank of the words would be what?
What do you think?
Well those words would be common words,
right?
As well we see in, in the language model,
in the top right, the words are actually
common words like, the, et cetera.
So, it's not very good for feedback,
because we will be adding a lot of such
words to our query when we interpret,
this was the original query model.
So, this is not good, so
we need to do something, in particular,
we are trying to get rid
of those common words.
And we all, we have seen actually one way
to do that, by using background language
model in the case of learning
the associations with of words, right.
The words that are related
to the word computer.
We could do that, and
that would be another way to do this.
But here, we're going to
talk about another approach,
which is a more principled approach.
In this case, we're going to say, well,
you, you said that there are common words
here in this, these documents that should
not belong to this top model, right?
So now, what we can do is to assume that,
well, those words are, generally,
from background language model,
so they will generate a,
those words like the, for example.
And if we use maximum
likelihood estimated,
note that if all the words here
must be generated from this model,
then this model is forced to assign
high probabilities to a word like the,
because it occurs so frequently here.
Note that in order to reduce its
probability in this model, we have to
have another model, which is this one
to help explain the word, the, here.
And in this case,
it's not appropriate to use the background
language model to achieve this goal
because this model will assign high
probabilities to these common words.
So in this approach then, we assume
this machine that which generated
these words would work as follows.
We have a source controller here.
Imagine we flip a coin here to
decide what distribution to use.
With the probability of lambda
the coin shows up as head.
And then we're going to use
the background language model.
And we can do then sample
word from that model.
With probability of 1 minus lambda now,
we now decide to use a unknown topic
model here that we will try to estimate.
And we're going to then
generate a word here.
If we make this assumption, and this
whole thing will be just one model, and
we call this a mixture model,
because there are two distributions
that are mixed here together.
And we actually don't know when
each distribution is used.
Right, so again think of this
whole thing as one model.
And we can still ask it for words, and
it will still give us a word
in a random method, right?
And of course which word will show up
will depend on both this distribution and
that distribution.
In addition,
it would also depend on this lambda,
because if you say,
lambda is very high and
it's going to always use the background
distribution, you'll get different words.
If you say, well our lambda is very small,
we're going to use this, all right?
So all these are parameters,
in this model.
And then, if you're thinking this way,
basically we can do exactly the same as
what we did before, we're going to use
maximum likelihood estimator to adjust
this model to estimate the parameters.
Basically we're going to adjust,
well, this parameter so
that we can best explain all the data.
The difference now is that we are not
asking this model alone to explain this.
But rather we're going to ask
this whole model, mixture model,
to explain the data because it has got
some help from the background model.
It doesn't have to assign high
probabilities towards like the,
as a result.
It would then assign high probabilities
to other words that are common here but
not having high probability here.
So those would be common here.
Right?
And if they're common they would
have to have high probabilities,
according to a maximum
likelihood estimator.
And if they are rare here,
all right, so if they are rare here,
then you don't get much help
from this background model.
As a result, this topic model
must assign high probabilities.
So the higher probability words
according to the topic model
will be those that are common here,
but rare in the background.
Okay, so, this is basically a little
bit like a idea for weighting here.
This would allow us to achieve
the effect of removing these top words
that are meaningless in the feedback.
So mathematically what we have is
to compute the likelihood again,
local likelihood of
the feedback documents.
And, and note that, we also have
another parameter, lambda here.
But we assume that lambda denotes
noise in the feedback document.
So we are going to, let's say,
set this to a parameter, let's say,
say 50% of the words are noise,
or 90% are noise.
And this can then be,
assume it will be fixed.
If we assume this is fixed, then we only
have these probabilities as parameters
just like in the simplest unigram
language model, we have n parameters.
n is the number of words and, then, the
likelihood function will look like this.
It's very similar to the likelihood
function, normal likelihood
function we see before except that inside
the logarithm there's a sum in here.
And this sum is because we can
see the two distributions.
And which ones used would depend on
lambda and that's why we have this form.
But mathematically this is the function
with theta as unknown variables, right?
So, this is just a function.
All the other variables are known,
except for this guy.
So, we can then choose this
probability distribution to
maximize this log likelihood.
The same idea as the maximum
likelihood estimator.
As a mathematical problem which is to,
we just have to solve this
optimization problem.
We said we would try all
of the theta values, and
here we find one that gives this
whole thing the maximum probability.
So, it's a well-defined math problem.
Once we have done that,
we obtain this theta F,
that can be the interpreter with
the original query model to do feedback.
So here are some examples of
the feedback model learned from a web
document collection, and
we do pseudo-feedback.
We just use the top 10 documents,
and we use this mixture model.
So the query is airport security.
What we do is we first retrieve ten
documents from the web database.
And this is of course pseudo-feedback,
right?
And then we're going to feed to that
mixture model, to this ten document set.
And these are the words
learned using this approach.
This is the probability of a word given
by the feedback model in both cases.
So, in both cases, you can see
the highest probability of words
include very random words to the query.
So, airport security for example,
these query words still show
up as high probabilities
in each case naturally because they occur
frequently in the top rank of documents.
But we also see beverage, alcohol,
bomb, terrorist, et cetera.
Right, so these are relevant
to this topic, and they,
if combined with original query can help
us match more accurately, on documents.
And also they can help us bring up
documents that only managing the,
some of these other words.
And maybe for example just airport and
then bomb for example.
These so,
this is how pseudo-feedback works.
It shows that this model really works and
picks up mm,
some related words to the query.
What's also interesting is that if
you look at the two tables here, and
you compare them, and you see in this
case, when lambda is set to a small value,
and we'll still see some common
words here, and that means.
When we don't use the background
model often, remember lambda can
use the probability of using the
background model to generate to the text.
If we don't rely much on background model,
we still have to use this topped model
to account for the common words.
Whereas if we set lambda to a very
high value we would use the background
model very often to explain these words,
then there is no burden on
expanding those common words in the
feedback documents by the topping model.
So, as a result, the top of the model
here is very discriminative.
It contains all the relevant
words without common words.
So this can be added to the original
query to achieve feedback.
So to summarize in this lecture we
have talked about the feedback in
language model approach.
In general,
feedback is to learn from examples.
These examples can be assumed examples,
can be pseudo-examples,
like assume the, the top ten
documents are assumed to be random.
They could be based on using
fractions like feedback,
based on quick sorts or implicit feedback.
We talked about the three major
feedback scenarios, relevance feedback,
pseudo-feedback, and implicit feedback.
We talked about how to use Rocchio to
do feedback in vector-space model and
how to use query model estimation for
feedback in language model.
And we briefly talked about
the mixture model and
the basic idea and
there are many other methods.
For example the relevance model
is a very effective model for
estimating query model.
So, you can read more about the,
these methods in the references that
are listed at the end of this lecture.
So there are two additional readings here.
The first one is a book that
has a systematic, review and
discussion of language models
of more information retrieval.
And the second one is an important
research paper that's about relevance
based language models and it's a very
effective way of computing query model.
[MUSIC]

[SOUND].
This lecture is about
a statistical language model.
In this lecture, we're go,
we're going to get an introduction
to the probabilistic model.
This has to do with how many models
have to go into these models.
So, it's ready to how we model
theory based on a document.
We're going to talk about, what is
a language model and, then, we're going to
talk about the simplest language model
called a unigram language model.
Which also happens to be the most
useful model for text retrieval.
And finally we'll discuss
possible uses of an m model.
What is a language model?
Well, it's just a probability
distribution over word sequences.
So, here I show one.
This model gives the sequence today's
Wednesday a probability of 0.001 it gives
today Wednesday is a very very small
probability, because it's algorithmatical.
You can see the probabilities
given to these sentences or
sequences of words can vary
a lot depending on the model.
Therefore, it's clearly context-dependent.
In ordinary conversation,
probably today is Wednesday is most
popular among these sentences.
But imagine in the context of
discussing a private math,
maybe the higher values positive
would have a higher probability.
This means it can be used to
represent as a topic of a test.
The model can also be regarded
as a probabilistic mechanism for
generating text, and this is why it
is often called a generating model.
So, what does that mean?
We can image this is
a mechanism that's visualized
here as a [INAUDIBLE] system that
can generate a sequences of words.
So we can ask for a sequence and
it's to sample a sequence
from the device if you want.
And it might generate, for
example, today is Wednesday, but
it could have generated
many other sequences.
So for example,
there are many possibilities, right?
So this, in this sense,
we can view our data as basically a sample
observed from such a generated model.
So why is such a model useful?
Well, it's mainly because it can quantify
the uncertainties in natural language.
Where do uncertainties come from?
Well, one source is
simply the ambiguity in
natural language that we
discussed earlier in the lecture.
Another source is because we don't
have complete understanding.
We lack all the knowledge
to understand language.
In that case there will
be uncertainties as well.
So let me show some examples of questions
that we can answer with an average model
that would have an interesting
application in different ways.
Given that we see John and feels.
How likely will we see
happy as opposed to habit
as the next word in a sequence of words?
Obviously this would be very useful
speech recognition because happy and
habit would have similar acoustical sound.
Acoustic signals.
But if we look at the language model
we know that John feels happy would be
far more likely than John feels habit.
Another example, given that we
observe baseball three times and
gained once in the news article
how likely is it about the sports?
This obviously is related to text
categorization and information.
Also, given that a user is
interested in sports news,
how likely would the user
use baseball in a query?
Now this is clearly related to the query
that we discussed in the previous lecture.
So now let's look at
the simplest language model.
Called a lan, unigram language model.
In such a case,
we assume that we generate the text by
generating each word independently.
So this means the probability of
a sequence of words would be then
the product of
the probability of each word.
Now normally they are not independent,
right?
So if you have seen a word like language.
Now, we'll make it far more
likely to observe model
than if you haven't seen language.
So this assumption is
not necessary sure but
we'll make this assumption
to simplify the model.
So now, the model has precisely n
parameters, where n is vocabulary size.
We have one probability for each word, and
all these probabilities must sum to 1.
So strictly speaking,
we actually have N minus 1 parameters.
As I said,
text can be then be assumed to be a sample
drawn from this word distribution.
So for example,
now we can ask the device, or the model,
to stochastically generate the words for
us instead of in sequences.
So instead of giving a whole
sequence like today is Wednesday,
it now gives us just one word.
And we can get all kinds of words.
And we can assemble these
words in a sequence.
So, that would still allows you to
compute the probability of today is Wed
as the product of the three probabilities.
As you can see even though we have
not asked the model to generate the,
the sequences it actually allows
us to compute the probability for
all the sequences.
But this model now only needs
N parameters to characterize.
That means if we specify
all the probabilities for
all the words then the model's
behavior is completely specified.
Whereas if you, we don't make this
assumption we would have to specify.
Find probabilities for all kinds of
combinations of words in sequences.
So by making this assumption, it makes it
much easier to estimate these parameters.
So let's see a specific example here.
Here I show two unigram lambda
models with some probabilities and
these are high probability
words that are shown on top.
The first one clearly suggests
the topic of text mining
because the high probability words
are all related to this topic.
The second one is more related to health.
Now, we can then ask the question how
likely we'll observe a particular text
from each of these three models.
Now suppose with sample
words to form the document,
let's say we take the first distribution
which are the sample words.
What words do you think it would
be generated or maybe text?
Or maybe mining maybe another word?
Even food,
which has a very small probability,
might still be able to show up.
But in general, high probability
words will likely show up more often.
So we can imagine a generated
text that looks like text mining.
A factor with a small probability,
you might be able to actually generate
the actual text mining paper that
would actually be meaningful, although
the probability would be very, very small.
In the extreme case, you might imagine
we might be able to generate a,
a text paper, text mining paper that
would be accepted by a major conference.
And in that case the probability
would be even smaller.
For instance nonzero probability,
if we assume none of the words
will have a nonzero probability.
Similarly from the second topic,
we can imagine we can generate a food and
nutrition paper.
That doesn't mean we cannot generate this
paper from text mining distribution.
We can, but the probability would be very,
very small, maybe smaller than even
generating a paper that can be accepted
by a major conference on text mining.
So the point of here is
that given a distribution,
we can talk about the probability of
observing a certain kind of text.
Some text would have higher
probabilities than others.
Now, let's look at the problem
in a different way.
Supposedly, we now have
available a particular document.
In this case, maybe the abstract or
the text mining paper, and
we see these word accounts here.
The total number of words is 100.
Now the question you ask here
is a estimation question.
We can ask the question, which model,
which word distribution has been used to,
to generate this text.
Assuming the text has been generated by
assembling words from the distribution.
So what would be your guess?
What have to decide what probabilities
test, mining, et cetera would have.
So pause a view for a second and
try to think about your best guess.
If you're like a lot of people
you would have guessed that well,
my best guess is text has
a probability of 10 out of 100
because I have seen text ten times and
there are a total of 100 words.
So we simply noticed,
normalize these counts.
And that's in fact [INAUDIBLE] justified.
And your intuition is consistent
with mathematical derivation.
And this is called a maximum
likelihood [INAUDIBLE].
In this estimator,
we'll assume that the parameter settings,.
Are those that would give our
observer the maximum probability.
That means if we change
these probabilities,
then the probability of observing the
particular text would be somewhat smaller.
So we can see this has
a very simple formula.
Basically, we just need to
look at the count of a word
in the document and then divide it by
the total number of words in the document.
About the length.
Normalize the frequency.
Well a consequence of this,
is of course, we're going to assign
0 probabilities to unseen words.
If we have an observed word,
there will be no incentive to assign
a non-0 probability using this approach.
Why?
Because that would take away probability
mass for this observed words.
And that obviously wouldn't
maximize the probability of this
particular observed [INAUDIBLE] data.
But one can still question whether
this is our best estimator.
Well, the answer depends on what kind
of model you want to find, right?
This is made if it's a best model
based on this particular layer.
But if you're interested in a model
that can explain the content of the four
paper of, for this abstract, then you
might have a second thought, right?
So for one thing there should be other
things in the body of that article.
So they should not have,
zero probabilities,
even though they are not
observing the abstract.
We're going to cover this later, in,
discussing the query model.
So, let's take a look at some possible
uses of these language models.
One use is simply to use
it to represent the topics.
So here it shows some general
English background that text.
We can use this text to
estimate a language model.
And the model might look like this.
Right?
So on the top we'll have those all common
words, is we, is, and then we'll
see some common words like these,
and then some very,
very real words in the bottom.
This is the background image model.
It represents the frequency on words,
in English in general, right?
This is the background model.
Now, let's look at another text.
Maybe this time, we'll look at
Computer Science research papers.
So we have a correction of computer
science research papers, we do again,
we can just use the maximum where we
simply normalize the frequencies.
Now, in this case, we look at
the distribution, that looks like this.
On the top, it looks similar,
because these words occur everywhere,
they are very common.
But as we go down we'll see words that
are more related to computer science.
Computer, or software, or text et cetera.
So, although here, we might also see
these words, for example, computer.
But, we can imagine the probability here
is much smaller than the probability here.
And we will see many
other words here that,
that would be more common
in general in English.
So, you can see this distribution
characterizes a topic
of the corresponding text.
We can look at the, even the smaller text.
So, in this case let's look
at the text mining paper.
Now if we do the same we have another.
Distribution again the can be
expected to occur on the top.
Soon we will see text, mining,
association, clustering,
these words have relatively
high probabilities in contrast
in this distribution has
relatively small probability.
So this means,
again based on different text data
that we can have a different model.
And model captures the topic.
So we call this document an LM model and
we call this collection LM model.
And later, we'll see how they're
used in a retrieval function.
But now, let's look at the,
another use of this model.
Can we statistically find what words
are semantically related to computer?
Now how do we find such words?
Well our first thought is well let's
take a look at the text that match.
Computer.
So we can take a look at all the documents
that contain the word computer.
Let's build a language model.
Okay, see what words we see there.
Well, not surprisingly, we see these
common words on top as we always do.
So in this case,
this language model gives us the.
Conditional probability of seeing
a word in the context of computer.
And these common words will
naturally have high probabilities.
Other words will see computer itself, and
software will have relatively
high probabilities.
But we,
if we just use this model we cannot.
I just say all these words
are semantically related to computer.
So intuitively what we'd like to get
rid of these these common words.
How can we do that?
It turns out that it's possible
to use language model to do that.
Now I suggest you think about that.
So how can we know what
words are very common so
that we want to kind of get rid of them.
What model will tell us that?
Well, maybe you can think about that.
So the background language model
precisely tells us this information.
It tells us what words
are common in general.
So if we use this background model,
we would know that these words
are common words in general.
So it's not surprising to observe
them in the context of computer.
Whereas computer has a very
small probability in general.
So it's very surprising that we have
seen computer in, with this probability.
And the same is true for software.
So then we can use these two
models to somehow figure out.
The words that are related to computer.
For example we can simply take the ratio
of these two probabilities and normalize
the top of the model by the probability
of the word in the background model.
So if we do that, we take the ratio,
we'll see that then on the top,
computer, is ramped, and
then followed by software,
program, all these words
related to computer.
Because they occur very frequently
in the context of computer, but
not frequently in whole connection.
Where as these common words will
not have a high probability.
In fact,
they have a ratio of about one down there.
Because they are not really
related to computer.
By taking the same ball of text
that contains the computer we don't
really see more occurrences
of that in general.
So this shows that even
with this simple LM models,
we can do some limited
analysis of semantics.
So in this lecture,
we talked about, language model,
which is basically a probability
distribution over the text.
We talked about the simplistic language
model called unigram language model.
Which is also just a word distribution.
We talked about the two
uses of a language model.
One is to represent the, the topic in
a document, in a classing or in general.
The other is discover word associations.
In the next lecture we're
going to talk about the how
language model can be used to
design a retrieval function.
Here are two additional readings.
The first is a textbook on statistical and
natural language processing.
The second is a article that has
a survey of statistical language
models with other pointers
to research work.
[MUSIC]

This lecture is about query likelihood and
probabilistic retrieval model.
In this lecture,
we continue the discussion of
probabilistic retrieval model.
In particular,
we're going to talk about the query
likelihood of the retrieval function.
In the query of likelihood retrieval
model our idea is a model.
How a likely a user, who likes a document
would pose a particular query.
So in this case, you can imagine,
if a user likes this particular document
about the presidential campaign news.
Then we can assume,
the user would use this
working as a basis to oppose a query
to try and retrieve this doc.
So you can imagine the user, could use
a process that works as follows, where
we assume that the query is generated
by sampling words from the document.
So for example,
a user might pick a word like
presidential from this document,
and then use this as a query word.
And then the user would pick another word,
like campaign and
that would be the second query word.
Now this, of course,
is assumption that we have made about,
how a user would post a query.
Whether a user actually
followed this process.
Maybe a different question.
But this assumption,
has allowed us to formally characterize
this conditional probability.
And this allows to also not rely on
the big table that I showed you earlier
to use imperative data to
estimate this probability.
And this is why we can use this
idea to then further derive
retrieval function that we can
implement with the languages.
So, as you see, the assumption that
we've made here is, each query word,
is independent in this sample, and also,
each word is basically
obtained from the document.
So now let's see how this works exactly.
Well, since we are computing
a query likelihood,
then the probability here is just
the probability of this particular query,
which is a sequence of words.
And we make the assumption that each
word is generated independently.
So, as a result, the probability
of the query is just a product
of the probability of each query word.
Now, how do we compute
the probability of each query word?
Well, based on the assumption,
that a word is picked from the document,
that the user has in mind.
Now we know the probability
of each word is just the,
the relative frequency of
the word in the document.
So, for example the probability of
presidential given the document,
would be just the count of
presidential in the document,
divided by the total number of words
in the document or document length.
So with this these assumptions,
we now have actual simple formula for
retrieval, right?
We can use this to rank our document.
So does this model work?
Let's take a look, here are some example
documents that you have seen before.
Suppose now the query is
presidential campaign.
And we see the formula here on the top.
So how do we score these documents?
Well it's very simple, right,
we just count how many times we have seen
presidential, how many times
we have seen campaign etc.
And see here 44 and
we've seen president Jou Tai,
so that's two over the lands
of document the four.
Multiply by 1 over lands of document
of 4 for the probability of
campaign and seeming we can probabilities
for the other two documents.
Now if you'll look at this, these numbers
or these, this, these formulas for
scoring all these documents, it seems to
make sense because, if we assume d3 and
d4 have about the same length,
then it looks like we will rank d4
above d3 and which is above d2, right?
And as we would expect, looks like
it did capture the tf heuristic.
And so this seems to work well.
However, if we try a different
query like this one,
presidential campaign update,
then we might see a problem.
But what problem?
Well, think about update, now none of
these documents has mentioned update.
So according to our assumption that
a user would pick a order from a document
to generate a query,
then the probability of obtaining
a word like update would be what.
Would be zero, right?
So that cause a problem,
because it would cause all these documents
to have zero probability
of generating this query.
Now, while it's fine to have a zero
probability for d2 which is not relevant.
It's not okay to have zero for d3 and
d4, because now we no longer
can distinguish them.
What's worse,
we can't even distinguish them from d 2.
All right, so
that's obviously not desirable.
Now when one has such result, we should
think about what has caused this problem.
So we have to examine what
assumptions have been made,
as we derive this ranking function.
Now if you examine those assumptions
carefully you would realize.
What has caused this problem, right?
So, take a moment to think about,
what do you think is the reason why
update has zero probability,
and how do we fix it?
Right?
So, if you think about this for
the moment that you realize that.
That's because we have made an assumption
that every query word must be drawn
from the document in the user's mind.
So, in order to fix this,
we have to assume that,
the user could have drawn a word,
not necessarily from the document.
So let's see improved model.
An improvement here is to say that,
well, instead of drawing a word from
the document, let's imagine that the user
would actually draw a word from a document
model and so I show a model here.
Here we assume that this
document is generated,
by using this unigram image model.
Now, this model, doesn't necessarily
assign zero probability for update.
In fact we assume this model does not
assign zero probability for any word.
Now if we're thinking this
way then the generation
process is a little bit different.
Now the user has this model in mind,
instead of this particular document.
Although the model has to be
estimated based on the document.
So the user can again generate
the query using a similar process.
They may pick a word, for example
presidential and another word campaign.
Now the difference is that, this time
we can also pick a word like update,
even though update it
doesn't occur in the document
to potentially generate
a query word like update.
So that, a query was updated we
want to have zero probabilities.
So this would fix our problem and
it's also reasonable,
because we're now thinking of what the
user is looking for in a more general way,
that is unique language model
instead of a fixed document.
So how do we compute this query,
like if we make this sum where
it involves two steps, right?
The first is the computer's model, and we
call it talking the language model here.
For example, I have shown two
possible energy models here.
This has been based on two documents.
And then given a query and
I get a mining algorithms.
The second step, is just to compute
the likelihood of this query.
And by making independence assumptions,
we could then have this probability as
a product of the probability
of each query word, all right?
But we do this for both documents.
And then we're going to score these
two documents and then rank them.
So that's the basic idea of this
query likelihood retrieval function.
So more generally than this ranking
function would look like the following and
here as, we assume that query
has end words W1 through WN.
And then the scoring function,
the ranking function is the probability
that we observe this query, given that
the user is thinking of this document.
And this assumed to be product of
probabilities of all individual words and
this is based on
the independence assumption.
Now we actually often score the,
document for
this query by using log of the query
likelihood, as shown on the sigma line.
Now we do this to avoid having
a lot of small probabilities.
M, multiplied together.
And this could cause underflow and
we might lose precision by transforming
the value as a logarithm function.
We maintain the order of these documents,
yet we can avoid the end of flow problem.
So if we take longer than transformation
of coarse the product that would become
a sum, as you stake in the line here.
So it's a sum of all of the query words,
and inside the sum
that is log of the probability of
this word given by the document.
And then we can further rewrite the sum,
into a different form.
So in the first of the sum here,
in this sum,
we have it over all the query
words n query words.
And in this sum, we have a sum
of all the possible words but
we put a counter here of
each word in the query.
Essentially we are only considering
the words in the query,
because if a word is not in the query,
it can would be zero.
So we're still considering
only these end words.
But we're using a different form as if
we were going to a sum of all the words,
in the vocabulary.
And of course a word might occur
multiple times in the query.
That's wh, why we have a count here.
And then this part is
log of the probability of the word
given by the document MG model.
So you can see, in this material function,
we actually know the count
of the word in the query.
So, the only thing that we don't know
is this document language model.
Therefore, we can convert
through the retrieval problem
into the problem of estimating
this document language model.
So that we can compute, the probability of
each query we're given by this document.
At different estimation methods here,
would lead to different ranking functions.
And this is just like a different
a ways to place a doc in the vector,
in the vector space.
Would lead it to a different ranking
function in the vector space model.
Here are different ways to estimate
this stuff in the language model,
will lead you to a different ranking
function for query likelihood.
[MUSIC]

This lecture is about
smoothing of language models.
In this lecture we're going to continue
talking about the probabilistic
retrieval model.
In particular, we're going to talk
about smoothing of language model and
the query likelihood of it,
which will method.
So you have seen this slide
from a previous lecture.
This is the ranking function
based on the query likelihood.
Here we assume that the independence
of generating each query word
and the formula would
look like the following.
Where we take a sum over all of the query
words and inside is the sum there is
a log of probability of a word given by
the document, or document language model.
So the main task now is to estimate
this document language model.
As we said before different methods for
estimating this model would lead
to different retrieval functions.
So, in this lecture we're going
to look into this in more detail.
So, how do I estimate this language model?
Well, the obvious choice would be
the Maximum Likelihood Estimate
that we have seen before.
And that is we're going to normalize
the word frequencies in the document.
And the estimated probability
would look like this.
This is a step function here.
Which means all the words
that have the same frequency
count will have an equal probability.
This is another frequency in the count
that has a different probability.
Note that for words that have not
occurred in the document here,
they all have zero probability.
So we know this is just like a model that
we assume earlier in the lecture, where
we assume the user with the sample word
from the document to formulate the query.
And there is no chance of sampling
any word that is not in the document.
And we know that's not good.
So how would we improve this?
Well, in order to assign
a non-zero probability
to words that have not been observed
in the document, we would have to take
away some probability to mass from
the words that are observing the document.
So for example here, we have to
take away some [INAUDIBLE] mass,
because we need some extra problem
in the mass for the unseen words.
Otherwise, they won't sum to 1.
So all these probabilities
must be sum to 1.
So to make this transformation, and
to improve the maximum [INAUDIBLE].
By assigning nonzero probabilities to
words that are not observed in the data.
We have to do smoothing, and smoothing
has to do with improving the estimate
by considering the possibility that,
if the author had been written.
Helping, asking to write more words for
the document.
The user,
the author might have rethink other words.
If you think about this factor
then a smoothed LM model
would be a more accurate
representation of the actual topic.
Imagine you have seen
abstract of such article.
Let's say this document is abstract.
Right.
If we assume and
see words in this abstract we have or,
or probability of 0 that
would mean it's no chance
of sampling a word outside the abstract
that the formula to query.
But imagine the user who is interested in
the topic of this abstract, the user might
actually choose a word that is not in
the abstractor to to use as query.
So obviously if we had asked
this author to write more,
the author would have written
a full text of that article.
So smoothing of the language
model is attempted to
to try to recover the model for
the whole, whole article.
And then of course we don't have written
knowledge about any words are not observed
in the abstract there, so that's why
smoothing is actually a tricky problem.
So let's talk a little more
about how to smooth a LM word.
The key question here is what probability
should be assigned to those unseen words.
Right.
And
there are many different
ways of doing that.
One idea here, that's very useful for
retrieval is let the probability
of an unseen word be proportional
to its probability given by
a reference language model.
That means if you don't observe
the word in the data set,
we're going to assume that
its probability is kind of
governed by another reference language
model that we were constructing.
It will tell us which unseen words
we have likely a higher probability.
In the case of retrieval
a natural choice would be to
take the Collection Language Model
as a Reference Language Model.
That is to say if you don't
observe a word in the document
we're going to assume that.
The probability of this word
would be proportional to the probability
of the word in the whole collection.
So, more formally,
we'll be estimating the probability of
a word getting a document as follows.
If the word is seen in the document,
then the probability
would be a discounted the maximum
[INAUDIBLE] estimated p sub c here.
Otherwise, if the word is not seen
in the document, we'll then let
probability be proportional to the
probability of the word in the collection,
and here the coefficient of is to
control the amount of probability
mass that we assign to unseen words.
Obviously all these
probabilities must sum to 1.
So, alpha sub d is
constrained in some way.
So, what if we plug in this
smoothing formula into our
query likelihood Ranking Function?
This is what we would get.
In this formula,
you can see, right, we have
this as a sum over all the query words.
And note that we have written in the form
of a sum over all the vocabulary.
You see here this is a sum of
all the words in the vocabulary,
but note that we have a count
of the word in the query.
So, in effect we are just taking
a sum of query words, right.
This is in now a common way that
we will use because of its
convenience in some transformations.
So, this is as I said,
this is sum of all the query words.
In our smoothing method,
we're assuming the words that are not
observed in the document, that we have
a somewhat different form of probability.
And then it's for this form.
So we're going to then decompose
this sum into two parts.
One sum is over all the query words
that are matched in the document.
That means in this sum,
all the words have a non
zero probability, in the document, sorry.
It's, the non zero count of
the word in the document.
They all occur in the document.
And they also have to, of course,
have a non-zero count in the query.
So, these are the words that are matched.
These are the query words that
are matched in the document.
On the other hand in this sum we are s,
taking the sum over all the words that
are note our query was not
matched in the document.
So they occur in the query due to this
term but they don't occur in the document.
In this case,
these words have this probability because
of our assumption about the smoothing.
But that here, these c words
have a different probability.
Now we can go further by
rewriting the second sum
as a difference of two other sums.
Basically the first sum is actually
the sum over all the query words.
Now we know that the original
sum is not over the query words.
This is over all the query words that
are not matched in the document.
So here we pretend that they
are actually over all the query words.
So, we take a sum over
all the query words.
Obviously this sum has
extra terms that are,
this sum has extra terms
that are not in this sum.
Because here we're taking sum
over all the query words.
There it's not matched in the document.
So in order to make them equal,
we have to then subtract another sum here.
And this is a sum over all the query
words that are mentioned in the document.
And this makes sense because here
we're considering all query words.
And then we subtract the query
that was matched in the document.
That will give us the query rules
that not matched in the document.
And this is almost a reverse
process of the first step here.
And you might wonder
why we want to do that.
Well, that's because if we do this then
we'll have different forms
of terms inside these sums.
So, now we can see in the sum we have,
all the words,
matched query words, matched in
the document with this kind of terms.
Here we have another sum
over the same set of terms.
Matched query terms in document.
But inside the sum it's different.
But these two sums can clearly be merged.
So, if we do that we'll get another form
of the formula that looks like
the following at the bottom here.
And note that this is a very interesting,
because here we combine the, these two,
that are a sum of the query words matched
in the document in the one sum here.
And the other sum, now is the compose
[INAUDIBLE] to two parts, and,
and these two parts look much simpler.
Just because these
are the probabilities of unseen words.
But this formula is very interesting,
because you can see the sum is now over
all the matched query terms.
And just like in the vector space model,
we take a sum of terms,
that intersection of query vector and
the document vector.
So it all already looks a little
bit like the vector space model.
In fact there is even more severity here.
As we, we explain on this slide.
[MUSIC]

[SOUND]
So, I showed you how we rewrite the into
a form that looks like a,
the formula on this slide.
After we make the assumption about
smoothing the language model
based on the collection
of the language model.
Now, if we look at the, this rewriting,
it actually would give us two benefits.
The first benefit is, it helps us better
understand that this ranking function.
In particular, we're going to show that
from this formula we can see smoothing is
the correction that we model will give
us something like a TF-IDF weighting and
length normalization.
The second benefit is
that it also allows us to
compute the query likelihood
more efficiently.
In particular,
we see that the main part of the formula
is a sum over the matching query terms.
So, this is much better than if we
take the sum over all the words.
After we smooth the document
the language model,
we essentially have nonzero
probabilities for all the words.
So, this new form of the formula is
much easier to score, or to compute.
It's also interesting to note
that the last of term here
is actually independent of the document.
Since our goal is to
rank the documents for
the same query,
we can ignore this term for ranking.
Because it's going to be the same for
all the documents.
Ignoring it wouldn't effect
the order of the documents.
Inside the sum,
we also see that each matched
query term would contribute a weight.
And this weight actually,
is very interesting because it
looks like TF-IDF weighting.
First, we can already see it has
a frequency of the word in the query,
just like in the vector space model.
When we take adult product,
we see the word frequency in
the query to show up in such a sum.
And so naturally,
this part will correspond to
the vector element from
the document vector.
And here, indeed, we can see it actually
encodes a weight that has similar
factor to TF-IDF weighting.
I let you examine it.
Can you see it?
Can you see which part is capturing TF,
and which part is capturing IDF weighting?
So if you want, you can pause
the video to think more about it.
So, have you noticed that this p sub-seen
is related to the term frequency
in the sense that if a word occurs
very frequently in the document,
then the S probability here
will tend to be larger.
Right?
So, this means this term is really
doing something like TF weighting.
Have you also noticed that
this time in the denominator
is actually achieving the factor of IDF?
Why?
Because this is the popularity of the term
in the collection, but
it's in the denominator.
So, if the probability in
the collection is larger
than the weight is actually smaller.
And, this means a popular term.
We actually have a smaller weight.
And, this is precisely what
IDF weighting is doing.
Only not, we now have
a different form of TF and IDF.
Remember, IDF has a log,
logarithm of document frequency, but
here we have something different.
But intuitively,
it achieves a similar effect.
Interestingly, we also have something
related to the length normalization.
Again, can you see which factor is
related to the length in this formula.
Well, I just say that, that this
term is related to IDF weighting.
This, this collection probability.
But, it turns out this term here
is actually related to
a document length normalization.
In particular,
D might be related to document N, length.
So, it, it encodes how much probability
mass we want to give to unseen words.
How much smoothing you are allowed to do.
Intuitively, if a document is long,
then we need to do less smoothing.
Because we can assume that
it is large enough that,
we have probably observed all of the words
that the author could have written.
But if the document is short,
the unseen are expected to be to be large,
and we need to do more smoothing.
It's like that there are words that have
not been retained yet by the author.
So, this term appears to
paralyze long documents
tend to be longer than,
larger than for long document.
But note that the also occurs here.
And so,
this may not actually be necessary,
penalizing long documents, and
in fact is not so clear here.
But as we will see later, when we
consider some specific smoothing methods,
it turns out that they do
penalize long documents.
Just like in TF-IDF weighting and
the document ends formulas
in the vector space model.
So, that's a very interesting
observation because it means
we don't even have to think about
the specific way of doing smoothing.
We just need to assume that if we
smooth with this language model,
then we would have a formula that
looks like a TF-IDF weighting and
document length normalization.
What's also interesting that we have
a very fixed form of the ranking function.
And see, we have not heuristically
put a logarithm here.
In fact, if you can think about,
why we would have a logarithm here?
If you look at the assumptions that
we have made, it will be clear.
It's because we have used a logarithm
of query likelihood for scoring.
And, we turned the product into
a sum of logarithm of probability.
And, that's why we have this logarithm.
Note that if we only want to heuristically
implement a TF weighting and
IDF weighting, we don't necessarily
have to have a logarithm here.
Imagine if we drop this logarithm,
we would still have TF and IDF weighting.
But, what's nice with
probabilistic modeling is that we
are automatically given
a logarithm function here.
And, that's basically,
a fixed reform of the formula that we did
not really have to hueristically line.
And in this case,
if you try to drop this logarithm
the model probably won't, won't work
as well as if you keep the logarithm.
So, a nice property of probabilistic
modeling is that by following some
assumptions and the probability rules,
we'll get a formula automatically.
And, the formula would have
a particular form, like in this case.
And, if we hueristically
design the formula,
we may not necessarily end up
having such a specific form.
So to summarize, we talked about the need
for smoothing a document and model.
Otherwise, it would give zero probability
for unseen words in the document.
And, that's not good for
scoring a query with such an unseen word.
It's also necessary,
in general, to improve the acc,
accuracy of estimating the model
representing the topic of this document.
The general idea of smoothing in retrieval
is to use the connection language model
to give us some clue about which unseen
word would have a higher probability.
That is the probability of the unseen
word is assumed to be proportional
to its probability in the collection.
With this assumption, we've shown that we
can derive a general ranking formula for
query likelihood.
That has a fact of TF-IDF waiting and
document length normalization.
We also see that through some rewriting,
the scoring of such ranking function,
is primarily based on sum of
weights on matched query terms,
just like in the vector space model.
But, the actual ranking function
is given us automatically by
the probability rules and
assumptions we have made.
Unlike in the vector space model,
where we have to heuristically think
about the form of the function.
However, we still need
to address the question,
how exactly we should we should
smooth a document image model?
How exactly we should use
the reference language model based on
the connection to adjusting
the probability of the maximum.
And, this is the topic
of the next to that.
[MUSIC]

[SOUND]
This lecture is about the specific
smoothing methods for language models
used in Probabilistic Retrieval Model.
In this lecture we will continue
the discussion of language models for
information retrieval, particularly
the query likelihood retrieval method.
And we're going to talk about
the specific smoothing methods used for
such a retrieval function.
So, this is a slide from a previous
lecture where we show that with
query likelihood ranking and the smoothing
with the collection language model.
We end up having a retrieval function
that looks like the following.
So, this is the retrieval function,
based on these assumptions
that we have discussed.
You can see it's a sum of all
the matched query terms here.
And inside the sum it's
a count of term in the query,
and some weight for
the term in the document.
We have TFI, TF weight here.
And then we have another constant here,
in n.
So clearly, if we want to implement this
function using a programming language,
we'll still need to figure
out a few variables.
In particular, we're going to
need to know how to estimate the,
probability of would exactly.
And how do we set alpha?
So in order to answer these questions,
we have to think about this very specific
smoothing methods, and
that is the main topic of this lecture.
We're going to talk about
two smoothing methods.
The first is the simple linear
interpolation, with a fixed coefficient.
And this is also called a Jelinek and
Mercer smoothing.
So the idea is actually very simple.
This picture shows how we estimate
document language model by using
maximum [INAUDIBLE] method,
that gives us word counts normalized by
the total number of words in the text.
The idea of using this method is to
maximize the probability
of the observed text.
As a result, if a word like network,
is not observed in the text.
It's going to get zero probability,
as shown here.
So the idea of smoothing, then,
is to rely on collection average model,
where this word is not going to have
a zero probability to help us decide
what non-zero probability should
be assigned to such a word.
So, we can know that network as
a non-zero probability here.
So, in this approach what we do is,
we do a linear interpolation between
the maximum likelihood or estimate here
and the collection language model.
And this controlled by
the smoothing parameter, lambda.
Which is between 0 and 1.
So this is a smoothing parameter.
The larger lambda is the two the more
smoothing we have, we will have.
So by mixing them together, we achieve the
goal of assigning non-zero probability.
And these two are word in our network.
So let's see how it works for
some of the words here.
For example if we compute to
the smallest probability for text.
Now, the next one right here
is made give us 10 over 100,
and that's going to be here.
But the connection probability is this, so
we just combine them together
with this simple formula.
We can also see a, the word network.
Which used to have zero probability
now is getting a non-zero
probability of this value.
And that's because the count is going
to be zero for network here, but
this part is non zero and
that's basically how this method works.
If you think about this and
you can easily see now the alpha sub d
in this smoothing method is basically
lambda because that's, remember,
the coefficient in front of
the probability of the word given by
the collection language model here, right?
Okay, so
this is the first smoothing method.
The second one is similar, but it has
a find end for manual interpretation.
It's often called a duration of the ply or
Bayesian smoothing.
So again here, we face the problem of
zero probability for like network.
Again we'll use the collection
language model, but
in this case we're going to combine
them in a somewhat different ways.
The formula first can be seen as
a interpolation of the maximum
and the collection
language model as before.
As in the J M's [INAUDIBLE].
Only and after the coefficient [INAUDIBLE]
is not the lambda, a fixed lambda, but
a dynamic coefficient in this form,
when mu is a parameter,
it's a non, negative value.
And you can see if we
set mu to a constant,
the effect is that a long document would
actually get smaller coefficient here.
Right?
Because a long document
we have a longer length.
Therefore, the coefficient
is actually smaller.
And so a long document would have
less smoothing as we would expect.
So this seems to make more sense
than a fixed coefficient smoothing.
Of course,
this part would be of this form, so
that the two coefficients would sum to 1.
Now, this is one way to understand
that this is smoothing.
Basically, it means that it's
a dynamic coefficient interpolation.
There is another way to
understand this formula.
Which is even easier to remember and
that's this side.
So it's easy to see we can rewrite
this modern method in this form.
Now, in this form, we can easily see
what change we have made to the maximum
estimator, which would be this part,
right?
So it normalizes the count
by the top elements.
So, in this form, we can see what we did,
is we add this to the count of every word.
So, what does this mean?
Well, this is basically
something relative to the probability
of the word in the collection..
And we multiply that by the parameter mu.
And when we combine this
with the count here,
essentially we are adding pseudo
counts to the observed text.
We pretend every word,
has got this many pseudocount.
So the total count would be
the sum of these pseudocount and
the actual count of
the word in the document.
As a result, in total, we would
have added this minute pseudocount.
Why?
Because if you take a sum of this,
this one, move over all the words and
we'll see the probability of the words
would sum to 1, and that gives us just mu.
So this is the total number of
pseudo counters that we added.
And, and so
these probabilities would still sum to 1.
So in this case, we can easily
see the method is essentially to
add these as a pseudocount to this data.
Pretend we actually augment the data
by including by some pseudo data defined
by the collection language model.
As a result, we have more counts.
It's the, the total counts for, for
word, a word that would be like this.
And, as a result,
even if a word has zero counts here.
And say if we have zero come here and
that it would still have none,
zero count because of this part, right?
And so this is how this method works.
Let's also take a look at
this specific example here.
All right, so for text again,
we will have 10 as original count.
That we actually observe but
we also added some pseudocount.
And so, the probability of
text would be of this form.
Naturally the probability of
network would be just this part.
And so, here you can also
see what's alpha sub d here.
Can you see it?
If you want to think about
you can pause the video.
Have you noticed that this
part is basically of a sub t?
So we can see this case of our sub t
does depend on the document, right?
Because this lens depends on the document
whereas in the linear interpolation.
The James move method
this is the constant.
[MUSIC]

[SOUND]
So let's plug in these model masses
into the ranking function to
see what we will get, okay?
This is a general smoothing.
So a general ranking function for
smoothing with subtraction and
you have seen this before.
And now we have a very specific smoothing
method, the JM smoothing method.
So now let's see what what's a value for
office of D here.
And what's the value for p sub c here?
Right, so we may need to decide this
in order to figure out the exact
form of the ranking function.
And we also need to figure
out of course alpha.
So let's see.
Well this ratio is basically this,
right, so,
here, this is the probability
of c board on the top,
and this is the probability
of unseen war or,
in other words basically 11
times basically the alpha here,
this, so it's easy to see that.
This can be then rewritten as this.
Very simple.
So we can plug this into here.
And then here, what's the value for alpha?
What do you think?
So it would be just lambda, right?
And what would happen if we plug in
this value here, if this is lambda.
What can we say about this?
Does it depend on the document?
No, so it can be ignored.
Right?
So we'll end up having this
ranking function shown here.
And in this case you can easy to see,
this a precisely a vector space
model because this part is
a sum over all the matched query terms,
this is an element of the query map.
What do you think is a element
of the document up there?
Well it's this, right.
So that's our document left element.
And let's further examine what's
inside of this logarithm.
Well one plus this.
So it's going to be nonnegative,
this log of this,
it's going to be at least 1, right?
And these, this is a parameter,
so lambda is parameter.
And let's look at this.
Now this is a TF.
Now we see very clearly
this TF weighting here.
And the larger the count is,
the higher the weighting will be.
We also see IDF weighting,
which is given by this.
And we see docking the lan's
relationship here.
So all these heuristics
are captured in this formula.
What's interesting that
we kind of have got this
weighting function automatically
by making various assumptions.
Whereas in the vector space model,
we had to go through those heuristic
design in order to get this.
And in this case note that
there's a specific form.
And when you see whether this
form actually makes sense.
All right so what do you think
is the denominator here, hm?
This is a math of document.
Total number of words,
multiplied by the probability of the word
given by the collection, right?
So this actually can be interpreted
as expected account over word.
If we're going to draw, a word,
from the connection that we model.
And, we're going to draw as many as
the number of words in the document.
If you do that,
the expected account of a word, w,
would be precisely given
by this denominator.
So, this ratio basically,
is comparing the actual count, here.
The actual count of the word in the
document with expected count given by this
product if the word is in fact following
the distribution in the clutch this.
And if this counter is larger than
the expected counter in this part,
this ratio would be larger than one.
So that's actually a very
interesting interpretation, right?
It's very natural and intuitive,
it makes a lot of sense.
And this is one advantage of using
this kind of probabilistic reasoning
where we have made explicit assumptions.
And, we know precisely why
we have a logarithm here.
And, why we have these probabilities here.
And, we also have a formula that
intuitively makes a lot of sense and
does TF-IDF weighting and
documenting and some others.
Let's look at the,
the Dirichlet Prior Smoothing.
It's very similar to
the case of JM smoothing.
In this case,
the smoothing parameter is mu and
that's different from
lambda that we saw before.
But the format looks very similar.
The form of the function
looks very similar.
So we still have linear operation here.
And when we compute this ratio,
one will find that is that
the ratio is equal to this.
And what's interesting here is that we
are doing another comparison here now.
We're comparing the actual count.
Which is the expected account of the world
if we sampled meal worlds according to
the collection world probability.
So note that it's interesting we don't
even see docking the lens here and
lighter in the JMs model.
All right so this of course
should be plugged into this part.
So you might wonder, so
where is docking lens.
Interestingly the docking lens
is here in alpha sub d so
this would be plugged into this part.
As a result what we get is
the following function here and
this is again a sum over
all the match query words.
And we're against the queer,
the query, time frequency here.
And you can interpret this as
the element of a document vector,
but this is no longer
a single dot product, right?
Because we have this part,
I know that n is the name of the query,
right?
So that just means if
we score this function,
we have to take a sum over
all the query words, and
then do some adjustment of
the score based on the document.
But it's still, it's still clear
that it does documents lens
modulation because this lens
is in the denominator so
a longer document will
have a lower weight here.
And we can also see it has tf here and
now idf.
Only that this time the form of the
formula is different from the previous one
in JMs one.
But intuitively it still implements TFIDF
waiting and document lens rendition again,
the form of the function is dictated
by the probabilistic reasoning and
assumptions that we have made.
Now there are also
disadvantages of this approach.
And that is, there's no guarantee
that there's such a form
of the formula will actually work well.
So if we look about at this geo function,
all those TF-IDF waiting and document lens
rendition for example it's unclear whether
we have sub-linear transformation.
Unfortunately we can see here there
is a logarithm function here.
So we do have also the,
so it's here right?
So we do have the sublinear
transformation, but
we do not intentionally do that.
That means there's no guarantee that
we will end up in this, in this way.
Suppose we don't have logarithm,
then there's no sub-linear transformation.
As we discussed before, perhaps
the formula is not going to work so well.
So that's an example of the gap
between a formal model like this and
the relevance that we have to model,
which is really a subject
motion that is tied to users.
So it doesn't mean we cannot fix this.
For example, imagine if we did
not have this logarithm, right?
So we can take a risk and
we're going to add one,
or we can even add double logarithm.
But then, it would mean that the function
is no longer a proper risk model.
So the consequence of
the modification is no
longer as predictable as
what we have been doing now.
So, that's also why, for example,
PM45 remains very competitive and
still, open channel how to use
public risk models as they arrive,
better model than the PM25.
In particular how do we use query
like how to derive a model and
that would work consistently
better than DM 25.
Currently we still cannot do that.
Still interesting open question.
So to summarize this part, we've talked
about the two smoothing methods.
Jelinek-Mercer which is doing the fixed
coefficient linear interpolation.
Dirichlet Prior this is what add a pseudo
counts to every word and is doing adaptive
interpolation in that the coefficient
would be larger for shorter documents.
In most cases we can see, by using these
smoothing methods, we will be able to
reach a retrieval function where
the assumptions are clearly articulate.
So they are less heuristic.
Explaining the results also show
that these, retrieval functions.
Also are very effective and they are
comparable to BM 25 or pm lens adultation.
So this is a major advantage
of probably smaller
where we don't have to do
a lot of heuristic design.
Yet in the end that we naturally
implemented TF-IDF weighting and
doc length normalization.
Each of these functions also has
precise ones smoothing parameter.
In this case of course we still need
to set this smoothing parameter.
There are also methods that can be
used to estimate these parameters.
So overall,
this shows by using a probabilistic model,
we follow very different strategies
then the vector space model.
Yet, in the end, we end up uh,with
some retrievable functions that
look very similar to
the vector space model.
With some advantages in having
assumptions clearly stated.
And then, the form dictated
by a probabilistic model.
Now, this also concludes our discussion of
the query likelihood probabilistic model.
And let's recall what
assumptions we have made
in order to derive the functions
that we have seen in this lecture.
Well we basically have made four
assumptions that I listed here.
The first assumption is that the relevance
can be modeled by the query likelihood.
And the second assumption with med is, are
query words are generated independently
that allows us to decompose
the probability of the whole query
into a product of probabilities
of old words in the query.
And then,
the third assumption that we have made is,
if a word is not seen,
the document or in the late,
its probability proportional to
its probability in the collection.
That's a smoothing with
a collection ama model.
And finally, we made one of these
two assumptions about the smoothing.
So we either used JM smoothing or
Dirichlet prior smoothing.
If we make these four assumptions
then we have no choice but
to take the form of the retrieval
function that we have seen earlier.
Fortunately the function has a nice
property in that it implements TF-IDF
weighting and document machine and
these functions also work very well.
So in that sense,
these functions are less heuristic
compared with the vector space model.
And there are many extensions of this,
this basic model and
you can find the discussion of them in
the reference at the end of this lecture.
[MUSIC]

[SOUND] This lecture is about
the Feedback in Text Retrieval.
So, in this lecture,
we're going to continue the discussion
on text retrieval methods.
In particular, we're going to talk
about Feedback in Text Retrieval.
This is a diagram that shows
the retrieval process.
We can see the user would
typed in a query and
then the query would be sent
to a Retrieval Engine or
search engine and
the engine would return results.
These results would be shown to the user.
After the user has seen these results,
the user can actually make judgments.
So for example, the user has say,
well, this is good and
this document is not very useful.
This is good again, et cetera.
Now this is called a relevance judgment or
Relevance Feedback, because we've
got some feedback information from
the user based on the judgments.
This can be very useful to the system.
Learn what exactly is
interesting to the user.
So the feedback module would
then take this as input and
also use the document collection
to try to improve ranking.
Typically, it would involve
updating the query.
So the system can now rank the results
more accurately for the user.
So this is called Relevance Feedback.
The feedback is based on relevance
judgements made by the users.
Now these judgements are reliable, but
the users generally don't want to make
extra effort, unless they have to.
So the downside's that involves
some extra effort by the user.
There is another form of feedback
called a Pseudo Relevance Feedback,
or a blind feedback also
called an automatic feedback.
In this case, you can see once
the user has got without an effect,
we don't have to involve users.
So you can see there's
no user involved here.
And we simply assume that the top
ranked documents to be relevant.
Let's say,
we have assumed the top ten is relevant.
And then we will then use these assumed
documents to learn and
to improve the query.
Now you might wonder, you know,
how could this help if we simply assume
the top rank documents would be random.
Well you can imagine these top rank
documents are actually similar to relevant
documents, even if they are not relevant,
they look like relevant documents.
So, it's possible to learn some related
terms to the query from this set.
In fact, you may recall that we
talked about using language model to
analyze word association to learn
related words to the word computer.
Right?
And then what we did is first,
use computer to retrieve all
the documents that contain computer.
So, imagine now the query
here is a computer.
Right?
And then the results will be those
documents that contain computer.
And what we can do then is
to take the top end results.
They can match computer very well and
we're going to count
the terms in this set and then we're
going to then use the background
language model to choose the terms
that are frequent the in this set,
but not frequent the in
the whole collection.
So, if we make a contrast between
these two, what we can find is that
we'll learn some related terms too, the
work computer as what I've seen before.
And these related words can then be added
to the original query to expand the query.
And this would help us free documents
that don't necessarily match computer,
but match other words like program and
software.
So this is factored for
improving the search doubt.
But of course, pseudo relevancy
feedback is completely unreliable.
We have to arbitrarily set a cutoff.
So there is also something in
between called Implicit Feedback.
In this case, what we do,
we do involve users, but
we don't have to ask
users to make judgements.
Instead, we are going to observe how the
user interacts with the search results.
So, in this case,
we're going to look at the clickthroughs.
So the user clicked on this one and
the, the user viewed this one.
And the user skipped this one and
the user viewed this one again.
Now this also is a clue about whether
a document is useful to the user and
we can even assume that we're going to use
only the snippet here in this document.
The text that's actually seen by the user,
instead of the actual document
of this entry in the link.
There that same web search may be broken,
but that, it doesn't matter.
If the user tries to fetch this document
that because of the displayed text,
we can assume this displayed text is
probably relevant is interesting to user,
so we can learn from such information.
And this is called Implicit Feedback and
we can again,
use the information to update the query.
This is a very important technique
used in modern search engines.
You know, think about Google and Bing and
they can collect a lot of user activities.
Why they are serverless?
Right.
So they would observe what documents we
click on, what documents we skip.
And this information is very valuable and
they can use this to
encode the search engine.
So to summarize,
we would talk about the three kinds of
feedback here rather than feedback.
Where the use exquisite judgement,
it takes some used effort, but
the judgement that
information is reliable.
We talked about the Pseudo Feedback, where
we simply assumed top random documents.
We get random,
we don't have to involve the user.
Therefore, we could do
that actually before we,
we return the results to the user.
And the third is Implicit Feedback,
where we use clickthroughs.
Where we don't, we involve users, but
the user doesn't have to make
explicit effort to make judgement.
[MUSIC]

[SOUND] This lecture is about
the feedback in the vector space model.
In this lecture, we continue talking
about the feedback and text retrieval.
Particularly we're going to talk about
feedback in the vector space model.
As we have discussed before in
the case of feedback the task of
a text retrieval system is relearned from
examples to improve retrieval accuracy.
We will have positive examples,
those are the documents that
are assumed that will be random or
judged with being random and all
the documents that are viewed by users.
We also have negative examples, those
are documents known to be non-relevant.
They can also be the documents
that are escaped by users.
The general method in
the vector space model for
feedback is to modify our query vector.
Now we want to place the query vector in
a better position to make that accurate
and what does that mean exactly?
Well, if you think about the query vector
that would mean you would have to do
something to vector elements.
And in general that would
mean we might add new terms.
We might adjust weights of old terms or
assign weights to new terms.
And as a result in general
the query will have more terms so
we often call this query expansion.
The most effective method in the vector
space model of feedback is called Rocchio
feedback which was actually
proposed several decades ago.
So, the idea is quite simple we illustrate
this idea by using a two-dimensional
display of all the documents in
the collection and also the query vector.
So, now we can see
the query vector is here in
the center and
these are all of the documents.
So when we use a query vector and
use a similarity function to
find the most similar documents.
We are basically drawing a circle here and
then these documents would be
basically the top-ranked documents.
And this process of relevant documents,
right?
And these are random documents for
example that's relevant, etc.
And then these minuses
are negative documents like this.
So our goal here is trying
to move this query vector to some position
to improve the retrieval accuracy.
By looking at this diagram
what do you think where
should we move the query vector so that
we can improve the retrieval accuracy.
Intuitively, where do you want
to move the query back to?
If you want to think more
you can pause the video.
Now if you think about
this picture you can realize that
in order to work well in this case
you want the query vector to be as close
to the positive vectors as possible.
That means, ideally you want to place
the query vector somewhere here or
we want to move the query
vector closer to this point.
Now, so what exactly at this point?
Well, if you want these relevant
documents to be ranked on the top
you want this to be in the center of
all of these relevant documents, right?
Because then if you draw
a circle around this one
you get all these relevant documents.
So that means we can move the query
back toward the centroid of
all the relevant document vectors.
And this is basically the idea of Rocchio,
of course you then can see that
the centroid of negative documents.
And one move away from
the negative documents.
Now geometrically we're
talking about a moving vector
closer to some other vector and
away from other vectors.
Algebraically it just means
we have this formula.
Here you can see this is
original query vector and
this average basically is the centroid
vector of relevant documents.
When we take the average
over these vectors
then we're computing
the centroid of these vectors.
And similarly this is the average in
that non-relevant document of vectors so
it's essentially of now random, documents.
And we have these three parameters here,
alpha, beta and gamma.
They're controlling
the amount of movement.
When we add these two vectors together
we're moving the query at the closer
to the centroid, alright, so
when we add them, together.
When we subtracted this part we kind
of move the query vector away from that
centroid so
this is the main idea of Rocchio Feedback.
And after we have done this we
will get a new query vector
which can use it to store documents.
This new New query vector will then
reflect the move of this
Original query vector toward
this Relevant centroid vector and
away from the Non-relevant
centroid vector, okay?
So let's take a look at example, right?
This is the example that we have seen
earlier only that I in the, the display
of the actual documents I only showed the
vector representation of these documents.
We have five documents here and we have
true red in the documents here, right?
They are displayed in red and
these are the term vectors.
Now, I just assumed an idea of weights,
a lot of times we have
zero weights of course.
These are negative documents, there
are two here, there is another one here.
Now in this Rocchio method we
first compute the centroid of
each category and so let's see.
Look at the centroid of
the positive document but
we simply just so it's very easy to see.
We just add this with this one
the corresponding element and
that's down here and take the average.
And then we're going to add
the corresponding elements and
then just take the average, right?
So we do this for all these.
In the end, what we have is this one.
This is the average vector of these two so
it's a centroid of these two, right?
Let's also look at the centroid
of the nested documents.
This is basically the same we're going to
take the average of three elements.
And these are the corresponding
elements in these three vectors and
so on and so forth.
So in the end, that we have this one.
Now, in the Rocchio feedback
method we're going to combine all
these with original query vector,
which is this.
So now let's see how we
combine them together.
Well, that's basically this, right?
So we have a parameter outlier controlling
the original query term weight that's 1.
And now I've beta to control
the inference of the positive
centroid Vector weight that's
1.5 that comes from here, right?
So this goes here and
we also have this negative wait here.
Conduit by a gamma here and
this weight has come from of
course the nective centroid here.
And we do exactly the same for
other terms each is for one term.
And this is our new vector.
And we're going to use this new query
vector, this one to run the documents.
You can imagine what would happen, right?
Because of the movement that this one or
the match of these red
documents much better.
Because we move this
vector closer to them and
it's going to penalize these black
documents, these non-relevant documents.
So this is precisely what
we want from feedback.
Now of course, if we apply this method in
practice we will see one potential problem
and that is the original query has
only four times that are not zero.
But after we do queries,
imagine you can imagine we'll have many
terms that would have a number of weights.
So the calculation would
have to involve more terms.
In practice,
we often truncate this vector and
only retain the terms which
is the highest weight.
So let's talk about how we
use this method in practice.
I just mentioned that we often truncate
the vector consider only a small number
of words that have highest
weights in the centroid vector.
This is for efficiency concern.
I also say that here that a negative
examples or non-relevant examples
tend not to be very useful especially
compared with positive examples.
Now you can think about the, why.
One reason is because negative documents
tend to distract the query in
all directions so when you take
the average it doesn't really tell you
where exactly it should be moving to.
Whereas, positive documents tend
to be clustered together and
they respond to you to
consistent the direction.
So that also means that sometimesw we
don't have those negative examples but
note that in,
in some cases in difficult queries where
most top random results are negative.
Negative feedback
afterwards is very useful.
Another thing is to avoid
over-fitting that means we have to
keep relatively high weight
on the original query terms.
Why?
Because the sample that we see in
feedback is a relatively small sample.
We don't want to overly
trust the small sample and
the original query terms
are still very important.
Those terms are typed in by the user and
the user has decided that those
terms are most important.
So in order to prevent the us
from over-fitting or drifting.
A type of drift prevent type of
drifting due to the bias toward the,
the feedback examples.
We generally would have to keep a pretty
high weight on the original terms so
it is safe to do that.
And this is especially, true for
pseudo awareness feedback.
Now this method can be used for
both relevance feedback and
pseudo relevance feedback.
In the case of pseudo feedback,
the parameter beta should be set to a,
a smaller value because
the random examples are assumed
to be random there not as reliable
as your relevance feedback, right?
In the case of relevance feedback,
we obviously could use a larger value.
So, those parameters
still have to be set and.
And the ro, Rocchio method is
usually robust and effective.
It's, it's still a very popular method for
feedback.
[MUSIC]

This lecture is about the web search.
In this lecture we
are going to talk about one of
the most important applications of
text retrieval, web search engines.
So let's first look at some
general challenges and
opportunities in web search.
Now, many information retrieval
algorithms had been developed at the,
before the web was born.
So, when the web was born,
it created the best opportunity to apply
those algorithms to major application
problem that everyone would care about.
So naturally, there had to be some
further extensions of the classical
search algorithms to address some new
challenges encountered in web search.
So here are some general challenges.
Firstly, this is a scalability challenge.
How we handle the size of the web,
and ensure completeness of
coverage of all the information.
How to serve many users quickly,
and by answering all their queries.
All right, so, that's one major challenge.
And before the web was born,
the scale of search was relatively small.
The second problem is that there
is low quality information.
And there are often spams.
The third challenge is
dynamics of the web.
The new pages are constantly created and
some pages may be updated,
eve-, very quickly.
So it makes it harder to,
keep the index fresh.
So these are some of
the challenges that the,
we have to solve in order to,
build a high quality web search engine.
On the other hand, there are also some
interesting opportunities that we can
leverage to improve search results.
There are many additional heuristics.
For example you know using links that
we can leverage to improve scoring.
Now the errors that we talked about such
as the vector space model are general
algorithms.
And they can be applied to any search
applications, so that's, the advantage.
On the other hand, they also don't take
advantage of special characteristics
of pages, or documents, in the specific
applications such as web search.
Web pages are linked with each other so
obviously the linking is something
that we can also leverage.
So because of these challenges and
opportunities there are new techniques
that have been developed for web search,
or due to the need of a web search.
One is parallel indexing and searching,
and this is to address the issue of
scalability, in particular
Google's imaging of MapReduce
is very inferential, and
has been very helpful in that aspect.
Second, there are techniques
that are developed for,
addressing the problem of spams.
So, spam detection.
We'll have to prevent those,
spam pages from being ranked high.
And there are also techniques
to achieve robust ranking.
And we're going to use a lot
of signals to rank pages so
that it's not easy to spam the search
engine with particular tricks.
And the third line of
techniques is link analysis.
And these are techniques
that can allow us to
to improve search results by
leveraging extra information.
And in general in web
search we're going to use
multiple features for ranking.
Not just link analysis but
also exploiting all kinds of crawls like
the layout of web pages or anchor text
that describes a link to another page.
So here's a picture showing the basic
search engine technologies.
Basically, this is the web on the left and
then user on the right side.
And we're going to help these, this
user get access to the web information.
And the first component is the crawler
that with the crawl pages and
the second component is indexer.
That will take these pages
create an invert index.
The third component that is a retrieval,
not with the using,
but the index to answer user's query,
by talking to the user's browser.
And then, the search results would be,
given to the user.
And, and then the browser
will show those results and,
to allow the user to
interact with the web.
So we're going to talk about
each of these component.
First we're going to talk about
the crawler also called a spider or
a software robot that would do something
like a crawling pages on the web.
To build a toy crawler is relatively easy
because you just need to start with a set
of seed pages and then fetch pages from
the web and parse these pages new links.
And then add them to the priority of q and
then just explore those additional links,
right.
But to build a real crawler
actually is tricky and
there are some complicated issues
that we have do deal with.
For example robustness,
what if the server doesn't respond.
What if there's a trap that generates
dynamically generated webpages that might,
attract your crawler to keep
crawling the same site and
to fetch dynamically generated pages.
The results of this issue of crawling and
you don't want to overload one particular
server with many crawling requests.
And you have to respect the,
the robot exclusion protocol.
You also need to handle
different types of files.
There are images, PDF files,
all kinds of formats on the web.
And you have to also
consider URL extensions.
So, sometimes those are cgi scripts, and,
you know, internal references, etc., and
sometimes, you have JavaScripts on the
page that, they also create challenges.
And you ideally should also
recognize [INAUDIBLE] the pages
because you don't have to
duplicate to the, those pages.
And finally, you may be interesting
to discover hidden URLs.
Those are URLs that may not be linked,
to any page.
But if you truncate the URL to,
shorter pass,
you might be able to get
some additional pages.
So, what are the major
crawling strategies?
In general, Breadth-First, is most common,
because it naturally balance,
balances server load.
You would not, keep probing
a particular server [INAUDIBLE].
Also parallel crawling is very natural,
because this task is very easy
to parallelise and there are some
variations of the crawling task.
One interesting variation
is called focused crawling.
In this kind we're going to crawl just
some pages about a particular topic.
For example, all pages about automobiles.
And, and, this is typically
going to start with a query,
and then you can use the query
to get some results.
From the major search engine.
And then you can start it with those
results and gradually crawl more.
So one challenge in crawling is to find
the new pages that people have created,
and people probably are creating
new pages all the time, and this is
very challenging if the new pages have
not been actually linked to any old page.
If they are, then you can probably refine
them by recrawling the older page.
So these are also some um,interesting
challenges that have to be solved.
And finally we might face the scenario of
incremental crawling or repeated crawling.
Right?
So your first,
let's say if you want to be
able to web search engine.
And you were the first to crawl
a lot of data from the web.
And then, but then once you
have collected all the data and
in future we just need to crawl the,
the update pages.
You, you, in general you don't have
to re-crawl everything, right?
Or it's not necessary.
So, in this case you,
you go as you minimize a resource overhead
by using minimum resource to,
to just still crawl updated pages.
So this is after a very interesting
research question here.
And [INAUDIBLE] research
question is that there aren't
many standard algorithms [INAUDIBLE] for
doing this, this task.
Right?
But in general, you can imagine,
you can learn from the past experience.
Right.
So the two major factors that
you have to consider are first,
will this page be updated frequently?
And do I have to crawl this page again?
If the page is a static page
that hasn't been changed for
months you probably don't have
to re-crawl it everyday, right?
Because it's unlikely that it
will be changed frequently.
On the other hand if it's you know,
sports score page that gets
updated very frequently and
you may need to re-crawl it maybe
even multiple times, on the same day.
The other factor to consider is,
is this page frequently accessed by users?
If it, if it is,
that means it's a high utility page, and
then thus it's more important to
ensure such a page to be fresh.
Compare it with another page that has
never been fetched by any users for
a year.
Than, even though that page
has been changed a lot, then,
it's probably not necessary to crawl that
page or at least it's not as urgent as,
to maintain the freshness of
frequently accessed page by users.
So to summarize,
web search is one of the most important
applications of text retrieval.
And there are some new challenges
particularly scalability,
efficiency, quality information.
There are also new opportunities
particularly, rich link information and
layout, et cetera.
Crawler is an essential component
of web search applications.
And, in general,
we can classify two scenarios.
Once is initial crawling and
here we want to have complete crawling
of the web if you are doing
a general search engine or
focus crawling if you want to just
target it at a certain type of pages.
And then there is another scenario that's
incremental updating of the crawl data or
incremental crawling.
In this case you need to
optimize the resource.
For to use minimum resource
we get the [INAUDIBLE]
[MUSIC].

[SOUND].
This lecture is about recommender systems.
So, so far we have talked about
a lot of aspects of search engines.
And we have talked about the problem
of search and the ranking problem,
different methods for ranking,
implementation of search engine and
how to evaluate the search engine,
et cetera.
This is partly because we know
that web search engines are,
by far, the most important
applications of text retrieval.
And they are the most useful tools
to help people convert big raw text
data into a small set
of relevant documents.
Another reason why we spend so
many lectures on search engines is because
many techniques used in search engines
are actually also very useful for
recommender systems,
which is the topic of this lecture.
And so overall the two systems
are actually well connected,
and there are many techniques
that are shared by them.
So this is a slide that you have
seen before when we talked about
the two different modes of
text access pull and push.
And, we mentioned that recommender
systems are the main systems to serve
users in the push mode, where
the systems will take the initiative to
recommend the information to user, or to
push the relevant information to the user.
And this often works well when the user
has a relatively stable information need,
when the system has good
knowledge about what a user wants.
So a recommender system is sometimes
called a filtering system.
And it's because recommending
useful items to people is like
discarding or
filtering out the useless articles.
So in this sense,
they are kind of similar.
And in all the cases,
the system must make a binary decision.
And usually, there is a dynamic
source of information items,
and you have some knowledge about the
user's interest, and then the system would
make a delivery decision, whether
this item is interesting to the user.
And then if it is interesting then
the system would recommend the article to
the user.
So the basic filtering question here is
really, will this user like, this item?
Will U like item X?
And there are two ways to answer this
question if you think about it, right?
One is look at what items U likes, and
then we can see if X is
actually like those items.
The other is to look at who likes X ,and
we can see if this user looks like a,
one of those users, or
like most of those users.
And these strategies can be combined.
If we follow the first strategy and
look at item similarity in the case
of recommended text objects,
then we are talking about a content-based
filtering or content-based recommendation.
If we look at the second strategy then,
this will compare users.
And in this case,
we're exploiting user similarity,
and the technique is often called
a collaborative filtering.
So let's first look at
the content-based filtering system.
This is what a system would look like.
Inside the system, there would be
a binary classifier that would have some
knowledge about the user's interests, and
it's called the user interest profile.
It maintains the profile to keep
track of the user's interest.
And then there is a utility function to
guide the user to make decisions, and
I'll explain the utility of
the function in a moment.
It helps the system decide
where to set the threshold.
And then the accepted documents
will be those that have passed
the threshold according to the classifier.
There should be also an initialization
module that would take a user's input,
maybe from a user's, specified keywords,
or a chosen category, et cetera.
And this will be, to feed the system
with a initial user profile.
There is also typically a learning
module that will learn from
users' feedback over time.
Now note that in this case, typically
users' information need is stable so
the system would have a lot of
opportunities to observe the users,
you know, if the user has taken
a recommended item as viewed that, and
this is a cu, a signal to indicate that
the recommended item may be relevant.
If the user discarded it,
no, it's not relevant.
And so, such feedback can be a long-term
feedback and can last for a long time and
the system can clock, collect a lot of
information about this user's interests.
And this can then be used
to improve the classifier.
Now whats the criteria for
evaluating such a system?
How do we know this filtering
system actually performs well?
Now in this case we cannot use the ranking
evaluation measures, like a map,
because we can't afford waiting for
a lot of documents,
and then rank the documents to
make a decision for the user.
And so, the system must make a decision,
in real time,
in general to decide whether the item
is above the threshold or not.
So in other words,
we're trying to decide absolute relevance.
So in this case one common use
strategy is to use a utility function
through a valid system.
So here I show a linear utility function
that's defined as, for example,
3 multiplied by the number of
good items that you delivered,
minus 2 multiplied by the number of bad
items you delete, that you delivered.
So in other words, we,
we could kind of just
treat this as almost a,
in a gambling game.
If you delete,
if you deliver one good item,
let's say you win $3, you gain $3.
But if you deliver a bad one,
you would lose $2.
And this utility function
basically kind of measures,
how much money you would,
get by doing this kind of game, right.
And so it's clear that if you want
to maximize this utility function,
your strategy should be to deliver
as many good articles as possible,
and minimize the delivery of bad articles.
That, that's obvious, right.
Now one interesting question here is,
how should we set these coefficients?
Now I just showed a 3 and a negative 2,
as the possible coefficients, but one can
ask the question, are they reasonable?
So what do you think?
Do you think that's a reasonable choice?
What about other choices?
So for example, we can have 10 and
minus 1, or 1 minus 10.
What's the difference?
What do you think?
How would this utility function affect
the system's threshold of this issue?
Right, you can think of these two extreme
cases, 10 minus 1 versus 1 minus 10.
Which one do we think it would
encourage the system to over-deliver?
And which one would encourage
the system to be conservative?
Yeah?
If you think about it, you will see
that when we get a big award for
delivering a good document, you incur only
a small penalty for delivering a bad one.
Intuitively, you would be
encouraging to deliver more, right?
And you can try to deliver more in hope
of getting a good one delivered, and
then you'll get a big award.
Right, so on the other hand,
if you choose 1 minus 10,
you don't really get such a big prize
if you deliver a good document.
On the other hand, you will have
a big loss if you deliver bad one.
You can imagine that the system
would be very reluctant to
deliver lot of documents.
It has to be absolutely sure
that it's a non-relevant one.
So this utility function has to be
designed based on a specific application.
The three basic problems in content-based
filtering are the following.
First has to make a filtering decision.
So it has to be a binary decision maker,
a binary classifier.
Given a text, a text document, and
a profile description of the user,
it has to say yes or no, whether this
document should be delivered or not.
So that's a decision module, and
there should be a initialization
module as you have seen earlier.
And this is to get the system started.
And we have to initialize the system based
on only very limited text description,
or very few examples from the user.
And the third component is
a learning module which ha,
has to be able to learn from limited
relevance judgments because we
can only learn from the user about their
preferences on the delivery documents.
If we don't deliver a document
to the user, we'd never know
we would never be able to know whether
the user likes it or not, right.
And we can accumulate a lot of documents,
we can learn from the entire history.
Now, all these models would have to
be optimized to maximize the utility.
So how can we build a such a system?
And there are many different approaches.
Here we are going to talk about
how to extend a retrieval system,
a search engine for information filtering.
Again, here's why we've spent a lot of
times talk about the search engines.
Because it's actually not very hard
to extend the search engine for
information filtering.
So, here is the basic idea for
extending a retrieval system for
information filtering.
First, we can reuse a lot of
retrieval techniques to do scoring.
All right, so we know how to score
documents against queries et cetera.
We can measure the similarity between
a profile text description and a document.
And then we can use a score threshold for
the filtering decision.
We, we do retrieval and then we kind
of find the scores of documents, and
then we apply a threshold to, to say,
to see whether a document is
passing this threshold or not.
And if it's passing the threshold,
we are going to say it's relevant and
we are going to deliver it to the user.
And another component that we have to add
is, of course, to learn from the history.
And here we can use the traditional
feedback techniques
to learn to improve scoring.
And we know Rocchio can be used for
scoring improvement, right?
And, but we have to develop new approaches
to learn how to set the threshold.
And you know,
we need to set it initially, and
then we have to learn how to
update the threshold over time.
So here's what the system
might look like if we just
generalized a vector-space model for
filtering problems, right?
So you can see the document vector could
be fed into a scoring module, which it
already exists in in a search engine
that implements a vector-space model.
And the profile will be treated
as a query essentially.
And then the profile vector can be
matched with the document vector,
to generate the score.
And then this score will be fed into
a thresholding module that would
say yes or no.
And then the evaluation would be based on
the utility for the filtering results.
If it says yes, and then the document
will be sent to the user, and
then the user could give some feedback.
And the feedback information would
have been use, would be used to both
adjust to the threshold and
adjust the vector representation.
So the vector learning is essentially
the same as query modification or
feedback in the case of search.
The threshold learning is a no,
new component in that we need
to talk a little bit more about.
[MUSIC]

[SOUND].
There are some interesting
challenges in threshold.
Would have known in the filtering problem.
So here I show the,
sort of the data that you can collect in,
in the filtering system.
So you can see the scores and
the status of relevance.
So the first one has a score 36.5,
and it's relevant.
The second one is not relevant.
Of course, we have a lot of documents for
which we don't know the status,
because we will have to the user.
So as you can see here,
we only see the judgements of
documents delivered to the user.
So this is not a random sample.
So it's a censored data.
It's kind of biased, so
that creates some difficulty for learning.
And secondly, there are in general very
little labeled data and very few relevant
data, so it's, it's also challenging for
machine learning approaches.
Typically they require
require more training data.
And in the extreme case at the beginning,
we don't even have any,
label there as well.
The system still has to make a decision,
so
that's a very difficult
problem at the beginning.
Finally, the results of this issue of
exploration versus exploitation tradeoff.
Now this means we also want to
explore the document space a little bit,
and to, to see if the user
might be interested in the documents
that we have not yet labeled.
So, in other words, we're going to
explore the space of user interests
by testing whether the user might be
interested in some other documents that
currently are not matching
the user's interest.
This so well.
So how do we do that?
Well we could lower the threshold a little
bit and do we just deliver some near
misses to the user to see what
the user would respond so
see how the user will,
would respond to this extra document.
And, and this is a trade off, because
on the one hand, you want to explore,
but on the other hand,
you don't want to really explore too much,
because then you would over-deliver
non-relevant information.
So exploitation means you would,
exploit what you learn about the user.
And let's say you know the user is
interested in this particular topic, so
you don't want to deviate that much.
And, but if you don't deviate at all,
then you don't explore at all.
That's also not good.
You might miss opportunity to learn
another interest of the user.
So this is a dilemma.
And that's also a difficult
problem to solve.
Now how do we solve these problems?
In general, I think why can't I used the
empirical utility optimization strategy?
And this strategy is basically to optimize
the threshold based on, historical data,
just as you have seen on
the previous slide, right?
So you can just compute the utility
on the training data for
each candidate score threshold.
Pretend that [INAUDIBLE]
cut at this point.
What if I cut out the [INAUDIBLE]
threshold, what would happen?
What's utility?
Compute the utility, right?
We know the status, what's it based on
approximation of click-throughs, right?
So then we can just choose this
threshold that gives the maximum
utility on the training data.
Now but this of course doesn't account for
exploration that we just talked about.
And there is also the difficulty of bias.
Training sample, as we mentioned.
So in general, we can only get an upper
bound or, for the true optimal threshold
because the, the al, the threshold
might be actually lower than this.
So it's possible that the discarded item
might be actually interesting to the user.
So how do we solve this problem?
Well we generally as I said we can lower
the threshold to explore a little bit.
So here's one particular approach called
the beta-gamma threshold learning.
So the, the idea is foreign.
So, here I show a ranked list of
all the training documents
that we have seen so far.
And they are ranked by their positions.
And on the Y-axis, we show the Utility.
Of course, this function depends on
how you specify the coefficients in
the Utility function.
But we can not imagine depending on the
cut off position we will have a utility.
That means suppose I cut at this
position and that will be the utility.
So we can for
example I then find some cut off point.
The optimal point theta
optimal is the point
when we would achieve the maximum
utility if we had chosen this threshold.
And there is also 0 threshold,
0 utility threshold.
As you can see at this cut off.
The utility is 0.
Now, what does that mean?
That means if I lower the threshold, and
then get the, and now I'm I reach this
threshold, the utility would be lower,
but it's still positive.
Still non-elective, at least.
So it's not as high as
the optimal utility, but
it gives us a a safe point
to explore the threshold.
As I just explained, it's desirable
to explore the interest space.
So it's desirable to lower the threshold
based on your training data.
So that means, in general, we want to set
the threshold somewhere in this range.
It's the when user off fault to
control the the deviation from
the optimal utility point.
So you can see the formula of the
threshold will be just the incorporation
of the zero utility threshold and
the optimal between the threshold.
Now the question is how,
how should we set r form, you know and
when should we deviate more
from the optimal utility point.
Well this can depend on multiple factors
and the one way to solve the problem is to
encourage this threshold
mechanism to explore
up the 0 point, and
that's a safe point, but
we're not going to necessarily
reach all the way to the 0 point.
But rather we're going to use other
parameters to further define alpha.
And this specifically is as follows.
So there will be a beta
parameter to control.
The deviation from the optimal threshold.
And this can be based on for
example can be accounting for
the over throughout
the training data let's say.
And so
this can be just the adjustment factor.
But what's more interesting is this gamma
parameter here, and you can see in this
formula gamma is controlling
the the influence
of the number of examples
in training data set.
So you can see in this formula as N which
denotes the number of training examples.
Becomes bigger than it would
actually encourage less exploration.
In other words, when N is very small,
it will try to explore more.
And that just means if we
have seen few examples,
we're not sure whether we have
exhausted the space of interests.
So [INAUDIBLE].
But as we have seen many examples
from the user, many data points,
then we feel that we probably
dont' have to explore more.
So this gives us a dynamic of strategy for
exploration, right?
The more examples we have seen,
the less exploration we are going to do.
So, the threshold will be closer
to the optimal threshold.
So, that's the basic
idea of this approach.
Now, this approach actually, has been
working well in some evaluation studies.
And, particularly effective.
And, also can welcome arbitrary utility
with a appropriate lower bound.
And explicitly addresses
exploration-exploration tradeoff.
And it kind of uses a zero in this
threshold point as a, a safeguard.
For exploration and exploiting tradeoff.
We're not, never going to explore
further than the zero utility point.
So, if you take the analogy of gambling,
and you,
you don't want to risk losing money.
You know, so it's a safe strategy,
a conservative strategy for exploration.
And the problem is, of course,
this approach is purely heuristic.
And the zero utility lower bound
is also often too conservative.
And there are, of course, calls
are more advanced than machine learning
projects that have been proposed for
solving these problems.
And this is a very active research area.
So to summarize there
are two strategies for
recommending systems or filtering systems.
One is content based,
which is looking at the item similarity.
And the other is collaborative filtering,
which is looking at the user similarity.
In this lecture we have covered
content-based filtering approach.
In the next lecture, we're going to
talk about collaborative filtering.
The content-based filtering
system we generally have to solve
several problems related to filtering
decision and learning, etc.
And such a system can actually
be based on a search engine
system by adding a threshold mechanism and
adding adaptive learning
algorithm to allow the system
to learn from long term
feedback from the user.
[MUSIC]

[SOUND] This lecture is about
Collaborative Filtering.
In this lecture, we're going to continue
the discussion of Recommender Systems.
In particular, we're going to look at
the approach of collaborative filtering.
You have seen this slide before
when we talked about the two
strategies to answer the basic
question will user U like item X.
In the previous lecture,
we looked at the item similarity,
that's content-based filtering.
In this lecture, we're going to
look at the user similarity.
This is a different strategy
called collaborative filtering.
So first of all,
what is collaborative filtering?
It is to make filtering decisions for
individual user based on
the judgement of other users and
that is to say,
we will infer individual's interest or
preferences from that,
of other similar users.
So the general idea is the following.
Given a user u, we are going to
first find the similar users,
u1 through and then we're going to
predict the used preferences based on
the preferences of these similar users,
u1 through.
Now the users similarity here can be
judged based on their similarity.
The preference is on
a common set of items.
Now here you'll see that the exact
content of item doesn't really matter.
We're going to look at the only,
the relationship between the users and
the items.
So this means this approach
is very general if it can be
applied to any items not
just with text objects.
So this approach, it would work well
under the following assumptions.
First users with the same interests
will have similar preferences.
Second, the users with similar preferences
probably share the same interests.
So for example, if the interest of
the user is in information retrieval,
then we can infer the user
probably favor SIGIR papers.
And so those who are interested in
information retrieval researches probably
all favor SIGIR papers,
that's something that we make.
And if this assumption is true,
then it would help collaborative
filtering to work well.
We can also assume that if we
see people favor SIGIR papers,
then we can infer the interest is
probably information retrieval.
So these simple examples,
it seems what makes sense.
And in many cases such as assumption
actually does make sense.
So, another assumption you have to make
is that there are a sufficiently large
number of user preferences
available to us.
So for example, if you see a lot
of ratings of users for movies and
those indicate their
preferences in movies.
And if you have a lot of such data,
then collaborative filtering
can be very effective.
If not, there will be a problem and
that's often called a cold start problem.
That means you don't have many
preferences available, so
the system could not fully take advantage
of collaborative filtering yet.
So let's look at the collaborative
filtering problem in a more formal way.
And so this picture shows that we are in
general considering a lot of users and
showing we're showing m users here.
So, u1 through and we're also
considering a number of objects.
Let's say,
n objects denoted as o1 through on and
then we will assume that the users will
be able to judge those objects and
the user could for example,
give ratings to those items.
For example, those items could be movies,
could be products and
then the users would give ratings
one through five, let's say.
So what you see here is that we have
assumed some ratings available for
some combinations.
So some users have watched movies,
they have rated those movies.
They obviously won't be able
to watch all the movies and
some users may actually
only watch a few movies.
So this is in general a response matrix,
right?
So many item many entries
have unknown values and
what's interesting here is
we could potentially infer
the value of a element in this
matrix based on other values and
that's actually the central question
in collaborative filtering.
And that is,
we assume an unknown function here f,
that would map a pair of user and
object to a rating.
And we have observed there are some
values of this function and
we want to infer the value
of this function for
other pairs that we,
that don't have values available here.
So this is ve, very similar to
other machine learning problems,
where we would know the values of the
function on some training there that and
we hope to predict the the values of
this function on some test there.
All right.
So this is the function approximation.
And how can we pick out the function
based on the observed ratings?
So this is the, the setup.
Now there are many approaches
to solving this problem.
And in fact,
this is a very active research area.
A reason that there are special
conferences dedicated to the problem
is a major conference
devoted to the problem.
[MUSIC]

[NOISE].
And here what will do is
talk about basic strategy,
and that would be based on
similarity of users and
then predicting the rating
of an object by a, a,
active user using the ratings of
similar users to this active user.
This is called a memory-based approach
because it's a little bit similar to
storing all the user information.
And when we are considering a particular
user, we're going to try to
kind of retrieve the relevant users, or
the similar users through this user case.
And then try to use that
user's information about those users
to predict the preference of this user.
So here's the general idea, and
we use some notations here, so.
X sub i j denotes the rating
of object o j by user u i.
And n sub i is average rating
of all objects by this user.
So this n i is needed.
Because we would like to normalize
the ratings of objects by this user.
So how do you do normalization?
Well, where do you adjust that?
Subtract the,
the average rating from all the ratings.
Now this is the normalized ratings so
that the ratings from different
users will be comparable.
Because some users might be more generous
and they generally give more high ratings.
But, some others might be more critical.
So, their ratings can not be
directly compared with each other or
aggregated them together.
So, we need to do this normalization.
Now, the prediction of the rating.
On the item by another user or
active user, u sub a here
can be based on the average
ratings of similar users.
So the user u sub a is the user that we
are interested in recommending items to.
And we now are interested in
recommending this o sub j.
So we're interested in knowing how
likely this user will like this object.
How do we know that?
Well the idea here is to look at the how
whether similar users to this user
have liked this object.
So mathematically, this is, as you say,
the predict the rating of this
user on this app, object.
User A on object Oj is
basically combination of
the normalized ratings of different users.
And in fact, here,
we're picking a sum of all the users.
But not all users contribute
equally to the average.
And this is controlled by the weights.
So this.
Weight controls the inference
of a user on the prediction.
And of course, naturally this weight
should be related to the similarity
between ua and this particular user, ui.
The more similar they are then
the more contribution we would like
user u i to make in predicting
the preference of u a.
So the formula is extremely simple.
You're going to see it's a sum
of all the possible users.
And inside the sum, we have their ratings,
well their normalized
ratings as I just explained.
The ratings need to be normalized in
order to be comfortable with each other.
And then these ratings
are rated by their similarity.
So we can imagine a W of A and
I is just a similarity of user A user I.
Now, what's k here?
Well, k is a simpler normalizer.
It's just it's just one over the sum
of all the weights, over all the users.
And so this means, basically, if you
consider the weight here together with k.
And we have coefficients or weights
that would sum to one for all the users.
And it's just a normalization strategy,
so that you get this predicted rating
in the same range as the these ratings
that we use to make the prediction.
Right?
So, this is basically the main idea
of memory-based approaches for
collaborative filtering, okay?
Once we make this prediction,
we also would like to map back
to the rating that the user.
The user would actually make.
And this is to further add the,
mean rating or
average rating of this user u
sub a to the predicted value.
This would recover.
A meaningful rating for this user.
So if this user is generous,
then the average would be somewhat high,
and when we added that, the rating will
be adjusted to a relatively high rating.
Now, when you recommend an item to a user,
this actually doesn't really matter
because you are interested in basically
the normalized rating
that's more meaningful.
But when they evaluate these collaborative
filtering approach is typically
assumed that actual ratings of user
on these objects to be unknown.
And then you do the prediction and
then you compare the predicted
ratings with their actual ratings.
So they,
you do have access to the actual ratings.
But then you pretend you don't know.
And then you compare real systems
predictions with the actual ratings.
In that case, obviously the system's
prediction would have to be adjusted to
match the actual result the user, and this
is not what's happening here, basically.
Okay?
So this is the memory-based approach.
Now of course if you look at the formula,
if you want to write
the program to implement it.
You still face the problem of determining
what is this w function, right?
Once you know the w function, then
the formula is very easy to implement.
So indeed there are many different ways to
compute this function or this weight, w.
And, specific approaches generally
differ in how this is computed.
So, here are some possibilities.
And, you can imagine,
there are many pro, other possibilities.
One popular approach is we use
the Pearson Correlation Coefficient.
This would be a sum of a common
range of items, and the formula
is a standard Pearson correlation
coefficient formula, as shown here.
So, this basically measures
weather the two users tended
to all give higher ratings to similar
items, or lower ratings to similar items.
Another measure is the cosine measure and
this is the retreat the rating vectors as
vectors in the vector space, and then
we're going to measure the the angel and
compute the cosign of
the angle of the two vectors.
And this measure has been used in the
vector space more for retrieval as well.
So as you can imagine, there are so
many different ways of doing that.
In all these cases, note that the user
similarity is based on their preferences
on items, and we did not actually use
any content information of these items.
It didn't matter what these items are.
They can be movies, they can be books,
they can be products,
they can be tax documents.
We just didn't care about the content.
And so this allows such approach to be
applied to a wide range of problems.
Now in some newer approaches of course,
we would like to use more
information about the user.
Clearly, we know more about the user, not
just a, these preferences on these items.
And so in a actual filtering system, using
collaborative filtering, we could also
combine that with content-based filtering,
we could use context information.
And those are all interesting approaches
that people are still studying.
There are newer approaches proposed.
But this approach has been shown
to work reasonably well and
it's easy to implement.
And practical applications
could be a starting point to
see if the strand here works well for
your application.
So there are some obvious ways
to also improve this approach.
And mainly would like to improve
the user similarity measure.
And there are some practical
issues to deal with here as well.
So for example,
there will be a lot of missing values.
What do you do with them?
Well, you can set them to default values
or the average ratings of the user.
And that will be a simple solution.
But there are advantages to approaches
that can actually try to predict those
missing values and then use the predicted
values to improve the similarity.
So in fact, the memory database approach,
you can predict those with missing values,
right?
So you can imagine,
you have iterative approach where you
first do some preliminary prediction and
then you can use the predictor values to
further improve the similarity function.
Right so this is here is
a way to solve the problem.
And the strategy of this in the effect
of the performance of clarity filtering,
just like in the other heuristics,
we improve the similarity function.
Another idea which is actually very
similar to the idea of IDF that we
have seen in text research, is called
the inverse user frequency or IUF.
Now here the idea is to look at the where
the two users share similar ratings.
If the item is a popular item that has
been aah, viewed by many people and
seemingly leads to people interested
in this item may not be so interesting.
But if it's a rare item and
has not been viewed by many users.
But, these two users
[INAUDIBLE] to this item.
And they give similar ratings, and it
says more about their similarity, right?
So it's kind of to emphasize
more on similarity
on items that are not
viewed by many users.
[MUSIC]

[SOUND] So to summarize our
discussion of recommender systems
in some sense the filtering
task of recommended is easy and
in some other sense and
the task is actually difficult.
So its easy because the user
dexpectations, though in this case,
the system takes initiative to
push the information to the user.
So the user doesn't really make an effort.
So any recommendation is
better than nothing, right?
So unless you recommend
that all the you know,
noisy items or useless documents,
if you can recommend that
some useful information uses general,
would appreciate it, all right.
So that's in that sense, that's easy.
However, filtering is
actually a much harder task.
Because you have to make a binary
decision, and you can't afford waiting for
a lot of items and then you will
whether one item is better than others.
You have to make a decision
when you see this item.
Let's think about news filtering
as well as you see the news.
And you have to decide whether the news
would be interesting to a user.
If you wait for a few days, well, even if
you can make accurate recommendation of
the most relevant news, only two days
wouldn't be significantly decreased.
Another reason why it's hard,
it's because of data sparseness.
If you think of this as a learning
problem in collaborative filtering, for
example, it's purely based on
learning from the past ratings.
So if you don't have many ratings,
there's really not much you can do, right?
And may I just mention this problem.
This is actually a very serious problem.
But of course there are strategies that
have been proposed to solve the problem.
And there are,
there are different strategies that
we will use to alleviate the problem.
We can use, for example, more user
information to assess their similarity
instead of using the preferences.
Of these users on these items
the immediate additional information or
better for
about the user etcetera and, and
we also talked about the two
strategies for filtering task.
One is content based where we
look at items in clarity you
know there's a clarity of filtering
where we look at the user similarity.
And they obviously can be combined.
In a practical system, you can imagine,
they generally would have to be combined.
So that will give us a hybrid strategy for
filtering.
A, and, we also could recall that we
talked about push versus
pull as two strategies for
getting access to the text data.
And recommend the system is it will help,
users in the push mode.
And search engines are,
certain users in the pull mode.
Of using the tool should be combined, and
they can be combined into have a system
that can support user with multiple
mode and formation access.
So in the future, we could anticipate for
such a system to be more usable to a user.
And also this is a active research area so
there are a lot of new algorithms being,
being proposed over time.
In particular, those new algorithms tend
to use a lot of context information.
Now the context here could be
the context of the user, you know,
it could also be context of documents or
items.
The items are not isolated.
They are connected in many ways.
The users might form social network as
well, so there's a rich context there
that we can leverage in order to really
solve the problem well, and then that's
a active research area where also machine
learning algorithms have been applied.
Here are some additional readings in
the handbook called Recommender Systems.
And has a collection of
a lot of good articles that
can give you an overview
of a number of specific
approaches to recommender systems.
[MUSIC]

[SOUND] This lecture is
a summary of this course.
This map shows the major topics
we have covered in this course.
And here are some key
high-level take-away messages.
First we talk about natural
language content analysis.
Here the main take-away message is natural
language processing is the foundation for
textual retrieval, but
current NLP isn't robust enough.
So the back of words
replenishing is generally
the main method used in
modern search engines and
it's often sufficient for
most of the search tasks.
But obviously, for
more compass search tasks,
then we need a deeper measurement
processing techniques.
And we then talked about
a high-level strategies for
text access and we talked about
push versus pull in plural.
We talked about a query,
which is browsing.
Now, in general in future search engines,
we should integrate
all these techniques to provide
a multiple information access and
then we talked about a number of
issues related to search engines.
We talked about the search problem and
we framed that as a ranking problem and
we talked about the a number
of retrieval methods.
We start with an overview of
the vector space model and
probabilistic model and then we talked
about the vector space model in that.
We also later talked about
leverageable learning approach and
that's probabilistic model.
And here, the main take-away message is
that model retrieval functions tend to
look similar and
they generally use various heuristics.
Most important ones are TF-IDF waiting
document length normalization and
that TF is often transformed through
a sub-linear transformation function and
then we talked about how to
implement a retrieval system.
And here the main technique that we talked
about how to construct an inverted index.
So that we can prepare the system
to answer a query quickly and
we talked about how to, to fast research
by using the inverted index and
we then talked about how to
evaluate the text retrieval system
mainly introduced the Cranfield
evaluation methodology.
This was a very important the various
methodology of that can be applied to
many tasks.
We talked about the major
evaluation measures.
So the most important measures for
a search engine are MAP mean
average precision and nDCG.
Normalized discounted accumulative
gain and also precision and
record the two basic measures.
And we then talked about
feedback techniques.
And we talked about the rock you
in the vector space model and
the mixture model in
the language modeling approach.
Feedback is very important
technique especially considering
the opportunity of learning from
a lot of pixels on the web.
We then talked about the web search.
And here, we talk about the how to
use parallel indexing to resolve
the scalability issue in indexing,
we introduce a MapReduce and
then we talked about the how to using
information interacting pull search.
We talked about page random
hits as the major algorithms
to analyze links on the web.
We then talked about learning to rank.
This is a use of machine learning
to combine multiple features for
improving scoring.
Not only the effectiveness can be
improved using this approach but
we can also improve the robustness
of the ranking function,
so that it's not easy to spam
a search engine with just a,
a some features to promote a page.
And finally,
we talked about the future of web search.
We talked about some major
interactions that we might assume
in the future in improving the current
generation of search engines.
And then finally, we talked about the
Recommender System and these are systems
to implement the push mode and
we'll talk about the two approaches.
One is content based,
one is collaborative filtering and
they can be combined together.
Now an obvious missing piece in this
picture is the user, you can see.
So user interface is also a important
component in any search engine,
even though the current search
interface is relatively simple.
There actually have been a lot
of studies of user interfaces
related to visualization for
example and this is topic to that,
you can learn more by reading this book.
It's a excellent book about all kind
of studies of search user interface.
If you want to know more about the,
the topics that we talked about,
you can also read some additional
readings that are listed here.
In this short course, we are only managing
to cover some basic topics in text
retrieval in search engines.
And these resources provide additional
information about more advanced topics and
they give more thorough treatment of
some of the topics that we talked about.
And a main source is
synthesis digital library
where you can see a lot
of short textbook or
textbooks or long tutorials.
They tend to provide us with a lot of
information to explain a topic and
there are multiple series that
are related to this course.
One is information concepts,
retrieval and services.
Another is human Language technology and
yet, another is artificial
intelligence and machine learning.
There are also some major journals and
conferences listed over here that
tend to have a lot of research papers
related to the topic of this course.
And finally for
more information about resources
including readings and tool kits, etc.
You can check out this URL.
So, if you have not taken
the text mining course in this
in this data mining specialization series,
then naturally,
the next step is to take that calls.
As this picture shows
to mine the text data,
we generally need two kinds of techniques.
One is text retrieval,
which is covered in this course.
And these techniques will help us
convert raw big text data into small,
relevant text data, which are actually
needed in the specific application.
And human plays important
role in mining any text data,
because text data is written for
humans to consume.
So, involving humans in the process
of data mining is very important.
And in this course,
we have covered various strategies to
help users get access to
the most relevant data.
These techniques are also essential
in any text mining system to help
provide providence and
to help users interpret the inner
patterns that the user would
find through text data mining.
So, in general, the user would have to
go back to the original data to better
understand the patterns.
So the text mining course or
rather text mining and ana,
analytics course will be deal,
dealing with what to do once
the user has found the information.
So this is a in this picture
where we would convert
the text data into action or knowledge.
And this has to do with helping
users to go further digest with
a found information or
to find the patterns and
to reveal knowledge buried in text and
such knowledge can be used in
application system to help decision-making
or to help user finish a task.
So, if you have not taken that
course the natural step and
the natural next step would
be to take that course.
Thank you for taking this course.
I hope you have found this
course to be useful to you and
I look forward to interacting
with you at a future activity.
[MUSIC]

[SOUND].
This lecture is about web indexing.
In this lecture, we will continue
talking about web search, and
we're going to talk about how
to create a web scale index.
So once we crawl the web
we've got a lot of web pages.
The next step is we use the indexer
to create the inverted index.
In general, we can use the standard
information retrieval techniques for
creating the index, and that is what we
talked about in the previous lecture.
But there are new challenges that we
have to solve for web scale indexing,
and the two main challenges of
scalability and efficiency.
The index will be so large that it cannot
actually fit into any single machine or
single disk, so we have to store
the data on multiple machines.
Also, because the data is so large,
it's beneficial to process the data in
parallel so
that we can produce the index quickly.
To address these challenges,
Google has made a number of innovations.
One is the Google File System,
that's a general distributed file system
that can help programmers manage files
stored on a cluster of machines.
The second is MapReduce.
This is a general software framework for
supporting parallel computation.
Hadoop is the most well known open
source implementation of MapReduce,
now used in many applications.
So this is the architecture
of the Google File System.
It uses a very simple centralized
management mechanism to manage
all the specific locations of files.
So it maintains the file namespace and
look up table to know where
exactly each file is stored.
The application client would
then talk to this GFS master.
And that obtains specific locations of
the files that they want to process.
And once the GFS client obtained
the specific information about the files,
then the application client
can talk to the specific
servers where the data
actually sits directly.
So that you can avoid avoid involving
other nodes in the network.
So when this file system
stores the files on machines
the system also would create
a fixed sizes of chunks.
So the data files are separate
into many chunks,
each chunk is 64 megabytes,
so it's pretty big.
And that's appropriate for
large data processing.
These chunks are replicated
to ensure reliability.
So this is something that the, the
programmer doesn't have to worry about,
and it's all taken care
of by this file system.
So from the application perspective,
the programmer would see this
as if it's a normal file.
The program doesn't have to know
where exactly it's stored, and
can just invoke high level
operators to process the file.
And another feature is that the data
transfer is directly between
application and chunk servers, so
it's, it's efficient in this sense.
On top of the Google file system, and
Google also proposed MapReduce as
a general framework for
parallel programming.
Now, this is very useful to support
a task like building inverted index.
And so this framework is hiding a lot of
low level features from the programmer.
As a result, the programmer can
make minimum effort to create
a application that can be run
on a large cluster in parallel.
So, some of the low level
details hidden in the framework,
including the specific natural
communications, or load balancing,
or where the tasks are executed, all these
details are hidden from the programmer.
There is also a nice feature which
is the built-in fault tolerance.
If one server is broken,
let's say, so it's down, and
then some tasks may not be finished,
then the MapReduce mechanism would
know that the task has not been done.
So it would automatically dispatch the
task on other servers that can do the job.
And therefore, again, the programmer
doesn't have to worry about that.
So here's how MapReduce works.
The input data will be separated
into a number of key, value pairs.
Now, what exactly is in the value
will depend on the data.
And it's actually a fairly
general framework to allow you to
just partition the data
into different parts.
And each part can be then
processed in parallel.
Each key, value pair will be
then sent to a map function.
The programmer will write
the map function, of course.
And then the map function will then
process this key value pair and
generate the,
a number of other key value pairs.
Of course, the new key is usually
different from the old key
that's given to the map as input.
And these key value pairs
are the output of the map function.
And all the outputs of all the map
functions will be then collected.
And then they will be further
sorted based on the key.
And the result is that all the values
that are associated with the same
key will be then grouped together.
So now we've got a pair of a key and a set
of values that are attached to this key.
So this will then be sent
to a reduce function.
Now, of course, each reduce function will
handle a different each a different key.
So we will send this,
these output values to
multiple reduce functions,
each handling a unique key.
A reduce function would then process
the input, which is a key and
a set of values, to produce another
set of key values as the output.
So these output values would be then
collected together to form the,
the final output.
Right, so this is the,
the general framework of MapReduce.
Now, the programmer only needs to
write the the map function and
the reduce function.
Everything else is actually taken
care of by the MapReduce framework.
So, you can see the programmer really
only needs to do minimum work.
And with such a framework, the input data
can be partitioned into multiple parts.
Each is processed in
parallel first by map, and
then in the process after
we reach the reduce stage,
then much more reduce functions
can also further process
the different keys and
their associated values in parallel.
So it achieves some it
achieves the purpose of parallel
processing of a large dataset.
So let's take a look at a simple example,
and that's word counting.
The input is is files containing words.
And the output that we want to generate is
the number of occurrences of each word, so
it's the word count.
Right, we know this,
this kind of counting would be useful to,
for example, assess the popularity
of a word in a large collection.
And this is useful for achieving
a factor of IDF weighting for search.
So how can we solve this problem?
Well, one natural thought is that,
well, this task can be done in
parallel by simply counting different
parts of the file in parallel and
then in the end,
we just combine all the counts.
And that's precisely the idea of
what we can do with MapReduce.
We can parallelize lines
in this input file.
So more specifically, we can assume
the input to each map function
is a key value pair that represents the
line number and the stream on that line.
So the first line, for
example, has a key of one.
And the value is Hello World Bye World,
and just four words on that line.
So this key-value pair will
be sent to a map function.
The map function would then just
count the words in this line.
And in this case, of course,
there are only four words.
Each word gets a count of one.
And these are the output that you see here
on this slide, from this map function.
So, the map function
is really very simple.
If you look at the, what the pseudocode
looks like on the right side, you see,
it simply needs to iterate over
all the words in this line,
and then just call a Collect function,
which means it would then send the word
and the counter to the collector.
The collector would then try to
sort all these key value pairs
from different map functions.
Right?
So the functions are very simple.
And the programmer specifies this function
as a way to process each part of the data.
Of course, the second line will be
handled by a different map function,
which will produce a similar output.
Okay, now the output from the map
functions will be then sent to
a collector.
And the collector will do
the internal grouping or sorting.
So at this stage, you can see we
have collected multiple pairs.
Each pair is a word and
its count in the line.
So once we see all these these pairs,
then we can sort them based on the key,
which is the word.
So we will collect all the counts of
a word, like bye, here, together.
And similarly, we do that for other words.
Like Hadoop, hello, etc.
So each word now is attached to
a number of values, a number of counts.
And these counts represent the occurrences
of this word in different lines.
So now we have got a new pair of a key and
a set of values,
and this pair will then be
fed into a reduce function.
So the reduce function now will
have to finish the job of counting
the total occurrences of this word.
Now it has already got all
these partial counts, so
all it needs to do is
simply to add them up.
So the reduce function shown
here is very simple as well.
You have a counter and then iterate over
all the words that you see in this array,
and then you just accumulate these counts,
right.
And then finally, you output the key and
and the total count,
and that's precisely what we want as
the output of this whole program.
So, you can see, this is already very
similar to building a inverted index,
and if you think about it,
the output here is indexed by a word, and
we have already got a dictionary,
basically.
We have got the count.
But what's missing is the document IDs and
the specific
frequency counts of words
in those documents.
So we can modify this slightly to actually
build a inverted index in parallel.
So here's one way to do that.
So in this case, we can assume
the input to a map function is a pair
of a key which denotes the document ID and
the value denoting the string for
that document.
So it's all the words in that document.
And so the map function will
do something very similar to
what we have seen in
the water company example.
It simply groups all the counts of
this word in this document together.
And it will then generate
a set of key value pairs.
Each key is a word.
And the value is the count of this word
in this document plus the document ID.
Now, you can easily see why we
need to add document ID here.
Of course, later, in the inverted index,
we would like to keep this information, so
the map function should keep track of it.
And this can then be sent to
the reduce function later.
Now, similarly another document D2
can be processed in the same way.
So in the end, again, there is a sorting
mechanism that would group them together.
And then we will have just
a key like java associated
with all the documents
that match this key, or
all the documents where java occurred,
and their counts,
right, so
the counts of java in those documents.
And this will be collected together.
And this will be, so
fed into the reduced function.
So, now you can see,
the reduce function has already got input
that looks like a inverted index entry,
right?
So, it's just the word and all
the documents that contain the word and
the frequency of the word
in those documents.
So, all you need to do is simply to
concatenate them into a continuous chunk
of data, and this can be then
retained into a file system.
So basically, the reduce function
is going to do very minimal work.
And so, this is pseudo-code for
inverted index construction.
Here we see two functions,
procedure Map and procedure Reduce.
And a programmer would specify these two
functions to program on top of MapReduce.
And you can see, basically,
they are doing what I just described.
In the case of Map,
it's going to count the occurrences
of a word using an associative array,
and will output all the counts
together with the document ID here.
Right?
So this,
the reduce function,
on the other hand simply concatenates
all the input that it has been given and
then put them together as one
single entry for this key.
So this is a very simple
MapReduce function, yet
it would allow us to construct an inverted
index at a very large scale, and
data can be processed
by different machines.
The program doesn't have to
take care of the details.
So this is how we can do parallel
index construction for web search.
So to summarize, web scale indexing
requires some new techniques that
go beyond the standard
traditional indexing techniques.
Mainly, we have to store index on
multiple machines, and this is usually
done by using a file system like Google
File System, a distributed file system.
And secondly, it requires creating
the index in parallel, because it's so
large, it takes a long time to create
an index for all the documents.
So if we can do it in parallel,
it would be much faster, and
this is done by using
the MapReduce framework.
Note that the both the GFS and
MapReduce frameworks are very general, so
they can also support
many other applications.
[MUSIC]

[SOUND].
This lecture is about link analysis for
web search.
In this lecture we're going to talk
about web search, and particularly
focusing on how to do link analysis and
use the results to improve search.
The main topic of this lecture is to look
at the ranking algorithms for web search.
In the previous lecture,
we talked about how to create index.
Now that we have got index,
we want to see how we can improve
ranking of pages on the web.
Standard IR models can
also be applied here,
in fact they are important building
blocks for supporting web search,
but they aren't sufficient,
mainly for the following reasons.
First, on the web we tend to have
very different information needs.
For example, people might search for
a web page or entry page, and
this is different from
the traditional library search
where people are primarily interested
in collecting literature information.
So these kind of queries are often
called navigational queries,
the purpose is to navigate into
a particular targeted page.
So for such queries, we might
benefit from using link information.
Secondly, documents have
additional information.
And on the web, web pages are web format.
There are a lot of other groups,
such as the layout, the title,
or link information again.
So this has provided an opportunity to
use extra context information of
the document to improve scoring.
And finally,
information quality varies a lot.
So that means we have to consider many
factors to improve the ranking algorithm.
This would give us a more robust way to
rank the pages making it the harder for
any spammer to just manipulate the one
signal to improve the ranking of a page.
So as a result people have made
a number of major extensions
to the ranking algorithms.
One line is to exploit links to
improve scoring and
that's the main topic of this lecture.
People have also proposed
algorithms to exploit large scale
implicit feedback information
in the form of clickthroughs.
That's of course in the category
of feedback techniques, and
machinery is often used there.
In general, in web search the ranking
algorithms are based on machinery
algorithms to combine
all kinds of features.
And many of them are based on the standard
original models such as BM25 that
we talked about, or
queried iCode to score
different parts of documents or
to, provide additional features
based on content matching.
But link information
is also very useful so
they provide additional scoring signals.
So let's look at links in
more detail on the web.
So this is a snapshot of some
part of the web, let's say.
So we can see there are many links
that link different pages together.
And in this case you can also look at the,
the center here.
There is a description of a link
that's pointing to the document on
the right side.
Now this description text
is called anchor text.
If you think about this text,
it's actually quite useful
because it provides some extra description
of that page being pointed to.
So, for example, if someone wants
to bookmark Amazon.com front page,
the person might say,
the big online bookstore, and
then with a link to Amazon, right?
So the description here is actually very
similar to what the user would type in
in the query box when they are looking for
such a page.
That's why it's very useful for,
for, ranking pages.
Suppose someone types in a query
like online bookstore or
big online bookstore, right.
The query would match this
anchor text in the page here.
And then this actually
provides evidence for
matching the page that's been pointed to,
that is the Amazon entry page.
So if you match the anchor text
that describes the link to a page,
actually that provides good evidence for
the relevance of the page
being pointing to.
So anchor text is very useful.
If you look at the bottom part of this
picture, you can also see there are some
patterns of links, and these links might
indicate the utility of a document.
So for example,
on the right side you can see this
page has received many in, in links.
That means many other pages
are pointing to this page.
And this shows that this
page is quite useful.
On the left side you can see, this is
a page that points to many other pages.
So, this is a theater page
that would allow you to
actually see a lot of other pages.
So we can call the first case authority
page and the second case a hub page.
This means the link information
can help in two ways.
One is to provide extra text for matching.
The other is to provide some
additional scores for the web
pages to characterize how likely a page is
a hub, how likely a page is a authority.
So people then, of course, propose ideas
to leverage this, this link information.
Google's PageRank,
which was a main technique that they
used in early days, is a good example.
And that, that is the algorithm
to capture page popularity,
basically to score authority.
So the intuitions here are, links are just
like citations in the literature.
Think about one page
pointing to another page.
This is very similar to one
paper citing another paper.
So, of course,
then if a page is cited often,
then we can assume this page to
be more useful in general, right?
So that's a very good intuition.
Now, page rank is essentially
to take advantage of this
intuition to implement the,
with the principle approach.
Intuitively it's essentially doing
citation counting or in link counting.
It just improves this simple idea in,
in two ways.
One is would consider indirect citations.
So that means you don't just look
at the how many in links you have,
you also look at the what are those
pages that are pointing to you.
If those pages, themselves, have a lot
of in links, well that means a lot.
In some sense you will get
some credit from that.
But if those pages that are pointing
to you are not are being pointed to
by other pages, they themselves don't
have many in links, then, well,
you don't get that much credit.
So that's the idea of
getting indirect citation.
Right, so you can also understand
this idea by looking at, again,
the research papers.
If you are cited by, let's say ten papers,
and those ten papers are, just
workshop papers and that, or some papers
that are not very influential, right,
so although you got ten in links,
that's not as good as if you have,
you're cited by ten papers that themselves
have attracted a lot of other citations.
So this is a case where
we would like to consider indirect
links and PageRank does that.
The other idea is,
it's good to smooth the citations.
Or, or, or
assume that basically every page is
having a non-zero pseudo citation count.
Essentially, you are trying to imagine
there are many virtual links that
will link all the pages together so
that you,
you actually get pseudo
citations from everyone.
The, the reason why they want to
do that is this would allow them
to solve the problem elegantly
with linear algebra technique.
So I think maybe the best
way to understand the page
rank is through think of
this as through computer,
the probability of a random surfer,
visiting every web page, right.
[MUSIC]

[SOUND].
So let's take a look at this in detail.
So in this random surfing model.
And any page would assume random surfer
would choose the next page to visit.
So this is a small graph here.
That's, of course an oversimplification
of the complicate it well.
But let's say there
are four documents here.
Right, D1, D2, D3 and D4.
And let's assume that a random surfer or
random walker can be any of these pages.
And then the random surfer could decide
to just randomly jump into any page.
Or follow a link and
then visit the next page.
So if the random server is at d1.
Then, you know, with some probability
that random surfer will follow the links.
Now there two outlinks here.
One is pointing to this D3.
The other is pointing to D4.
So the random surfer could pick any
of these two to reach e3 and d4.
But it also assumes that the random
surfer might, get bored sometimes.
So the random surfer would decide
to ignore the actual links, and
simply randomly jump to
any page on the web.
So, if it does that, eh,
it would be able to reach
any of the other pages even though there
is no link directly from to that page.
So this is the assume the randoms of.
Imagine a random server is
really doing surfing like this,
then we can ask the question.
How likely on average
the server would actually reach
a particular page d1, or d2, or d3.
That's the average probability
of visiting a particular page.
And this probability is precisely
what page rank computes.
So the page rank score of the document
is the average probability
that the surfer visits a particular page.
Now, intuitively this will basically
kept you the [INAUDIBLE] link account.
Why?
Because if a page has
a lot of in-links then
it would have a higher chance of being
visited, because there will be more
opportunities of having the surfer to
follow a link to come to this page.
And this is why
the random surfing model actually captures
the idea of counting the in links.
Note that is also considers
the indirect in links.
Why?
Because if the pages that point to you
have themselves a lot of in links,
that would mean the random server
would very likely reach one of them.
And therefore it increases
the chance of visiting you.
So this is a nice way to capture
both indirect and direct links.
So mathematically, how can we compute
this problem enough to see that we need
to take a look at how this
problem [INAUDIBLE] in computing.
So first let's take a look at
the transition matching sphere.
And this is just a matrix with
values indicating how likely a rand,
the random surfer will go
from one page to another.
So each rule stands for a starting page.
For example,
rule one would indicate the probability
of going to any other four pages from e1.
And here we see there are only
non two non zero entries.
Each is 1 over 2, a half.
So this is because if you look at
the graph, d1 is pointing to d3 and d4.
There's no link from d1 to d1 server or
d2,
so we've got 0s for
the first two columns and
0.5 for d3 and d4.
In general, the M in this matrix
M sub i j is the probability
of going from d, i, to d, j.
And obviously for each rule,
the values should sum to one,
because the surfer will have to go to
precisely one of these other pages.
Right?
So this is a transition matrix.
Now how can we compute the probability
of a server visiting a page?
Well if you look at the,
the server model, then basically
we can compute the probability
of reaching a page as follows.
So, here on the left-hand side,
you see it's the probability of
visiting page DJ at time t plus 1
because it's the next time cont.
On the right hand side, you can see
the question involves the probability
of, at page ei at time t.
So you can see the subsequent index t,
here.
And that indicates that's the probability
that the server was at
a document at time t.
So the equation basically captures
the two possibilities of
reaching at d j at time t plus 1.
What are these two possibilities?
Well one is through random surfing, and
one is through following
a link as we just explained.
So the first part captures the probability
that the random server would reach
this page by following a link.
And you can see, and
the random surfer chooses this
strategy was probably
the [INAUDIBLE] as we assumed.
And so
there is a factor of one minus alpha here.
But the main part is really
sum over all the possible
pages that the server could
have been at time t, right?
There were N pages, so
it's a sum over all the possible N pages.
Inside the sum is the product
of two probabilities.
One is the probability that
the server was at d i at time t.
That's p sub t of d i.
The other is the transition
probability from di to dj.
And so in order to reach this dj page,
the surfer must first be at di at time t.
And then also would have to follow
the link to go from di to dj,
so the probability is the probability
of being at di at time t, not divide by
the probability of, going from that
page to the top of the page dj here.
The second part is a similar sum.
The only difference is that now
the transition probability is uniform,
transition probability.
1 over n.
And this part captures the probability of
reaching this page,
through random jumping.
Right.
So, the form is exactly the same.
And in, in, this also allows us to see
why PageRank essentially assumes
smoothing of the transition matrix.
If you think about this 1 over N as
coming from another transition matrix
that has all the elements being 1 over N,
the uniform matrix.
Then you can see very clearly
essentially we can merge the two parts.
Because they are of the same form,
we can imagine there's a difference of
metrics that's a combination of this m and
that uniform matrix where
every element is 1 over n.
In this sense,
page one uses this idea of smoothing and
ensuring that there's no 0,
entry in such a transition matrix.
Of course this is, time depend,
calculation of probabilities.
Now, we can imagine if we want to
compute average probabilities,
the average probabilities probably
would satisfy this equation
without considering the time index.
So let's drop the time index and
just assume that they would be equal.
Now this would give us N equations.
Because for
each page we have such a equation.
And if you look at the what variables
we have in these equations,
there are also precisely N variables,
right?
So this basically means
we now have a system of
n equations with n variables,
and these are linear equations.
So basically, now the problem boils down
to solve this system of equations and
here I also show that
the equations in the metric form.
It's the vector P here equals a metrics or
the transports of the metrics here.
And multiply it by the vector again.
Now if you still remember some knowledge
that you learned from linear algebra and
then you will realize this is precisely
the equation for item vector.
Right?
When [INAUDIBLE] metrics by this method
you get the same value as this method.
And this can solved by using
an iterative algorithm.
So is it, because she's here, on the ball,
easily taken from the previous, slide.
So you see the, relationship between the,
the page source of different pages.
And in this iterative approach or
power approach
we simply start with, randomly the p.
And then we repeatedly just
updated this p by multiplying.
The metrics here by this P-Vector.
So I also show a concrete example here.
So you can see this now, if we assume.
How far is point two.
Then with the example that
we show here on this slide
we have the original
transition metrics here.
Right?
That encodes, that encodes the graph.
The actual links.
And we have this smoothing
transition metrics,
uniform transition metrics,
representing random jumping.
And we can combine them together with
interpolation to form another
metrics that would be like this.
So essentially we can
imagine now the looks.
Like this can be captured by that.
There are virtual links
between all the pages now.
So the page rank algorithm will
just initialize the p vector first,
and then just computed
the updating of this p vector
by using this, metrics multiplication.
Now if you rewrite this metrics multi,
multiplication
in terms of just a,
an individual equations, you'll see this.
And this is a, basically,
the updating formula for
this particular page is a,
page ranking score.
So you can also see, even if you
want to compute the value of this
updated score for d1,
you basically multiple this rule.
Right?
By this column, I will take
the total product of the two, right?
And that will give us the value for
this value.
So this is how we updated the vector.
We started with some initial values for
these guys.
For, for this, and then,
we just revise the scores which
generate a new set of scores.
And the updated formula is this one.
So we just repeatedly apply this,
and here it converges.
And when the metrics is like this.
Where there is no zero values and
it can be guaranteed to converge.
And at that point we will just, have
the PageRank scores for all the pages.
Now we typically set the initial
values just to 1 over n.
So interestingly, this update
formula can be also interpreted as
propagating scores on the graph.
All right.
Can you see why?
Well if you look at this formula and
then compare that with this graph,
and can you imagine how we
might be able to interpret this
as essentially propagating
scores over the graph.
I hope you will see that indeed
we can imagine we have values
initialized on each of these page.
All right, so we can have values here
that say, that's one over four for each.
And then welcome to use these
matrix to update this, the scores.
And if you look at the equation here,
this one, basically we're
going to combine the scores
of the pages that possible would lead to,
reaching this page.
So we'll look at all the pages
that are pointing to this page.
And then combine their scores and
the propagated score,
the sum of the scores to this document D1.
We look after the, the scores
that represented the probability
that the random server would be visiting
the other pages before it reaches the D1.
And then just do the propagation
to simulate the probability
of reaching this, this page D 1.
So there are two interpretations here.
One is just the matrix multiplication.
And we repeated that.
Multiply the vector by this metrics.
The other is to just think of it as
propagating the scores repeatedly
on the web.
So in practice the composition of PageRank
score is actually efficient because
the metrices are sparse and there are some
ways to transform the equation so
you avoid actually literally computing
the values of all of those elements.
Sometimes you may also
normalize the equation, and
that will give you a somewhat
different form of the equation,
but then the ranking of
pages will not change.
The results of this potential
problem of zero out link problem.
In that case if the page does not have
any outlook, then the probability of
these pages will, will not sum to 1.
Basically, the probability of
reaching the next page from this
page will not sum to 1.
Mainly because we have lost some
probability mass when we assume that
there's some probability that
the server will try to follow links but
then there's no link to follow, right?
And one possible solution is simply to
use page specific damping factor and
that, that could easily fix this.
Basically that's to say, how far do we
want from zero for a page with no outlink.
In that case the server would just have to
render them [INAUDIBLE] to another page
instead of trying to follow the link.
So there are many extensions of page rank.
One extension is to do
top-specific page rank.
Note that page rank doesn't really
use the query format machine, right?
So, [INAUDIBLE] so we can make page rank,
appear specific, however.
So, for example,
in the topic specific page rank,
we can simply assume when the surfer,
is bored.
The surfer is not going to randomly
jump into any page on the web.
Instead, it's going to jump,
to only those pages that are to a query.
For example, if the query is about sports
then we could assume that when it's
doing random jumping, it's going
to randomly jump to a sports page.
By doing this then we canbuy
a PageRank to topic align with sports.
And then if you know the current query
is about sports then you can use this
specialized PageRank score
to rank the options.
That would be better than if you
use a generic PageRank score.
PageRank is also general algorithm
that can be used in many other.
Locations for network analysis, particular
for example for social networks.
We can imagine if you compute their
PageRank scores for social network,
where a link might indicate
friendship relation,
you'll get some meaningful scores for
people.
[MUSIC]

[SOUND] So
we talked about a page rank as a way to
to capture the Authorities.
Now we also looked at the, some other
examples where a hub might be interesting.
So, there is another
algorithm called the HITS and
that's going to do compute the scores for
us.
Authorities & Hubs.
Intuitions of,
pages that are widely cited, good, sorry,
there is, then,
there is pages that are cited.
Many other pages are good Hubs, right?
But there, I think that the.
Most interesting idea of this
algorithm HITS is, it's going to use,
a reinforcement mechanism to kind of
help improve the scoring for
Hubs and the Authorities.
And here, so here's the idea,
it will assume that good
authorities are cited by good hubs.
That means if you're cited by
many pages with good hub scores,
then that increases your authority score.
And similarly, good hubs are those
that pointed to good authorities.
So if you get you point it to
a lot of good authority pages,
then your hub score would be increased.
So you then, you would have
iterative reinforce each other,
because you can point
it to some good hubs.
Sorry, you can point it
to some good authorities.
To get a good hub score.
Whereas those authority scores,
would be also improved,
because they are pointed to by a good hub.
And this hub is also general,
it can have many applications in graph and
network analysis.
So just briefly, here's how it works.
We first also construct the matrix, but
this time we're going to
construct the Adjacency matrix.
We're not going to normalize the values,
so if there's a link there's a y.
If there's no link that's zero.
Right again, it's the same graph and then,
we're going to define the top score of
page as a sum of the authority scores
of all the pages that it appoints to.
So whether you are hub that really depends
on whether you are pointing to a lot of,
good authority pages.
That's what it says in the first equation.
Your second equation,
will define the authority score of a page
as a sum of the hub scores
of all those pages.
That they point to, so whether you
are a good authority would depend on
whether those pages that
are pointing to you are good Hubs.
So you can see this a forms
a iterative reinforcement mechanism.
Now these two equations
can be also written.
In the matrix fo-, format.
Right, so
what we get here is then the hub vector is
equal to the product of
the Adjacency matrix.
And the authority vector.
And this is basically the first equation.
Right.
And similarly, the second equation can
be returned as the authority vector
is equal to the product of A transpose
multiplied by the hub vector.
And these are just different ways
of expressing these equations.
But what's interesting is that if
you look at to the matrix form.
You can also plug-in the authority
equation into the first one.
So if you do that, you can actually
make it limited to the authority vector
completely, and
you get the equation of only hub scores.
Right, the hub score vector is equal
to A multiplied by A transpose.
Multiplied by the hub score vector again.
And similarly we can do
a transformation to have equation for
just the authorities scores.
So although we framed the problem
as computing Hubs & Authorities,
we can actually eliminate the one of them
to obtain equation just for one of them.
Now the difference between this and
page is that, now the matrix
is actually a multiplication of the mer-,
Adjacency matrix and its transpose.
So this is different from page rank.
Right?
But mathematically then we would
be computing the same problem.
So in ha, in hits,
we're keeping would initialize the values
that state one for all these values.
And then with the algorithm will apply
these, these equations essentially and
this is equivalent if you multiply that.
By, by the matrix.
A and A transpose.
Right.
And so the arrows of these are exactly
the same in the debate rank.
But here, because the Adjacency matrix
is not normalized, so what we have to do
is to, what we have to do is after each
iteration we have to do normalize.
And this would allow us to
control the grooves of value.
Otherwise they would,
grew larger and larger.
And if we do that, and
then we will basically get a, HITS.
I was in the computer, the hub scores and
also the scores for all of the pages.
And these scores can then be used,
in ranging to start the PageRank scores.
So to summarize, in this lecture we have
seen that link information is very useful.
In particular,
the Anchor text base is very useful.
To increase the the text
representation of a page.
And we also talk about the PageRank and
HITS algorithm as two major
link analysis algorithms.
Both can generate scores for.
What pages that can be used for
the, the ranking function.
Those that PageRank and
the HITS also very general algorithms, so
they have many applications in
analyzing other graphs or networks.
[MUSIC]

[SOUND] This lecture is about
learning to rank.
In this lecture, we're going to
continue talking about web search.
In particular, we're going to talk about
using machine running to combine definite
features to improve ranking function.
So the question that we
address in this lecture is
how we can combine many
features to generate a,
a single ranking function
to optimize search results.
In the previous lectures,
we have talked about the,
a number of ways to rank documents.
We have talked about some retrieval
models, like a BM25 or clear light code.
They can generate a content based scores
for matching documents with a query.
And we also talked about
the link-based approaches,
like page rank that can give additional
scores to help us improve ranking.
Now the question now is how can
we combine all these features and
potentially many other
features to do ranking?
And this will be very useful for
ranking web pages not only just to improve
accuracy, but also to improve
the robustness of the ranking function.
So that's it not easy for
a spammer to just perturb a one or
a few features to promote a page.
So the general idea of learning to
rank is to use machine learning to
combine these features to optimize
the weight on different features to
generate the optimal ranking function.
So we would assume that
the given a query document pair,
Q and D,
we can define a number of features.
And these features can vary from
content based features such as
a score of the document it
was respected to the query
according to a retrieval function,
such as BM25 or
Query Light or pivot commands
from a machine or PL2, et cetera.
It can also be linked based
score like PageRank score.
It can be also application of retrieval
models to the anchor text of the page.
Right?
Those are the types of descriptions
of links that pointed to this page.
So these can all be clues about whether
this document is relevant or not.
We can even include a, a feature such
as whether the URL has a [INAUDIBLE],
because this might be the indicator
of home page or entry page.
So, all of these features can then be
combined together to generate the ranking
functions.
The question is of course,
how can we combine them?
In this approach,
we simply hypothesize that the probability
that this document is random to this query
is a function of all these features.
So we can hypothesize this
that the probability of
relevance is related to these
features through a particular
form of the function
that has some parameters.
These parameters can
control the influence of
different features on the final relevance.
This is of course, just a assumption.
Whether this assumption really makes
sense is still a, a big question.
However, you have to empirically
evaluate the, the, the function.
But by hypothesizing that the relevance
is related to those features
in the particular way, we can then
combine these futures to generate
the potentially more powerful ranking
function, a more robust ranking function.
Naturally, the next question is how
do we estimate loose parameters?
You know, how do we know which
features should have high weight and
which features should have low weight?
So this is a task of training or learning.
All right.
So,
in this approach what we will
do is use some training data.
Those are the data that
have been judged by users.
So that we already know
the relevance judgments.
We already know which documents should
be rather high for which queries and
this information can be based on
real judgments by users or can,
this can also be approximated by just
using click through information.
Where we can assume the clicked documents
are better than the skipped documents or
clicked documents are relevant and
the skipped documents are not relevant.
So, in general, the fit such hypothesize
ranging function to the training day,
meaning that we will try to optimize its
retrieval accuracy on the training data.
And we adjust these parameters to see
how we can optimize the performance
of the function on the training data in
terms of some measure such as map or NDCG.
So the training data would
look like a table of tuples.
H-tuple it has three elements, the query,
the document and the judgment.
So, it looks very much like
our relevance judgment that we
talked about in evaluation
of retrieval systems.
[MUSIC]

[SOUND] So
now let's take a look at the specific,
method that's based on regression.
Now this is one of the many
different methods in fact,
it's the one of the simplest methods.
And I choose this to explain the idea
because it's it's so simple.
So in this approach we simply assume
that the relevance of a document
with respect to the query, is related to
a linear combination of all the features.
Here I used the Xi to emote the feature.
So Xi of Q and D is a feature.
And we can have as many features as,
we would like.
And we assume that these features
can be combined in a linear manner.
And each feature is controlled
by a parameter here.
And this beta is a parameter,
that's a weighting parameter.
A larger value would mean the feature
would have a higher weight and
it would contribute more
to the scoring function.
The specific form of the function
actually also involves
a transformation of
the probability of relevance.
So this is the probability of relevance.
We know that the probability of relevance
is within the range from 0 to 1.
And we could have just assumed
that the scoring function is
related to this linear combination.
Right, so we can do a,
a linear regression but
then the value of this linear
combination could easily go beyond 1.
So this transformation here would map ze,
0 to 1 range through the whole
range of real values.
You can, you can verify it,
it by yourself.
So this allows us then to connect
to the probability of relevance
which is between 0 and 1 to a linear
combination of arbitrary efficients.
And if we rewrite this into a probability
function, we will get the next one.
So on this side on this equation,
we will have the probability of relevance.
And on the right hand side,
we will have this form.
Now this form is created non-active.
And it still involves the linear
combination of features.
And it's also clear that is,
if this value is,
is.
Of the linear combination
in the equation above.
If this this, this value here,
if this value is large then it
will mean this value is small.
And therefore, this probability,
this whole probability, would be large.
And that's what we expect.
Basically, it would be if this
combination gives us a high value,
then the document's more likely relevant.
So this is our hypothesis.
Again, this is not necessarily
the best hypothesis.
That this is a simple way to connect
these features with
the probability of relevance.
So now we have this this
combination function.
The next task is to see how we
need to estimate the parameters so
that the function can truly be applied.
Right.
Without them knowing
that they have values, it's,
it's harder to apply this function, okay.
So let's how we can estimate, beta values.
All right.
Let's take a look, at a simple example.
In this example, we have three features.
One is BM25 score of
the document under the query.
One is the page rank score of
the document, which might or
might not depend on the query.
Hm, we might have a top
sensitive page rank.
That would depend on the query.
Otherwise, the general page rank
doesn't really depend on the query.
And then we have BM25 score on
the Anchor task of the document.
These are then the feature values for
a particular doc, document query pair.
And in this case the document is D1.
And the,
the judgment says that it's relevant.
Here's another training instance,
and these features values.
But in this case it's non-relevant, okay?
This is a overly simplified case,
where we just have two instances.
But it,
it's sufficient to illustrate the point.
So what we can do is we use the maximum
likelihood estimator to actually estimate
the parameters.
Basically, we're going to do, predict
the relevance status of the document,
the, based on the feature values.
That is given that we observe
these feature values here.
Can we predict the relevance?
Yeah.
And of course, the prediction will be
using this function that you see here.
And we hypothesize this that
the probability of relevance is related
features in this way.
So we're going to see for
what values of beta we can
predict that the relevance well.
What do we mean?
Well, what, what do we mean by
predicting the relevance well?
Well we just mean.
In the first case for D1,
this expression here,
right here, should give higher values.
In fact, they would hope this
to give a value close to one.
Why?
Because this is a relevant document.
On the other hand, in the second case for
D2 we hope this value would be small.
Right.
Why?
It's because it's a non-relevant document.
So now let's see how this can
be mathematical expressed.
And this is similar to,
expressing the probability of a document.
Only that we are not talking about
the probability of words but
talking about the probability
of relevance, 1 or 0.
So what's the probability
of this document?
The relevant if it has
these feature values.
Well this is.
Just this expression, right?
We just need to pluck in the X, the Xis.
So that's what we'll get.
It's exactly like, what we have seen that,
only that we replace these Xis.
With now specific values.
And so, for example, this 0.7 goes
to here and this 0.11 goes to here.
And these are different feature values and
we'll combine them in this particular way.
The beta values are still unknown.
But this gives us the probability
that this document is relevant
if we assume such a model.
Okay, and
we want to maximize this probability since
this is a random document.
What we do for the second document.
Well, we want to compute to the
probability that the predictions is, is n,
non-relevant.
So, this would mean, we have to compute
a 1 minus, right this expression.
Since this expression.
Is actually the probability of relevance,
so to compute the non relevance
from relevance, we just do 1 minus
the probability of relevance, okay?
So this whole expression then.
Just is our probability of predicting
these two relevance values.
One is 1.
Here, one is a 0.
And this whole equation
is our probability.
Of observing a 1 here and
observing a 0 here.
Of course this probability depends
on the beta values, right?
So then our goal is to
adjust the beta values to make this
whole thing reach its maximum.
Make that as large as possible.
So that means we
are going to compute this.
The beta is just the, the parameter
values that would maximize this for
like holder expression.
And what it means is if
look at the function is
we're going to choose betas to
make this as large as possible.
And make this also as large as possible
which is equivalent to say make
this the part as small as possible.
And this is precisely what we want.
So once we do the training,
now we will know the beta values.
So then this function will be well
defined once their values are known.
Both this and
this will become pretty less specified.
So for any new query and new document we
can simply compute the features [NOISE]
For that pair and then we just use this
formula to generate a ranking score.
And this scoring function can be used in
for rank documents for a particular query.
So that's the basic idea of,
learning to rank.
[MUSIC]

[NOISE].
There are many more advanced learning
algorithms than the regression based
reproaches.
And they generally
account to theoretically
optimize or retrieval method.
Like map or nDCG.
Note that the optimization objecting
function that we have seen
on the previous slide is not directly
related to retrieval measure.
Right?
By maximizing the prediction of one or
zero.
Or we don't necessarily optimize
the ranking of those documents.
One can imagine that why,
our prediction may not be too bad and
let's say both are around 0.5.
So it's kind of in the middle of zero and
one for
the two documents, but
the ranking can be wrong.
So we might have the, a larger value for.
D2 and then e1.
So that won't be good from
retrieval perspective,
even though by likelihood function,
it's not bad.
In contrast, we might have another
case where we predicted values.
Or around 0.9 let's say,
and by the objective function,
the error will be larger, but if we
can get the order of the two documents
correct, that's actually a better result.
So these new more advanced approaches
will try to correct that problem.
Of course then the challenge is that.
That the optimization problem
will be harder to solve.
And then researchers have proposed
many solutions to the problem.
And you can read more of
the references at the end.
Know more about the these approaches.
Now these learning to random approaches.
Are actually general, so they can also be
applied to many other ranking problems,
not just retrieval problem.
So here I list some for
example recommender systems,
computational adv, advertising,
or summarization, and
there are many others that you can
probably encounter in your applications.
To summarize this lecture,
we have talked about, using machine
learning to combine much more features
to incorporate a ranking without.
Actually the use of machine learning,
in information retrieval has
started since many decades ago.
So for example on the Rocchio feedback
approach that we talked about earlier
was a machine learning approach
applied to to learn this feedback, but
the most reasonable use of machine
learning has been driven by some changes.
In the environment of applications
of retrieval systems.
And first it's, mostly,
driven by the availability of a lot of
training data in the form of clicks rules.
Such data weren't available before.
So the data can provide a lot
of useful knowledge about
relevance and machine learning methods
can be applied to leverage this.
Secondly it's also due by
the need of combining them.
In the features.
And
this is not only just because there
are more features available on the web
that can be naturally re-used
with improved scoring.
It's also because by combining them,
we can improve the robustness of ranking.
So this is designed for combating spams.
Modern search engines all use some kind
of machine learning techniques to combine
many features to optimize ranking and
this is a major feature of these
current engines such as Google, Bing.
The topic of learning to rank
is still active research.
Topic in the community, and so you can
expect to see new results being developed,
in the next, few years.
Perhaps.
Here are some additional readings that
can give you more information about.
About, how learning to rank books and
also some advanced methods.
[MUSIC]

[SOUND].
This lecture is about
the future of web search.
In this lecture, we're going to talk
about some possible future trends
of web search and intelligent information
retrieval systems in general.
In order to further improve
the accuracy of a search engine,
it's important that to consider
special cases of information need.
So one particular trend could be to
have more and more specialized than
customized search engines, and they
can be called vertical search engines.
These vertical search engines can be
expected to be more effective than
the current general search engines
because they could assume that
users are a special group of users that
might have a common information need,
and then the search engine can be
customized with this ser, so, such users.
And because of the customization,
it's also possible to do personalization.
So the search can be personalized,
because we have a better
understanding of the users.
Because of the restrictions with domain,
we also have some advantages
in handling the documents, because we can
have better understanding of documents.
For example, particular words may
not be ambiguous in such a domain.
So we can bypass the problem of ambiguity.
Another trend we can expect to see,
is the search engine will
be able to learn over time.
It's like a lifetime learning or
lifelong learning, and this is, of course,
very attractive because that means the
search engine will self-improve itself.
As more people are using it, the search
engine will become better and better, and
this is already happening,
because the search engines can learn
from the [INAUDIBLE] of feedback.
More users use it, and the quality
of the search engine allows for
the popular queries that are typed in by
many users allow it to become better,
so this is sort of another
feature that we will see.
The third trend might be
to the integration of
bottles of information access.
So search, navigation, and
recommendation or filtering might be
combined to form a full-fledged
information management system.
And in the beginning of this course,
we talked about push versus pull.
These are different modes of information
access, but these modes can be combined.
And similarly, in the pull mode, querying
and the browsing could also be combined.
And in fact we're doing that basically,
today, is the [INAUDIBLE] search endings.
We are querying, sometimes browsing,
clicking on links.
Sometimes we've got some
information recommended.
Although most of the cases the information
recommended is because of advertising.
But in the future, you can imagine
seamlessly integrate the system with
multi-mode for information access, and
that would be convenient for people.
Another trend is that we might see systems
that try to go beyond the searches
to support the user tasks.
After all, the reason why people want
to search is to solve a problem or
to make a decision or perform a task.
For example consumers might search for
opinions about products in
order to purchase a product,
choose a good product by, so
in this case it would be beneficial to
support the whole workflow of purchasing
a product, or choosing a product.
In this era, after the common search
engines already provide a good support.
For example, you can sometimes look at the
reviews, and then if you want to buy it,
you can just click on the button to go the
shopping site and directly get it done.
But it does not provide a,
a good task support for many other tasks.
For example, for researchers,
you might want to find the realm in
the literature or site of the literature.
And then, there's no, not much support for
finishing a task such as writing a paper.
So, in general, I think,
there are many opportunities in the wait.
So in the following few slides, I'll
be talking a little bit more about some
specific ideas or thoughts that hopefully,
can help you in imagining new
application possibilities.
Some of them might be already relevant
to what you are currently working on.
In general, we can think about any
intelligent system, especially intelligent
information system, as we specified
by these these three nodes.
And so
if we connect these three into a triangle,
then we'll able to specify
an information system.
And I call this
Data-User-Service Triangle.
So basically the three questions you
ask would be who are you serving and
what kind of data are you are managing and
what kind of service you provide.
Right there, this would help us
basically specify in your system.
And there are many different ways
to connect them depending on
how you connect them,
you will have a different kind of systems.
So let me give you some examples.
On the top,
you can see different kinds of users.
On the left side, you can see different
types of data or information, and
on the bottom,
you can see different service functions.
Now imagine you can connect
all these in different ways.
So, for example, you can connect
everyone with web pages, and
the support search and
browsing, what do you get?
Well, that's web search, right?
What if we connect UIUC employees with
organization documents or enterprise
documents to support the search and
browsing, but that's enterprise search.
If you connect the scientist
with literature information
to provide all kinds of service,
including search, browsing, or
alert of new random documents or
mining analyzing research trends,
or provide the task with support or
decision support.
For example, we might be,
might be able to provide a support for
automatically generating
related work section for
a research paper, and
this would be closer to task support.
Right?
So then
we can imagine this would
be a literature assistant.
If we connect the online shoppers
with blog articles or product reviews
then we can help these people
to improve shopping experience.
So we can provide, for example data mining
capabilities to analyze the reviews,
to compare products, compare sentiment of
products and to provide task support or
decision support to have them
choose what product to buy.
Or we can connect customer service
people with emails from the customers,
and, and we can imagine a system
that can provide a analysis
of these emails to find that the major
complaints of the customers.
We can imagine a system we
could provide task support
by automatically generating
a response to a customer email.
Maybe intelligently attach
also a promotion message
if appropriate, if they detect that that's
a positive message, not a complaint, and
then you might take this opportunity
to attach some promotion information.
Whereas if it's a complaint,
then you might be able to
automatically generate some
generic response first and
tell the customer that he or she can
expect a detailed response later, etc.
All of these are trying to help
people to improve the productivity.
So this shows that
the opportunities are really a lot.
It's just only restricted
by our imagination.
So this picture shows the trend
of the technology, and also,
it characterizes the, intelligent
information system in three angles.
You can see in the center, there's
a triangle that connects keyword queries
to search a bag of words representation.
That means the current search engines
basically provides search support
to users and mostly model
users based on keyword queries
and sees the data through
bag of words representation.
So it's a very simple approximation of
the actual information in the documents.
But that's what the current system does.
It connects these three nodes
in such a simple way, or
it only provides a basic search function
and doesn't really understand the user,
and it doesn't really understand that
much information in the documents.
Now, I showed some trends to push each
node toward a more advanced function.
So think about the user node here, right?
So we can go beyond the keyword queries,
look at the user search history,
and then further model the user
completely to understand the,
the user's task environment,
task need context or other information.
Okay, so this is pushing for
personalization and complete user model.
And this is a major
direction in research in,
in order to build intelligent
information systems.
On the document side,
we can also see, we can
go beyond bag of words implementation
to have entity relation representation.
This means we'll recognize people's names,
their relations, locations, etc.
And this is already feasible with
today's natural processing technique.
And Google is the reason
the initiative on the knowledge graph.
If you haven't heard of it,
it is a good step toward this direction.
And once we can get to that level without
initiating robust manner at larger scale,
it can enable the search engine
to provide a much better service.
In the future we would like to have
knowledge representation where we
can add perhaps inference rules, and
then the search engine would
become more intelligent.
So this calls for
large-scale semantic analysis, and
perhaps this is more feasible for
vertical search engines.
It's easier to make progress
in the particular domain.
Now on the service side,
we see we need to go beyond the search of
support information access in general.
So search is only one way to get access
to information as well recommender
systems and push and pull so different
ways to get access to random information.
But going beyond access,
we also need to help people digest the
information once the information is found,
and this step has to do with analysis
of information or data mining.
We have to find patterns or
convert the text information into
real knowledge that can
be used in application or
actionable knowledge that can be used for
decision making.
And furthermore the knowledge
will be used to help a user to
improve productivity in finishing a task,
for example, a decision-making task.
Right, so this is a trend.
And, and, and so basically,
in this dimension, we anticipate
in the future intelligent information
systems will provide intelligent and
interactive task support.
Now I should also emphasize interactive
here, because it's important to optimize
the combined intelligence of the users and
the system.
So we, we can get some help
from users in some natural way.
And we don't have to assume the system
has to do everything when the human,
user, and the machine can collaborate in
an intelligent way, an efficient way,
then the combined intelligence
will be high and in general,
we can minimize the user's overall
effort in solving problem.
So this is the big picture of future
intelligent information systems,
and this hopefully can provide
us with some insights about
how to make further innovations
on top of what we handled today.
[MUSIC]

[SOUND]
Hello
welcome to the course in
Text Retrieval and Search Engines.
I'm Cheng Xiang Zhai.
I have a nickname Cheng.
I'm a professor of the Department
of Computer Science at
the University of Illinois
at Urbana-Champaign.
this first lecture is a basic
introduction to the course.
A brief introduction to what
we we'll cover in the course.
We're going to first talk about the data
mining specialization since this course is
part of that specialization.
And then we'll cover motivation
objectives of the course.
This will be followed by pre-requisites
and course format and reference books.
And then finally we'll talk
about the course schedule,
which has number of topics to be
covered in the rest of this course.
So the data mining specialization
offered by the University of Illinois
at Urbana-Champaign is really to address
the need for data mining techniques to
handle a lot of big data,
to turn the big data into knowledge.
There are five lecture-based courses,
as you see on the slide.
Plus one capstone,
project course in the end.
I'm teaching two of them which is
this course, Text Retrieval and
Search Engines and this one.
So the two courses that I cover
here are all about the text data.
In contrast, the other courses are
covering more general techniques that can
be applied to all kinds of data.
So Patent Discovery taught by the
Professor Jowi Han and Cluster Analysis
again taught by him about the general data
mining techniques to handle structure.
The end and structure text data.
And data mine, data visualization
covered by professor Jung Hart is about
the general visualization techniques.
Again applicable to all kinds of data.
So the motivation for this course.
In fact also for
the other courses that I'm teaching
is that we have a lot of text data.
And the data is everywhere,
is growing rapidly, so
you must have been
experiencing this growth.
Just think about how much text data
you're dealing with every day.
I listed some data types here, for
example, on the internet we see a lot
of web pages, news articles etcetera.
And then we have block articles,
emails, scientific literature,
tweets, as well speaking,
maybe a lot of tweets are being written,
and a lot of emails are, are being sent.
So, the amount of text data is beyond
our capacity to understand them.
Also, the amount of data makes it possible
to actually analyze the data to discover
interesting knowledge and that's what
we meant by, harnessing big text data.
[MUSIC]

[MUSIC]
Text data is very special.
In contrast to the data captured
by machines such as sensors,
text data is produced by humans.
And they also are meant
to be consumed by humans.
And this has some
interesting consequences.
Because it is produced by humans, it tends
to have a lot of useful knowledge about
people's' preferences,
people's' opinions about everything.
And that makes it possible to mine
text data to discover those
latent prefaces of people,
which could be very useful to build
an intelligent system to help people.
You can think about
scientific literature or
so and it's a way to encode
our knowledge about the world.
So it's very high quality content, yet we
have difficulty digesting all the content.
Now as a result of the fact that
text is consumed by we humans,
we also need intelligent software tools
to help people digest the content, or
otherwise we'd miss
a lot of useful content.
This slide shows that the human really
plays important role in test data mining.
We have to consider human in the loop, and
we have to consider the fact that
the text is generated by human.
So, here are some examples of
useful text information systems.
This is by no means a complete
list of all applications.
I categorize them into
different categories.
But you can probably imagine
other kinds of applications.
So let's take a look at some of them.
Search for example,
we all know search engines is special.
Web search engines, iPad,
all of you are using Google, or Bing, or
another web search engine all the time.
And we also have live research assistants.
And in fact, wherever you have a lot of
text data, you would have a search engine.
So for example, you might have
a search box on your laptop.
All right,
to search content on your computer.
So that's one kind of application systems,
but
we also have filtering systems or
recommended systems.
Those systems can push
information to users.
They can recommend useful
information to users.
So again, use filters, spam filters.
Literature the movie recommenders.
Now not of them are necessary
recommending the information to you.
For example email filter,
spam email filter,
this is actually to filter out
the spams from your inbox, all right.
But in nature these are similar systems in
that they have to make a binary decision
regarding whether to retain
a particular document or discard it.
Another kind of systems
are categorization systems.
So for example, in handling emails,
you might prefer automatic,
sorter that would automatically
sort incoming emails into a proper
folders that you created.
Or we might want to categorize product
reviews into positive or negative.
News agencies might be interested in
categorizing news articles into
all kinds of subject categories.
Those are all categorization systems.
Finally there are also systems
that might do more analysis.
And oh, you can say mine text data.
And these can be text mining systems or
information extraction systems,
and they can be
used to analyze text data in more detail
to discover potentially useful knowledge.
For example companies might
be interested in discovering
major complaints from their customers
based on the email messages that the,
they have received from the customers.
All right, so
having a system to support that would
really help improve their productivity and
the customer relations.
Also in business, intelligence companies
are often interested in analyzing product
reviews to understand the relative
strengths of their own products
in comparison with competitors.
And, and so these are all examples
of these test mining systems.
[INAUDIBLE] we have a lot of data
in particular literature data.
So, there's also great opportunity
of using computer systems
to analyze the data to
automatically read literature, and
to gain knowledge, and
to help biologists make discoveries.
And you can imagine many others.
So the point is that with so
much text data,
we can build very useful systems to
help people in many different ways.
Now how do we build this systems?
Well these actually are the main
technologies that we'll be talking
about in this course and the other course
that I'm teaching for this specialization.
The main techniques for
building these systems and also for
harnessing the text data are text
retrieval and text data mining.
So I use this picture to show
the relation between these two
some of the different techniques.
We started with big text data, right?
But for any applications, we don't
necessarily need to use all the data.
Often we only need the small subset of the
most relevant data, and that's shown here.
So text retrieval is to convert big,
raw text data into that small
subset of most relevant data that are most
useful for a particular application.
And this is usually
done by search engines.
And so
this will be covered in this course.
After we have got a small
amount of relevant data,
we also need to further analyze the data
to help people digest the data, or
to turn the data into
actionable knowledge.
And this step is called text mining,
where we use a number of techniques to
mine the data to get useful knowledge or
pairings.
And the knowledge can then be used
in many different applications.
And this part, text mining, will be
covered in the other course that I'm
teaching called Text Mining and Analytics.
The emphasis of this course
is on basic concepts and
practical techniques in text retrieval.
More specifically we will
cover how search engines work.
How to implement a search engine.
How to evaluate a search engine, so
that you know one search engine is
better than another or
one method is better than another.
How to improve and
optimize a search engine system.
And how to build a recommender system.
We also hope to provide a hands on
experience on multiple aspects.
One is to create a test collection for
evaluating search engines.
This is very important for knowing
which technique actually worked well.
And whether your search engine system
is really good for your application.
The other aspect is to experiment
with search engine algorithms.
In practice, you will have to face
choices of different algorithms.
So it's important to know
how to compare them and
to figure out how they work or
maybe potentially, how to improve them.
And finally, we'll provide a platform for
you to do search engine competition.
Where you can compare your different
ideas to see which idea works better
on some data set.
The prerequisites for
this course are minimum.
Basically we hope you have some basic
concepts of computer science, for
example data structures.
And we hope you will be comfortable
with programming, especially in C++.
because that's the language that we'll use
for some of the programming assignments.
The format is lectures plus quizzes,
as often happens in MOOCs.
And we also will provide
a program assignments for
those of you that have
the resources to do that.
We don't really have any required
readings for this course.
That just means if you follow all
the lecture videos carefully,
and you're suppose to know all the basic
concepts and the basic techniques.
But it's always useful to read more, so
here we provide a list of
some useful reference books.
And this in time order, and
that also includes a book that
and I are co-authoring now, and
we make some draft chapters
available on this website.
And we can find more readings and
reference books on this website.
Finally, and this is the course schedule.
That's just the top of the map for
the rest of the course,
and it shows the topics that we will
cover in the remaining lectures.
This picture also shows basic flow of
information in a text information system.
So starting from the big text data, the
first step is to do some natural language
content analysis, because text data is
in the form of natural language text.
So we need to understand
the text to some extent
in order to do something useful for
the users.
So this is the first
topic that we will cover.
And then on top of that as you
can see there are two boxes here.
Those are two types of systems
that can be used to help people
get access to the most relevant data.
Or in other words, those are the two
kinds of systems that will convert
big text data into small
relevant text data.
Search engines are helping
users to search or
to query the data to get
the most relevant documents out.
Recommender systems are to
recommend information to users,
to push information to users.
So those are two, complementary was of
getting users connected to the most
relevant data at the right time.
So this part is called text access,
and this will be the next topic.
And after we cover that we are going
to cover a number of topics,
all about the search engines.
Now the text access
topic is a brief topic,
a brief coverage of
the two kinds of systems.
In the remaining topics, we'll cover
search engines in much more detail.
That includes text retrieval problem,
text retrieval methods, how to evaluate
these methods, implementation of
the system, and web search applications.
And after these, we're going to
go cover the recommender system.
So this is what you expect
in the rest of this course.
Thanks.
[MUSIC]

[SOUND] This lecture is about
natural language content analysis.
As you see from this picture,
this is really the first step
to process any text data.
Text data are in natural languages.
So, computers have to understand
natural languages to some extent in
order to make use of the data, so
that's the topic of this lecture.
We're going to cover three things.
First, what is natural language
processing, which is a main technique for
processing natural language
to obtain understanding?
The second is the State of the Art in NLP,
which stands for
natural language processing.
Finally, we're going to cover the relation
between natural language processing and
text retrieval.
First, what is NLP?
Well, the best way to
explain it is to think about,
if you see a text in a foreign
language that you can't understand.
Now, what you have to do in
order to understand that text?
This is basically what
computers are facing.
Right?
So, looking at the simple sentence like,
a dog is chasing a boy on the playground.
We don't have any problems
understanding this sentence, but
imagine what the computer would have
to do in order to understand it.
For in general,
it would have to do the following.
First, it would have to know dog is
a noun, chasing's a verb, et cetera.
So, this is a code lexile analysis or
part of speech tagging.
And, we need to pick out the,
the syntaxing categories of those words.
So, that's a first step.
After that, we're going to figure
out the structure of the sentence.
So for example, here it shows that a and
dog would go together
to form a noun phrase.
And, we won't have dog and
is to go first, right.
And, there are some structures
that are not just right.
But, this structure shows what we might
get if we look at the sentence and
try to interpret the sentence.
Some words would go together first, and
then they will go together
with other words.
So here, we show we have noun phrases
as intermediate components and
then verb phrases.
Finally, we have a sentence.
And, you get this structure, we need to
do something called a syntactic analysis,
or parsing.
And, we may have a parser,
a computer program that would
automatically create this structure.
At this point, you would know
the structure of this sentence, but
still you don't know
the meaning of the sentence.
So, we have to go further
through semantic analysis.
In our mind,
we usually can map such a sentence to what
we already know in our knowledge base.
And for example, you might imagine
a dog that looks like that,
there's a boy and
there's some activity here.
But for computer,
will have to use symbols to denote that.
All right.
So, we would use the symbol
d1 to denote a dog.
And, b1 to denote a boy, and then p1
to denote the playground, playground.
Now, there is also a chasing
activity that's happening here, so
we have the relation chasing here,
that connects all these symbols.
So, this is how a computer would obtain
some understanding of this sentence.
Now from this representation, we could
also further infer some other things,
and we might indeed, naturally think
of something else when we read text.
And, this is call inference.
So for example, if you believe
that if someone's being chased and
this person might be scared.
All right.
With this rule,
you can see computers could also
infer that this boy may be scared.
So, this is some extra knowledge
that you would infer based on
some understanding of the text.
You can even go further to understand the,
why the person said this sentence.
So, this has to do with
the use of language.
All right.
This is called pragmatic analysis.
In order to understand the speech
actor of a sentence, all right,
we say something to
basically achieve some goal.
There's some purpose there and
this has to do with the use of language.
In this case, the person who said
the sentence might be reminding
another person to bring back the dog.
That could be one possible intent.
To reach this level of understanding,
we would require all these steps.
And, a computer would have to go
through all these steps in order to
completely understand this sentence.
Yet, we humans have no
trouble with understand that.
We instantly, will get everything,
and there is a reason for that.
That's because we have a large
knowledge base in our brain, and
we use common sense knowledge
to help interpret the sentence.
Computers, unfortunately,
are hard to obtain such understanding.
They don't have such a knowledge base.
They are still incapable of doing
reasoning and uncertainties.
So, that makes natural language
processing difficult for computers.
But, the fundamental reason why the
natural language processing is difficult
for computers is simple because natural
language has not been designed for
computers.
They, they, natural languages
are designed for us to communicate.
There are other languages designed for
computers.
For example, program languages.
Those are harder for us, right.
So, natural languages is designed to
make our communication efficient.
As a result,
we omit a lot of common sense knowledge
because we assume everyone
knows about that.
We also keep a lot of ambiguities
because we assume the receiver, or
the hearer could know how to
discern an ambiguous word,
based on the knowledge or the context.
There's no need to invent a different
word for different meanings.
We could overload the same word with
different meanings without the problem.
Because of these reasons,
this makes every step in natural language
of processing difficult for computers.
Ambiguity's the main difficulty, and
common sense reasoning is often required,
that's also hard.
So, let me give you some
examples of challenges here.
Conceded the word-level ambiguities.
The same word can have different
syntactical categories.
For example,
design can be a noun or a verb.
The word root may have multiple meanings.
So, square root in math sense,
or the root of a plant.
You might be able to
think of other meanings.
There are also syntactical ambiguities.
For example, the main topic of this
lecture, natural language processing,
can actually be interpreted in two ways,
in terms of the structure.
Think for a moment and
see if you can figure that out.
We usually think of this as
processing of natural languages, but
you could also think of this as you say,
language process is natural.
Right.
So, this is example of syntatic ambiguity.
Where we have different
structures that can be
applied to the same sequence of words.
Another example of ambiguous
sentence is the following,
a man saw a boy with a telescope.
Now, in this case, the question is,
who had the telescope?
All right, this is called a prepositional
phrase attachment ambiguity,
or PP attachment ambiguity.
Now, we generally don't have a problem
with these ambiguities because we have
a lot of background knowledge to
help us disintegrate the ambiguity.
Another example of difficulty
is anaphora resolution.
So, think about the sentence like John
persuaded Bill to buy a TV for himself.
The question here is,
does himself refer to John or Bill?
So again, this is something that
you have to use some background or
the context to figure out.
Finally, presupposition
is another problem.
Consider the sentence,
he has quit smoking.
Now this obviously
implies he smoked before.
So, imagine a computer wants to understand
all the subtle differences and meanings.
They would have to use a lot of
knowledge to figure that out.
It also would have to maintain a large
knowl, knowledge base of odd meanings of
words and how they are connected to our
common sense knowledge of the word.
So this is why it's very difficult.
So as a result we are still not perfect.
In fact, far from perfect in understanding
natural languages using computers.
So this slide sort of gives a simplified
view of state of the art technologies.
We can do part of speech
tagging pretty well.
So, I showed minus 7% accuracy here.
Now this number is obviously
based on a certain data set, so
don't take this literally.
All right, this just shows that
we could do it pretty well.
But it's still not perfect.
In terms of parsing,
we can do partial parsing pretty well.
That means we can get noun phrase
structures or verb phrase structure, or
some segment of the sentence understood
correctly in terms of the structure.
And, in some evaluation
results we have seen about 90%
accuracy in terms of partial
parsing of sentences.
Again, I have to say, these numbers
are relative to the data set.
In some other data sets,
the numbers might be lower.
Most of existing work has been
evaluated using news data set.
And so, a lot of these numbers are more or
less biased towards news data.
Think about social media data.
The accuracy likely is lower.
In terms of semantic analysis,
we are far from being able to do
a complete understanding of a sentence.
But we have some techniques
that would allow us to do
partial understanding of the sentence.
So, I could mention some of them.
For example, we have techniques that can
allow us to extract the entities and
relations mentioned in text or articles.
For example, recognizing
the mentions of people, locations,
organizations, et cetera in text.
Right?
So this is called entity extraction.
We may be able to recognize the relations.
For example,
this person visited that per, that place.
Or, this person met that person, or
this company acquired another company.
Such relations can be extracted
by using the current and
natural languaging processing techniques.
They are not perfect, but
they can do well for some entities.
Some entities are harder than others.
We can also do word sentence
disintegration to some extent.
We have to figure out whether this word in
this sentence would have certain meaning,
and in another context,
the computer could figure out
that it has a different meaning.
Again, it's not perfect but
you can do something in that direction.
We can also do sentiment analysis meaning
to figure out whether sentence
is positive or negative.
This is a special use for, for
review analysis for example.
So these examples of semantic analysis.
And they help us to obtain partial
understanding of the sentences.
Right?
It's not
giving us a complete understanding as
I showed before for the sentence, but
it will still help us gain understanding
of the content and these can be useful.
In terms of inference,
we are not yet there,
probably because of the general difficulty
of inference and uncertainties.
This is a general challenge
in artificial intelligence.
That's probably also because we don't have
complete semantic reimplementation for
natural language text.
So this is hard.
Yet in some domains, perhaps in
limited domains when you have a lot of
restrictions on the world of users,
you may be to may be able to perform
inference to some extent, but in general
we cannot really do that reliably.
Speech act analysis is also
far from being done, and
we can only do that analysis for
very special cases.
So, this roughly gives you some
idea about the state of the art.
And let me also talk a little
bit about what we can't do.
And, and so we can't even do
100% part of speech tagging.
This looks like a simple task,
but think about the example here,
the two uses of off may have different
syntactic categories if you try
to make a fine grain distinctions.
It's not that easy to figure
out such differences.
It's also hard to do general
complete the parsing.
And, again this same sentence
that you saw before is example.
This, this ambiguity can be
very hard to disambiguate.
And you can imagine example where you
have to use a lot of knowledge i,
in the context of the sentence or
from the background in order to figure
out the, who actually had the telescope.
So is, i, although sentence looks very
simple, it actually is pretty hard.
And in cases when the sentence is
very long, imagine it has four or
five prepositional phrases, then there
are even more possibilities to figure out.
It's also harder to precise
deep semantic analysis.
So here's example.
In this sentence, John owns a restaurant,
how do we define owns exactly?
The word, own, you know,
is something that we can understand but
it's very hard to precisely describe
the meaning of own for computers.
So as a result we have robust and
general natural language processing
techniques that can process a lot of text
data in a shallow way,
meaning we only do superficial analysis.
For example, part of s,
of speech tagging, or
partial parsing, or recognizing sentiment.
And those are not deep understanding
because we're not really
understanding the exact
meaning of the sentence.
On the other hand, the deep understanding
techniques tend not to scale up well,
meaning that they would fail
on some unrestricted text.
And if you don't restrict
the text domain or
the use of words, then these
techniques tend not to work well.
They may work well, based on machine
learning techniques on the data
that are similar to the training data
that the program has been trained on.
But they generally wouldn't work well on
the data that are very different from
the training data.
So this pretty much summarizes the state
of the art of natural language processing.
Of course, within such a short amount
of time, we can't really give you a,
a complete view of any of it, which is a
big field, and either expect that to have,
to see multiple courses on natural
language processing topic itself.
But, because of it's relevance to the
topic that we talked about it's useful for
you to know the background in case
you haven't been exposed to that.
So, what does that mean for
text retrieval?
Well, in text retrieval we
are dealing with all kinds of text.
It's very hard to restrict
the text to a certain domain.
And we also are often dealing with
a lot of text data, so that means.
The NLP techniques must be general,
robust, and efficient and that
just implies today we can only use fairly
shallow NLP techniques for text retrieval.
In fact,
most search engines today use something
called a bag of words representation.
Now this is probably the simplest
representation you can probably think of.
That is to turn text data
into simply a bag of words.
Meaning we will keep the individual words
but we'll ignore all the orders of words.
And we'll keep duplicated
occurrences of words.
So this is called a bag
of words representation.
When you represent the text in this way,
you ignore a lot about the information,
and that just makes it harder
to understand the exact meaning of
a sentence because we've lost the order.
But yet, this representation tends
to actually work pretty well for
most search tasks.
And this is partly because the search
task is not all that difficult.
If you see matching of some of the query
words in a text document, chances
are that that document is about the topic,
although there are exceptions, right?
So in comparison some other tasks, for
example machine translation, would require
you to understand the language accurately,
otherwise the translation would be wrong.
So in comparison,
search tasks are solved relatively easy
such a representation is often sufficient.
And that's also the representation
that the major search engines today,
like Google or Bing are using.
Of course I put in in parentheses but
not all.
Of course there are many queries that are
not answered well by the current search
engines, and
they do require a representation
that would go beyond bag
of words representation.
That would require more natural
language processing, to be done.
There is another reason why we have
not used the sophisticated NLP
techniques in modern search engines, and
that's because some retrieval techniques
actually naturally solve
the problem of NLP.
So, one example,
is word sense disambiguation.
Think about a word like java.
It could mean coffee or
it could mean program language.
If you look at the word
alone it would be ambiguous.
But when the user uses the water in
the query, usually there are other words.
For example I'm looking for
usage of Java applet.
When I have applet there that
implies Java means program language.
And that context can help us naturally
prefer documents where Java is
referring to program language,
because those documents would
probably match applet as well.
If java occurs in the document
in a way that means coffee,
then you would never match applet,
or with very small probability.
Right.
So this is a case when some retrieval
techniques naturally achieve the goal
of word sense disambiguation.
Another example is some technique called
feedback which we will talk about
later in some of the lectures.
This tech, technique would allow us
to add additional words to the query.
And those additional words could
be related to the query words.
And these words can help match documents
where the original query words
have not occurred.
So this achieves, to some extent,
semantic matching of terms.
So those techniques also helped us
bypass some of the difficulties
in natural language processing.
However, in the long run, we still need
deeper natural language processing
techniques in order to improve the
accuracy of the current search engines.
And it's particularly needed for complex
search tasks, or for question answering.
Google has recently
launched a knowledge graph.
And this is one step toward that goal,
because knowledge graph would contain
entities and their relations.
And this goes beyond the simple
bag of words representation.
And such technique should help us improve
the search engine utility significantly,
although this is a still open topic for
research and exploration.
In sum, in this lecture we'll talk
about what is NLP and we've talked
about the state of the art techniques,
what we can do, what we cannot do.
And finally, we also explained
why bag of words representation
remains the dominant representation used
in modern search engines even though
deeper NLP would be needed for
future search engines.
If you want to know more you can take
a look at some additional readings.
I only cited one here.
And that's a good starting point though.
Thanks.
[MUSIC]

[SOUND] In this lecture,
we're going to talk about text access.
In the previously lecture, we talked
about natural language content analysis.
We explained that the state of the art
natural language processing techniques
are still not good enough to process
a lot of unrestricted text data
in a robust manner.
As a result, bag of words representation
remains very popular in
applications like search engines.
In this lecture we're going to talk
about some high level strategies
to help users get access to the text data.
This is also important step to
convert raw, big text data into small
relevant data that are actually
needed in a specific application.
So the main question we address here is,
how can a text information system
help users get access to
the relevant text data?
We're going to cover two complementary
strategies, push vs pull.
And then we're going to talk about
two ways to implement the pull mode,
querying vs browsing.
So first, push vs pull.
These are two different ways to connect
users with the right information
at the right time.
The difference is which
takes the initiative,
which party it takes in the initiative.
In the pull mode,
the users would take the initiative to
start the information access process.
And in this case, a user typically would
use a search engine to fulfill the goal.
For example,
the user may type in a query, and
then browse results to find
the relevant information.
So this is usually appropriate for
satisfying a user's ad
hoc information need.
An ad hoc information need is
a temporary information need.
For example, you want to buy a product so
you suddenly have a need to read
reviews about related products.
But after you have collected information,
you have purchased your product, you
generally no longer need such information.
So it's a temporary information need.
In such a case, it's very hard for
a system to predict your need, and
it's more appropriate for
the users to take the initiative.
And that's why search engines
are very useful today,
because many people have many ad
hoc information needs all the time.
So as we are speaking Google probably is
processing many queries from this, and
those are all, or
mostly ad hoc information needs.
So this is a pull mode.
In contrast, in the push mode
the system will take the initiative
to push the information to the user or
to recommend the information to the user.
So in this case, this is usually
supported by a recommender system.
Now this would be appropriate if
the user has a stable information need.
For example, you may have a research
interest in some topic, and
that interest tends to stay for
a while, so it's relatively stable.
Your hobby is another example
of a stable information need.
In such a case, the system can interact
with you and can learn your interest, and
then can monitor the information stream.
If it is, the system hasn't seen any
relevant items to your interest,
the system could then take the initiative
to recommend information to you.
So for example, a news filter or
news recommender system could
monitor the news stream and
identify interest in news to you, and
simply push the news articles to you.
This mode of information access
may be also appropriate when
the system has a good knowledge
about the user's need.
And this happens in the search context.
So for example, when you search for
information on the web a search
engine might infer you might be also
interested in some related information.
And they would recommend
the information to you.
So that should remind you for example,
advertisement placed on a search page.
So this is about the, the two high level
strategies or two modes of text access.
Now let's look at the pull
mode in more detail.
In the pull mode, we can further this
in usually two ways to help users,
querying vs browsing.
In querying, a user would just enter
a query, typically a keyword query, and
the search engine system would
return relevant documents to users.
And this works well when
the user knows what exactly key,
are the keywords to be used.
So if you know exactly
what you're looking for
you tend to know the right keywords,
and then query would work very well.
And we do that all the time.
But we also know that
sometimes it doesn't work so
well, when you don't know the right
keywords to use in the query or
you want to browse information
in some topic area.
In this case browsing
would be more useful.
So in this case in the case of browsing
the users would simply navigate
into the relevant information
by following the path that's
supported by the structures
on the documents.
So the system would maintain
some kind of structures, and
then the user could follow
these structures to navigate.
So this strategy works well when the user
wants to explore information space or
the user doesn't know what
are the keywords to use in the query.
Or simply because the user, finds it
inconvenient to type in the query.
So even if a user knows what query to type
in, if the user is using a cell phone
to search for information,
then it's still hard to enter the query.
In such a case again,
browsing tends to be more convenient.
The relationship between browsing and
the query is best understood by
making an analogy to sightseeing.
Imagine if you are touring a city.
Now if you know the exact address
of a attraction, then taking a taxi
there is perhaps the fastest way,
you can go directly to the site.
But if you don't know the exact address,
you may need to walk around, or
you can take a taxi to a nearby place,
and then walk around.
It turns out that we do exactly
the same in the information space.
If you know exactly what you
are looking for, then you can
use the right keywords in your query
to find the information directly.
That's usually the fastest
way to do find information.
But what if you don't know
the exact keywords to use?
Well, your query probably won't work so
well, and you will land on some related
pages, and then you need to also walk
around in the information space.
Meaning by following the links or
by browsing,
you can then finally get
into the relevant page.
If you want to learn about a topic again,
you you will likely do a lot of browsing.
So just like you are looking
around in some area and
you want to see some interesting
attractions in a related
in the same region.
So this analogy also tells us that
today we have very good support for
querying, but we don't really
have good support for browsing.
And this is because in order to browse
effectively, we need a a map to guide us,
just like you need a map of Chicago
to tour the city of Chicago.
You need a topical map to
tour the information space.
So how to construct such a topical
map is in fact a very interesting
research question that
likely will bring us
more interesting browsing experience
on the web or in other applications.
So to summarize this lecture, we have
talked about two high level strategies for
text access, push and pull.
Push tends to be supported
by a recommender system and
pull tends to be supported
by a search engine.
Of course in the sophisticated
intent in the information system,
we should combine the two.
In the pull mode we have further
distinguished querying and browsing.
Again, we generally want to combine
the two ways to help users so
that you can support both querying and
browsing.
If you want to know more about
the relationship between pull and
push, you can read this article.
This gives a excellent discussion of the
relationship between information filtering
and information retrieval.
Here information filtering is similar
to information recommendation,
or the push mode of information access.
[MUSIC]

[SOUND] This lecture is about
the text retrieval problem.
This picture shows our overall plan for
lectures.
In the last lecture, we talked about
the high level strategies for text access.
We talked about push versus pull.
Search engines are the main tools for
supporting the pull mode.
Starting from this lecture,
we're going to talk about the how
search engines work in detail.
So first,
it's about the text retrieval problem.
We're going to talk about
the three things in this lecture.
First, we'll define text retrieval.
Second, we're going to make
a comparison between text retrieval and
the related task, database retrieval.
Finally, we're going to talk about the
document selection versus document ranking
as two strategies for
responding to a user's query.
So what is text retrieval?
It should be a task that's familiar
to most of us because we're using web
search engines all the time.
So text retrieval is basically a task
where the system would respond
to a user's query with relevant lock-ins,
basically through
supported querying as one way to implement
the poor mold of information access.
So the scenario's the following.
You have a collection of text documents.
These documents could be all
the web pages on the web.
Or all the literature articles
in the digital library or
maybe all the text files in your computer.
A user will typically give a query to the
system to express the information need.
And then the system would return
relevant documents to users.
Relevant documents refer to those
documents that are useful to the user who
typed in the query.
Now this task is a often
called information retrieval.
But literally, information retrieval
would broadly include the retrieval of
other non-textual information as well.
For example, audio, video, et cetera.
It's worth noting that text
retrieval is at the core of
information retrieval in the sense
that other medias such as
video can be retrieved by
exploiting the companion text data.
So for example,
can the image search engines actually
match a user's query with
the companion text data of the image?
This problem is also called the,
the search problem,
and the technology is often called
search technology in industry.
If you ever take on course in databases,
it'll be useful to pause
the lecture at this point and
think about the differences between
text retrieval and database retrieval.
Now these two tasks
are similar in many ways.
But there are some important differences.
So, spend a moment to think about
the differences between the two.
Think about the data and information
managed by a search engine versus
those that are man,
managed by a database system.
Think about the difference between
the queries that you typically specify for
a database system versus the queries that
typed in by users on the search engine.
And then finally think about the answers.
What's the difference between the two?
Okay, so
if we think probably the information out
there are managed by the two systems.
We will see that in text retrieval,
the data is unstructured, it's free text.
But in databases, they are structured
data, where there is a clear defined
schema to tell you this column
is the names of people and
that column is ages, et cetera.
In unstructured text,
it's not obvious what are the names
of people mentioned in the text.
Because of this difference,
we can also see that text information
tends to be more ambiguous.
And we'll talk about that in the natural
language processing lecture.
Whereas in databases, the data tend
to have well-defined semantics.
There is also important
difference in the queries, and
this is partly due to the difference
in the information, or data.
So text queries tend to be ambiguous,
whereas in their research,
the queries are particularly well-defined.
Think about the SQL query,
that would clear the specify
what records to be returned.
So it has very well defined semantics.
Queue all queries or naturally ending
queries tend to be incomplete.
Also in that it doesn't really,
fully specify what documents
should be retrieved.
Whereas, in the database search,
the SQL query
can be regarded as a computer
specification for what should be returned.
And because of these differences,
the answers would be also different.
In the case of text retrieval,
we're looking for relevant documents.
In the database search,
we are retrieving records or
matched records with the SQL query,
more precisely.
Now in the case of text retrieval,
what should be the right answers to
a query is not very well specified,
as we just discussed.
So it's unclear what should be
the right answers to a query.
And this has very important consequences,
and
that is text retrieval is
an empirically defined problem.
And so this a problem because
if it's empirically defined,
then we cannot mathematically prove one
method is better than another method.
That also means we must rely
on emperical evaluation
more than users to know
which method works better.
And that's why we have one lecture,
actually more than one lectures
to cover the issue of evaluation.
Because this is a very important topic for
search engines.
Without knowing how to evaluate
an algorithm appropriately,
there's no way to tell whether we
have got the better algorithm or
whether one system is better than another.
So now let's look at
the problem in a formal way.
So this slide shows a formal formulation
of the text retrieval problem.
First, we have our vocabulary set which
is just a set of words in a language.
Now here,
we're considering just one language, but
in reality on the web there might
be multiple natural languages.
We have text that are in
all kinds of languages.
But here for simplicity, we just
assume there is one kind of language.
As the techniques used for
retrieving data from multiple languages,.
Are more or
less similar to the techniques used for
retrieving documents in one language.
Although there is important difference,
the principles and
methods are very similar.
Next we have the query,
which is a sequence of words.
And so here you can see the query
is defined as a sequence of words.
Each q sub i is a word in the vocabulary.
A document is defined in the same way.
So it's also a sequence of words.
And here,
d sub ij is also a word in the vocabulary.
Now typically, the documents
are much longer than queries.
But there are also cases where
the documents may be very short.
So you can think about the,
what might be a example of that case.
I hope you can think of,
of twitter search, all right?
Tweets are very short.
But in general,
documents are longer then the queries.
Now, then we have
a collection of documents.
And this collection can be very large.
So think about the web.
It could, could be very large.
And then the goal of text retrieval is
you'll find the set of relevant documents,
which we denote by R of q,
because it depends on the query.
And this is, in general, a subset of
all the documents in the collection.
Unfortunately, this set of random
documents is generally unknown,
and usually depend in the sense that for
the same query typed
in by different users, the expected
relevant documents may be different.
The query given to us by
the user is only a hint
on which document should be in this set.
And indeed, the user is generally unable
to specify what exactly should be in
the set, especially in the case of a web
search where the collection is so large.
The user doesn't have complete
knowledge about the whole collection.
So, the best a search system can do is
to compute an approximation of
this relevent document set.
So we denote it by R prime of q.
So, formally, we can see the task
is to compute this R prime of q,
an approximation of
the relevant documents.
So how can we do that?
Now, imagine if you are now asked
to write a program to do this.
What would you do?
Now think for a moment.
Right, so these are your input.
With the query, the documents and
then you will have computed
the answers to this query,
which is set of documents that
would be useful to the user.
So how would you solve the problem?
In general there are two
strategies that we can use.
All right, the first strategy
is to do document selection.
And that is, we're going to have
a binary classification function, or
binary classified.
That's a function that
will take a document and
query as input, and
then give a zero or one as output,
to indicate whether this document
is relevant to the query, or not.
So in this case, you can see the document.
The, the relevant document
set is defined as follows.
It basically, all the documents that
have a value of one by this function.
And so in this case,
you can see the system must have decided
if a document is relevant or not.
Basically, that has to say
whether it's one or zero.
And this is called absolute relevance.
Basically, it needs to know exactly
whether it's going to be useful
to the user.
Alternatively, there's another
strategy called document ranking.
Now in this case,
the system is not going to make a call
whether a document is relevant or not.
Rather, the system's going to
use a real value function, f,
here that would simply give us a value.
That would indicate which
document is more likely relevant.
So it's not going to make a call whether
this document is relevant or not,
but rather it would say which
document is more likely relevant.
So this function then can be
used to rank the documents.
And then we're going to let the user
decide where to stop when the user looks
at the documents.
So we have a threshold,
theta, here to determine
what documents should be
in this approximation set.
And we're going to assume that all
the documents that are ranked above this
threshold are in the set.
Because in effect, these are the documents
that we delivered to the user.
And theta is a cutoff
determined by the user.
So here we've got some collaboration from
the user in some sense because we
don't really make a cutoff, and
the user kind of helped
the system make a cutoff.
So in this case, the system only needs
to decide if one document is more likely
relevant than another.
And that is, it only needs for
determined relative relevance as
opposed to absolute relevance.
Now you can probably already
sense that relevant,
relative relevance would be easier
to determine the absolute relevance.
Because in the first case,
we have to say exactly whether
a document is relevant or not, right?
And it turns out that ranking is indeed
generally preferred to document selection.
So let's look this these two
strategies in more detail.
So this pictures shows how it works.
So on the left side,
we see these documents.
And we use the pluses to
indicate the relevant documents.
So we can see the true relevant
documents here consists
this set of true relevant documents
consists of these pluses, these documents.
And with the document selection function,
we can do,
basically classify them into two groups,
relevant documents and non-relevant ones.
Of course, the classifier will not be
perfect, so it will make mistakes.
So here we can see in the approximation of
the relevant documents we have
got some non-relevant documents.
And similarly, there's a relevant document
that that's misclassified as non-relevant.
In the case of document ranking,
we can see the system seems like
simply ranks all the documents in
the descending order of the scores.
And we're going to let the user stop
wherever the user wants to stop.
So if a user wants to examine more
documents, then the user will
go down the list to examine more and
stop at the lower position.
But if the user only wants to
read a few random documents,
the user might stop at the top position.
So in this case,
the user stops at d4, so the effect,
we have delivered these
four documents to our user.
So as I said,
ranking is generally preferred.
And one of the reasons is
because the classifier,
in the case of document selection,
is unlikely accurate.
Why?
Because the only clue is usually
the query.
But the query may not be accurate, in the
sense that it could be overly constrained.
For example, you might expect the relevant
documents to talk about all these
topics you, by using specific vocabulary,
and as a result,
you might match no random documents,
because in the collection,
no others have discussed the topic
using these vocabularies.
All right.
So in this case,
we'll see there is this problem of
no relevant documents to return in
the case of overly constrained query.
On the other hand, if the query is
under constrained, for example,
if the query does not have sufficient
discriminating words you'll
find in relevant documents,
you may actually end up having.
over delivery.
And this is when you thought these
words might be sufficient to
help you find the relevant documents, but
it turns out that they're not sufficient.
And there are many distraction
documents using similar words.
And so this is the case of over delivery.
Unfortunately, it's very hard to find the
right position between these two extremes.
Why?
Because, when the users looking for
the information in general,
the user does not have a good knowledge
about the the information to be found.
And in that case, the user does
not have a good knowledge about
what vocabularies will be used
in those random documents.
So it's very hard for
a user to pre-specify
the right level of of constraints.
Even if the class file is accurate,
we also still want to rank these
relevant documents because they
are generally not equally relevant.
Relevance is often a matter of degree.
So we must prioritize these documents for
user to exam.
And this, note that this
prioritization is very important,
because a user cannot digest
all the contents at once.
The user generally would have to
look at each document sequentially.
And therefore, it would make
sense to feed users with the most
relevant documents, and
that's what ranking is doing.
So for these reasons ranking
is generally preferred.
Now, this preference also has
a theoretical justification, and
this is given by the probability
ranking principle.
In the end of this lecture
there is a reference for this.
This principal says, returning a ranked
list of documents in descending order of
probability, that a document
is relevant to the query,
is the optimal strategy under
the following two assumptions.
First, the utility of a document to a user
Is independent of the utility
of any other document.
Second, a user would be assumed to
browse the results sequentially.
Now it's easy to understand why
these two assumptions are needed,
in order to justify for
the ranking, strategy.
Because, if the documents are independent,
then we can evaluate the utility
of each document that's separate.
And this would allow us to compute
a score for each document independently.
And then we're going to rank these
documents based on those scores.
The second assumption is to say that the
user would indeed follow the rank list.
If the user is not going to follow
the ranked list, is not going to examine
the documents sequentially, then obviously
the ordering would not be optimal.
So under these two assumptions, we can
theoretically justify the ranking strategy
is in fact the best that you could do.
Now I've put one question here.
Do these 2 assumptions hold?
Now I suggest you to pause the lecture for
a moment to think about these.
Now can you think of some
examples that would suggest
these assumptions aren't necessarily true?
Now if you think for
a moment you may realize none of
the assumptions is actually true.
For example in the case of
independence assumption, we might have
identical documents that have similar
content or exactly the same content.
If you look at each of them alone,
each is relevant.
But if the user has already seen
one of them, we assume it's
generally not very useful for the user
to see another similar or duplicate one.
So clearly the utility of
a document is dependent
on other documents that the user has seen.
In some other cases, you might see
a scenario where one document that may not
be useful to the user, but when three
particular documents are put together,
they provide answer to
the user's question.
So this is collective relevance.
And that also suggests that
the value of the document
might depend on other documents.
Sequential browsing generally would make
sense if you have a ranked list there.
But even if you have a run list,
there is evidence showing that
users don't always just go strictly
sequentially through the entire list.
They sometimes would look at
the bottom for example, or skip some.
And if you think about the more
complicated interfaces that would possibly
use like, two dimensional interface
where you can put additional information
on the screen, then sequential browsing
is a very restrictive assumption.
So the point here is that,
none of these assumptions is really true,
but nevertheless,
the probability ranking principle
establishes some solid foundation for
ranking as a primary task for
search engines.
And this has actually been the basis for
a lot of research work in
information retrieval.
And many algorithms have been
designed based on this assumption.
Despite that the assumptions
aren't necessarily true.
And we can, address this problem by
doing post processing of a ranked list.
For example, to remove redundancy.
So to summarize this lecture,
the main points that you can
take away are the following.
First, text retrieval is
an empirically defined problem.
And that means which algorithm is
better must be judged by the users.
Second, document ranking
is generally prefer and
this is, will help users prioritize
examination of search results.
And this is also to bypass the difficulty
in determining absolute relevance,
because we can get some help from users
in determining where to make the cut off.
It's more flexible.
So this further suggests that
the main technical challenge
in designing the search engine is with
designing effective ranking function.
In other words, we need to define
what is the value of this function f
on the query and document pair.
Now how to design such a function is
a main topic in the following lectures.
There are two suggested
additional readings.
The first is the classic paper on
probability ranking principle.
The second, is a must read for anyone
doing research information retrieval.
It's classical IR book,
which has excellent coverage of
the main research results in early days,
up to the time when the book was written.
Chapter six of this book has
an in depth discussion of
the probability of the ranking principal,
and
the probabilistic retrieval models,
in general.
[MUSIC]

[MUSIC]
This lecture is a overview
of text retrieval methods.
In the previous lecture we introduced
you to the problem of text retrieval.
We explained that the main problem is
to design a ranking function to rank
documents for a query.
In this lecture,
we will give a overview of different
ways of designing this ranking function.
So the problem is the following.
We have a query that has
a sequence of words, and
a document that, that's also a sequence of
words, and we hope to define the function
f that can compute a score based
on the query and document.
So the main challenge you here is with
designing a good ranking function that can
rank all the relevant documents,
on top of all the non-relevant ones.
Now clearly this means our
function must be able to measure
the likelihood that a document
d is relevant to a query q.
That also means we have to have
some way to define relevance.
In particular in order to implement
the program to do that we have to have
a computational definition of relevance,
and we achieve this goal by
designing a retrieval model, which
gives us a formalization of relevance.
Now, over many decades,
researchers have designed
many different kinds of retrieval models,
and they fall into different categories.
First, one fair many of the models
are based on the similarity idea.
Basically, we assume that if
a document is more similar to the query
than another document is,
then we would say the first document
is more relevant than the second one.
So in this case,
the ranking function is defined as
the similarity between the query and
the document.
One well known example in this
case is vector space model,
which we will cover more in
detail later in the lecture.
The second kind of models
are called probabilistic models.
In this family of models,
we follow a very different strategy.
While we assume that queries and
documents are all observations
from random variables, and
we assume there is a binary
random variable called R here,
to indicate whether a document
is relevant to a query.
We then define the score of document with
respect to a query as is a probability
that this random variable R is equal to 1,
given a particular document and query.
There are different cases
of such a general idea.
One is classic probabilistic model,
another is language model, yet
another is
divergence-from-randomness model.
In a later lecture,
we will talk more about the, one case,
which is language model.
The third kind of models of this
is probabilistic inference.
So here the idea is to associate
uncertainty to inference rules.
And we can then quantify the probability
that we can show that the query
follows from the document.
Finally, there is also a family of models
that are using axiomatic thinking.
Here the idea is to define
a set of constraints that
we hope a good retrieval
function to satisfy.
So in this case the problem is you seek
a good ranking function that can
satisfy all the desired constraints.
Interestingly, although these different
models are based on different thinking,
in the end the retrieval function
tends to be very similar.
And these functions tend to
also involve similar variables.
So now let's take a look at the, the
common form of a state of that retrieval
model and examine some of the common
ideas used in all these models.
First, these models are all
based on the assumption
of using bag of words for
representing text.
And we explained this in the natural
language processing lecture.
Bag of words representation remains
the main representation used in all
the search engines.
So, with this assumption,
the score of a query like a presidential
campaign news,
with respect to a document d here,
would be based on scores computed at,
based on each individual word.
And that means the score would
depend on the score of each word,
such as presidential, campaign, and news.
Here we can see there are three
different components,
each corresponding to how well the
document matches each of the query words.
Inside of these functions,
we see a number of heuristics views.
So for example, one factor that
affects the function g here is how
many times does the word
presidential occur in the document?
This is called a Term Frequency or TF.
We might also denote as
c of presidential and d.
In general if the word
occurs more frequently in
the document then the value of
this function would be larger.
Another factor is how
long is the document, and
this is so
to use the document length for score.
In general, if a term occurs in a long
document that many times,
it's not as significant as
if it occurred the same number
of times in a short document.
Because in the long document any term
is expected to occur more frequently.
Finally, there is this factor called
a document frequency, and that is we also
want to look at how often presidential
occurs in the entire collection.
And we call this Document Frequency,
or DF, of presidential.
And in some other models we
might also use a probability
to characterize this information.
So here, I show the probability of
presidential in the collection.
So all these are trying to
characterize the popularity of
the term in the collection.
In general,
matching a rare term in the collection
is contributing more to the overall
score then matching a common term.
So this captures some of the main ideas
used in pretty much all the state of
the art retrieval models.
So now, a natural question is
which model works the best?
Now, it turns out that many models work
equally well, so here I listed the four
major models that are generally regarded
as a state of the art retrieval models.
Pivoted length normalization,
BM25, query likelihood, PL2.
When optimized these models tend
to perform similarly and this was,
discussed in detail in this reference
at the end of this lecture.
Among all these,
BM25 is probably the most popular.
It's most likely that this has been used
in virtually all the search engines,
and you will also often see this
method discussed in research papers.
And we'll talk more about this
method later in some other lectures.
So, to summarize, the main points made
in this lecture are, first the design
of a good ranking function pre-requires a
computational definition of relevance, and
we achieve this goal by designing
a proper retrieval model.
Second, many models are equally effective
but we don't have a single winner here.
Researchers are still actively
working on this problem,
trying to find a truly
optimal retrieval model.
Finally, the state of the art
ranking functions tend to rely on
the following ideas.
First, bag of words representation.
Second, TF and
the document frequency of words.
Such information is used when
ranking function to determine
the overall contribution of matching
a word, and document length.
These are often combined in
interesting ways and we'll discuss
how exactly they are combined to rank
documents in the lectures later.
There are two suggested additional
readings if you have time.
The first is a paper where you can
find a detailed discussion and
comparison of multiple
state of the art models.
The second, is a book with a chapter
that gives a broad review of
different retrieval models.
[MUSIC]

[SOUND].
This lecture is about the vector
space retrieval model.
We're going to give
an introduction to its basic idea.
In the last lecture we talked about
the different ways of designing
a retrieval model which would give
us a different the ranking function.
In this lecture, we're going to
talk about the, the specific way of
design the ramping function called
a vector space mutual model.
And we're going to give a brief
introduction to the basic idea.
Vector space model is a special case of
similarity based models
as we discussed before.
Which means,
we assume relevance is roughly
similarity between a document and a query.
Now whether this assumption is true,
is actually a question.
But in order to solve our
search problem we have to
convert the vague notion of
relevance into a more precise
definition that can be implemented
with the programming language.
So in this process we have to
make a number of assumptions.
This is the first assumption
that we make here.
Basically we assume that
if a document is more
similar to a query than another document,
then the first document would be assumed
to be more relevant than the second one.
And this is the basis for
ranking documents in this approach.
Again, it's questionable whether this is
really the best definition for relevance.
As we will see later there
are other ways to model relevance.
The first idea of vector space retrieval
model is actually very easy to understand.
Imagine a high dimensional space, where
each dimension corresponds to a term.
So, here, I show a three
dimensional space with three words,
programming, library, and presidential.
So each term, here, defines one dimension.
Now we can consider vectors in
this three dimensional space.
And we're going to assume
all our documents and
the query will be placed
in this vector space.
So, for example, one document that might
be represented at by this vector, d1.
Now this means this document probably
covers library and presidential.
But it doesn't really
talk about programming.
All right, what does this mean in
terms of presentation of document?
That just means,
we're going to look at our document
from the perspective of this vector.
We're going to ignore everything else.
Basically what we see here is
only the vector of the document.
Of course the document
has other information.
For example,
the orders of words are simply ignored and
that's because we're
assume that the words.
So with this representation
you have already seen, d1,
seems to suggest a topic in
either presidential library.
Now this is different
from another document.
Which might be represented as
a different vector, d2 here.
Now in this case, the document that
covers programming and library, but
it doesn't talk about presidential.
So what does this remind you?
Well, you can probably guess, the topic
is likely about program language and
the library is software library, library.
So this shows that by using this
vector space representation,
we can actually capture the differences
between topics of documents.
Now you can also imagine
there are other vectors.
For example,
d3 is pointing in that direction, that
might be about presidential programming.
And in fact we're going to place all
the documents in this vector space.
And they will be pointing
to all kinds of directions.
And similarly, we're going to place
our query also in this space,
as another vector.
And then we're going to measure the
similarity between the query vector and
every document vector.
So, in this case for example, we can
easily see d2 seems to be the closest of,
to this query factor and
therefore d2 will be ranked above others.
So this was a, basically the main
idea of the, the vector space model.
So to be more pri,
precise, be more precise.
Vector space model is a framework.
In this framework,
we make the following assumptions.
First, we represent a document and
query by a term vector.
So here a term can be any basic concept.
For example, a word or a phrase,
or even enneagram of characters.
Those are a sequence of
characters inside a word.
Each term is assumed to
define one dimension.
Therefore N terms.
In our vocabulary,
we define N-dimensional space.
A query vector would consist
of a number of elements
corresponding to the weights
of different terms.
Each document vector is also similar.
It has a number of elements and
each value of each element
is indicating that weight
of the corresponding term.
Here you can see,
we have seen there are N dimensions.
Therefore, there are N elements,
each corresponding to the weight
on the particular term.
So the relevance in this case would
be assume to be the similarity
between the two vectors,
therefore our range in function is
also defined as the similarity between
the query vector and document vector.
Now, if I ask you to write the program
to the internet this approach
in the search engine.
You would realize that this
was far from clear, right?
We haven't seen a lot of things in detail
therefore it's impossible to actually
write the program to implement this.
That's why I said this is a framework.
And this has to be refined
in order to actually
suggest a particular function,
that you can implement on the computer.
So, what does this framework not serve?
Well, it actually hasn't set many things
that would be required in order
to implement this function.
First, it did not say how we should define
or select the basic concepts exactly.
We clearly assume
the concepts are orthogonal,
otherwise there will be redundancy.
For example, if two synonyms are somehow
distinguished as two different concepts.
Then they would be defined
in two different dimensions.
And then that would clearly
cause a redundancy here.
Or overemphasizing of
matching this concept.
Because it would be as if you
matched the two dimensions
when you actually matched
one semantic concept.
Secondly, it did not say how we
exactly should place documents and
query in this space.
Basically I show you some examples
of query and document vectors.
But where exactly should the vector for
a particular document point to?
[INAUDIBLE] So this is equivalent
to how to define the term weights.
How do you computer use element
values in those vectors?
This is a very important
question because term weight
in the query vector indicates
the importance of term.
So depending on how you assign the weight,
you might prefer some terms
to be matched over others.
Similarly, term weight in
the document is also very meaningful.
It indicates how well the term
characterizes the document.
If you got it wrong, then you clearly
don't represent this document accurately.
Finally, how we define the similarity
measure is also not clear.
So these questions must be addressed
before we can have an operational
function that we can actually
implement using a program language.
So how do we solve these problems
is the main topic of the next lecture.
[MUSIC]

[SOUND].
In this lecture, we're going to talk about
how to instantiate a vector space model,
so that we can get a very
specific ranking function.
So this is the, to continue
the discussion of the vector space model.
Which is one particular approach
to design ranking function.
And we are going to talk about how
we use the general framework of
the the vector space model.
As a guidance to instantiate the framework
to derive a specific ranking function.
And we're going to cover the simplest
instantiation of the framework.
So as we discussed in
the previous lecture.
The vector space model
is really a framework.
It isn't, didn't say.
As we discussed in the previous lecture,
vector space model is really a framework.
It doesn't, say many things.
So for
example here it shows that it did not say
how we should define the dimension.
It also did not say how we place
a documented vector in this space.
It did not say how we place a query
vector in this vector space.
And finally, it did not say how
we should match a similarity
between the query vector and
the document vector.
So, you can imagine,
in order to implement this model.
We have to see specifically,
how we are computing these vectors.
What is exactly xi and what is exactly yi?
This will determine where we
place the document vector.
Where we place a query vector.
And of course, we also need to say exactly
what will be the similarity function.
So if we can provide a definition
of the concepts that would
define the dimensions and
these xi's, or yi's.
And then, the waits of terms for
query and document.
Then we will be able to
place document vectors and
query vector in this well defined space.
And then,
if we also specify similarity function,
then we'll have well
defined ranking function.
So let's see how we can do that.
And think about
the the simpliciter instantiation.
Actually, I would suggest you to
pause the lecture at this point
spend a couple of minute to think about.
Suppose you are asked
to implement this idea.
You've come up with the idea
of vector space model.
But you still haven't figured out
how to compute this vector exactly,
how to define this similarity function.
What would you do?
So think for a couple of minutes and
then, proceed.
So let's think about some simplest ways
of instantiating this vector space model.
First, how do we define a dimension.
Well the obvious choice is we use
each word in our vocabulary
to define a dimension.
And a whole issue that there
are n words in our vocabulary,
therefore there are n dimensions.
Each word defines one dimension.
And this is basically the Bag
of Words Instantiation.
Now let's look at how we
place vectors in this space.
Again here, the simplest of strategy is to
use a bit vector to represent
both a query and a document.
And that means each element xi and
yi would be taking a value
of either zero or one.
When it's one,
it means the corresponding word is
present in the document or in the query.
When it's zero,
it's going to mean that it's absent.
So you can imagine if the user
types in a few word in your query.
Then the query vector,
we only have a few ones, many, many zeros.
The document vector in general
we have more ones of course,
but we also have many zeros.
So it seems the vocabulary
is generally very large.
Many words don't really
occur in a document.
Many words will only occasionally
occur in the document.
A lot of words will be absent
in a particular document.
So, now we have placed the documents and
the query in the vector space.
Let's look at how we
match up the similarity.
So, a commonly used similarity
measure here is Dot Product.
The dot product of two
vectors is simply defined as
the sum of the products of the
corresponding elements of the two vectors.
So here we see that it's
the product of x1 and the y1.
So here.
And then, x2 multiplied by y2.
And then finally xn multiplied by yn.
And then we take a sum here.
So that's the dot product.
Now we can represent this in a more
general way, using a sum here.
So this only one of the many different
ways of matching the similarity.
So now we see that we have defined the,
the dimensions.
We have defined the, the vectors.
And we have also defined
the similarity function.
So now we finally have
the Simplest Vector Space Model.
Which is based on the bit vector
representation, dot product similarity,
and bag of words instantiation.
And the formula looks like this.
So this is our formula.
And that's actually a particular retrieval
function, a ranking function all right?
Now, we can finally implement this
function using a program language and
then rank documents for query.
Now at this point you should
again pause the lecture
to think about how we can
interpret this score.
So we have gone through the process
of modeling the retrieval problem
using a vector space model.
And then, we make assumptions.
About how we place vectors in the vector
space and how we define the similarity.
So in the end we've got a specific
retrieval function shown here.
Now the next step is to think about
what of this individual function
actually makes sense?
I, can we expect this function
to actually perform well?
Where we use it to ramp it up,
for use in query.
So, it's worth thinking about, what is
this value that we are calculating?
So in the end, we've got a number,
but what does this number mean?
Is it meaningful?
So spend a couple minutes
to think about that.
And of course,
the general question here is do you
believe this is a good ranking function?
Would it actually work well?
So again,
think about how to interpret this value.
Is it actually meaningful?
Does it mean something?
So related to how well that
document matches the query.
So in order to assess
whether this simplest
vector space model actually works well,
let's look at the example.
So here I show some sample documents and
a simple query.
The query is news about
the presidential campaign.
And we have five documents here.
They cover different, terms in the query.
And if you look at the,
these documents for a moment.
You may realize that
some documents are probably relevant
in some cases or probably not relevant.
Now if I ask you to rank these documents,
how would you rank them?
This is basically our ideal ranking.
Right.
When humans can examine the documents and
then try to rank them.
Now, so think for a moment and
take a look at this slide.
And perhaps by pausing the lecture.
So I think most of you
would agree that d4,
and d3, are probably better than others.
Because they really cover the query well.
They match news,
presidential, and campaign.
So, it looks like that these two documents
are probably better than the others.
They should be ranked on top.
And the other three, d1, d2, and
d5, are really non-relavant.
So we can also say d4 and
d3 are relevent documents, and d1, d2, and
d5 are non-relevant.
So, now lets see if our vector
space model could do the same or
could do something closer.
So let's first think about how we actually
use this model to score documents.
Right here I show two documents, d1 and
d3, and we have the query also here.
In the vector space model, of course we
want to first compute the vectors for
these documents and the query.
Now I issue with the vocabulary
here as well, so
these are the n dimensions
that we'll be thinking about.
So what do you think is the vector
representation for the query?
Note that we are assuming
that we only use zero and one
to indicate whether a term is absent or
present in the query or in the document.
So these are zero, one bit vectors.
So what do you think is the query vector?
Well the query has four words here.
So for these four words, there would be a
one and for the rest, there will be zeros.
Now what about the documents?
It's the same.
So d1 has two rows, news and about.
So there are two ones here and
the rest are zeros.
Similarly, so
now that we have the two vectors,
let's compute the similarity.
And we're going to use dot product.
So you can see when we use dot product we
just, multiply the corresponding elements.
Right.
So
these two would be, form a,
be forming a product.
And these two will
generate another product.
And these two would generate yet
another product.
And so on and so forth.
Now you can,
you need to see if we do that.
We actually don't have to
care about these zeroes
because if whenever we have a zero,
the product will be zero.
So, when we take a sum
over all these pairs,
then the zero entries will be gone.
As long as you have one zero,
then the product would be zero.
So in the fact, we're just counting
how many pairs of one and one, right?
In this case, we have seen two.
So the result will be two.
So, what does that mean?
Well that means, this number or
the value of this scoring function.
Is simply the count of how many unique
query terms are matched in the document.
Because if a document,
if a term is matched in the document,
then there will be two ones.
If it's not, then there will
be zero on the document side.
Similarly, if the document has a term,.
But the terms not in the query there
will be zero in the query vector.
So those don't count.
So as a result this
scoring function basically
meshes how many unique query
terms are matched in a document.
This is how we interpret this score.
Now we can also take a look at the d3.
In this case,
you can see the result is three.
Because d3 matched the three distinctive
query words, news, presidential, campaign.
Whereas d1 only matched two.
Now in this case, it seems
reasonable to rank d3 on top of d1.
And this simplest vector
space model indeed does that.
So that looks pretty good.
However, if we examine this model in
detail, we likely will find some problems.
So here I'm going to show all
the scores for these five documents.
And you can even verify they are correct.
Because we're basically counting
the number of unique query
terms matched in each document.
Now note that this method
actually makes sense.
Right?
It basically means if a document matches
more unique query terms, then the document
will be assuming to be more relevant.
And that seems to make sense.
The only problem is here, we can note set
there are three documents, d2, d3, and d4.
And they tied with a three, as a score.
So that's a problem, because if you
look at them carefully it seems that
d4 should be right above d3.
Because d3 only mentioned
the presidential once.
But d4 mentioned it much more times.
In case of d3,
presidential could be extended mentioned.
But d4 is clearly above
presidential campaign.
Another problem is that d2 and
d3 also have the same soul.
But, if you look at the,
the three words that are matched.
In the case of d2, it matched the news,
about, and the campaign.
But in the case of d3, it match the news,
presidential, and campaign.
So intuitively, d3 is better.
Because matching presidential is more
important though than matching about.
Even though about and
the presidential are both in the query.
So intuitively,
we would like d3 to be ranked above d2.
But this model, doesn't do that.
So that means this is still not good
enough, we have to solve these problems.
To summarize,
in this lecture we talked about how
to instantiate a vector space model.
We may need to do three things.
One is to define the dimension.
The second is to
decide how to place documents
as vectors in the vector space.
And to also place a query in
the vector space as a vector.
And third is to define
the similarity between two vectors,
particularly the query vector and
the document vector.
We also talked about a very simple way
to instantiate the vector space model.
Indeed, that's probably the simplest
vector space model that we can derive.
In this case,
we use each word to define a dimension.
We use a zero one bit vector to
represent a document or a query.
In this case, we basically only care
about word presence or absence.
We ignore the frequency.
And we use the dot product
as the similarity function.
And with such a, a, in situation.
And we showed that the scoring
function is basically to score
a document based on the number of distinct
query words matched in the document.
We also show that such a single vector
space model still doesn't work well,
and we need to improve it.
And this is the topic that we're
going to cover in the next lecture.
[MUSIC]

[SOUND].
In this lecture, we're going to talk about
how to improve the instant changing of
the Vector Space Model.
This is the continued discussion
of the Vector Space Model.
We're going to focus on how to improve
the instant changing of this model.
In a previous lecture,
you have seen that with simple
situations of the Vector Space Model,
we can come up with
a simple scoring function that
would give us, basically,
a count of how many unique query
terms are matching the document.
We also have seen that this function
has a problem as shown on this slide.
In particular,
if you look at these three documents,
they will all get the same score because
they match the three unique query words.
But intuitively we would like,
d4 to be ranked above d3.
And d2 is really non relevant.
So the problem here is that
this function couldn't capture
the following characteristics.
First, we would like to
give more gratitude to d4
because it matches the presidential
more times than d3.
Second, intuitively matching
presidential should be more important
than matching about, because about is
a very common word that occurs everywhere.
It doesn't really carry that much content.
So, in this lecture,
let's see how we can improve the model
to solve these two problems.
It's worth thinking at this point about
why do we have these four problems.
If we look back at
the assumptions we have made
while substantiating the Vector
Space Model, we will realize that
the problem is really coming
from some of the assumptions.
In particular, it has to do with how we
place the vectors in the vector space.
So then, naturally,
in order to fix these problems,
we have to revisit those assumptions.
Perhaps, you will have
to use different ways to
instantiate the Vector Space Model.
In particular, we have to place
the vectors in a different way.
So, let's see how can we prove this?
Well, our natural thought is in order
to consider multiple times of a term
in a document.
We should consider the term frequency
instead of just the absence or presence.
In order to consider the difference
between a document where a query
term occurred multiple times and the one
where the query term occurred just once.
We have to concede a term frequency,
the count of a term being in the document.
In the simplest model, we only model
the presence and absence of a term.
We ignore the actual number of times
that a term occurs in a document.
So let's add this back.
So we're going to do then represent
a document by a vector with
term frequency as element.
So, that is to say, now,
the elements of both the query vector and
the document vector will not be zero once,
but
instead there will be the counts of
a word in the query or the document.
So this would bring additional
information about the document.
So this can be seen as a more accurate
representation of our documents.
So, now let's see what the formula
would look like if we change
this representation.
So as you see on this slide,
we still use that product, and,
so the formula looks
very similar in the form.
In fact, it looks identical, but
inside of the sum of cos xi and
yi are now different.
They're now the counts of words
i in the query and the document.
Now at this point, I also suggest you
to pause the lecture for moment and
just we'll think about how we have
interpret the score of this new function.
It's doing something very similar
to what the simplest VSM is doing.
But because of the change of the vector,
now the new score has
a different interpretation.
Can you see the difference?
And it has to do with
the consideration of multiple
occurrences of the same
time in the document.
More importantly, we''ll try to know
whether this would fix the problem of
the simplest vector space model.
So, let's look at the this example again.
So suppose, we change the vector
to term frequency vectors.
Now, let's look at these
three documents again.
The query vector is the same because
all these words occurred exactly once
in the query.
So the vector is still 0 1 vector.
And in fact,
d2 is also essential in representing
the same way because none of these
words has been repeated many times.
As a result, the score is also the same,
still three.
The same issue for d3 and
we still have a 3.
But d4 would be different, because now,
presidential occurred twice here.
So the end in the four presidential in
the [INAUDIBLE] would be 2 instead of 1.
As a result, now the score for
d4 is higher.
It's a four now.
So this means, by using term frequency,
we can now rank d4 above d2 and
d3 as we hope to.
So this solve the problem with default.
But, we can also see that d2 and
d3 are still featured in the same way.
They still have identical scores,
so it did not fix the problem here.
So, how can we fix this problem?
We would like, to give more credit for
matching presidential than matching about.
But how can we solve
the problem in a general way?
Is there any way to determine which word
should be treated more importantly and
which word can be, basically ignored.
About is such a word.
And which it does not really
carry that much content,
we can essentially ignore that.
We sometimes call such a word,
a stock word.
Those are generally very frequent and
they occur everywhere,
matching it, doesn't really mean anything.
But computation how can we capture that?
So again, I encourage you to
think a little bit about this.
Can you come up with any
statistical approaches to somehow
distinguish presidential from about.
If you think about it for
a moment, you realize that,
one difference is that a word
like above occurs everywhere.
So if you count the currents of the water
in the whole collection that we
would see that about as much higher for
this than presidential, which it tends
to occur only in some documents.
So this idea suggests
that we could somehow
use the global statistics of terms or
some other formation to try to
down weight the element for
about in the vector representation of d2.
At the same time,
we hope to somehow increase the weight
of presidential in the vector of d3.
If we can do that, then,
we can expect that d2 will get
the overall score to be less than three,
while d3 will get the score about three.
Then, we'll be able to
rank d3 on top of d2.
So how can we do this systematically?
Again, we can rely on some
steps that people count.
And in this case, the particular idea is
called the Inverse Document Frequency.
We have seen document frequency.
As one signal used in,
the moding retrieval functions.
We discussed this in a previous lecture.
So here's the specific way of using it.
Document frequency is the count of
documents that contain a particular term.
Here, we say inverse document frequency
because we actually want to reword a word
that doesn't occur in many documents.
And so, the way to incorporate this
into our vector [INAUDIBLE] is
then to modify the frequency
count by multiplying
it by the idea of the corresponding
word as shown here.
If we didn't do that,
then we can penalize common
words which generally have a low idea of,
and
reward real words,
which we're have a higher IDF.
So most specific [INAUDIBLE] IDF
can be defined as the logarithm
of M plus one divided by k,
where M is the total number of
documents in the collection,k is df or
document frequency.
The total number of documents
containing the word W.
Now, if you plot this
function by varying k,
then you will see the curve
that look like this.
In general, you can see it
would give a higher value for
a low DF word, a rare word.
You can also see the maximum value
of this function is log of M plus 1.
Will be interesting for you to think about
what's minimum value for this function?
This could be interesting exercise.
Now, the specific function
may not be as important as
the heuristic to simply
penalize popular terms.
But it turns out this particular
function form has also worked very well.
Now, whether there is a better
form of function here,
is the open research question.
But, it's also clear that if we use
a linear kernalization like what's
shown here with this line, then, it may
not be as reasonable as the standard IDF.
In particular, you can see
the difference in the standard IDF,
and we,
somehow have a [INAUDIBLE] point here.
After this point, we're going to say these
terms are essentially not very useful.
They can be essentially ignored.
And this makes sense when the term
occurs so frequently, and
let's say a term occurs in more
than 50% of the documents.
Then the term is unlikely very important
and it's, it's basically, a common term.
It's not very important to match this
word, so with the standard IDF, you can
see it's, basically, assumed that they all
have lower weights, there's no difference.
But if you look at the linear
kernelization, at this point there is,
there's some difference.
So intuitively, we want to focus more
on the discrimination of low DF words,
rather than these common words.
Well, of course, which one works better,
still has to be validated
by using the empirically related data set.
And we have to use users to
judge which results of that.
So now let's see how this
can solve problem two.
So now,
let's look at the two documents again.
Now without IDF weighting, before,
we just have [INAUDIBLE] vectors,
but with IDF weighting we
now can adjust the DF weight
by multiplying the, with the IDF value.
For example here, you can see is
the adjustment in particular for
about, there is an adjustment
by using the IDF value of about
which is smaller than the IDF
value of presidential.
So if you look at these,
the IDF will distinguish these two words.
As a result, adjustment here would be
larger, would make this weight larger.
So if we score with these new vectors, and
what would happen is that the, of course,
they share the same weights for news and
the campaign, but the margin of about and
presidential with this grouping may.
So now as a result of IDF weighting,
we will have d3 to be ranked above d2.
Because it matched rail word,
where as d2 matched common word.
So this shows that the idea of
weighting can solve problem two.
So, how effective is this model in
general when we use TF-IDF weighting?
Well, let's look at all these
documents that we have seen before.
These are the new scores
of the new documents.
But how effective is this
new weighting method and
new scoring function, all right?
So now let's see overall how effective
is this new ranking function
with TF-IDF Weighting?
Here, we show all the five documents
that we have seen before, and
these are their scores.
Now, we can see the scores for
the first four
documents here seem to
be quite reasonable.
They are as we expected.
However, we also see a new problem.
Because now d5, here,
which did not have a very high
score with our simplest
vector space model.
Now, after it has a very high score.
In fact, it has the highest score here.
So, this creates a new problem.
This actually a common phenomenon
in designing material functions.
Basically, when you try
to fix one problem,
you tend to introduce other problems.
And that's why it's very tricky how
to design effective ranking function.
And what's what's the best ranking
function is the open research question.
Researchers are still working on that.
But in the next few lecture, we're
going to also talk about some additional
ideas to further improve this model and
try to fix this problem.
So to summarize this lecture,
we've talked about how to
improve this vector space model.
And we've got to improve the [INAUDIBLE]
of the vector space model based on
TF-IDF weighting.
So the improvement, most of it,
is on the placement of the vector.
Where we give higher weight to a term
that occurred many times in the document,
but infrequently in the whole collection.
And we have seen that this improved
model indeed works better than
the simplest vector space model, but
it also still has some problems.
In the next lecture,
we're going to look at the how to
address these additional problems.
[MUSIC]

[SOUND]
In this lecture, we continue
the discussion of Vector Space Model.
In particular, we are going to
talk about the TF transformation.
In the previous lecture,
we have derived a TF-IDF weighting
formula using the vector space model.
And we have shown that this model
actually works pretty well for
these examples as shown on
this slide except for d5,
which has received a very high score.
Indeed, it has received the highest
score among all these documents.
But this document is intuitively
non-relevant, so this is not desirable.
In this lecture, we're going to
talk about how would you use TF
transformation to solve this problem.
Before we discuss the details,
let's take a look at the formula for
this symbol here for
IDF weighting ranking function and
see why this document has
received such a high score.
So this is the formula, and
if you look at the formula carefully,
then you will see it involves a sum
over all the matched query terms.
And inside the sum, each matched
query sum has a particular weight.
And this weight is TF-IDF weighting.
So it has an IDF component
where we see 2 variables.
One is the total number of documents
in the collection, and that is m.
The other is the documentive frequency.
This is the number of documents
that contain this word w.
The other variables in,
involving the formula,
include the count of the query term.
W in the query, and
the count of the word in the document.
If you look at this document again,
now it's not hard to
realize that the reason why it has
received a high score is because
it has a very high count of campaign.
So the count of campaign in this document
is a four, which is much higher than
the other documents, and has contributed
to the high score of this document.
So intriguingly, in order to lower
the score for this document, we need
to somehow restrict the contribution of,
the matching of this term in the document.
And if you think about the matching of
terms in the document carefully you
actually would realize we
probably shouldn't reward
multiple occurrences so generously.
And by that I mean the first occurrence
of a term says a lot about the,
the matching of this term,
because it goes from zero count
to a count of one, and
that increase means a lot.
Once we see a word in the document,
it's very likely that the document
is talking about this word.
If we see an extra occurrence
on top of the first occurrence,
that is to go from one to two,
then we also can say that well, the second
occurrence kind of confirmed that it's
not a accidental mention of the word.
Now, we are more sure that this
document is talking about this word.
But imagine we have seen, let's say,
50 times of the word in the document.
Then, adding one extra occurrence
is not going to test more about
evidence because we are already sure
that this document is about this word.
So if you're thinking
this way it seems that
we should restrict the contributing
of a high account of term.
And that is the idea of TF Transformation.
So this transformation function is
going to turn the raw count of word
into a Term Frequency Weight,
for the word in the document.
So here I show in x-axis, that raw count,
and in y-axis I show
the Term Frequency Weight.
So, in the previous ranking functions
we actually have increasingly,
used some kind of transformation.
So for example in the zero-one bit
vector retentation we actually use
the Suchier transformation
function as shown here.
Basically if the count is
zero then it has zero weight.
Otherwise it would have a weight of one.
It's flat.
Now what about using
Term Count as a TF weight.
Well that's a linear function, right?
So it has just exactly
the same weight as the count.
Now we have just seen that
this is not desirable.
So what we want is something like this.
So for example with a logarithm function,
we can have a sub-linear
transformation that looks like this.
And this will control the influence of
really high weight because it's going to
lower its inference, yet it will
retain the inference of small count.
Or we might want to even bend the curve
more by applying logarithm twice.
Now people have tried all these methods
and they are indeed working better than
the linear form of the transformation,
but so far what works the best
seems to be this special transformation
called a BM25 transformation.
BM stands for best matching.
Now in this transformation,
you can see there's a parameter k here.
And this k controls the upper
bound of this function.
It's easy to see this function has
a upper bound because if you look at
the x divided by x plus k where
k is not an active number,
then the numerator will never be able
to exceed the denominator, right?
So, it's upper bounded by k plus 1.
Now, this is also difference between
this transformation function and
the logarithm transformation.
Which it doesn't have upperbound.
Now furthermore, one interesting property
of this function is that as we vary K,
we can actually simulate different
transformation functions,
including the two extremes
that are shown here.
That is a zero one bit transformation,
and the unit transformation.
So for example, if we set k to zero,
now you can see
the function value would be one.
So we precisely,
recover the zero one bit transformation.
If you set k to a very large number,
on the other hand,
other hand, it's going to look more
like the linear transformation function.
So in this sense,
this transformation is very flexible,
it allows us to control
the shape of the transformation.
It also has a nice property
of the upper bound.
And this upper bound is useful to control
the inference of a particular term.
And so that we can prevent a, a spammer
from just increasing the count of
1 term to spam all queries
that might match this term.
In other words this upper bound
might also ensure that all terms
will be counted when we aggregate the,
the weights, to compute a score.
As I said, this transformation
function has worked well, so far.
So to summarise this lecture,
the main point is that we need to do
some sub linearity of TF Transformation.
And this is needed to capture
the intuition of diminishing return from
high Term Counts.
It's also to avoid a dominance by
one single term over all others.
This BM25 Transformation, Transformation
that we talked about is very interesting.
It's so far one of the best performing
TF Transforming formation formulas.
It has upper bound, and
it's also robust and effective.
Now, if we're plug in this
function into our TF-IDF weighting
vector space model then we would
end up having the following
ranking function,
which has a BM25 TF component.
Now this is already very close to a state
of the art ranking function called a BM25.
And we will discuss how we can further
improve this formula in the next lecture.
[MUSIC]

This lecture is about document length
normalization in the vector space model.
In this lecture we are going to continue
the discussion of the vector space model
in particular we are going to discuss.
The issue of document
length normalization.
So far in the lectures about
the vector space model,
we have used the various
signals from the document to
assess the matching of the document
though with a preorder.
In particular we have
considered the term frequency,
the count of a term in a document.
We have also considered a,
it's global statistics such as
IDF in words document frequency.
But we have not considered
a document length.
So, here I show two example documents.
D4 is much shorter with only 100 words.
D6 on the other hand has 5,000 words.
If you look at the matching of these
query words we see that in D6 there
are more matchings of the query words but
one might reason that D6 may
have matched these query words.
In a scattered manner.
So maybe the topic of d6 is not
really about the topic of the query.
So the discussion of a campaign
at the beginning of the document
may have nothing to do with the mention
of presidential at the end.
In general,
if you think about the long documents,
they would have a higher
chance to match any query.
In fact, if you generate a,
a long document that randomly sampling,
sampling words from
the distribution of words,
then eventually you probably
will match any query.
So in this sense we should
penalize no documents because they
just naturally have better
chances to match any query.
And this is our idea of document answer.
We also need to be careful in avoiding
to overpenalize small documents.
On the one hand,
we want to penalize a long document.
But on the other hand,
we also don't want to over-penalize them.
And the reason is because a document that
may be long because of different reason.
In one case the document may be more
long because it uses more words.
So for example think about
the article of a research paper.
It would use more words than
the corresponding abstract.
So this is the case where we probably
should penalize the matching of
a long document such as, full paper.
When we compare the matching
of words in such
long document with matching of
the words in the short abstract.
Then long papers generally have a higher
chance of matching query words.
Therefore we should penalize them.
However, there is another case
when the document is long and
that is when the document
simply has more content.
Now consider another
case of a long document,
where we simply concatenated a lot
of abstracts of different papers.
In such a case, obviously, we don't
want to penalize such a long document.
Indeed, we probably don't want to penalize
such a document because it's long.
So that's why we need to be careful.
About using the right
degree of penalization.
A method that has been working well
based on recent research is called,
pivot length normalization.
And in this case the idea is to use.
The average document length as a P word,
as a reference point.
That means we will assume that for
the average length documents,
the score is about right.
So, the normalizer would be 1.
But if a document is longer than
the average document length
then there will be some penalization.
Where as if it's shorter than
there's even some reward.
So this is an illustrator
that using this slide.
On the axis,
s axis you can see the length of document.
On the y-axis we show the normalizer,
in the case pivoted length normalization
formula for the normalizer is
is seem to be interpolation of one and
the normalize the document lengths,
controlled by a parameter b here.
So, you can see here,
when we first divide the lengths of the
document by the average document length.
This not only gives us
some sense about the,
how this document is compared with
the average document length, but
also gives us a benefit of not
worrying about the unit of
length, we can measure the length
by words or by characters.
Anyway this normalizer has
an interesting property.
First we see that if we set the parameter
b to 0 then the value would be 1,
so there's no pair,
length normalization at all.
So b in this sense controls the length
normalization, where as if we set
d to a non-zero value, then
the normalizer will look like this, right.
So the value would be higher for
documents that are longer than
the average document length.
Where as the value of the normalizer
will be short- will be smaller for
shorter documents.
So in this sense we see there's
a penalization for long documents.
And there's a reward for short documents.
The degree of penalization
is conjured by b.
Because if we set b to a larger
value then the normalizer.
What looked like this.
There's even more penalization for
long documents and more reward for
the short documents.
By adjusting b which
varies from zero to one
we can control the degree
of length normalization.
So if we're plucking this length
normalization factor into
the vector space model ranking functions
that we have already examined.
Then we will end up heading with formulas,
and
these are in fact the state of
the are vector space models.
Formulas.
So, let's talk an that,
let's take a look at the each of them.
The first one's called a pivoted length
normalization vector space model.
And, a reference in the end has detail
about the derivation of this model.
And, here, we see that it's basically
the TFIDF weighting model that we have
discussed.
The IDF component should be
very familiar now to you.
There is also a query term
frequency component, here.
And, and then in the middle there is.
And normalize the TF.
And in this case,
we see we use the double algorithm,
as we discussed before, and this is to
achieve a sublinear transformation.
But we also put document length
normalizer in the bottom, all right so
this would cause penalty for
a long document, because the larger
the denominator is, the denominator is
then the smaller the shift weight is.
And this is of course controlled
by the parameter b here.
And you can see again, b is set to 0, and
there, there is no length normalization.
Okay.
So this is one of the two most effective.
Not this base model of formulas.
The next one called a BM25,
or Okapi, is, also similar.
In that, it also has a i, df component
here, and a query df component here.
But in the middle, the normalization's
a little bit different.
As we expand there is this or
copied here for transformation here.
And that does, sublinear
transformation with an upper bound.
In this case we have put the length
normalization factor here.
We are adjusting k, but
it achieves a similar factor
because we put a normalizer
in the denominator.
Therefore again, if a document is longer,
then the term weight will be smaller.
So, you can see, after we have gone
through all the instances that we talked
about, and we have,
in the end, reached the,
basically the state of
the art mutual function.
So, so far we have talked
about mainly how to place
the document matter in the matter space.
And this has played an important role
in uh,determining the factors of
the function.
But there are also other dimensions
where we did not really examine detail.
For example can we further
improve the instantiation of
the dimension of the vector space model.
Now we've just assumed
that the back of words.
So each dimension is a word.
But obviously we can see
there are many other choices.
For example, stemmed words, those
are the words that have been transformed
into the same rule form.
So that computation and computing will all
become the same and they can be matched.
We need to stop water removal.
This is removes on very common
words that don't carry any content.
Like the or of,
we use the phrases to define that [SOUND].
We can even use late in the semantica,
an answer sort of find in the sum cluster.
So words that represent
a legend of concept as one.
We can also use smaller units,
like a character in grams.
Those are sequences of n characters for
dimensions.
However, in practice people have found
that the bag-of-words representation
with the phrases is where
the the most effective one.
And it's also efficient so
this is still so
far the most popular dimension
instantiation method and
it's used in all the major search engines.
I should also mention that sometimes
we did to do language specific and
domain specific organization.
And this is actually very important as
we might have variations of the terms.
That might prevent us from
matching them with each other.
Even though they mean the same thing.
And some of them, which is like Chinese,
the results of the.
Segmenting text to obtain word boundaries.
Because it's just
a sequence of characters.
A word might, might correspond to
one character or two characters or
even three characters.
So it's easier in English when we
have a space to separate the words.
But in some other languages we may need
to do some natural language processing
to figure out the,
where are the boundaries for words.
There is also possibility to
improve this in narrative function.
And so
far we have used the about product, but
one can imagine there are other matches.
For example we can match the cosine
of the angle between two vectors, or
we can use Euclidean distance measure.
And these are all possible.
The dot product seems still the best and
one of the reasons is
because it's very general.
In fact, it's sufficiently general.
If you consider the possibilities of
doing weighting in different ways.
So, for example,
cosine measure can be regarded as the dot
product of two normalized vectors.
That means we first normalize each vector,
and then we take the dot product.
That would be equivalent
to the cosine measure.
I just mentioned that the BM25.
Seems to be one of the most
effective formulas.
But there has been also further
development in, improving BM25, although
none of these works have
changed the BM25 fundamentally.
So in one line of work,
people have derived BM25 F.
Here F stands for field, and
this is a little use BM25 for
documents with a structures.
For example you might consider
title field, the abstract, or
body of the reasearch article, or
even anchor text on the web pages.
Those are the text fields that
describe links to other pages.
And these can all be
combined with a appropriate
weight on different fields to help
improve scoring for document.
Use BM25 for such a document.
And the obvious choice is to
apply BM25 for each field, and
then combine the scores.
Basically, the ideal of BM25F,
is to first combine
the frequency counts of tons in all
the fields and then apply BM25.
Now this has advantage of avoiding over
counting the first occurrence of the term.
Remember in the sublinear
transformation of TF,
the first recurrence is very important
then, and contributes a large weight.
And if we do that for all the fields, then
the same term might have gained a, a lot
of advantage in every field, but when we
combine these word frequencies together.
We just do the transformation one time,
and
that time then the extra occurrences will
not be counted as fresh first occurrences.
And this method has been working very
well for scoring structured documents.
The other line of extension is called
a BM25 plus and this line, arresters
have addressed the problem of over
penalization of long documents by BM25.
So to address this problem,
the fix is actually quite simple.
We can simply add a small constant
to the TF normalization formula.
But what's interesting is that we can
analytically prove that by doing such
a small modification,
we will fix the problem of a,
over penalization of long
documents by the original BM25.
So the new formula called
BM25-plus is empirically and
analytically shown to be better than BM25.
So to summarize all what we have
said about the Vector Space Model.
Here are the major takeaway points.
First, in such a model,
we use the similarity notion of relevance,
assuming that the relevance of
a document with respect to a query is
basically proportional to the similarity
between the query and the document.
So, naturally,
that implies that the query and
document must be represented in
the same way, and in this case,
we represent them as vectors in
high dimensional vector space.
Where the dimensions are defined by
words or concepts or terms in general.
And we generally need to use a lot of
heuristics to design a ranking function.
We use some examples which show
the need for several heuristics,
including TF waiting and transformation.
And IDF weighting, and
document length normalization.
These major heuristics are the most
important heuristics to ensure such
a general ranking function to
work well for all kinds of tasks.
And finally BM25 and
Pivoted normalization seem
to be the most effective
formulas out of that Space Model.
Now I have to say that, I've put BM25
in the category of Vector Space Model.
But in fact the BM25 has
been derived using model.
So the reason why I've put it in
the vector space model is first
the ranking function actually has a nice
interpretation in the vector space model.
We can easily see it looks very
much like a vector space model
with a special weighting function.
The second reason is because the original
BM25 has a somewhat different from of IDF.
And that form of IDF actually
doesn't really work so
well as the standard IDF
that you have seen here.
So as a effective original function
BM25 should probably use a heuristic
modification of the IDF to make that
even more like a vector space model.
There are some additional readings.
The first is a paper about
the pivoted length normalization.
It's an excellent example of using
empirical data enhances to suggest a need
for length normalization, and then further
derived a length normalization formula.
The second is the original
paper when the was proposed.
The third paper has
a thorough discussion of and
its extensions, particularly BM-25F.
And finally, the last paper
has a discussion of improving
BM-25 to correct the overpenalization
of long documents.
[MUSIC]

[SOUND] This lecture is about
the implementation of
text retrieval systems.
In this lecture, we will discuss how we
can implement a text retrieval method
to build a search engine.
The main challenge is to
manage a lot of text data and
to enable a query to be answered very
quickly and to respond to many queries.
This is a typical text
retrieval system architecture.
We can see the documents
are first processed by a tokenizer
to get tokenizer units, for example words.
And then these words or
tokens would be processed by
an indexer that would create an index,
which is a data structure for the search
engine to use to quickly answer a query.
And the query will be going
through a similar processing step.
So, the tokenizer will be
apprised to query as well so
that the text can be
processed in the same way.
The same units will be
matched with each other.
And the query's representation
will then be given to the scorer.
Which would use a index to
quickly answer a user's query by
scoring the documents and
then ranking them.
The results will be given to the user.
And then the user can look at the results
and and provide some feedback that can be
expressed judgements about which documents
are good, which documents are bad,
or implicit feedback such as pixels so the
user doesn't have to any, anything extra.
The user will just look at the results and
skip some and
click on some results to view.
So these interaction signals can be used
by the system to improve the ranking
accuracy by assuming that viewed documents
are better than the skipped ones.
So, a search engine system then
can be divided into three parts.
The first part is the indexer, and
the second part is the scorer,
that responds to the user's query.
And the third part is
the feedback mechanism.
Now typically, the indexer is done in
the offline manner so you can pre-process
the correct data and to build the inverter
index which we will introduce in a moment.
And this data structure can then be used
by the online module which is a scorer
to process a user's query dynamically and
quickly generate search results.
The feedback mechanism can be done online
or offline depending on the method.
The implementation of the index and
the, the scorer is fairly standard,
and this is the main topic of this
lecture and the next few lectures.
The feedback mechanism,
on the other hand has variations.
It depends on what method is used.
So that is usually done in
a algorithm-specific way.
Let's first talk about the tokenize.
Tokenization is a normalize lexical
units into the same form so
that semantically similar words
can be matched with each other.
Now in the language of English
stemming is often used and
this what map all the inflectional
forms of words into the same root form.
So for example, computer computation and
computing can all be matched
to the root form compute.
This way, all these different forms of
computing can be matched with each other.
Normally this is a good idea to increase
the coverage of documents that
are matched with this query.
But it's also not always beneficial
because sometimes the subtlest
difference between computer and
computation might still suggest the
difference in the coverage of the content.
But in most cases,
stemming seems to be beneficial.
When we tokenize the text in some other
languages, for example Chinese, we might
face some special challenges in segmenting
the text to find the word boundaries.
Because it's not ob,
obvious where the boundary is as
there's no space separating them.
So, here, of course,
we have to use some language-specific
natural language processing techniques.
Once we do tokenization, then we would
index the text documents, and that it
will convert the documents into some data
structure that can enable fast search.
The basic idea is to precompute
as much as we can, basically.
So the most commonly used index
is called a inverted index.
And this has been used, to,
in many search engines to
support basic search algorithms.
Sometimes other indices, for
example a document index,
might be needed in order to support a,
a feedback.
Like I said, this, this kind of
techniques are not really standard
in that they vary a lot according
to the feedback methods.
To understand why we
are using inverted index.
It will be useful for you to think
about how you would respond to
a single term query quickly.
So if you want to use more time to
think about that, pause the video.
So think about how you can
preprocess the text data so
that you can quickly respond
to a query with just one word.
Well, if you have thought about question,
you might realize that where the best is
to simply create a list of documents
that match every term in the vocabulary.
In this way, you can basically
pre-construct the answers.
So when you see a term,
you can simply just fetch
the ranked list of documents for
that term and return the list to the user.
So that's the fastest way to
respond to single term query.
Now the idea of invert index is
actually basically like that.
We can do, pre-construct such a index.
That would allow us to quickly find the,
all the documents that
match a particular term.
So let's take a look at this example.
We have three documents here, and
these are the documents that you
have seen in some previous lectures.
Suppose we want to create invert index for
these documents, then we will
need to maintain a dictionary.
In the dictionary we'll have one entry for
each term.
And we're going to store some
basic statistics about the term.
For example, the number of
documents that match the term or
the total number of, fre,
total frequency of the term,
which means we would encounter
duplicated occurrences of the term.
And so, for example, news.
This term occurred in
all the three documents.
So the count of documents is three.
And you might also realize we needed
this count of documents or document
frequency for computing some statistics
to be used in the vector space model.
Can you think of that?
So, what waiting heuristic
would need this count?
Well, that's the IDF, right,
inverse document frequency.
So IDF is a property of the term,
and we can compute it right here.
So with the document account here,
it's easy to compute the IDF either at
this time or when we build an index or.
At running time when we see a query.
Now in addition to these
basic statistics we also
saw all the documents that matched news.
And these entries are stored
in a file called a Postings.
So in this case it matched 3 documents and
we store Information about
these 3 documents here.
This is the document id,
document 1, and the frequency is 1.
The TF is 1 for news.
In the second document it's also 1, etc.
So from this list that we can get all
the documents that match the term news.
And we can also know the frequency
of news in these documents.
So, if the query has just one word,
news, and
we can easily look up in this
table to find the entry and
go quickly to the postings to fetch
all the documents that match news.
So, let's take a look at another term.
Now this time let's take a look
at the word presidential.
All right, this word occurred
in only 1 document, document 3.
So, the document frequency is 1, but
it occurred twice in this document.
And so the frequency count is 2, and
the frequency count is used for,
in some other retrieval method
where we might use the frequency
to assess the popularity of a,
a term in the collection.
And similarly, we'll have a pointer
to the postings, right here.
And in this case there is
only one entry here because
the term occurred in just one document.
And that's here.
The document id is 3,
and it occurred twice.
So this is the basic
idea of inverted index.
It's actually pretty simple, right?
With this structure we can easily fetch
all the documents that match a term.
And this will be the basis for
storing documents for our query.
Now sometimes we also want to store
the positions of these terms.
So, in many of these
cases the term occurred
just once in the document so there's only
one position, for example in this case.
But in this case the term occurred
twice so it would store two positions.
Now the position information is
very useful for checking whether
the matching of query terms is actually
within a small window of, let's say,
five words, or ten words,
or whether the matching of,
the two query terms,
is in fact a phrase of two words.
This can all be checked quickly by
using the position information.
So why is inverted index good for
faster search?
Well we just talked about the possibility
of using the two ends
of a single-word query.
And that's very easy.
What about a multiple-term queries?
Well, let's look at the,
some special cases of the Boolean query.
A Boolean query is basically
a Boolean expression, like this.
So I want the relevant document
to match both term A AND term B.
All right, so
that's one conjunctive query.
Or, I want the relevant documents
to match term A OR term B.
That's a disjunctive query.
Now how can we answer such
a query by using inverted index?
Well if you think a, a bit about it,
it would be obvious.
Because we have simply to fetch all
the documents that match term A and
also fetch all the documents
that match term B.
And then just take the intersection
to answer a query like A and B.
Or to take the union to
answer the query A or B.
So this is all very easy to answer.
It's going to be very quick.
Now what about the multi-term
keyword query?
We talked about the vector space model for
example.
And we would match such a query with
a document and generate a score.
And the score is based on
aggregated term weights.
So in this case it's not a Boolean query,
but
the scoring can be actually
done in a similar way.
Basically it's similar to
disjunctive Boolean query.
Basically It's like A OR B.
We take the union of all the, documents
that matched at least one query term,
and then we would
aggregate the term weights.
So this is a, a, a basic idea of
using inverted index for
scoring documents in general.
And we're going to talk about
this in more detail later.
But for now,
let's just look at the question,
why is inverted index, a good idea?
Basically, why is it more efficient than
sequentially just scanning documents?
Right?
This is, the obvious approach.
You can just compute the score for
each document, and
then you can score them,
sorry, you can then sort them.
This is a, a straightforward method.
But this is going to be very slow.
Imagine the web.
It has a lot of documents.
If you do this, then it will take
a long time to answer your query.
So the question now is, why would the in,
the inverted index be much faster?
Well it has to do with
the word distribution in text.
So, here's some common phenomenon
of word distribution in text.
There are some language-in, independent
patterns that seem to be stable.
And these patterns are basically
characterized by the following pattern.
A few words like the common words
like the a, or we, occur very,
very frequently in text.
So they account for
a large percent of occurrences of words.
But most word would occur just rarely.
There are many words that occur just once,
let's say, in a document,
or once in the collection.
And there are many such single terms.
It's also true that the most
frequent words in one corpus
may actually be rare in another.
That means, although the general
phenomenon is applicable or
is observed in many cases,
the exact words that are common
may vary from context to context.
So this phenomena is characterized
by what's called a Zipf's Law.
This law says that the rank
of a word multiplied by,
the frequency of the word
is roughly constant.
So formally if we use F of
w to denote the, frequency,
r of w to denote the rank of a word,
then this is the formula.
It basically says the same thing,
just mathematical term, where C is,
basically a constant, right, so as, so.
And there is also
parameter alpha that might,
be adjusted to better fit
any empirical observations.
So if I plot the word
frequencies in sorted order,
then you can see this more easily.
The x-axis is basically the word rank.
And this is r of w.
And the y-axis is the word frequency,
or F of w.
Now, this curve basically shows
that the product of the two
is roughly the constant.
Now, if you look these words, we can see.
They can be separated into three group2s.
In the middle it's
the immediate frequency words.
These words tend to occur in
quite a few documents, right?
But they're not like those
most frequent words.
And they are also not very rare.
So they tend to be often used in in,
in queries.
And they also tend to have high TFI
diff weights in these intermediate
frequency words.
But if you look at the left
part of the curve.
These are the highest frequency words.
They occur very frequently.
They are usually stopper words,
the, we, of, et cetera.
Those words are very, very frequently.
They are, in fact,
a too frequently to be discriminated.
And they generally are not very
useful for, for retrieval.
So, they are often removed, and
this is called a stop words removal.
So you can use pretty much just the count
of words in the collection to kind
of infer what words might be stop words.
Those are basically
the highest frequency words.
And they also occupy a lot of
space in the invert index.
You can imagine the posting entries for
such a word would be very long.
And then therefore,
if you can remove such words,
you can save a lot of
space in the invert index.
We also show the tail part,
which is, has a lot of rare words.
Those words don't occur very frequently,
and there are many such words.
Those words are actually very useful for
search,
also, if a user happens to be
interested in such a topic.
But because they're rare it's
often true that users are,
aren't the necessary
interest in those words.
But retain them would allow us to
match such a document accurately,
and they generally have very high IDFs.
So what kind of data structures should
we use to to store inverted index?
Well, it has two parts, right?
If you recall we have a dictionary,
and we also have postings.
The dictionary has modest size,
although for
the web, it still wouldn't be very large.
But compared with postings, it's modest.
And we also need to have fast,
random access to the entries
because we want to look up
the query term very quickly.
So, therefore, we prefer to keep such
a dictionary in memory if it's possible.
Or, or, or if the connection is not
very large, and this is visible.
But if the connection is very large,
then it's in general not possible.
If the vocabulary size is very large,
obviously we can't do that.
So, but in general, that's our goal.
So the data structures
that we often use for
storing dictionary would be direct access
data structures, like a hash table or
B-tree if we can't store everything
in memory of the newest disk.
And but to try to build a structure that
would allow it to quickly look up our
entries.
Right.
For postings, they're huge, you can see.
And in general, we don't have to have
direct access to a specific engine.
We generally would just look up a,
a sequence of document IDs and
frequencies for all of the documents
that match a query term.
So we would read those
entries sequentially.
And therefore,
because it's large and we generate,
have store postings on disk,
so they have to stay on disk.
And they would contain information such
as document IDs, term frequencies, or
term positions, et cetera.
Now because they're very large,
compression is often desirable.
Now this is not only
to save disk space and
this is of course,
one benefit of compression.
It's not going to occupy that much space.
But it's also to help improving speed.
Can you see why?
Well, we know that input and
output will cost a lot of time in
comparison with the time taken by CPU.
So CPU is much faster.
But IO takes time.
And so by compressing the inverted index,
the posting files will become smaller.
And the entries that we
have to read into memory
to process a query done,
would would be smaller.
And then so we, we can reduce
the amount of traffic and IO.
And that can save a lot of time.
Of course, we have to then do
more processing of the data
when we uncompress the,
the data in the memory.
But as I said, CPU is fast, so
overall, we can still save time.
So compression here is both
to save disk space and
to speed up the loading
of the inverted index.
[MUSIC]

[SOUND].
This lecture is about
the Inverted Index Construction.
In this lecture, we will continue
the discussion of system implementation.
In particular, we're going to discuss
how to construct the inverted index.
The construction of the inverted index
is actually very easy if the data set is
very small.
It's very easy to construct a dictionary
and then store the postings in a file.
The problem's that when our data
is not able to fit to the memory,
then we have to use some
special method to deal with it.
And unfortunately, in most retrieval a
petitions, the data set would be large and
they generally cannot be,
loaded into the memory at once.
And there are many approaches
to solving that problem, and
sorting-based method, is quite common and
works in four steps as shown here.
First, we collect the the local termID,
document ID, and frequency tuples.
Basically, you overlook kinds of terms
in a small set of documents, and, and
then, once you collect those counts, you
can sort those counts based on terms so
that you build a local,
a partial inverted index.
And these are called, runs.
And then, you write them into
a temporary file on the disk.
And then, you merge in step three with do
pair-wise merging of these runs, and here,
you eventually merge all the runs,
we generate a single inverted index.
So this is an illustration of this method.
On the left, you see some documents.
And on the right, we have, show a term
lexicon and a document ID lexicon.
And these lexicon's are to map a stream
based representations of document IDs or
terms into integer representations.
Or, and, map back from,
integers to the screen representation.
And the reason why we want, are interested
in using integers represent these IDs,
is because,
integers are often easier to handle.
For example,
integers can be used as index for
array and they are also easy to compress.
So this is a, one reason why we,
tend to map these streams
into integers so that so that we don't
have to, carry these streams around.
So how does this approach work?
Well, it's very simple.
We're going to scan these
documents sequentially, and
then pause the documents and
a count the frequencies of terms.
And in this, stage we generally sort
the frequencies by document IDs because we
process each document that sequentially.
So, we first encounter all the terms in,
the first document.
Therefore, the document IDs,
are all once in this stage.
And so, and, this would be
followed by document IDs 2.
And, and they're naturally sort in this
order just because we process the data in
this order.
At some point, the,
we will run out of memory and
that would have to,
to write them into the disk.
But before we do that,
we're going to a sort them, just,
use whatever memory we have,
we can sort them, and
then, this time,
we're going to sort based on term IDs.
Note that here, we're using, this,
the term IDs as a key to sort.
So, all the entries that share the same
term would be grouped together.
In this case,
we can see all the, all the IDs
of documents that match term
one would be grouped together.
And we're going to write this into
the disk as a temporary file.
And that would, allow us to use the memory
to process the next batch of documents,
and we're going to do that for
all the documents.
So we're going to write a lot of
temporary files into the disk.
And then,
the next stage is to do merge sort.
Basically, we're going to,
merge them and the sort them.
Eventually, we will get a single
inverted index where the,
their entries are sorted
based on term IDs.
And on the top,
we can see these are the order entries for
the documents that match term ID 1.
So this is basically how we can do,
the construction of inverted index,
even though that they're or
cannot be, or loaded into the memory.
Now, we mentioned earlier that
because the po, postings are very large,
it's desirable to compress them.
So let's now talk a little bit about
how we compress inverted index.
Well, the idea of compression, in general,
is you leverage skewed
distributions of values.
And we generally have to use variable
lengths in coding instead of the fixed
lengths in coding as we', using,
by defaulting a program language like C++.
And so, how can we leverage the skewed
distributions of values to,
compress these values?
Well, in general, we would use fewer
bits to encode those frequent words
at a cost of using, longer bits from
the code than those, rare values.
So in our case, let's think about how
we can compress the tf, term frequency.
If you can picture what the inverted
index would look like and
you'll see in postings there are a lot of,
term frequencies.
Those are the frequencies of terms,
in all those documents.
Now, we, if you think about it, what
kind of values are most frequent there?
You probably will, be able to guess
that the small numbers tend to occur
far more frequently than large numbers.
Why?
Well, think of about
the distribution of words, and
this is due to Zipf's law and
many words occur just, rarely.
So we see a lot of small numbers,
therefore, we can use fewer bits for
the small, but highly frequent integers,
and at the cost of using more bits for
large integers.
This is a trade-off, of course.
If the values are distributed uniformly
and this won't save us any, spacing.
But because we tend to see many
small values, they're very frequent.
We can save on average
even though sometimes,
when we see a large number we
have to use a lot of bits.
What about the document IDs
that we also saw in postings.
Well, they are not,
distributed in a skewed way, right?
So, how can we deal with that?
Well, it turns out you can
use a trick called the d-gap,
and that, that is to store
the difference of these term IDs.
And we can, imagine if a term
has matched many documents,
then there will be a long
list of document IDs.
So when we take the gap, and when we take
difference between adjacent document IDs,
those gaps will be small.
So we'll again see a lot of small numbers,
whereas,
if a term occurred in only a few
documents, then the gap would be large.
The larger numbers will not be frequent,
so this creates some skewed distribution
that would allow us to,
to compress these values.
This is also possible because in order to
uncover or uncompress these document IDs,
we have to sequentially process the data
because we stored the difference.
And in order to recover the,
the exact document ID,
we have to first recover the previous
document ID, and then, we can add
the difference to the previous document ID
to restore the, the current document ID.
Now, this was possible because we
only needed to have sequential
access to those document IDs.
Once we look up a term we fetch all
the document IDs that match the term,
then we sequentially process them.
So it's very natural that's why this,
trick actually works.
And there are many different methods for
encoding.
So binary code is a common used code in,
in just any program.
Language that we use basically
a fixed length in coding.
Unary code and gamma code, and
delta code are all possible in this and
there are many other possible in this.
So let's look at some
of them in more detail.
Binary code is really
equal-length in coding.
And that's a property for
the randomly distributed values.
The unary coding is is a variable and
it's important [INAUDIBLE].
In this case, integer that is,
I've missed one or
we encode that as x minus 1,
1 bit followed by 0.
So for example, 3 would be encoded
as two 1s followed by a 0,
whereas 5 would be encoded as
four 1s followed by 0, et cetera.
So now, now you can imagine how
many bits do we have to use for
a large number like 100.
So, how many bits do I have to use for
exactly for a number like 100?
Well, exactly, we have to use 100 bits,
but so, it's the same number of
bits as the value of this number.
So, this is very inefficient.
If you were likely to
see some large numbers,
imagine if you occasionally see a number
like 1000, you have to use 1000 bits.
So, this only works where if you
are absolutely sure that there would be no
large numbers.
Mostly very frequent,
they're often using very small numbers.
Now, how do you decode this code?
Since these are variables
lengths in coding methods, and
you can't just count how many bits and
then just stop.
Right?
You can say eight bits or 32 bits,
then you, you will start another code.
There are variable lengths, so,
you have to rely on some mechanism.
In this case for unary, you can see
it's very easy to see the boundary.
Now you can easily see 0 would
signal the end of encoding.
So you just count how many 1s you
have seen, and then you hit the 0.
You know you have finished one number,
you start another number.
Now which is to start at unary code is to
aggressive in rewarding small numbers.
And if you occasionally can see a very
big number, it will be a disaster.
So what about some other
less aggressive method?
Well, gamma coding is one of them.
And in this method, we can do,
use unary coding for
a transformed form of the value.
So it's 1 plus the flow of log of x.
So the magnitude of this value is
much lower than the original, x.
So that's why we have four
using urinary code for that so,
and so we, first we have the urinary
code for coding this log of s.
And this will be followed by
a uniform code or binary code, and
this is basically the same uniform
code and binary code are the same.
And we're going to use this code to code
the remaining part of the value of x.
And this is basically, precisely,
x minus 1, 2 to the flow of log of x.
So the unary code or basically code
with a flow of log of x, well,
I added one there, and here.
But the remaining part will,
we using uniform
code to actually code
the difference between the x and
and this, 2 to the log of x.
And, and it's easy to to show that for
this this value, there's difference.
We only need to use up to,
this many bits and
in flow of log of x bits.
And this is easy to understand,
if the difference is too large then we
would have a higher flow of log of x.
So, here are some examples.
For example, 3 is encoded as 101.
The first two digits are the unary code.
Right.
So, this is for the value 2.
Right.
10 encodes 2 in unary coding.
And so, that means log of x,
the flow of log of x is 1,
because we will actually use unary code
to encode 1 plus the flow of log of x.
Since this is 2, then we know that
the floor of log of x is actually 1.
So but,
3 is still larger than 2 to the 1, so
the difference is 1, and
that 1 is encoded here at the end.
So that's why we have 101 for 3.
Now, similarly 5 is encoded
as 110 followed by 01.
And in this case,
the unary code encodes 3.
So, this is the unary code for 110 and
so the floor of log of x is 2.
And that means, we will compute
the difference between 5 and
the 2 to the 2, and that's 1, and
so we now have again 1 at the end.
But this time, we're going to
use two bits because with this
level of flow of log of x,
we could have more numbers, 5, 6, 7.
They would all share the same prefix here,
110.
So, in order to differentiate them,
we have to use two bits,
in the end to differentiate them.
So you can imagine 6 would be, 10 here
in the end instead of 01, after 110.
It's also true that the form
of a gamma code is always,
the first odd number of bits,
and in the center, there was a 0.
That's the end of the unary code.
And before that, or to, on the left
side of this 0, there will be all 1s.
And on the right side of this 0,
it's binary coding or uniform coding.
So how can you decode such a code?
Well, you again first do unary coding,
right?
Once you hit 0,
you know you have got the unary code.
And this also will tell you how many
bits you have to read further to
decode the uniform code.
So this is how you can
decode a gamma code.
There is also delta code, but
that's basically same as gamma code,
except that you replace the unary
prefix with the gamma code.
So that's even less
conservative than gamma code,
in terms of avoiding the small integers.
So that means it's okay if you
occasionally see a large number.
It's, it's, you know,
it's okay with delta code.
It's also fine with gamma code.
It's really a big loss for unary code,
and they are all operating,
of course, at different degrees of
favoring short favoring small integers.
And that also means they would
appropriate for sorting distribution.
But none of them is perfect for
all distributions.
And which method works,
the best would have to depend on
the actual distribution in your data set.
For inverted index, compression,
people have found that gamma
coding seems to work well.
So how to uncompress inverted index?
We just, talked about this.
Firstly, you decode those encode integers.
And we just, I think discussed how we
decode unary coding and gamma coding.
So I won't repeat.
What about the document IDs that
might be compressed using d-gap?
Well, we're going to do
sequential decoding.
So suppose the encoded idealist is x1,
x2, x3 et cetera.
We first decode x1 to obtain
the first document ID, ID1.
Then, we will decode x2,
which is actually the difference between
the second ID and the first one.
So we have to add the decoded value
of x2 to ID1 to recover the value
of the,
the ID at this secondary position, right.
So this is where you can see the advantage
of, converting document IDs into integers.
And that allows us to do this
kind of compression, and
we just repeat until we
decode all the documents.
Every time we use the document
ID in the previous position
to help recover the document
ID in the next position.
[MUSIC]

[SOUND].
This lecture is about how to do fast
research by using inverted index.
In this lecture,
we are going to continue the discussion
of the system implementation.
In particular, we're going to talk about,
to how to support a faster
search by using inverted index.
So, let's think about what a general
scoring function might look like.
Now, of curse the vector space
model is a special case of this.
But we can imagine many other
retrieval functions of the same form.
So, the form of this
function is as follows.
We see this scoring
function of document d, and
query q is defined as first, a function
of f a that's adjustment in the function.
That what consider two
factors that are shown
here at the end, f sub d of d,
and f sub q of q.
These are adjustment factors
of a document and query, so
they're at the level of document,
and query.
So, and
then inside of this function we also see
there's a another function called edge.
So, this is the main part of
the scoring function,
and these as I just said
of the scoring factors at the level
of the whole document, and the query.
For example, document and
this aggregate function would
then combine all these.
Now, inside this h function,
there are functions that would compute
the weights of the contribution
of a matched query term t i.
So, this this g, the function g gives us
the weight of a matched query
term t i in document d.
And this h function with that
aggregate all these weights, so
it were, for example, take a sum, but
it of all the matched query in that terms.
But it can also be a product, or
could be another way of aggregate them.
And then finally, this adjustment
function would then consider
the document level, or query level
factors through further adjuster score,
for example, document lens [INAUDIBLE].
So, this general form would cover
many state of original functions.
Let's look at how we can score such
score documents with such
a function using inverted index.
So here's the general algorithm
that works as follows.
First these these Query level and
document level factors can be
pre-computed in the indexing term.
Of course, for the query,
we have to compute it as a query term.
But for document, for example,
document can be pre-computed.
And then we maintain a score accumulator
for each document d to compute the h.
And h is aggregation function
of all the matching query terms.
So how do we do that?
Well, for each query term,
we going to do fetch inverted list,
from the inverted index.
This will give us all the documents
that match this query term,
and that includes d1,
f1, and so, d and fn.
So each pair is document id and
the frequency of the term in the document.
Then for each entry d sub j and f sub j,
a particular match of the term in
this particular document d sub j,
we're going to computer the function g.
That would give us something like
a t of i, ef weights of this term.
So, we're computing the weight
contribution of matching this query term
in this document.
And then we're going to update the score
accumulator for this document.
And this would allow us to
add this to our accumulator,
that would incrementally
compute function h.
So this is basically a general
way to allow sort of computer
all functions of this form,
by using inverted index.
Note that we don't have to
attach any document that that
didn't match any query term,
but this is why it's fast.
We only need to process the documents that
tap, that match at least one query term.
In the end, then we're going to
adjust the score to compute a,
this function f of a and then we can sort.
So let's take a look at
the specific example.
In this, case let's assume the scoring
function's a very simple one.
It just takes us sum of tf, the rule of
tf, the count of, of term in the document.
Now this simple equation with the help
showing the algorithm clearly.
It's very easy to extend the,
the computation to include other weights
like the transformation of TF or
document or IDF weighting.
So let's take a look at specific example
with the query's information security,
and shows some entries of
the inverted index on the right side.
Information occurring before documents and
the frequencies is also there,
security is coding three documents.
So, let's see how the algorithm works,
all right?
So, first we iterate all the query terms,
and we fetch the first query then.
What is that?
That's information.
Right?
So, and imagine we have all these score
accumulators to score, score the,
score the scores for these documents.
We can imagine there will be allocated,
but
then they will only be
allocated as needed.
So before we do any weighting of terms
we don't even need a score accumulators.
But conceptual we have these score
accumulators eventually allocated, right?
So let's fetch the,
the entries from the inverted list for
information first, that's the first one.
So these score accumulators obviously
would be initialized as zeros.
So the first entry is d1 and 3,
3 is occurrences of
information in this document.
Since our scoring function assume that the
score is just a sum of these raw counts.
We just need to add a 3 to the score
accumulator to account for
the increase of score, due to matching
this term information, a document d1.
And now we go to the next entry.
That's d2 and 4 and then we'll add
a 4 to the score accumulator of d2.
Of course, at this point we will allocate
the score accumulator as needed.
And so, at this point, we have located
d1 and d2, and the next one is d3.
And we add 1, or we locate another score
coming in the spot d3 and add 1 to it.
And finally,
the d4 gets a 5 because the information
the term information occurred ti
in five times in this document.
Okay, so this completes the processing
of all the entries in the,
inverted index for information.
It's processed all the contributions
of matching information in this
four documents.
So now our arrows will go to the next
query term, that's security.
So, we're going to factor all
the inverted index entries for security.
So in this case, there were three entries.
And we're going to go
through each of them.
The first is d2 and 3.
And that means security occurred
three times in d2, and what do we do?
Well, we do exactly the same as
what we did for information.
So this time we're going
to do change the score,
accumulating d2 sees
it's already allocate.
And what we do is we'll add 3 to
the existing value which is a 4,
so we now get the 7 for d2.
D2 sc, score is increased because of the
match both information and the security.
Go to the next step entry, that's d4 and
1, so we've updated the score for
d4,and again we add 1 to d4,
so d4 goes from 5 to 6.
Finally we process d5 and 3.
SInce we have not yet
equated a score accumulator d4 to d5,
at this point, we allocate one,
45 and we're going to add 3 to it.
So, those scores on the last row
are the final scores for these documents.
If our scoring function is just a,
a simple sum of tf values.
Now what if we actually would like to,
to do lands normalization.
Well we can do the normalization
at this point for each document.
So to summarize this,
all right so you can see we first
processed the information determine
query term information, and
we process all the entries in
the inverted index for this term.
Then we process the security,
all right, let's think about
the what should be the order of processing
here when we consider query terms?
It might make difference,
especially if we don't want to keep
to keep all the score accumulators.
Let's say we only want to keep
the most promising score accumulators.
What do you think it would be
a good order to go through?
Would you go would you process
a common term first or
would you process a rare term first?
The answer is we should go through we
should process the rare term first.
A rare term will match fewer documents and
then the score confusion will be higher,
because the IDF value will be higher and,
and
then it allows us to attach
the most diplomacy documents first.
So it helps pruning some non
promising ones, if we don't need so
many documents to be returned to the user.
And so those are heuristics for
further improving the accuracy.
Here can also see how we can
incorporate the idea of weighting.
All right.
So they can [INAUDIBLE] when we
incorporated a one way process each
query term.
When we fetch in word index we
can fetch the document frequency,
and then we can compute the IDF.
Or maybe perhapsIDF value has already been
pre-computed when we index the document.
At that time we already computed the IDF
value that we can just fetch it.
So all these can be down at this time.
So that will mean one will process
all the entries for information these
these weights would be adjusted by the
same IDF, which is IDF for information.
So this is the basic idea of using
inverted index for faster search, and
works well for all kinds of formulas that
are of the general form and this generally
cov, the general form covers actually most
state of the art retrieval functions.
So there are some tricks to further
improve the efficiency ,some general mac
tech, techniques include caching.
This is just a to store some
results of popular query's, so
that next time when you see the same query
you simply return the stored results.
Similarly, you can also score the missed
of inverted index in the memory for
popular term.
And if the query comes
popular you will assume
it will fetch the inverted index for
the same term again.
So keeping that in the memory would help.
And these are general techniques for
improving efficiency.
We can also only keep the most promising
accumulators because a user generally
doesn't want to examine so many documents.
We only want to return high quality
subset of documents that likely ranked
on the top, in,in for that purpose
we can then prune the accumulators.
We don't have to store
all the accumulators.
At some point we just keep
the highest value accumulators.
Another technique is to do parallel
processing, and that's needed for
really processing such a large data set,
like the web data set.
And to scale up to the Web-scale
we need to special
to have the special techniques
to do parallel processing and
to distribute the storage of
files on multiple machines.
So here as a, here is a list of
some text retrieval toolkits.
It's, it's not a complete list.
You can find the more information
at this URL on the bottom.
Here I listed four here,
lucene is one of the most popular toolkit
that can support a lot of applications.
And it has very nice support for
applications.
You can use it to build
a search engine very quickly,
the downside is that it's not
that easy to extend it, and
the algorithms incremented there
are not the most advanced algorithms.
Lemur or Indri is another toolkit that
that does not have such a nice
support application as Lucene.
But it has many advanced
search algorithms.
And it's also easy to extend.
Terrier is yet another toolkit
that also has good support for
quotation capability and
some advanced algorithms.
So that's maybe in between Lemur,
or Lucene or
maybe rather combining the strands of
both, so that's also useful toolkit.
MeTA is the toolkit that we'll use for
the programming assignment,
and this is a new toolkit
that has a combination
of both text retrieval algorithms and
text mining algorithms.
And so, toolkit models are implement, they
are, there are a number of text analysis
algorithms, implemented in the toolkit,
as well as basic research algorithms.
So, to summarize all the discussion
about the system implementation,
here are the major take away points.
Inverted index is the primary data
structure for supporting a search engine.
That's the key to enable faster
response to a user's query.
And the basic idea is process that,
pre-process the data as much as we can,
and we want to do compression
when appropriate.
So that we can save disk space and
can speed up IO and
processing of the inverted
index in general.
We'll talk about how we will construct
the inverted index when the data
can fit into the memory.
And then we talk about faster search using
inverted index, basically to exploit
the inverted index to accumulate scores
for documents matching a query term.
And we exploit Zipf's law
avoid touching many documents
that don't match any query term.
And this algorithm can, can support
a wide range of ranking algorithms.
So these basic techniques have mm,
have great potential for further scanning
output using distribution to withstand
parallel processing and the caching.
Here are two additional readings that
you can take a look at if you have time,
and are interested in
learning more about this.
The first one is a classic textbook on the
scare the efficiency of inverted index and
the compression techniques,
and how to in general,
build a efficient search engine in
terms of the space overhead and speed.
The second one is a newer textbook that
has a nice discussion of implementing and
evaluating search engines.
[MUSIC]

[SOUND] This lecture is about
evaluation of text retrieval systems.
In the previous lectures, we have talked
about a number of text retrieval methods.
Different kinds of ranking functions.
But how do we know which
one works the best?
In order to answer this question,
we have to compare them,
and that means we'll have to
evaluate these retrieval methods.
So this is the main topic of this lecture.
First, let's think about why
do we have to do evaluation?
I already gave one reason.
And that is,
we have to use evaluation to figure out
which retrieval method works better.
Now this is very important for
advancing our knowledge.
Otherwise we wouldn't know whether
a new idea works better than old idea.
In the beginning of this
course we talked about the,
the problem of text retrieval we
compare it with database retrieval.
There, we mentioned that text retrieval
is imperative to find the problem.
So, evaluation must rely on users,
which system works better,
that would have to be judged by our users.
So this becomes very challenging problem.
Because how can we get users involved in,
in matters, and
how can we draw a fair
comparison of different methods.
So just go back to the reasons for
evaluation.
I listed two reasons here.
The second reason is basically what I just
said but there is also another reason,
which is to assess the actual
utility of a test regional system.
Now imagine you're building
your own applications.
Would be interested in knowing how well
your search engine works for your users.
So in this case measures must
reflect the utility to the actual
users in the the real application.
And typically, this has been
done by using user studies and
using the real search engine.
In the second case or for
the second reason, the measures
actually all need to be correlated
with the utility to actual users.
Thus they don't have to accurately
reflect the, the exact utility to users.
So the measure only needs to be good
enough to tell which method works better.
And this is usually done
through test collection.
And this is the main idea that we'll
be talking about in this course.
This has been very important for
comparing different algorithms and
for improving search
engines systems in general.
So next we will talk
about what to measure.
There are many aspects of a search engine
we can measure, we can evaluate and
here I list the three major aspects.
One is effectiveness or accuracy,
how accurate are the search results?
In this case we're measuring a system's
capability of ranking relevant documents
on top of non relevant ones.
The second is efficiency.
How quickly can a user get the results?
How much computing resources
are needed to answer a query?
So in this case we need to measure
the space and time overhead of the system.
The third aspect is usability.
Basically the question is how useful
is the system for real user tasks?
Here, obviously, interfaces and
many other things are also important and
we typically would have
to do user studies.
Now, in this course, we're going to talk
more, mostly about the effectiveness and
accuracy measures because,
the efficiency and
usability dimensions are, not really
unique to search engines, and so,
they are, needed for
evaluating any other software systems.
And there is also good coverage of
such materials in other courses.
But how to evaluate a search engine
is quite, you know accuracy is
something you need to text retrieval, and
we're going to talk a lot about this.
The main idea that people have proposed
before using a attitude, evaluate
a text retrieval algorithm, is called
the Cranfield Evaluation Methodology.
This one actually was developed long
time ago, developed in the 1960s.
It's a methodology for laboratory test
of system components, it's actually
a methodology that has been very useful,
not just for search engine evaluation.
But also for evaluating virtually
all kinds of empirical tasks.
And, for example in processing or
in other fields where the problem
is empirically defined we typically would
need to use to use such a methodology.
And today was the big data challenge with
the use of machine learning every where.
We general, this methodology has been very
popular, but it was first developed for
search engine application in the 1960s.
So the basic idea of this approach is
it'll build a reusable test collections
and define measures.
Once such a test collection is
build it can be used again and
again to test the different algorithms.
And we're going to define measures
that would allow you to quantify
performance of a system or
an, an algorithm.
So how exactly would this work?
Well, we're going to do,
have assembled collection of documents and
this is just similar to real document
collection in your search application.
We can also have a sample
set of queries or topics.
This is to simulate the user's queries.
Then we'll have to have
relevance judgments.
These are judgments of which documents
should be returned for which queries.
Ideally, they have to made by
users who formulated the queries
because those are the people that know
exactly what documents would be used for.
And then finally we have to have measures
to quantify how well a system's result
matches the ideal ranked list.
That would be constructed and
based on users' relevant judgements.
So this methodology is very useful for
starting retrieval
algorithms because the test can actually,
can be reused many times.
And it will also provide a fair
comparison for all the methods.
We have the same criteria,
same data set to use and
to compare different algorithms.
This allows us to compare a new
algorithm with an old algorithm,
that was the method of many years ago.
By using the same standard.
So this is the illustration
of how this works, so
as I said,
we need a queries that are shown here.
We have Q1, Q2, et cetera.
We also need a documents, and
that's called the document collection,
and on the right side,
you see we need relevance judgment.
These are basically the binary judgments
of documents with respect to a query.
So, for example D1 is judged
as being relevant to Q1,
D2 is judged as being relevant as well.
And D3 is judged as non relevant
in the two, Q1, et cetera.
These would be created by users.
Once we have these, and
we basically have a test, correction, and
then, if you have two systems,
you want to, compare them.
Then you can just run each
system on these queries and
documents and
each system will then return results.
Let's say if the query is Q1 and
then we would have the results here,
here I show R sub A as
results from system A.
So, this is remember we talked about
task of computing approximation of the,
relevant document setter.
So A is,
the system A's approximation here, and
also B is system B's approximation
of relevant documents.
Now let's take a look at these results.
So which is better?
Now imagine for
a user which one would you like?
All right lets take
a look at both results.
And there are some differences and
there are some documents that
are return to both systems.
But if you look at the results
you will feel that well,
maybe an A is better in the sense that
we don't have many number in documents.
And among the three documents returned
the two of them are relevant, so
that's good, it's precise.
On the other hand can also
say maybe B is better because
we've got more relevant documents,
we've got three instead of two.
So which one is better and
how do we quantify this?
Well obviously, this question
highly depends on a user's task.
And, it depends on users as well.
You might be able to imagine, for
some users may be system made is better.
If the user is not interested in
getting all the relevant documents,
right, in this case this is
the user doesn't have to read.
User would see most relevant documents.
On the other hand on one count,
imagine user might need to have
as many relevant documents as possible,
for example, taking a literature survey.
You might be in the second category, and
then you might find
that system B's better.
So in either case, we'll have to also
define measures that would quantify them.
And we might need to define
multiple measures because
users have different perspectives
of looking at results.
[MUSIC]

[SOUND] This lecture is about the,
the basic measures for
evaluation of text original systems.
In this lecture,
we're going to discuss how we design basic
measures [SOUND] to quantitatively,
compare two original [SOUND] systems.
This is a slide that you have
seen earlier in the lecture,
where we talk about the grand
evaluation methodology.
We can have a test collection that
consists of queries, documents and
relevance judgements.
We can then run two systems on these da,
data sets to,
quantitatively evaluate your performance.
And we raised to the question about,
[SOUND] which settles results is better
is System A better or System B better?
[SOUND] So let's now talk about how to
actually quantify their performance.
Suppose we have a total of,
of 10 random documents in
the current folder for this query.
Now, the relevance judgements
shown on the right,
did not include all the ten obviously.
And we have only seen three
rendered documents there but
we can imagine there are other random
documents in judging for this query.
So now, intuitively we thought that
System A is better because
it did not have much noise.
And in particular we have seen,
amount of three results,
two of them are relevant but
in System B we
have five results and
only three of them are relevant.
So intuitively,
it looks like System A is more accurate.
And this can be captured by
a matching order precision.
Where we simply compute to what extent
all the retrieval results are relevant.
If you have 100% precision that would mean
all the retrieval documents are relevant.
So, in this case the system A has
a Precision of two out of three.
System B as three over five.
And this shows that System A is
better by Precision.
But we also talked about
System B might be preferred by
some other users hold like to retrieve
as many relevant documents as possible.
So, in that case we have to compare
the number of relevant
documents that retrieve.
And there is an other
measure called a Recall.
This measures the completeness of
coverage of relevant documents
in your retriever result.
So, we just assume that there are ten
relevant documents in the collection.
And here we've got two of them in
System A, so the recall is two out of ten.
Where as system B has got a three,
so it's a three out of ten.
Now ,we can see by recall
System B is better and these two
measures turned out to be the very basic
measures for evaluating search engine.
And they are very important because
they are also widely used in many other
testing variation problems.
For example, if you look at the
applications of machine learning you tend
to see precision recall numbers being
reported for all kinds of tasks.
Okay, so now, let's define these
two measures more precisely and
these measures are to evaluate
a set of retrieval documents.
So that means we are considering
that approximation
of a set of relevant documents.
We can distinguish it four cases,
depending on the situation of a document.
A document that can be retrieved or
not retrieved, right?
Because we're talking
about the set of result.
The document can be also relevant or
not relevant, depending on whether
the user thinks this is a useful document.
So, we can now have counts of documents
in each of the four categories.
We can have a to represent the number
of documents that are retrieved and
relevant, b for documents that
are not retrieved but relevant, etc.
Now, with this table,
then we have defined precision.
As the, ratio of, the relevant
retriever documents A to the total
number of retriever documents.
So this is just you know,
a divided by the sum of a and c.
The sum of this column.
Signal recall is defined by
dividing a by the sum of a and b.
So that's, again, to divide a by the sum
of the rule, instead of the column.
All right, so we going to see
precision and recall is all focused on
looking at the a, that's the number
of retrieval relevant documents, but
we're going to use different denominators.
Okay, so what would be an ideal result?
Well, you can able to see in ideal
case we have precision and recall, all
to be 1.0 that means we have got 1% of
all the random documents in our results.
And all the results that
we return are relevant.
[INAUDIBLE] There's no single
not relevant document returned.
The reality however, high recall tends
to be associated with low precision And
you can imagine why that is the case.
As you go down the distant to try to get
as many relevant actions as possible.
You tend to in time a lot of non relevant
documents, so the precision goes down.
Look at this set, can also be defined
by a cutoff in a ranked list.
That's why, although these two measures
are defined for a set of retrieved
documents, they are actually very
useful for evaluating a ranked list.
They are the fundamental measures in
tension retrieval and many other tasks.
We often are interested in to
the precision up to ten documents for
web search.
This means we look at the,
how many documents among the top
results are actually relevant.
Now, this is a very meaningful measure,
because it tells us how many relevant
documents a user can expect to see.
On the first page of search results,
where they typically show ten results.
So, precision and recall are,
the basic measures and
we need to use them to further
evaluate a search engine but
they are the building blocks really.
We just to say that there tends to be
a trade off between precision and recall.
So, naturally it would be interesting
to [SOUND] combine them and
here's one measure that's often used,
called f measure.
And it's harmonic mean of precision and
recall, it's defined on this slide.
So you can see it first computed,
inverse of R and P here and
then it would be
interpreted to by using a co,
coefficients.
Depending on the parameter Beta and
after some transformation we can
easily see it would be of this form.
And in many cases it's just
a combination of precision and recall.
And, and Beta is a parameter
that's often set to one.
It can control the emphasis
on precision or recall.
When we set,
beta to one we end up by having a special
case of F measure, often called F1.
This is a popular measure, that is often
used as a combined precision and recall.
And the formula looks very
simple it's just this, here.
Now it's easy to see that if you have,
a larger precision or
larger recall than F
measure would be high.
But what's interesting is that,
the trade off between precision and
recall, is captured in
an interesting way in F1.
So, in order to understand that, we,
can first look at the natural question.
Why not just the,
combining them using a simple
arithmetic mean as a [INAUDIBLE] here.
That would be likely the most
natural way of combining them.
So, what do you think?
If you want to think more,
you can pause the media.
So why is this not as good as F1?
Or what's the problem with this?
Now, if you think about
the arithmetic mean,
you can see that this is the sum of,
of multiple terms.
In this case,
this is the sum of precision and recall.
In the case of the sum, the total value
tends to be dominated by the large values.
That means if you have a very high P or
a very high R,
then you really don't care about the,
whether the other varies is low.
So, the whole sum would be high.
Now, this is not the desirable because
one can easily have a perfect recall.
We can have a perfect recall is it?
Can you imagine how?
It's probably very easy to imagine that
we simply retrieve all
the document in the collection,
then we have a perfect recall and
this will give us 0.5 as the average.
But search results are clearly
not very useful for users,
even though the, the average using
this formula would be relatively high.
Now, in contrast, you can see F1 will
reward a case where precision and
recall are roughly but similar.
So, it would paralyze a case
where you have extremely high
matter for one of them.
So, this means F1 encodes
a different trade off between that.
Now this example shows actually,
a very important methodology here.
When we try to solve a problem,
you might naturally think of one solution.
Let's say, in this case,
it's this arithmetic mean.
But it's important that not
to settle on this solution.
It's important to think whether you
have other ways to combine them.
And once you think about
the multiple variance.
It's important to analyze
their difference and
then think about which
one makes more sense.
In this case,
if you think more carefully you will feel
that if one problem makes more sense.
Then the simple arithmetic mean.
Although in other cases,
there may be, different results.
But in this case, the arithmetic mean,
seems not reasonable.
But if you don't pay attention
to these subtle differences,
you might just, take an easy way to
combine them and then go ahead with it.
And here later you'll find that, hm,
the measure doesn't seem to work well.
Right so, at this methodology
is actually very important in
general in solving problem and
try to think about the best solution.
Try to understand that the problem,
very well and then know why
you needed this measure, and why you
need to combine precision and recall.
And then use that to guide you in
finding a good way to solve the problem.
To summarize, we talk about precision,
which addresses the question,
are the retrieval results all relevant?
We'll also talk about the recall,
which addresses the question,
have all the relevant
documents been retrieved?
These two are the two basic measures
in testing retrieval in variation.
They are are used for, for
many other tasks as well.
We'll talk about F measure as a way
to combine precision and recall.
We also talked about the trade
off between precision and recall.
And this turns out to depend
on the users search tasks and
we'll discuss this point
more in the later lecture.
[MUSIC]

[MUSIC]
This lecture is about,
how we can evaluate a ranked list?
In this lecture, we will continue
the discussion of evaluation.
In particular,
we are going to look at, how we can
evaluate a ranked list of results.
In the previous lecture,
we talked about, precision-recall.
These are the two basic measures for,
quantitatively measuring
the performance of a search result.
But, as we talked about, ranking, before,
we framed that the text of retrieval
problem, as a ranking problem.
So, we also need to evaluate the,
the quality of a ranked list.
How can we use precision-recall
to evaluate, a ranked list?
Well, naturally, we have to look after the
precision-recall at different, cut-offs.
Because in the end, the approximation
of relevant documents, set,
given by a ranked list, is determined
by where the user stops browsing.
Right?
If we assume the user, securely browses,
the list of results, the user would,
stop at some point, and
that point would determine the set.
And then,
that's the most important, cut-off,
that we have to consider,
when we compute the precision-recall.
Without knowing where
exactly user would stop,
then we have to consider, all
the positions where the user could stop.
So, let's look at these positions.
Look at this slide, and
then, let's look at the,
what if the user stops at the,
the first document?
What's the precision-recall at this point?
What do you think?
Well, it's easy to see, that this document
is So, the precision is one out of one.
We have, got one document,
and that's relevent.
What about the recall?
Well, note that, we're assuming that,
there are ten relevant documents, for
this query in the collection,
so, it's one out of ten.
What if the user stops
at the second position?
Top two.
Well, the precision is the same,
100%, two out of two.
And, the record is two out of ten.
What if the user stops
at the third position?
Well, this is interesting,
because in this case, we have not got any,
additional relevant document,
so, the record does not change.
But the precision is lower,
because we've got number [INAUDIBLE] so,
what's exactly the precision?
Well, it's two out of three, right?
And, recall is the same, two out of ten.
So, when would see another point,
where the recall would be different?
Now, if you look down the list,
well, it won't happen until,
we have, seeing another relevant document.
In this case D5, at that point, the,
the recall is increased through
three out of ten, and,
the precision is three out of five.
So, you can see, if we keep doing this,
we can also get to D8.
And then, we will have
a precision of four out of eight,
because there are eight documents,
and four of them are relevant.
And, the recall is a four out of ten.
Now, when can we get,
a recall of five out of ten?
Well, in this list, we don't have it,
so, we have to go down on the list.
We don't know, where it is?
But, as convenience, we often assume that,
the precision is zero,
at all the, the othe,
the precision are zero at
all the other levels of recall,
that are beyond the search results.
So, of course,
this is a pessimistic assumption,
the actual position would be higher,
but we make, make this assumption,
in order to, have an easy way to,
compute another measure called Average
Precision, that we will discuss later.
Now, I should also say, now, here you see,
we make these assumptions that
are clearly not, accurate.
But, this is okay, for
the purpose of comparing to, text methods.
And, this is for the relative comparison,
so, it's okay, if the actual measure,
or actual, actual number deviates
a little bit, from the true number.
As long as the deviation,
is not biased toward any particular
retrieval method, we are okay.
We can still,
accurately tell which method works better.
And, this is important point,
to keep in mind.
When you compare different algorithms,
the key's to avoid any
bias toward each method.
And, as long as, you can avoid that.
It's okay, for you to do transformation
of these measures anyway, so,
you can preserve the order.
Okay, so, we'll just talk about,
we can get a lot of precision-recall
numbers at different positions.
So, now, you can imagine,
we can plot a curve.
And, this just shows on the,
x-axis, we show the recalls.
And, on the y-axis, we show the precision.
So, the precision line was marked as .1,
.2, .3, and, 1.0.
Right?
So,
this is, the different, levels of recall.
And,, the y-axis also has,
different amounts, that's for precision.
So, we plot the, these, precision-recall
numbers, that we have got,
as points on this picture.
Now, we can further, and
link these points to form a curve.
As you'll see,
we assumed all the other, precision
as the high-level recalls, be zero.
And, that's why, they are down here,
so, they are all zero.
And this, the actual curve probably will
be something like this, but, as we just
discussed, it, it doesn't matter that
much, for comparing two methods.
because this would be,
underestimated, for all the method.
Okay, so, now that we,
have this precision-recall curve,
how can we compare ranked to back list?
All right, so, that means,
we have to compare two PR curves.
And here, we show, two cases.
Where system A is showing red,
system B is showing blue, there's crosses.
All right, so, which one is better?
I hope you can see,
where system A is clearly better.
Why?
Because, for the same level of recall,
see same level of recall here,
and you can see,
the precision point by system A is better,
system B.
So, there's no question.
In here, you can imagine, what does the
code look like, for ideal search system?
Well, it has to have perfect,
precision at all the recall points, so,
it has to be this line.
That would be the ideal system.
In general, the higher the curve is,
the better, right?
The problem is that,
we might see a case like this.
This actually happens often.
Like, the two curves cross each other.
Now, in this case, which one is better?
What do you think?
Now, this is a real problem,
that you actually, might have face.
Suppose, you build a search engine,
and you have a old algorithm,
that's shown here in blue, or system B.
And, you have come up with a new idea.
And, you test it.
And, the results are shown in red,
curve A.
Now, your question is, is your new
method better than the old method?
Or more, practically,
do you have to replace the algorithm that
you're already using, your, in your search
engine, with another, new algorithm?
So, should we use system,
method A, to replace method B?
This is going to be a real decision,
that you to have to make.
If you make the replacement, the search
engine would behave like system A here,
whereas, if you don't do that,
it will be like a system B.
So, what do you do?
Now, if you want to spend more time
to think about this, pause the video.
And, it's actually very
useful to think about that.
As I said, it's a real decision that you
have to make, if you are building your own
search engine, or if you're working, for
a company that, cares about the search.
Now, if you have thought about this for
a moment, you might realize that,
well, in this case, it's hard to say.
Now, some users might like a system A,
some users might like, like system B.
So, what's the difference here?
Well, the difference is just that,
you know,
in the, low level of recall,
in this region, system B is better.
There's a higher precision.
But in high recall region,
system A is better.
Now, so, that also means,
it depends on whether the user
cares about the high recall, or
low recall, but high precision.
You can imagine, if someone is just going
to check out, what's happening today, and
want to find out something
relevant in the news.
Well, which one is better?
What do you think?
In this case, clearly, system B is better,
because the user is unlikely
examining a lot of results.
The user doesn't care about high recall.
On the other hand,
if you think about a case,
where a user is doing you are,
starting a problem.
You want to find, whether your idea ha,
has been started before.
In that case, you emphasize high recall.
So, you want to see,
as many relevant documents as possible.
Therefore, you might, favor, system A.
So, that means, which one is better?
That actually depends on users,
and more precisely, users task.
So, this means, you may not necessarily
be able to come up with one number,
that would accurately
depict the performance.
You have to look at the overall picture.
Yet, as I said, when you have
a practical decision to make,
whether you replace ours with another,
then you may have to actually come up with
a single number, to quantify each, method.
Or, when we compare many different
methods in research, ideally, we have
one number to compare, them with, so, that
we can easily make a lot of comparisons.
So, for all these reasons, it is desirable
to have one, single number to match it up.
So, how do we do that?
And, that,
needs a number to summarize the range.
So, here again it's
the precision-recall curve, right?
And, one way to summarize
this whole ranked, list, for
this whole curve,
is look at the area underneath the curve.
Right?
So, this is one way to measure that.
There are other ways to measure that,
but, it just turns out that,,
this particular way of matching
it has been very, popular, and
has been used, since a long time ago for
text And, this is,
basically, in this way, and
it's called the average precision.
Basically, we're going to take a, a look
at the, every different, recall point.
And then, look out for the precision.
So, we know, you know,
this is one precision.
And, this is another,
with, different recall.
Now, this, we don't count to this one,
because the recall level is the same,
and we're going to, look at the,
this number, and that's precision at
a different recall level et cetera.
So, we have all these, you know, added up.
These are the precisions
at the different points,
corresponding to retrieving the first
relevant document, the second, and
then, the third, that follows, et cetera.
Now, we missed the many relevant
documents, so, in all of those cases,
we just, assume,
that they have zero precisions.
And then, finally, we take the average.
So, we divide it by ten, and
which is the total number of relevant
documents in the collection.
Note that here,
we're not dividing this sum by four.
Which is a number retrieved
relevant documents.
Now, imagine, if I divide by four,
what would happen?
Now, think about this, for a moment.
It's a common mistake that people,
sometimes, overlook.
Right, so, if we, we divide this by four,
it's actually not very good.
In fact, that you are favoring a system,
that would retrieve very few random
documents, as in that case,
the denominator would be very small.
So, this would be, not a good matching.
So, note that this denomina,
denominator is ten,
the total number of relevant documents.
And, this will basically ,compute
the area, and the needs occur.
And, this is the standard method,
used for evaluating a ranked list.
Note that, it actually combines
recall and, precision.
But first, you know, we have
precision numbers here, but secondly,
we also consider recall, because if missed
many, there would be many zeros here.
All right, so,
it combines precision and recall.
And furthermore, you can see this
measure is sensitive to a small change
of a position of a relevant document.
Let's say, if I move this relevant
document up a little bit, now,
it would increase this means,
this average precision.
Whereas, if I move any relevant document,
down, let's say, I move this relevant
document down, then it would decrease,
uh,the average precision.
So, this is a very good,
because it's a very sensitive to
the ranking of every relevant document.
It can tell, small differences
between two ranked lists.
And, that is what we want,
sometimes one algorithm only works
slightly better than another.
And, we want to see this difference.
In contrast, if we look at
the precision at the ten documents.
If we look at this, this whole set, well,
what, what's the precision,
what do you think?
Well, it's easy to see,
that's a four out of ten, right?
So, that precision is very meaningful,
because it tells us, what user would see?
So, that's pretty useful, right?
So, it's a meaningful measure,
from a users perspective.
But, if we use this measure to
compare systems, it wouldn't be good,
because it wouldn't be sensitive to where
these four relevant documents are ranked.
If I move them around the precision
at ten, still, the same.
Right.
So,
this is not a good measure for
comparing different algorithms.
In contrast, the average precision
is a much better measure.
It can tell the difference of, different,
a difference in ranked list in,
subtle ways.
[MUSIC]

[SOUND]
So average precision is computer for
just one.
one query.
But we generally experiment with many
different queries and this is to
avoid the variance across queries.
Depending on the queries you use you
might make different conclusions.
Right, so
it's better then using more queries.
If you use more queries then,
you will also have to
take the average of the average
precision over all these queries.
So how can we do that?
Well, you can naturally.
Think of just doing arithmetic mean as we
always tend to, to think in, in this way.
So, this would give us what's called
a "Mean Average Position", or MAP.
In this case,
we take arithmetic mean of all the average
precisions over several queries or topics.
But as I just mentioned in
another lecture, is this good?
We call that.
We talked about the different ways
of combining precision and recall.
And we conclude that the arithmetic
mean is not as good as the MAP measure.
But here it's the same.
We can also think about the alternative
ways of aggregating the numbers.
Don't just automatically assume that,
though.
Let's just also take the arithmetic
mean of the average position over
these queries.
Let's think about what's
the best way of aggregating them.
If you think about the different ways,
naturally you will,
probably be able to think about
another way, which is geometric mean.
And we call this kind of average a gMAP.
This is another way.
So now, once you think about
the two different ways.
Of doing the same thing.
The natural question to ask is,
which one is better?
So.
So, do you use MAP or gMAP?
Again, that's important question.
Imagine you are again
testing a new algorithm in,
by comparing the ways your old
algorithms made the search engine.
Now you tested multiple topics.
Now you've got the average precision for
these topics.
Now you are thinking of looking
at the overall performance.
You have to take the average.
But which, which strategy would you use?
Now first, you should also think about the
question, well did it make a difference?
Can you think of scenarios where using
one of them would make a difference?
That is they would give different
rankings of those methods.
And that also means depending on
the way you average or detect the.
Average of these average positions.
You will get different conclusions.
This makes the question
becoming even more important.
Right?
So, which one would you use?
Well again, if you look at
the difference between these.
Different ways of aggregating
the average position.
You'll realize in arithmetic mean,
the sum is dominating by large values.
So what does large value here mean?
It means the query is relatively easy.
You can have a high pres,
average position.
Whereas gMAP tends to be
affected more by low values.
And those are the queries that
don't have good performance.
The average precision is low.
So if you think about the,
improving the search engine for
those difficult queries,
then gMAP would be preferred, right?
On the other hand, if you just want to.
Have improved a lot.
Over all the kinds of queries or
particular popular queries that might be
easy and you want to make the perfect and
maybe MAP would be then preferred.
So again, the answer depends on
your users, your users tasks and
their pref, their preferences.
So the point that here is to think
about the multiple ways to solve
the same problem, and then compare them,
and think carefully about the differences.
And which one makes more sense.
Often, when one of them might
make sense in one situation and
another might make more sense
in a different situation.
So it's important to pick out under
what situations one is preferred.
As a special case of the mean average
position, we can also think about
the case where there was precisely
one rank in the document.
And this happens often, for example,
in what's called a known item search.
Where you know a target page, let's
say you have to find Amazon, homepage.
You have one relevant document there,
and you hope to find it.
That's call a "known item search".
In that case,
there's precisely one relevant document.
Or in another application,
like a question and answering,
maybe there's only one answer.
Are there.
So if you rank the answers,
then your goal is to rank that one
particular answer on top, right?
So in this case, you can easily
verify the average position,
will basically boil down
to reciprocal rank.
That is, 1 over r where r is the rank
position of that single relevant document.
So if that document is ranked
on the very top or is 1, and
then it's 1 for reciprocal rank.
If it's ranked at the,
the second, then it's 1 over 2.
Et cetera.
And then we can also take a, a average
of all these average precision or
reciprocal rank over a set of topics, and
that would give us something
called a mean reciprocal rank.
It's a very popular measure.
For no item search or, you know,
an problem where you have
just one relevant item.
Now again here, you can see this
r actually is meaningful here.
And this r is basically
indicating how much effort
a user would have to make in order
to find that relevant document.
If it's ranked on the top it's low effort
that you have to make, or little effort.
But if it's ranked at 100
then you actually have to,
read presumably 100 documents
in order to find it.
So, in this sense r is also a meaningful
measure and the reciprocal rank will
take the reciprocal of r,
instead of using r directly.
So my natural question here
is why not simply using r?
I imagine if you were to design
a ratio to, measure the performance
of a random system,
when there is only one relevant item.
You might have thought about
using r directly as the measure.
After all,
that measures the user's effort, right?
But, think about if you take a average
of this over a large number of topics.
Again it would make a difference.
Right, for one single topic, using r or
using 1 over r wouldn't
make any difference.
It's the same.
Larger r with corresponds
to a small 1 over r, right?
But the difference would only show when,
show up when you have many topics.
So again, think about the average of Mean
Reciprocal Rank versus average of just r.
What's the difference?
Do you see any difference?
And would, would this difference
change the oath of systems.
In our conclusion.
And this, it turns out that,
there is actually a big difference, and
if you think about it, if you want to
think about it and then, yourself,
then pause the video.
Basically, the difference is,
if you take some of our directory, then.
Again it will be dominated
by large values of r.
So what are those values?
Those are basically large values that
indicate that lower ranked results.
That means the relevant items
rank very low down on the list.
And the sum that's also the average
that would then be dominated by.
Where those relevant documents
are ranked in, in ,in,
in the lower portion of the ranked.
But from a users perspective we care
more about the highly ranked documents.
So by taking this transformation
by using reciprocal rank.
Here we emphasize more on
the difference on the top.
You know, think about
the difference between 1 and the 2,
it would make a big difference, in 1 over
r, but think about the 100, and 1, and
where and when won't make much
difference if you use this.
But if you use this there will
be a big difference in 100 and
let's say 1,000, right.
So this is not the desirable.
On the other hand, a 1 and
2 won't make much difference.
So this is yet another case where there
may be multiple choices of doing the same
thing and then you need to figure
out which one makes more sense.
So to summarize,
we showed that the precision-recall curve.
Can characterize the overall
accuracy of a ranked list.
And we emphasized that the actual
utility of a ranked list depends
on how many top ranked results
a user would actually examine.
Some users will examine more.
Than others.
An average person uses a standard measure
for comparing two ranking methods.
It combines precision and recall and
it's sensitive to the rank
of every random document.
[MUSIC]

[SOUND] This lecture is about how to
evaluate the text retrieval system when
we have multiple levels of judgments.
In this lecture we will continue
the discussion of evaluation.
We're going to look at how to
evaluate the text retrieval system.
And we have multiple level of judgements.
So, so far we have talked
about binding judgements,
that means a documents is judged
as being relevant or not-relevant.
But earlier we will also talk about,
relevance as a matter of degree.
So we often can distinguish it
very higher relevant options,
those are very useful options, from you
know, lower rated relevant options.
They are okay, they are useful perhaps.
And further from non-relevant documents.
Those are not useful.
Right?
So imagine you can have ratings for
these pages.
Then you would have much
more levels of ratings.
For example, here I show an example
of three levels, three were relevant.
Sorry, three were very relevant.
Two for marginally relevant and
one for non-relevant.
Now how do we evaluate such a new system
using these judgements of use of the map
doesn't work, average of precision
doesn't work, precision and
record doesn't work because
they rely on vinyl judgement.
So let's look at the sum top regular
results when using these judgments.
Right?
Imagine the user would be mostly
care about the top ten results here.
Right.
And we mark the the rating levels or
relevance levels for
these documents as shown here.
Three, two, one, one, three, et cetera.
And we call these gain.
And the reason why we call it a gain,
is because the measure that
we are infusing is called, NTCG,
normalizer discount of accumulative gain.
So this gain basically can mesh your,
how much gain of random
information a user can obtain by
looking at each document, alright.
So looking after the first document
the user can gain three points.
Looking at the non-relevant document
the user would only gain one point.
Right.
Looking at the multi-level relevant or
marginally relevant document the user
would get two points et cetera.
So this gain usually matches the utility
of a document from a user's perspective.
Of course if we assume the user
stops at the ten documents, and
we're looking at the cutoff at ten we can
look after the total gain of the user.
And what's that,
well that's simply the sum of these and
we call it the cumulative gain.
So if we use a stops at
the positua that's just a three.
If the user looks at another
document that's a 3 plus 2.
If the user looks at the more documents.
Then the cumulative gain is more.
Of course, this is at the cost of
spending more time to examine the list.
So cumulative gain gives
us some idea about
how much total gain the user would have
if the user examines all these documents.
Now, in NDCG, we also have another letter
here, D, discounted cumulative gain.
So why do we want to do discounting?
Well, if you look at this cumulative gain,
there is one deficiency which is
it did not consider the rank
position of these these documents.
So, for example looking at the,
this sum here
and we only know there is only
one highly relevant document,
one marginally relevant document,
two non-relevant documents.
We don't really care
where they are ranked.
Ideally, we want these two
to be ranked on the top.
Which is the case here.
But how can we capture that intuition?
Well we have to say, well this 3 here
is not as good as this 3 on the top.
And that means the contribution of,
the game from different positions,
has to be weight by their position.
And this is the idea of discounting,
basically.
So, we're going to say, well, the first
one, doesn't it need to be discounted
because the user can be assume that
you always see this document, but
the second one,
this one will be discounted a little bit,
because there's a small possibility
that the user wouldn't notice it.
So, we divide this gain by the weight,
based on the position.
So, log of two, two is the rank
position of this document and,
when we go to the third position, we,
discount even more because the numbers
is log of three, and so on and so forth.
So when we take a such a sum then a lowly
ranked document would not contribute
contribute that much as
a highly ranked document.
So that means if you, for example,
switch the position of this and let's say
this position and this one, and then
you would get more discount if you put
for example, very relevant document
here as opposed to two here.
Imagine if you put the three here,
then it would have to be discounted.
So it's not as good as if you
would put the three here.
So this is the idea of discounting.
Okay, so n, now at this point that we have
got this discounted cumulative gain for
measuring the utility of this ranked
list with multiple levels of judgments.
So are we happy with this?
Well we can use this rank systems.
Now we still need to do
a little bit more in order to
make this measure comfortable
across different topics.
And this is the last step.
And by the way,
here we just show that DCG at the ten.
Alright.
So this is the total sum of DCG
over all these ten documents.
So the last step is called N,
normalization.
And if we do that then
we get normalized DCG.
So how do we do that?
Well, the idea here is
within the Normalized DCG
by the Ideal DCG at the same cutoff.
What is the Ideal DCG?
Well this is a DCG of ideal ranking.
So imagine if we have nine
documents in the whole collection
rated a three here and that means in
total we have nine documents rated three.
Then, our ideal ranked the Lister
would have put all these nine
documents on the very top.
So all these would have to be three and
then this would be followed by a two here,
because that's the best we could do
after we have run out of threes.
But all these positions would be threes.
Right?
So this would be our ideal ranked list.
And then we can compute the DCG for
this ideal rank list.
So this would be given by this
formula you see here, and so
this idea DCG would be used
as the normalizer DCG.
Like so here, and this IdealDCG
would be used as a normalizer.
So you can imagine now normalization
essentially is to compare the actual DCG
with the best decision you can
possibly get for this topic.
Now why do we want to do this?
Well by doing this we'll map the DCG
values in to a range of zero through one,
so the best value, or the highest
value for every query would be one.
That's when you're relevance
is in fact the idealist.
But otherwise in general
you will be lower than one.
Now what if we don't do that?
Well, you can see this transformation or
this numberization,
doesn't really affect the relative
comparison of systems for
just one topic, because this ideal
DCG is the same for all the systems.
So the ranking of systems based on
only DCG would be exactly the same.
As if you rank them based
on the normalized decision.
The difference however is when
we have multiple topics because
if we don't do normalization, different
topics will have different scales of DCG.
For a topic like this one we have
nine highly relevant documents.
The DCG can get really high.
But imagine that in another case there
are only two very relevant documents.
In total in the whole collection.
Then the highest DCG that
any system could achieve for
such a topic would not be very high.
So again we face the problem of different
scales of DCG values and when we
take an average we don't want the average
to be dominated by those high values.
Those are again easy quires.
So by doing the normalization we have all,
avoid the problem.
Making all the purists
contribute equal to the average.
So this is the idea of NDCG.
It's used for measuring relevance based
on much more level relevance judgments.
So more in the more general way,
this is basically a measure
that can be applied through
any ranked task with much more level of,
of judgments.
And the scale of
the judgments can be multiple
can be more than binary, not only more
than binary, they can be multiple levels,
like one's or five, or
even more depending on your application.
And the main idea of this
measure just to summarize,
is to measure the total utility
of the top k documents.
So you always choose a cutoff, and
then you measure the total utility.
And it would discount the contribution
from a lowly ranked document,
and finally, it would do normalization
to ensure comparability across queries
[MUSIC]

[NOISE].
>> This lecture is about some practical
issues that you would have to address in
evaluation of text retrieval systems.
In this lecture we will continue
the discussion of evaluation.
We will cover some practical
issues that you will have to solve
in actual evaluation of
text retrieval systems.
So, in order to create a test collection,
we have to create a set of queries,
a set of documents and
a set of relevance judgments.
It turns out that each is
actually challenging to create.
So first, the documents and
queries must be representative.
They must rep, represent the real queries
and real documents that the users handle.
And we also have to use many queries and
many documents in order to
avoid biased conclusions.
For the matching of relevant
documents with the queries,
we also need to ensure that there exists a
lot of relevant documents for each query.
If a query has only one that is a relevant
document in the collection then, you know,
it's not very informative to
compare different methods
using such a query because there is not
much room for us to see difference.
So, ideally there should be more
relevant documents in the collection.
But yet the queries also should represent
real queries that we care about.
In terms of relevance judgements, the
challenge is to ensure complete judgements
of all the documents for all the queries,
yet, minimizing human and fault.
Because we have to use the human
labor to label these documents.
It's very labor intensive.
And as a result, it's impossible to
actually label all of the documents for
all the queries, especially considering
a joint, data set like the web.
So, this is actually a major challenge.
It's a very difficult challenge.
For measures, it's also challenging
because what we want with measures is that
with accuracy reflected
the perceived utility of users.
We have to consider carefully
what the users care about and
then design measures to measure that.
If we, your measure is not
measuring the right thing,
then your conclusion would,
would be misled.
So it's very important.
So we're going to talk
about a couple issues here.
One is the statistical significance test,
and
this also is the reason why we
have to use a lot of queries, and
the question here is how sure can you
be that I observed the difference?
It doesn't simply result from
the particular queries you choose.
So here are some sample results of
average precision for System A and
System B in two different experiments.
And you can see in the bottom,
we have mean average position, all right?
So the mean,
if you look at the mean average position
the mean average positions are exactly
the same in both experiments.
All right, so you can see this is 0.2,
this is 0.4 for
system B and
again here its also 0.2 and 0.4.
So they are identical.
Yet if you look at the, these exact
average positions for different queries,
if you look at these numbers in detail,
you will realize that in one case
you would feel that you can trust
the conclusion here given by the average.
In another case, in the other case,
you will feel that, well, I'm not sure.
So, why don't you take a look at
all these numbers for a moment.
Pause the video.
So, if you at the average,
the main average position,
we can easily say that, well,
System B is better, right?
So it's, after all, it's 0.4 and
then this is twice as much as 0.2.
So that's a better performance.
But if you look at these two experiments
and look at the detailed results,
you will see that we'll be more
confident to say that in the case one.
In experiment one.
In this case because these numbers seem to
be consistently better than for system B.
Where as in, experiment two,
we're not sure.
because, looking at some results,
like this, after system A is better.
And this is another case
where system A is better.
But yet, if we look at on the average,
System B is better.
So what do you think?
You know, how reliable is our conclusion
if we only look at the average?
Now in this case, intuitively, we feel
it's better than one, it's more reliable.
But how can we quantitatively
answer this question?
And this is why we need to do
statistical significance test.
So the idea of a statistical significance
test is basically to assess the vary,
variance across these different queries.
If there's a, a big variance that means
that the results could fluctuate
a lot according to different queries.
Then we should believe that
unless you have used a lot of
queries the results might change
if we use another set of queries.
Right?
So, this is then not so
if you have seen high variance
then it's not very reliable.
So let's look at these results
again in the second case.
So here we show two,
different ways to compare them.
One is a Sign Test.
And we'll, we'll just look at the sign.
If System B is better than System A,
then we have a plus sign.
When System A is better
we have a minus sign etc.
Using this case if you see this,
well, there are seven cases.
We actually have four cases
where System B is better.
But three cases System A is better.
You know intuitively,
this is almost like random results.
Right, so if you just take a random
sample of to, to flip seven coins,
and if you use plus to denote the head and
then minus to denote the tail, and
that could easily be the results of just
randomly flipping, these seven coins.
So, the fact that the, the average
is larger doesn't tell us anything.
You know, we can't reliably concur that.
And this can be quantitative
in the measure by, a p value.
And that basically, means,
the probability that this result is
in fact from random fluctuation.
In this case, probability is one.
It means it surely is
a random fluctuation.
Now in Wilcoxon, test,
it's a non parametrical test,
and we would be not only
looking at the signs
we'll be also looking at
the magnitude of the difference.
But, we, we, we can draw a similar
conclusion where you say well it's
very likely to be from random.
So to illustrate this let's
think about such a distribution.
And this is called a normal distribution.
We assume that the mean is zero here.
Let's say, well, we started with
the assumption that there's no difference
between the two systems.
But we assume that because of random
fluctuations depending on the queries
we might observe a difference, so
the actual difference might be
on the left side here or
on the right side here, right?
And, and
this curve kind of shows the probability
that we would actually observe values
that are deviating from zero here.
Now, so if we, look at this picture then
we see that if a difference
is observed here,
then the chance is very
high that this is in fact,
a random observation, right.
We can define region of you know, likely
observation because of random fluctuation.
And this is 95% of all outcomes.
And in this interval
then the observed values
may still be from random fluctuation.
But if you observe a value in this
region or a difference on this side,
then the difference is unlikely
from random fluctuation.
Right, so there is a very small
probability that you will observe
such a difference just because
of random fluctuation.
So in that case, we can then conclude
the difference must be real.
So System B is indeed better.
So, this is the idea of
the statistical significance test.
The takeaway message here is that
you have used many queries to avoid
jumping into a conclusion as in this
case to say System B is better.
There are many different ways of doing
this statistical significance test.
So now, let's talk about the other
problem of making judgements and
as we said earlier,
it's very hard to judge all the documents
completely unless it is a small data set.
So the question is, if we can't
afford judging all the documents
in the collection,
which subset should we judge?
And the solution here is pooling.
And this is a strategy that has been used
in many cases to solve this problem.
So the idea of pulling is the following.
We would first choose a diverse
set of ranking methods,
these are types of retrieval systems.
And we hope these methods
can help us nominate
likely relevance in the documents.
So the goal is to pick out
the relevant documents..
It means we are to make judgements
on relevant documents because those
are the most useful documents
from the users perspective.
So, that way we would have each
to return top-K documents.
And the K can vary from systems, right.
But the point is to ask them to suggest
the most likely relevant documents.
And then we simply combine
all these top-K sets to form
a pool of documents for
human assessors to judge.
So, imagine you have many systems.
Each will return K documents, you know,
take the top-K documents, and
we form the unit.
Now, of course there are many
documents that are duplicated,
because many systems might have
retrieved the same random documents.
So there will be some duplicate documents.
And there are,
there are also unique documents that are
only returned by one system, so the idea
of having diverse set of result ranking
methods is to ensure the pool is broad.
And can include as many possible
random documents as possible.
And then the users with
the human assessors would make
complete the judgements on this data set,
this pool.
And the other unjudged documents are
usually just a assumed to be non-relevant.
Now if the pool is large enough,
this assumption is okay.
But the, if the pool is not very large,
this actually has to be reconsidered,
and we might use other
strategies to deal with them and
there are indeed other
methods to handle such cases.
And such a strategy is generally okay for
comparing systems that
contribute to the pool.
That means if you participate in
contributing to the pool then it's
unlikely that it will penalize
your system because the top
ranked documents have all been judged.
However, this is problematic for
even evaluating a new system that may
not have contributed to the pool.
In this case, you know, a new system
might be penalized because it might have
nominated some relevant documents
that have not been judged.
So those documents might be
assumed to be non-relevant.
And that, that's unfair.
So to summarize the whole part
of text retrieval evaluation,
it's extremely important because the
problem is an empirically defined problem.
If we don't rely on users, there's no way
to tell whether one method works better.
If we have inappropriate
experiment design,
we might misguide our research or
applications.
And we might just draw wrong conclusions.
And we have seen this in
some of our discussion.
So, make sure to get it right for
your research or application.
The main methodology is Cranfield
evaluation methodology and
this is near the main paradigm used in
all kinds of empirical evaluation tasks,
not just a search engine variation.
Map and nDCG are the two main measures
that should definitely know about and
they are appropriate for
comparing ranking algorithms.
You will see them often
in research papers.
Perceiving up to ten documents is easier
to interpret from users perspective.
So, that's also often useful.
What's not covered is some other
evaluation strategy like A-B test
where the system would mix two of
the results of two methods randomly.
And then will show the mix
of results to users.
Of course, the users don't see
which result is from which method.
The users would judge those results or
click on those those documents in
in a search engine application.
In this case, then, the search engine can
keep track of the clicked documents, and
see if one method has contributed
more to the clicked documents.
If the user tends to click on
one the results from one method,
then it's just that method may,
may be better.
So this is what leverages a real users
of a search engine to do evaluation.
It's called A-B Test, and
it's a strategy that's often used by
the modern search engines,
the commercial search engines.
Another way to evaluate IR or
text retrieval is user studies,
and we haven't covered that.
I've put some references here that you can
look at if you want to
know more about that.
So there are three
additional readings here,
these are three mini
books about evaluation.
And they are all excellent in covering a
broad review of information retrieval and
evaluation.
And this covered some of
the things that we discussed.
But they also have a lot
of others to offer.
[MUSIC]

[SOUND] This lecture is about
the probabilistic retrieval model.
In this lecture, we're going to continue
the discussion of text retrieval methods.
We're going to look at another kind of
very different way to design ranking
functions, then the Vector Space Model
that we discussed before.
In probabilistic models we define
the ranking function based
on the probability that this
document is random to this query.
In other words, we are, we introduced
a binary random variable here.
This is the variable R here.
And we also assume that the query and
the documents are all observations
from random variables.
Note that in the vector space model,
we assume they are vectors.
But here we assumed we assumed they are
the data observed from random variables.
And so the problem, model retrieval
becomes to estimate
the probability of relevance.
In this category of models,
there are different variants.
The classic probabilistic model has
led to the BM25 retrieval function,
which we discussed in
the vector space model,
because it's form is actually
similar to a vector space model.
In this lecture,
we're going to discuss another subclass in
this big class called a language
modeling approaches to retrieval.
In particular, we're going to discuss
the Query Likelihood retrieval model,
which is one of the most effective
models in probabilistic models.
There is also another line called
a divergence-from-randomness model,
which has latitude the PL2 function.
It's also one of the most effective
state of the art attribute functions.
In query likelihood, our assumption
is that this probability readiness
can be approximated by the probability
of query given a document and readiness.
So, intuitively, this probability just
captures the following probability.
And that is if a user likes document d,
how likely would
the user enter query q in
order to retrieve document d.
So we'll assume that the user likes d,
because we have a relevance value here.
And the we ask the question about
how likely we will see this
particular query from this user?
So this is the basic idea.
Now to understand this idea,
let's take a look at the general idea or
the basic idea of probabilistic
retrieval models.
So here, I listed some imagined
relevance status values or
relevance judgments of queries and
documents.
For example, in this slide,
it shows that query one
is a query that the user typed in and
d1 is a document the user has seen and
one means the user thinks
d1 is relevant to to q1.
So this R here can be also approximated
by the clicks little data that the search
engine can collect it by watching how
you interact with the search results.
So, in this case, let's say,
the user clicked on this document, so
there's a one here.
Similarly, the user clicked on d2 also,
so there's a one here.
In other words,
d2 is assumed to relevant at two, q1.
On the other hand, d3 is non relevant,
there's a zero here.
And d4 is non-relevant and then d 5 is
again relevant and so on and so forth.
And this part of maybe,
they are collected from a different user.
Right.
So this user typed in q1 and
then found that d1 is actually not useful,
so d1 is actually non-relevant.
In contrast here we see it's relevant and,
or this could be the same query typing
by the same user at different times,
but d2 is also relevant, et cetera.
And then here, we can see more
data that about other queries.
Now we can imagine,
we have a lot of search data.
Now we can ask the question,
how can we then estimated
the probability of relevance?
Right.
So
how can we compute this
probability of relevance?
Well, intuitively,
that just means if we look at the,
all the entries where we see this
particular d and this particular q,
how likely will we see
a one on the third column?
Basically, that just means
we can correct the counts.
We can first count how many
times where we see q and
d as a pair in this table and
then count how many times
we actually have also seen
one in the third column and
then we just compute the ratio.
So let's take a look at
some specific examples.
Suppose we are trying to computed this
probability for d1, d2 and d3 for q1.
What is the estimated probability?
Now think about that.
You can pause the video if needed.
Try to take a look at the table and
try to give your estimate
of the probability.
Have you seen that if we are interested
in q1 and d1, we've been looking at the,
these two pairs and in both cases or
actually in one of the cases,
the user has said that this is one,
this is relevant.
So R is equal to 1 in only
one of the two cases.
In the other case, this is zero.
So that's one out of two.
What about the d1 and the d2?
Well, they're are here,
you want d2, d1, d2.
In both cases,
in this case R is equal to 1.
So, it's two out of two and
so and so forth.
So you can see with this approach,
we captured it score these documents for
the query.
Right?
We now have a score for d1,
d2 and d3 for this query.
We can simply ranked them based
on these probabilities and so
that's the basic idea of
probabilistic retrieval model.
And you can see, it makes a lot of sense.
In this case, it's going to rank
d2 above all the other documents.
Because in all the cases, when you
have seen q1 and d2, R is equal to 1.
The user clicked on this document.
So this also showed showed that
with a lot of click through data,
a search engine can learn a lot from
the data to improve the search engine.
This is a simple example that shows that
with even a small number of entries here,
we can already estimate
some probabilities.
These probabilities would give us some
sense about which document might be more
read or more useful to a user for
typing this query.
Now, of course, the problem is that
we don't observe all the queries and
all of the documents and
all the relevance values.
Right?
There will be a lot of unseen documents.
In general, we can only collect data from
the document's that we have shown to
the users.
There are even more unseen queries,
because you cannot predict what
queries will be typed in by users.
So, obviously, this approach won't work
if we apply it to unseen queries or
unseen documents.
Nevertheless, this shows the basic idea
of the probabilistic retrieval model and
it makes sense intuitively.
So what do we do in such a case when we
have a lot of unseen documents and, and
unseen queries?
Well, the solutions that we have
to approximate in some way.
Right.
So, in this particular case called
the Query LIkelihood Retrieval Model,
we just approximate this
by another conditional probability,
p q | d, R is equal to 1.
So, in the condition part, we assume
that the user likes the document,
because we have seen that the user
clicked on this document.
And this part,
shows that we're interested in how likely
the user would actually enter this query.
How likely we will see this
query in the same row.
So note that here, we have made
an interesting assumption here.
Basically, we, we're going to assume
that whether the user types in this
query has something to do with
whether user likes the document.
In other words, we actually
make the foreign assumption and
that is a user formula to query based
on an imaginary relevant document.
Well, if you just look at this
as a conditional probability,
it's not obvious we
are making this assumption.
So what I really meant
is that to use this new
conditional probability to help us score
then this new condition of probability.
We have to somehow be able
to estimate this conditional
probability without
relying on this big table.
Otherwise, it would be having
similar problems as before.
And by making this assumption, we have
some way to bypass this big table and
try to just mortar how to
use a formula to the query.
Okay.
So this is how you can simplify the,
the general model so that we can
give either specific function later.
So let's look at how this model works for
our example.
And basically,
what we are going to do in this case
is to ask the following question.
Which of these documents is most likely
the imaginary relevant document in
the user's mind when the user
formulates this query?
And so we ask this question and
we quantify the probability and this
probability is a conditional probability
of observing this query if a particular
document is in fact the imaginary
relevant document in the user's mind.
Here you can see we compute all these
query likelihood probabilities,
the likelihood of queries
given each document.
Once we have these values,
we can then rank these documents
based on these values.
So to summarize, the general idea of
modern relevance in the probability
risk model is to assume that we introduce
a binary random variable, R here.
And then let the scoring function be
defined based on this conditional
probability.
We also talked about a proximate in this
[SOUND] by using the query likelihood.
And in this case,
we have a ranking function that's
basically based on a probability
of a query given the document.
And this probability should be
interpreted as the probability
that a user who likes document
d would pose query q.
Now the question, of course is how do
we compute this additional probability?
At this in general has to do with how
to compute the probability of text,
because q is a text.
And this has to do with a model
called a Language Model.
And this kind of models
are proposed to model text.
So most specifically, we will be
very interested in the following
conditional probability as I show you,
you this here.
If the user like this document, how
likely the user would approve this query?
And in the next lecture, we're going
to give introduction to Language Model,
so that we can see how we can model text
with a probability risk model in general.
[MUSIC]

[SOUND] This lecture is about the feedback
in the language modeling approach.
In this lecture we will continue the
discussion of feedback in text retrieval.
In particular we're going to talk about
the feedback in language modeling
approaches.
So we derive the query likelihood ranking
function by making various assumptions.
As a basic retrieval function, that
formula, or those formulas worked well.
But if we think about the feedback
information, it's a little bit awkward to
use query likelihood to
perform feedback because
a lot of times the feedback information is
additional information about the query.
But we assume the query is
generated by assembling words
from a language model in
the query likelihood method.
It's kind of unnatural to sample,
words that, form feedback documents.
As a result, then research is proposed,
a way to generalize query
likelihood function.
It's called a Kullback-Leibler
divergence retrieval model.
And this model is actually,
going to make the query likelihood,
our retrieval function much
closer to vector space model.
Yet this, form of the language model can
be, regarded as a generalization of query
likelihood in the sense that if it can
cover query likelihood as a special case.
And in this case the feedback
can be achieved through
simply query model estimation or updating.
This is very similar to Rocchio
which updates the query vector.
So let's see what the, is the scale
of divergence, which we will model.
So, on the top, what you see is query
likelihood retrieval function,
all right, this one.
And then KL-divergence or
also called cross entropy retrieval
model is basically to
generalize the frequency part,
here, into a layered model.
So basically it's the difference,
given by the probabilistic model here
to characterize what the user's looking
for versus the kind of query words there.
And this difference allows us to plotting
various different ways to estimate this.
So this can be estimated in many different
ways including using feedback information.
Now this is called a KL-divergence because
this can be interpreted as measuring
the KL-divergence of two distributions.
One is the query model
denoted by this distribution.
One is the talking,
the language model here.
And [INAUDIBLE] though is a [INAUDIBLE]
language model, of course.
And we are not going to talk
about the detail of that, and
you'll find the things in references.
It's also called cross entropy,
because, in, in fact,
we can ignore some terms in the
KL-divergence function and we will end up
having actually cross entropy, and that,
both are terms in information theory.
But, anyway for
our purposes here you can just see
the two formulas look almost identical,
except that here we have a probability of
a word given by a query language model.
This, and here,
the sum is over all the words
that are in the document,
and also with the non-zero probability for
the query model.
So it's kind of, again, a generalization
of sum over all the matching query words.
Now you can also, easy to see,
we can recover the query likelihood,
which we will find here by as simple
as setting this query model to
the relative frequency of
a word in the query, right?
This is very to easy see
once you practice this.
And to here, you can eliminate this
query lens, that's a constant,
and then you get exactly like that.
So you can see the equivalence.
And that's also why this KL-divergence
model can be regarded as a generalization
of query likelihood because we can cover
query likelihood as a special case,
but it would also allow it
to do much more than that.
So this is how we use the KL-divergence
model to then do feedback.
The picture shows that we first
estimate a document language model,
then we estimate a query
language model and
we compute the KL-divergence,
this is often denoted by a D here.
But this basically means,
this was exactly like in vector space
model because we compute the vector for
the document in the computer and
not the vector for the query,
and then we compute the distance.
Only that these vectors
are of special forms,
they have probability distributions.
And then we get the results, and
we can find some feedback documents.
Let's assume they are more selective
sorry, mostly positive documents.
Although we could also consider
both kinds of documents.
So what we could do is, like in Rocchio,
we can compute another language model
called feedback language model here.
Again, this is going to be another vector
just like a computing centroid vector in
Rocchio.
And then this model can be
combined with the original
query model using a linear interpolation.
And this would then give us an updated
model, just like again in Rocchio.
Right, so here, we can see the parameter
of our controlling amount of feedback if
it's set to 0,
then it says here there's no feedback.
After set to 1, we've got full feedback,
we can ignore the original query.
And this is generally not desirable,
right.
So this unless you are absolutely sure you
have seen a lot of relevant documents and
the query terms are not important.
So of course the main question here
is how do you compute this theta F?
This is the big question here.
And once you can do that,
the rest is easy.
So here we'll talk about
one of the approaches.
And there are many approaches of course.
This approach is based on generative model
and I'm going to show you how it works.
This is a user generative mixture model.
So this picture shows that
the we have this model here,
the feedback model that
we want to estimate.
And we the basis is the feedback options.
Let's say we are observing
the positive documents.
These are the collected documents by
users, or random documents judged by
users, or simply top ranked documents
that we assumed to be random.
Now imagine how we can
compute a centroid for
these documents by using language model.
One approach is simply to assume
these documents are generated from
this language model as we did before.
What we could do is do it,
just normalize the word frequency here.
And then we,
we'll get this word distribution.
Now the question is whether this
distribution is good for feedback.
Well you can imagine well the top
rank of the words would be what?
What do you think?
Well those words would be common words,
right?
As well we see in, in the language model,
in the top right, the words are actually
common words like, the, et cetera.
So, it's not very good for feedback,
because we will be adding a lot of such
words to our query when we interpret,
this was the original query model.
So, this is not good, so
we need to do something, in particular,
we are trying to get rid
of those common words.
And we all, we have seen actually one way
to do that, by using background language
model in the case of learning
the associations with of words, right.
The words that are related
to the word computer.
We could do that, and
that would be another way to do this.
But here, we're going to
talk about another approach,
which is a more principled approach.
In this case, we're going to say, well,
you, you said that there are common words
here in this, these documents that should
not belong to this top model, right?
So now, what we can do is to assume that,
well, those words are, generally,
from background language model,
so they will generate a,
those words like the, for example.
And if we use maximum
likelihood estimated,
note that if all the words here
must be generated from this model,
then this model is forced to assign
high probabilities to a word like the,
because it occurs so frequently here.
Note that in order to reduce its
probability in this model, we have to
have another model, which is this one
to help explain the word, the, here.
And in this case,
it's not appropriate to use the background
language model to achieve this goal
because this model will assign high
probabilities to these common words.
So in this approach then, we assume
this machine that which generated
these words would work as follows.
We have a source controller here.
Imagine we flip a coin here to
decide what distribution to use.
With the probability of lambda
the coin shows up as head.
And then we're going to use
the background language model.
And we can do then sample
word from that model.
With probability of 1 minus lambda now,
we now decide to use a unknown topic
model here that we will try to estimate.
And we're going to then
generate a word here.
If we make this assumption, and this
whole thing will be just one model, and
we call this a mixture model,
because there are two distributions
that are mixed here together.
And we actually don't know when
each distribution is used.
Right, so again think of this
whole thing as one model.
And we can still ask it for words, and
it will still give us a word
in a random method, right?
And of course which word will show up
will depend on both this distribution and
that distribution.
In addition,
it would also depend on this lambda,
because if you say,
lambda is very high and
it's going to always use the background
distribution, you'll get different words.
If you say, well our lambda is very small,
we're going to use this, all right?
So all these are parameters,
in this model.
And then, if you're thinking this way,
basically we can do exactly the same as
what we did before, we're going to use
maximum likelihood estimator to adjust
this model to estimate the parameters.
Basically we're going to adjust,
well, this parameter so
that we can best explain all the data.
The difference now is that we are not
asking this model alone to explain this.
But rather we're going to ask
this whole model, mixture model,
to explain the data because it has got
some help from the background model.
It doesn't have to assign high
probabilities towards like the,
as a result.
It would then assign high probabilities
to other words that are common here but
not having high probability here.
So those would be common here.
Right?
And if they're common they would
have to have high probabilities,
according to a maximum
likelihood estimator.
And if they are rare here,
all right, so if they are rare here,
then you don't get much help
from this background model.
As a result, this topic model
must assign high probabilities.
So the higher probability words
according to the topic model
will be those that are common here,
but rare in the background.
Okay, so, this is basically a little
bit like a idea for weighting here.
This would allow us to achieve
the effect of removing these top words
that are meaningless in the feedback.
So mathematically what we have is
to compute the likelihood again,
local likelihood of
the feedback documents.
And, and note that, we also have
another parameter, lambda here.
But we assume that lambda denotes
noise in the feedback document.
So we are going to, let's say,
set this to a parameter, let's say,
say 50% of the words are noise,
or 90% are noise.
And this can then be,
assume it will be fixed.
If we assume this is fixed, then we only
have these probabilities as parameters
just like in the simplest unigram
language model, we have n parameters.
n is the number of words and, then, the
likelihood function will look like this.
It's very similar to the likelihood
function, normal likelihood
function we see before except that inside
the logarithm there's a sum in here.
And this sum is because we can
see the two distributions.
And which ones used would depend on
lambda and that's why we have this form.
But mathematically this is the function
with theta as unknown variables, right?
So, this is just a function.
All the other variables are known,
except for this guy.
So, we can then choose this
probability distribution to
maximize this log likelihood.
The same idea as the maximum
likelihood estimator.
As a mathematical problem which is to,
we just have to solve this
optimization problem.
We said we would try all
of the theta values, and
here we find one that gives this
whole thing the maximum probability.
So, it's a well-defined math problem.
Once we have done that,
we obtain this theta F,
that can be the interpreter with
the original query model to do feedback.
So here are some examples of
the feedback model learned from a web
document collection, and
we do pseudo-feedback.
We just use the top 10 documents,
and we use this mixture model.
So the query is airport security.
What we do is we first retrieve ten
documents from the web database.
And this is of course pseudo-feedback,
right?
And then we're going to feed to that
mixture model, to this ten document set.
And these are the words
learned using this approach.
This is the probability of a word given
by the feedback model in both cases.
So, in both cases, you can see
the highest probability of words
include very random words to the query.
So, airport security for example,
these query words still show
up as high probabilities
in each case naturally because they occur
frequently in the top rank of documents.
But we also see beverage, alcohol,
bomb, terrorist, et cetera.
Right, so these are relevant
to this topic, and they,
if combined with original query can help
us match more accurately, on documents.
And also they can help us bring up
documents that only managing the,
some of these other words.
And maybe for example just airport and
then bomb for example.
These so,
this is how pseudo-feedback works.
It shows that this model really works and
picks up mm,
some related words to the query.
What's also interesting is that if
you look at the two tables here, and
you compare them, and you see in this
case, when lambda is set to a small value,
and we'll still see some common
words here, and that means.
When we don't use the background
model often, remember lambda can
use the probability of using the
background model to generate to the text.
If we don't rely much on background model,
we still have to use this topped model
to account for the common words.
Whereas if we set lambda to a very
high value we would use the background
model very often to explain these words,
then there is no burden on
expanding those common words in the
feedback documents by the topping model.
So, as a result, the top of the model
here is very discriminative.
It contains all the relevant
words without common words.
So this can be added to the original
query to achieve feedback.
So to summarize in this lecture we
have talked about the feedback in
language model approach.
In general,
feedback is to learn from examples.
These examples can be assumed examples,
can be pseudo-examples,
like assume the, the top ten
documents are assumed to be random.
They could be based on using
fractions like feedback,
based on quick sorts or implicit feedback.
We talked about the three major
feedback scenarios, relevance feedback,
pseudo-feedback, and implicit feedback.
We talked about how to use Rocchio to
do feedback in vector-space model and
how to use query model estimation for
feedback in language model.
And we briefly talked about
the mixture model and
the basic idea and
there are many other methods.
For example the relevance model
is a very effective model for
estimating query model.
So, you can read more about the,
these methods in the references that
are listed at the end of this lecture.
So there are two additional readings here.
The first one is a book that
has a systematic, review and
discussion of language models
of more information retrieval.
And the second one is an important
research paper that's about relevance
based language models and it's a very
effective way of computing query model.
[MUSIC]

[SOUND].
This lecture is about
a statistical language model.
In this lecture, we're go,
we're going to get an introduction
to the probabilistic model.
This has to do with how many models
have to go into these models.
So, it's ready to how we model
theory based on a document.
We're going to talk about, what is
a language model and, then, we're going to
talk about the simplest language model
called a unigram language model.
Which also happens to be the most
useful model for text retrieval.
And finally we'll discuss
possible uses of an m model.
What is a language model?
Well, it's just a probability
distribution over word sequences.
So, here I show one.
This model gives the sequence today's
Wednesday a probability of 0.001 it gives
today Wednesday is a very very small
probability, because it's algorithmatical.
You can see the probabilities
given to these sentences or
sequences of words can vary
a lot depending on the model.
Therefore, it's clearly context-dependent.
In ordinary conversation,
probably today is Wednesday is most
popular among these sentences.
But imagine in the context of
discussing a private math,
maybe the higher values positive
would have a higher probability.
This means it can be used to
represent as a topic of a test.
The model can also be regarded
as a probabilistic mechanism for
generating text, and this is why it
is often called a generating model.
So, what does that mean?
We can image this is
a mechanism that's visualized
here as a [INAUDIBLE] system that
can generate a sequences of words.
So we can ask for a sequence and
it's to sample a sequence
from the device if you want.
And it might generate, for
example, today is Wednesday, but
it could have generated
many other sequences.
So for example,
there are many possibilities, right?
So this, in this sense,
we can view our data as basically a sample
observed from such a generated model.
So why is such a model useful?
Well, it's mainly because it can quantify
the uncertainties in natural language.
Where do uncertainties come from?
Well, one source is
simply the ambiguity in
natural language that we
discussed earlier in the lecture.
Another source is because we don't
have complete understanding.
We lack all the knowledge
to understand language.
In that case there will
be uncertainties as well.
So let me show some examples of questions
that we can answer with an average model
that would have an interesting
application in different ways.
Given that we see John and feels.
How likely will we see
happy as opposed to habit
as the next word in a sequence of words?
Obviously this would be very useful
speech recognition because happy and
habit would have similar acoustical sound.
Acoustic signals.
But if we look at the language model
we know that John feels happy would be
far more likely than John feels habit.
Another example, given that we
observe baseball three times and
gained once in the news article
how likely is it about the sports?
This obviously is related to text
categorization and information.
Also, given that a user is
interested in sports news,
how likely would the user
use baseball in a query?
Now this is clearly related to the query
that we discussed in the previous lecture.
So now let's look at
the simplest language model.
Called a lan, unigram language model.
In such a case,
we assume that we generate the text by
generating each word independently.
So this means the probability of
a sequence of words would be then
the product of
the probability of each word.
Now normally they are not independent,
right?
So if you have seen a word like language.
Now, we'll make it far more
likely to observe model
than if you haven't seen language.
So this assumption is
not necessary sure but
we'll make this assumption
to simplify the model.
So now, the model has precisely n
parameters, where n is vocabulary size.
We have one probability for each word, and
all these probabilities must sum to 1.
So strictly speaking,
we actually have N minus 1 parameters.
As I said,
text can be then be assumed to be a sample
drawn from this word distribution.
So for example,
now we can ask the device, or the model,
to stochastically generate the words for
us instead of in sequences.
So instead of giving a whole
sequence like today is Wednesday,
it now gives us just one word.
And we can get all kinds of words.
And we can assemble these
words in a sequence.
So, that would still allows you to
compute the probability of today is Wed
as the product of the three probabilities.
As you can see even though we have
not asked the model to generate the,
the sequences it actually allows
us to compute the probability for
all the sequences.
But this model now only needs
N parameters to characterize.
That means if we specify
all the probabilities for
all the words then the model's
behavior is completely specified.
Whereas if you, we don't make this
assumption we would have to specify.
Find probabilities for all kinds of
combinations of words in sequences.
So by making this assumption, it makes it
much easier to estimate these parameters.
So let's see a specific example here.
Here I show two unigram lambda
models with some probabilities and
these are high probability
words that are shown on top.
The first one clearly suggests
the topic of text mining
because the high probability words
are all related to this topic.
The second one is more related to health.
Now, we can then ask the question how
likely we'll observe a particular text
from each of these three models.
Now suppose with sample
words to form the document,
let's say we take the first distribution
which are the sample words.
What words do you think it would
be generated or maybe text?
Or maybe mining maybe another word?
Even food,
which has a very small probability,
might still be able to show up.
But in general, high probability
words will likely show up more often.
So we can imagine a generated
text that looks like text mining.
A factor with a small probability,
you might be able to actually generate
the actual text mining paper that
would actually be meaningful, although
the probability would be very, very small.
In the extreme case, you might imagine
we might be able to generate a,
a text paper, text mining paper that
would be accepted by a major conference.
And in that case the probability
would be even smaller.
For instance nonzero probability,
if we assume none of the words
will have a nonzero probability.
Similarly from the second topic,
we can imagine we can generate a food and
nutrition paper.
That doesn't mean we cannot generate this
paper from text mining distribution.
We can, but the probability would be very,
very small, maybe smaller than even
generating a paper that can be accepted
by a major conference on text mining.
So the point of here is
that given a distribution,
we can talk about the probability of
observing a certain kind of text.
Some text would have higher
probabilities than others.
Now, let's look at the problem
in a different way.
Supposedly, we now have
available a particular document.
In this case, maybe the abstract or
the text mining paper, and
we see these word accounts here.
The total number of words is 100.
Now the question you ask here
is a estimation question.
We can ask the question, which model,
which word distribution has been used to,
to generate this text.
Assuming the text has been generated by
assembling words from the distribution.
So what would be your guess?
What have to decide what probabilities
test, mining, et cetera would have.
So pause a view for a second and
try to think about your best guess.
If you're like a lot of people
you would have guessed that well,
my best guess is text has
a probability of 10 out of 100
because I have seen text ten times and
there are a total of 100 words.
So we simply noticed,
normalize these counts.
And that's in fact [INAUDIBLE] justified.
And your intuition is consistent
with mathematical derivation.
And this is called a maximum
likelihood [INAUDIBLE].
In this estimator,
we'll assume that the parameter settings,.
Are those that would give our
observer the maximum probability.
That means if we change
these probabilities,
then the probability of observing the
particular text would be somewhat smaller.
So we can see this has
a very simple formula.
Basically, we just need to
look at the count of a word
in the document and then divide it by
the total number of words in the document.
About the length.
Normalize the frequency.
Well a consequence of this,
is of course, we're going to assign
0 probabilities to unseen words.
If we have an observed word,
there will be no incentive to assign
a non-0 probability using this approach.
Why?
Because that would take away probability
mass for this observed words.
And that obviously wouldn't
maximize the probability of this
particular observed [INAUDIBLE] data.
But one can still question whether
this is our best estimator.
Well, the answer depends on what kind
of model you want to find, right?
This is made if it's a best model
based on this particular layer.
But if you're interested in a model
that can explain the content of the four
paper of, for this abstract, then you
might have a second thought, right?
So for one thing there should be other
things in the body of that article.
So they should not have,
zero probabilities,
even though they are not
observing the abstract.
We're going to cover this later, in,
discussing the query model.
So, let's take a look at some possible
uses of these language models.
One use is simply to use
it to represent the topics.
So here it shows some general
English background that text.
We can use this text to
estimate a language model.
And the model might look like this.
Right?
So on the top we'll have those all common
words, is we, is, and then we'll
see some common words like these,
and then some very,
very real words in the bottom.
This is the background image model.
It represents the frequency on words,
in English in general, right?
This is the background model.
Now, let's look at another text.
Maybe this time, we'll look at
Computer Science research papers.
So we have a correction of computer
science research papers, we do again,
we can just use the maximum where we
simply normalize the frequencies.
Now, in this case, we look at
the distribution, that looks like this.
On the top, it looks similar,
because these words occur everywhere,
they are very common.
But as we go down we'll see words that
are more related to computer science.
Computer, or software, or text et cetera.
So, although here, we might also see
these words, for example, computer.
But, we can imagine the probability here
is much smaller than the probability here.
And we will see many
other words here that,
that would be more common
in general in English.
So, you can see this distribution
characterizes a topic
of the corresponding text.
We can look at the, even the smaller text.
So, in this case let's look
at the text mining paper.
Now if we do the same we have another.
Distribution again the can be
expected to occur on the top.
Soon we will see text, mining,
association, clustering,
these words have relatively
high probabilities in contrast
in this distribution has
relatively small probability.
So this means,
again based on different text data
that we can have a different model.
And model captures the topic.
So we call this document an LM model and
we call this collection LM model.
And later, we'll see how they're
used in a retrieval function.
But now, let's look at the,
another use of this model.
Can we statistically find what words
are semantically related to computer?
Now how do we find such words?
Well our first thought is well let's
take a look at the text that match.
Computer.
So we can take a look at all the documents
that contain the word computer.
Let's build a language model.
Okay, see what words we see there.
Well, not surprisingly, we see these
common words on top as we always do.
So in this case,
this language model gives us the.
Conditional probability of seeing
a word in the context of computer.
And these common words will
naturally have high probabilities.
Other words will see computer itself, and
software will have relatively
high probabilities.
But we,
if we just use this model we cannot.
I just say all these words
are semantically related to computer.
So intuitively what we'd like to get
rid of these these common words.
How can we do that?
It turns out that it's possible
to use language model to do that.
Now I suggest you think about that.
So how can we know what
words are very common so
that we want to kind of get rid of them.
What model will tell us that?
Well, maybe you can think about that.
So the background language model
precisely tells us this information.
It tells us what words
are common in general.
So if we use this background model,
we would know that these words
are common words in general.
So it's not surprising to observe
them in the context of computer.
Whereas computer has a very
small probability in general.
So it's very surprising that we have
seen computer in, with this probability.
And the same is true for software.
So then we can use these two
models to somehow figure out.
The words that are related to computer.
For example we can simply take the ratio
of these two probabilities and normalize
the top of the model by the probability
of the word in the background model.
So if we do that, we take the ratio,
we'll see that then on the top,
computer, is ramped, and
then followed by software,
program, all these words
related to computer.
Because they occur very frequently
in the context of computer, but
not frequently in whole connection.
Where as these common words will
not have a high probability.
In fact,
they have a ratio of about one down there.
Because they are not really
related to computer.
By taking the same ball of text
that contains the computer we don't
really see more occurrences
of that in general.
So this shows that even
with this simple LM models,
we can do some limited
analysis of semantics.
So in this lecture,
we talked about, language model,
which is basically a probability
distribution over the text.
We talked about the simplistic language
model called unigram language model.
Which is also just a word distribution.
We talked about the two
uses of a language model.
One is to represent the, the topic in
a document, in a classing or in general.
The other is discover word associations.
In the next lecture we're
going to talk about the how
language model can be used to
design a retrieval function.
Here are two additional readings.
The first is a textbook on statistical and
natural language processing.
The second is a article that has
a survey of statistical language
models with other pointers
to research work.
[MUSIC]

This lecture is about query likelihood and
probabilistic retrieval model.
In this lecture,
we continue the discussion of
probabilistic retrieval model.
In particular,
we're going to talk about the query
likelihood of the retrieval function.
In the query of likelihood retrieval
model our idea is a model.
How a likely a user, who likes a document
would pose a particular query.
So in this case, you can imagine,
if a user likes this particular document
about the presidential campaign news.
Then we can assume,
the user would use this
working as a basis to oppose a query
to try and retrieve this doc.
So you can imagine the user, could use
a process that works as follows, where
we assume that the query is generated
by sampling words from the document.
So for example,
a user might pick a word like
presidential from this document,
and then use this as a query word.
And then the user would pick another word,
like campaign and
that would be the second query word.
Now this, of course,
is assumption that we have made about,
how a user would post a query.
Whether a user actually
followed this process.
Maybe a different question.
But this assumption,
has allowed us to formally characterize
this conditional probability.
And this allows to also not rely on
the big table that I showed you earlier
to use imperative data to
estimate this probability.
And this is why we can use this
idea to then further derive
retrieval function that we can
implement with the languages.
So, as you see, the assumption that
we've made here is, each query word,
is independent in this sample, and also,
each word is basically
obtained from the document.
So now let's see how this works exactly.
Well, since we are computing
a query likelihood,
then the probability here is just
the probability of this particular query,
which is a sequence of words.
And we make the assumption that each
word is generated independently.
So, as a result, the probability
of the query is just a product
of the probability of each query word.
Now, how do we compute
the probability of each query word?
Well, based on the assumption,
that a word is picked from the document,
that the user has in mind.
Now we know the probability
of each word is just the,
the relative frequency of
the word in the document.
So, for example the probability of
presidential given the document,
would be just the count of
presidential in the document,
divided by the total number of words
in the document or document length.
So with this these assumptions,
we now have actual simple formula for
retrieval, right?
We can use this to rank our document.
So does this model work?
Let's take a look, here are some example
documents that you have seen before.
Suppose now the query is
presidential campaign.
And we see the formula here on the top.
So how do we score these documents?
Well it's very simple, right,
we just count how many times we have seen
presidential, how many times
we have seen campaign etc.
And see here 44 and
we've seen president Jou Tai,
so that's two over the lands
of document the four.
Multiply by 1 over lands of document
of 4 for the probability of
campaign and seeming we can probabilities
for the other two documents.
Now if you'll look at this, these numbers
or these, this, these formulas for
scoring all these documents, it seems to
make sense because, if we assume d3 and
d4 have about the same length,
then it looks like we will rank d4
above d3 and which is above d2, right?
And as we would expect, looks like
it did capture the tf heuristic.
And so this seems to work well.
However, if we try a different
query like this one,
presidential campaign update,
then we might see a problem.
But what problem?
Well, think about update, now none of
these documents has mentioned update.
So according to our assumption that
a user would pick a order from a document
to generate a query,
then the probability of obtaining
a word like update would be what.
Would be zero, right?
So that cause a problem,
because it would cause all these documents
to have zero probability
of generating this query.
Now, while it's fine to have a zero
probability for d2 which is not relevant.
It's not okay to have zero for d3 and
d4, because now we no longer
can distinguish them.
What's worse,
we can't even distinguish them from d 2.
All right, so
that's obviously not desirable.
Now when one has such result, we should
think about what has caused this problem.
So we have to examine what
assumptions have been made,
as we derive this ranking function.
Now if you examine those assumptions
carefully you would realize.
What has caused this problem, right?
So, take a moment to think about,
what do you think is the reason why
update has zero probability,
and how do we fix it?
Right?
So, if you think about this for
the moment that you realize that.
That's because we have made an assumption
that every query word must be drawn
from the document in the user's mind.
So, in order to fix this,
we have to assume that,
the user could have drawn a word,
not necessarily from the document.
So let's see improved model.
An improvement here is to say that,
well, instead of drawing a word from
the document, let's imagine that the user
would actually draw a word from a document
model and so I show a model here.
Here we assume that this
document is generated,
by using this unigram image model.
Now, this model, doesn't necessarily
assign zero probability for update.
In fact we assume this model does not
assign zero probability for any word.
Now if we're thinking this
way then the generation
process is a little bit different.
Now the user has this model in mind,
instead of this particular document.
Although the model has to be
estimated based on the document.
So the user can again generate
the query using a similar process.
They may pick a word, for example
presidential and another word campaign.
Now the difference is that, this time
we can also pick a word like update,
even though update it
doesn't occur in the document
to potentially generate
a query word like update.
So that, a query was updated we
want to have zero probabilities.
So this would fix our problem and
it's also reasonable,
because we're now thinking of what the
user is looking for in a more general way,
that is unique language model
instead of a fixed document.
So how do we compute this query,
like if we make this sum where
it involves two steps, right?
The first is the computer's model, and we
call it talking the language model here.
For example, I have shown two
possible energy models here.
This has been based on two documents.
And then given a query and
I get a mining algorithms.
The second step, is just to compute
the likelihood of this query.
And by making independence assumptions,
we could then have this probability as
a product of the probability
of each query word, all right?
But we do this for both documents.
And then we're going to score these
two documents and then rank them.
So that's the basic idea of this
query likelihood retrieval function.
So more generally than this ranking
function would look like the following and
here as, we assume that query
has end words W1 through WN.
And then the scoring function,
the ranking function is the probability
that we observe this query, given that
the user is thinking of this document.
And this assumed to be product of
probabilities of all individual words and
this is based on
the independence assumption.
Now we actually often score the,
document for
this query by using log of the query
likelihood, as shown on the sigma line.
Now we do this to avoid having
a lot of small probabilities.
M, multiplied together.
And this could cause underflow and
we might lose precision by transforming
the value as a logarithm function.
We maintain the order of these documents,
yet we can avoid the end of flow problem.
So if we take longer than transformation
of coarse the product that would become
a sum, as you stake in the line here.
So it's a sum of all of the query words,
and inside the sum
that is log of the probability of
this word given by the document.
And then we can further rewrite the sum,
into a different form.
So in the first of the sum here,
in this sum,
we have it over all the query
words n query words.
And in this sum, we have a sum
of all the possible words but
we put a counter here of
each word in the query.
Essentially we are only considering
the words in the query,
because if a word is not in the query,
it can would be zero.
So we're still considering
only these end words.
But we're using a different form as if
we were going to a sum of all the words,
in the vocabulary.
And of course a word might occur
multiple times in the query.
That's wh, why we have a count here.
And then this part is
log of the probability of the word
given by the document MG model.
So you can see, in this material function,
we actually know the count
of the word in the query.
So, the only thing that we don't know
is this document language model.
Therefore, we can convert
through the retrieval problem
into the problem of estimating
this document language model.
So that we can compute, the probability of
each query we're given by this document.
At different estimation methods here,
would lead to different ranking functions.
And this is just like a different
a ways to place a doc in the vector,
in the vector space.
Would lead it to a different ranking
function in the vector space model.
Here are different ways to estimate
this stuff in the language model,
will lead you to a different ranking
function for query likelihood.
[MUSIC]

This lecture is about
smoothing of language models.
In this lecture we're going to continue
talking about the probabilistic
retrieval model.
In particular, we're going to talk
about smoothing of language model and
the query likelihood of it,
which will method.
So you have seen this slide
from a previous lecture.
This is the ranking function
based on the query likelihood.
Here we assume that the independence
of generating each query word
and the formula would
look like the following.
Where we take a sum over all of the query
words and inside is the sum there is
a log of probability of a word given by
the document, or document language model.
So the main task now is to estimate
this document language model.
As we said before different methods for
estimating this model would lead
to different retrieval functions.
So, in this lecture we're going
to look into this in more detail.
So, how do I estimate this language model?
Well, the obvious choice would be
the Maximum Likelihood Estimate
that we have seen before.
And that is we're going to normalize
the word frequencies in the document.
And the estimated probability
would look like this.
This is a step function here.
Which means all the words
that have the same frequency
count will have an equal probability.
This is another frequency in the count
that has a different probability.
Note that for words that have not
occurred in the document here,
they all have zero probability.
So we know this is just like a model that
we assume earlier in the lecture, where
we assume the user with the sample word
from the document to formulate the query.
And there is no chance of sampling
any word that is not in the document.
And we know that's not good.
So how would we improve this?
Well, in order to assign
a non-zero probability
to words that have not been observed
in the document, we would have to take
away some probability to mass from
the words that are observing the document.
So for example here, we have to
take away some [INAUDIBLE] mass,
because we need some extra problem
in the mass for the unseen words.
Otherwise, they won't sum to 1.
So all these probabilities
must be sum to 1.
So to make this transformation, and
to improve the maximum [INAUDIBLE].
By assigning nonzero probabilities to
words that are not observed in the data.
We have to do smoothing, and smoothing
has to do with improving the estimate
by considering the possibility that,
if the author had been written.
Helping, asking to write more words for
the document.
The user,
the author might have rethink other words.
If you think about this factor
then a smoothed LM model
would be a more accurate
representation of the actual topic.
Imagine you have seen
abstract of such article.
Let's say this document is abstract.
Right.
If we assume and
see words in this abstract we have or,
or probability of 0 that
would mean it's no chance
of sampling a word outside the abstract
that the formula to query.
But imagine the user who is interested in
the topic of this abstract, the user might
actually choose a word that is not in
the abstractor to to use as query.
So obviously if we had asked
this author to write more,
the author would have written
a full text of that article.
So smoothing of the language
model is attempted to
to try to recover the model for
the whole, whole article.
And then of course we don't have written
knowledge about any words are not observed
in the abstract there, so that's why
smoothing is actually a tricky problem.
So let's talk a little more
about how to smooth a LM word.
The key question here is what probability
should be assigned to those unseen words.
Right.
And
there are many different
ways of doing that.
One idea here, that's very useful for
retrieval is let the probability
of an unseen word be proportional
to its probability given by
a reference language model.
That means if you don't observe
the word in the data set,
we're going to assume that
its probability is kind of
governed by another reference language
model that we were constructing.
It will tell us which unseen words
we have likely a higher probability.
In the case of retrieval
a natural choice would be to
take the Collection Language Model
as a Reference Language Model.
That is to say if you don't
observe a word in the document
we're going to assume that.
The probability of this word
would be proportional to the probability
of the word in the whole collection.
So, more formally,
we'll be estimating the probability of
a word getting a document as follows.
If the word is seen in the document,
then the probability
would be a discounted the maximum
[INAUDIBLE] estimated p sub c here.
Otherwise, if the word is not seen
in the document, we'll then let
probability be proportional to the
probability of the word in the collection,
and here the coefficient of is to
control the amount of probability
mass that we assign to unseen words.
Obviously all these
probabilities must sum to 1.
So, alpha sub d is
constrained in some way.
So, what if we plug in this
smoothing formula into our
query likelihood Ranking Function?
This is what we would get.
In this formula,
you can see, right, we have
this as a sum over all the query words.
And note that we have written in the form
of a sum over all the vocabulary.
You see here this is a sum of
all the words in the vocabulary,
but note that we have a count
of the word in the query.
So, in effect we are just taking
a sum of query words, right.
This is in now a common way that
we will use because of its
convenience in some transformations.
So, this is as I said,
this is sum of all the query words.
In our smoothing method,
we're assuming the words that are not
observed in the document, that we have
a somewhat different form of probability.
And then it's for this form.
So we're going to then decompose
this sum into two parts.
One sum is over all the query words
that are matched in the document.
That means in this sum,
all the words have a non
zero probability, in the document, sorry.
It's, the non zero count of
the word in the document.
They all occur in the document.
And they also have to, of course,
have a non-zero count in the query.
So, these are the words that are matched.
These are the query words that
are matched in the document.
On the other hand in this sum we are s,
taking the sum over all the words that
are note our query was not
matched in the document.
So they occur in the query due to this
term but they don't occur in the document.
In this case,
these words have this probability because
of our assumption about the smoothing.
But that here, these c words
have a different probability.
Now we can go further by
rewriting the second sum
as a difference of two other sums.
Basically the first sum is actually
the sum over all the query words.
Now we know that the original
sum is not over the query words.
This is over all the query words that
are not matched in the document.
So here we pretend that they
are actually over all the query words.
So, we take a sum over
all the query words.
Obviously this sum has
extra terms that are,
this sum has extra terms
that are not in this sum.
Because here we're taking sum
over all the query words.
There it's not matched in the document.
So in order to make them equal,
we have to then subtract another sum here.
And this is a sum over all the query
words that are mentioned in the document.
And this makes sense because here
we're considering all query words.
And then we subtract the query
that was matched in the document.
That will give us the query rules
that not matched in the document.
And this is almost a reverse
process of the first step here.
And you might wonder
why we want to do that.
Well, that's because if we do this then
we'll have different forms
of terms inside these sums.
So, now we can see in the sum we have,
all the words,
matched query words, matched in
the document with this kind of terms.
Here we have another sum
over the same set of terms.
Matched query terms in document.
But inside the sum it's different.
But these two sums can clearly be merged.
So, if we do that we'll get another form
of the formula that looks like
the following at the bottom here.
And note that this is a very interesting,
because here we combine the, these two,
that are a sum of the query words matched
in the document in the one sum here.
And the other sum, now is the compose
[INAUDIBLE] to two parts, and,
and these two parts look much simpler.
Just because these
are the probabilities of unseen words.
But this formula is very interesting,
because you can see the sum is now over
all the matched query terms.
And just like in the vector space model,
we take a sum of terms,
that intersection of query vector and
the document vector.
So it all already looks a little
bit like the vector space model.
In fact there is even more severity here.
As we, we explain on this slide.
[MUSIC]

[SOUND]
So, I showed you how we rewrite the into
a form that looks like a,
the formula on this slide.
After we make the assumption about
smoothing the language model
based on the collection
of the language model.
Now, if we look at the, this rewriting,
it actually would give us two benefits.
The first benefit is, it helps us better
understand that this ranking function.
In particular, we're going to show that
from this formula we can see smoothing is
the correction that we model will give
us something like a TF-IDF weighting and
length normalization.
The second benefit is
that it also allows us to
compute the query likelihood
more efficiently.
In particular,
we see that the main part of the formula
is a sum over the matching query terms.
So, this is much better than if we
take the sum over all the words.
After we smooth the document
the language model,
we essentially have nonzero
probabilities for all the words.
So, this new form of the formula is
much easier to score, or to compute.
It's also interesting to note
that the last of term here
is actually independent of the document.
Since our goal is to
rank the documents for
the same query,
we can ignore this term for ranking.
Because it's going to be the same for
all the documents.
Ignoring it wouldn't effect
the order of the documents.
Inside the sum,
we also see that each matched
query term would contribute a weight.
And this weight actually,
is very interesting because it
looks like TF-IDF weighting.
First, we can already see it has
a frequency of the word in the query,
just like in the vector space model.
When we take adult product,
we see the word frequency in
the query to show up in such a sum.
And so naturally,
this part will correspond to
the vector element from
the document vector.
And here, indeed, we can see it actually
encodes a weight that has similar
factor to TF-IDF weighting.
I let you examine it.
Can you see it?
Can you see which part is capturing TF,
and which part is capturing IDF weighting?
So if you want, you can pause
the video to think more about it.
So, have you noticed that this p sub-seen
is related to the term frequency
in the sense that if a word occurs
very frequently in the document,
then the S probability here
will tend to be larger.
Right?
So, this means this term is really
doing something like TF weighting.
Have you also noticed that
this time in the denominator
is actually achieving the factor of IDF?
Why?
Because this is the popularity of the term
in the collection, but
it's in the denominator.
So, if the probability in
the collection is larger
than the weight is actually smaller.
And, this means a popular term.
We actually have a smaller weight.
And, this is precisely what
IDF weighting is doing.
Only not, we now have
a different form of TF and IDF.
Remember, IDF has a log,
logarithm of document frequency, but
here we have something different.
But intuitively,
it achieves a similar effect.
Interestingly, we also have something
related to the length normalization.
Again, can you see which factor is
related to the length in this formula.
Well, I just say that, that this
term is related to IDF weighting.
This, this collection probability.
But, it turns out this term here
is actually related to
a document length normalization.
In particular,
D might be related to document N, length.
So, it, it encodes how much probability
mass we want to give to unseen words.
How much smoothing you are allowed to do.
Intuitively, if a document is long,
then we need to do less smoothing.
Because we can assume that
it is large enough that,
we have probably observed all of the words
that the author could have written.
But if the document is short,
the unseen are expected to be to be large,
and we need to do more smoothing.
It's like that there are words that have
not been retained yet by the author.
So, this term appears to
paralyze long documents
tend to be longer than,
larger than for long document.
But note that the also occurs here.
And so,
this may not actually be necessary,
penalizing long documents, and
in fact is not so clear here.
But as we will see later, when we
consider some specific smoothing methods,
it turns out that they do
penalize long documents.
Just like in TF-IDF weighting and
the document ends formulas
in the vector space model.
So, that's a very interesting
observation because it means
we don't even have to think about
the specific way of doing smoothing.
We just need to assume that if we
smooth with this language model,
then we would have a formula that
looks like a TF-IDF weighting and
document length normalization.
What's also interesting that we have
a very fixed form of the ranking function.
And see, we have not heuristically
put a logarithm here.
In fact, if you can think about,
why we would have a logarithm here?
If you look at the assumptions that
we have made, it will be clear.
It's because we have used a logarithm
of query likelihood for scoring.
And, we turned the product into
a sum of logarithm of probability.
And, that's why we have this logarithm.
Note that if we only want to heuristically
implement a TF weighting and
IDF weighting, we don't necessarily
have to have a logarithm here.
Imagine if we drop this logarithm,
we would still have TF and IDF weighting.
But, what's nice with
probabilistic modeling is that we
are automatically given
a logarithm function here.
And, that's basically,
a fixed reform of the formula that we did
not really have to hueristically line.
And in this case,
if you try to drop this logarithm
the model probably won't, won't work
as well as if you keep the logarithm.
So, a nice property of probabilistic
modeling is that by following some
assumptions and the probability rules,
we'll get a formula automatically.
And, the formula would have
a particular form, like in this case.
And, if we hueristically
design the formula,
we may not necessarily end up
having such a specific form.
So to summarize, we talked about the need
for smoothing a document and model.
Otherwise, it would give zero probability
for unseen words in the document.
And, that's not good for
scoring a query with such an unseen word.
It's also necessary,
in general, to improve the acc,
accuracy of estimating the model
representing the topic of this document.
The general idea of smoothing in retrieval
is to use the connection language model
to give us some clue about which unseen
word would have a higher probability.
That is the probability of the unseen
word is assumed to be proportional
to its probability in the collection.
With this assumption, we've shown that we
can derive a general ranking formula for
query likelihood.
That has a fact of TF-IDF waiting and
document length normalization.
We also see that through some rewriting,
the scoring of such ranking function,
is primarily based on sum of
weights on matched query terms,
just like in the vector space model.
But, the actual ranking function
is given us automatically by
the probability rules and
assumptions we have made.
Unlike in the vector space model,
where we have to heuristically think
about the form of the function.
However, we still need
to address the question,
how exactly we should we should
smooth a document image model?
How exactly we should use
the reference language model based on
the connection to adjusting
the probability of the maximum.
And, this is the topic
of the next to that.
[MUSIC]

[SOUND]
This lecture is about the specific
smoothing methods for language models
used in Probabilistic Retrieval Model.
In this lecture we will continue
the discussion of language models for
information retrieval, particularly
the query likelihood retrieval method.
And we're going to talk about
the specific smoothing methods used for
such a retrieval function.
So, this is a slide from a previous
lecture where we show that with
query likelihood ranking and the smoothing
with the collection language model.
We end up having a retrieval function
that looks like the following.
So, this is the retrieval function,
based on these assumptions
that we have discussed.
You can see it's a sum of all
the matched query terms here.
And inside the sum it's
a count of term in the query,
and some weight for
the term in the document.
We have TFI, TF weight here.
And then we have another constant here,
in n.
So clearly, if we want to implement this
function using a programming language,
we'll still need to figure
out a few variables.
In particular, we're going to
need to know how to estimate the,
probability of would exactly.
And how do we set alpha?
So in order to answer these questions,
we have to think about this very specific
smoothing methods, and
that is the main topic of this lecture.
We're going to talk about
two smoothing methods.
The first is the simple linear
interpolation, with a fixed coefficient.
And this is also called a Jelinek and
Mercer smoothing.
So the idea is actually very simple.
This picture shows how we estimate
document language model by using
maximum [INAUDIBLE] method,
that gives us word counts normalized by
the total number of words in the text.
The idea of using this method is to
maximize the probability
of the observed text.
As a result, if a word like network,
is not observed in the text.
It's going to get zero probability,
as shown here.
So the idea of smoothing, then,
is to rely on collection average model,
where this word is not going to have
a zero probability to help us decide
what non-zero probability should
be assigned to such a word.
So, we can know that network as
a non-zero probability here.
So, in this approach what we do is,
we do a linear interpolation between
the maximum likelihood or estimate here
and the collection language model.
And this controlled by
the smoothing parameter, lambda.
Which is between 0 and 1.
So this is a smoothing parameter.
The larger lambda is the two the more
smoothing we have, we will have.
So by mixing them together, we achieve the
goal of assigning non-zero probability.
And these two are word in our network.
So let's see how it works for
some of the words here.
For example if we compute to
the smallest probability for text.
Now, the next one right here
is made give us 10 over 100,
and that's going to be here.
But the connection probability is this, so
we just combine them together
with this simple formula.
We can also see a, the word network.
Which used to have zero probability
now is getting a non-zero
probability of this value.
And that's because the count is going
to be zero for network here, but
this part is non zero and
that's basically how this method works.
If you think about this and
you can easily see now the alpha sub d
in this smoothing method is basically
lambda because that's, remember,
the coefficient in front of
the probability of the word given by
the collection language model here, right?
Okay, so
this is the first smoothing method.
The second one is similar, but it has
a find end for manual interpretation.
It's often called a duration of the ply or
Bayesian smoothing.
So again here, we face the problem of
zero probability for like network.
Again we'll use the collection
language model, but
in this case we're going to combine
them in a somewhat different ways.
The formula first can be seen as
a interpolation of the maximum
and the collection
language model as before.
As in the J M's [INAUDIBLE].
Only and after the coefficient [INAUDIBLE]
is not the lambda, a fixed lambda, but
a dynamic coefficient in this form,
when mu is a parameter,
it's a non, negative value.
And you can see if we
set mu to a constant,
the effect is that a long document would
actually get smaller coefficient here.
Right?
Because a long document
we have a longer length.
Therefore, the coefficient
is actually smaller.
And so a long document would have
less smoothing as we would expect.
So this seems to make more sense
than a fixed coefficient smoothing.
Of course,
this part would be of this form, so
that the two coefficients would sum to 1.
Now, this is one way to understand
that this is smoothing.
Basically, it means that it's
a dynamic coefficient interpolation.
There is another way to
understand this formula.
Which is even easier to remember and
that's this side.
So it's easy to see we can rewrite
this modern method in this form.
Now, in this form, we can easily see
what change we have made to the maximum
estimator, which would be this part,
right?
So it normalizes the count
by the top elements.
So, in this form, we can see what we did,
is we add this to the count of every word.
So, what does this mean?
Well, this is basically
something relative to the probability
of the word in the collection..
And we multiply that by the parameter mu.
And when we combine this
with the count here,
essentially we are adding pseudo
counts to the observed text.
We pretend every word,
has got this many pseudocount.
So the total count would be
the sum of these pseudocount and
the actual count of
the word in the document.
As a result, in total, we would
have added this minute pseudocount.
Why?
Because if you take a sum of this,
this one, move over all the words and
we'll see the probability of the words
would sum to 1, and that gives us just mu.
So this is the total number of
pseudo counters that we added.
And, and so
these probabilities would still sum to 1.
So in this case, we can easily
see the method is essentially to
add these as a pseudocount to this data.
Pretend we actually augment the data
by including by some pseudo data defined
by the collection language model.
As a result, we have more counts.
It's the, the total counts for, for
word, a word that would be like this.
And, as a result,
even if a word has zero counts here.
And say if we have zero come here and
that it would still have none,
zero count because of this part, right?
And so this is how this method works.
Let's also take a look at
this specific example here.
All right, so for text again,
we will have 10 as original count.
That we actually observe but
we also added some pseudocount.
And so, the probability of
text would be of this form.
Naturally the probability of
network would be just this part.
And so, here you can also
see what's alpha sub d here.
Can you see it?
If you want to think about
you can pause the video.
Have you noticed that this
part is basically of a sub t?
So we can see this case of our sub t
does depend on the document, right?
Because this lens depends on the document
whereas in the linear interpolation.
The James move method
this is the constant.
[MUSIC]

[SOUND]
So let's plug in these model masses
into the ranking function to
see what we will get, okay?
This is a general smoothing.
So a general ranking function for
smoothing with subtraction and
you have seen this before.
And now we have a very specific smoothing
method, the JM smoothing method.
So now let's see what what's a value for
office of D here.
And what's the value for p sub c here?
Right, so we may need to decide this
in order to figure out the exact
form of the ranking function.
And we also need to figure
out of course alpha.
So let's see.
Well this ratio is basically this,
right, so,
here, this is the probability
of c board on the top,
and this is the probability
of unseen war or,
in other words basically 11
times basically the alpha here,
this, so it's easy to see that.
This can be then rewritten as this.
Very simple.
So we can plug this into here.
And then here, what's the value for alpha?
What do you think?
So it would be just lambda, right?
And what would happen if we plug in
this value here, if this is lambda.
What can we say about this?
Does it depend on the document?
No, so it can be ignored.
Right?
So we'll end up having this
ranking function shown here.
And in this case you can easy to see,
this a precisely a vector space
model because this part is
a sum over all the matched query terms,
this is an element of the query map.
What do you think is a element
of the document up there?
Well it's this, right.
So that's our document left element.
And let's further examine what's
inside of this logarithm.
Well one plus this.
So it's going to be nonnegative,
this log of this,
it's going to be at least 1, right?
And these, this is a parameter,
so lambda is parameter.
And let's look at this.
Now this is a TF.
Now we see very clearly
this TF weighting here.
And the larger the count is,
the higher the weighting will be.
We also see IDF weighting,
which is given by this.
And we see docking the lan's
relationship here.
So all these heuristics
are captured in this formula.
What's interesting that
we kind of have got this
weighting function automatically
by making various assumptions.
Whereas in the vector space model,
we had to go through those heuristic
design in order to get this.
And in this case note that
there's a specific form.
And when you see whether this
form actually makes sense.
All right so what do you think
is the denominator here, hm?
This is a math of document.
Total number of words,
multiplied by the probability of the word
given by the collection, right?
So this actually can be interpreted
as expected account over word.
If we're going to draw, a word,
from the connection that we model.
And, we're going to draw as many as
the number of words in the document.
If you do that,
the expected account of a word, w,
would be precisely given
by this denominator.
So, this ratio basically,
is comparing the actual count, here.
The actual count of the word in the
document with expected count given by this
product if the word is in fact following
the distribution in the clutch this.
And if this counter is larger than
the expected counter in this part,
this ratio would be larger than one.
So that's actually a very
interesting interpretation, right?
It's very natural and intuitive,
it makes a lot of sense.
And this is one advantage of using
this kind of probabilistic reasoning
where we have made explicit assumptions.
And, we know precisely why
we have a logarithm here.
And, why we have these probabilities here.
And, we also have a formula that
intuitively makes a lot of sense and
does TF-IDF weighting and
documenting and some others.
Let's look at the,
the Dirichlet Prior Smoothing.
It's very similar to
the case of JM smoothing.
In this case,
the smoothing parameter is mu and
that's different from
lambda that we saw before.
But the format looks very similar.
The form of the function
looks very similar.
So we still have linear operation here.
And when we compute this ratio,
one will find that is that
the ratio is equal to this.
And what's interesting here is that we
are doing another comparison here now.
We're comparing the actual count.
Which is the expected account of the world
if we sampled meal worlds according to
the collection world probability.
So note that it's interesting we don't
even see docking the lens here and
lighter in the JMs model.
All right so this of course
should be plugged into this part.
So you might wonder, so
where is docking lens.
Interestingly the docking lens
is here in alpha sub d so
this would be plugged into this part.
As a result what we get is
the following function here and
this is again a sum over
all the match query words.
And we're against the queer,
the query, time frequency here.
And you can interpret this as
the element of a document vector,
but this is no longer
a single dot product, right?
Because we have this part,
I know that n is the name of the query,
right?
So that just means if
we score this function,
we have to take a sum over
all the query words, and
then do some adjustment of
the score based on the document.
But it's still, it's still clear
that it does documents lens
modulation because this lens
is in the denominator so
a longer document will
have a lower weight here.
And we can also see it has tf here and
now idf.
Only that this time the form of the
formula is different from the previous one
in JMs one.
But intuitively it still implements TFIDF
waiting and document lens rendition again,
the form of the function is dictated
by the probabilistic reasoning and
assumptions that we have made.
Now there are also
disadvantages of this approach.
And that is, there's no guarantee
that there's such a form
of the formula will actually work well.
So if we look about at this geo function,
all those TF-IDF waiting and document lens
rendition for example it's unclear whether
we have sub-linear transformation.
Unfortunately we can see here there
is a logarithm function here.
So we do have also the,
so it's here right?
So we do have the sublinear
transformation, but
we do not intentionally do that.
That means there's no guarantee that
we will end up in this, in this way.
Suppose we don't have logarithm,
then there's no sub-linear transformation.
As we discussed before, perhaps
the formula is not going to work so well.
So that's an example of the gap
between a formal model like this and
the relevance that we have to model,
which is really a subject
motion that is tied to users.
So it doesn't mean we cannot fix this.
For example, imagine if we did
not have this logarithm, right?
So we can take a risk and
we're going to add one,
or we can even add double logarithm.
But then, it would mean that the function
is no longer a proper risk model.
So the consequence of
the modification is no
longer as predictable as
what we have been doing now.
So, that's also why, for example,
PM45 remains very competitive and
still, open channel how to use
public risk models as they arrive,
better model than the PM25.
In particular how do we use query
like how to derive a model and
that would work consistently
better than DM 25.
Currently we still cannot do that.
Still interesting open question.
So to summarize this part, we've talked
about the two smoothing methods.
Jelinek-Mercer which is doing the fixed
coefficient linear interpolation.
Dirichlet Prior this is what add a pseudo
counts to every word and is doing adaptive
interpolation in that the coefficient
would be larger for shorter documents.
In most cases we can see, by using these
smoothing methods, we will be able to
reach a retrieval function where
the assumptions are clearly articulate.
So they are less heuristic.
Explaining the results also show
that these, retrieval functions.
Also are very effective and they are
comparable to BM 25 or pm lens adultation.
So this is a major advantage
of probably smaller
where we don't have to do
a lot of heuristic design.
Yet in the end that we naturally
implemented TF-IDF weighting and
doc length normalization.
Each of these functions also has
precise ones smoothing parameter.
In this case of course we still need
to set this smoothing parameter.
There are also methods that can be
used to estimate these parameters.
So overall,
this shows by using a probabilistic model,
we follow very different strategies
then the vector space model.
Yet, in the end, we end up uh,with
some retrievable functions that
look very similar to
the vector space model.
With some advantages in having
assumptions clearly stated.
And then, the form dictated
by a probabilistic model.
Now, this also concludes our discussion of
the query likelihood probabilistic model.
And let's recall what
assumptions we have made
in order to derive the functions
that we have seen in this lecture.
Well we basically have made four
assumptions that I listed here.
The first assumption is that the relevance
can be modeled by the query likelihood.
And the second assumption with med is, are
query words are generated independently
that allows us to decompose
the probability of the whole query
into a product of probabilities
of old words in the query.
And then,
the third assumption that we have made is,
if a word is not seen,
the document or in the late,
its probability proportional to
its probability in the collection.
That's a smoothing with
a collection ama model.
And finally, we made one of these
two assumptions about the smoothing.
So we either used JM smoothing or
Dirichlet prior smoothing.
If we make these four assumptions
then we have no choice but
to take the form of the retrieval
function that we have seen earlier.
Fortunately the function has a nice
property in that it implements TF-IDF
weighting and document machine and
these functions also work very well.
So in that sense,
these functions are less heuristic
compared with the vector space model.
And there are many extensions of this,
this basic model and
you can find the discussion of them in
the reference at the end of this lecture.
[MUSIC]

[SOUND] This lecture is about
the Feedback in Text Retrieval.
So, in this lecture,
we're going to continue the discussion
on text retrieval methods.
In particular, we're going to talk
about Feedback in Text Retrieval.
This is a diagram that shows
the retrieval process.
We can see the user would
typed in a query and
then the query would be sent
to a Retrieval Engine or
search engine and
the engine would return results.
These results would be shown to the user.
After the user has seen these results,
the user can actually make judgments.
So for example, the user has say,
well, this is good and
this document is not very useful.
This is good again, et cetera.
Now this is called a relevance judgment or
Relevance Feedback, because we've
got some feedback information from
the user based on the judgments.
This can be very useful to the system.
Learn what exactly is
interesting to the user.
So the feedback module would
then take this as input and
also use the document collection
to try to improve ranking.
Typically, it would involve
updating the query.
So the system can now rank the results
more accurately for the user.
So this is called Relevance Feedback.
The feedback is based on relevance
judgements made by the users.
Now these judgements are reliable, but
the users generally don't want to make
extra effort, unless they have to.
So the downside's that involves
some extra effort by the user.
There is another form of feedback
called a Pseudo Relevance Feedback,
or a blind feedback also
called an automatic feedback.
In this case, you can see once
the user has got without an effect,
we don't have to involve users.
So you can see there's
no user involved here.
And we simply assume that the top
ranked documents to be relevant.
Let's say,
we have assumed the top ten is relevant.
And then we will then use these assumed
documents to learn and
to improve the query.
Now you might wonder, you know,
how could this help if we simply assume
the top rank documents would be random.
Well you can imagine these top rank
documents are actually similar to relevant
documents, even if they are not relevant,
they look like relevant documents.
So, it's possible to learn some related
terms to the query from this set.
In fact, you may recall that we
talked about using language model to
analyze word association to learn
related words to the word computer.
Right?
And then what we did is first,
use computer to retrieve all
the documents that contain computer.
So, imagine now the query
here is a computer.
Right?
And then the results will be those
documents that contain computer.
And what we can do then is
to take the top end results.
They can match computer very well and
we're going to count
the terms in this set and then we're
going to then use the background
language model to choose the terms
that are frequent the in this set,
but not frequent the in
the whole collection.
So, if we make a contrast between
these two, what we can find is that
we'll learn some related terms too, the
work computer as what I've seen before.
And these related words can then be added
to the original query to expand the query.
And this would help us free documents
that don't necessarily match computer,
but match other words like program and
software.
So this is factored for
improving the search doubt.
But of course, pseudo relevancy
feedback is completely unreliable.
We have to arbitrarily set a cutoff.
So there is also something in
between called Implicit Feedback.
In this case, what we do,
we do involve users, but
we don't have to ask
users to make judgements.
Instead, we are going to observe how the
user interacts with the search results.
So, in this case,
we're going to look at the clickthroughs.
So the user clicked on this one and
the, the user viewed this one.
And the user skipped this one and
the user viewed this one again.
Now this also is a clue about whether
a document is useful to the user and
we can even assume that we're going to use
only the snippet here in this document.
The text that's actually seen by the user,
instead of the actual document
of this entry in the link.
There that same web search may be broken,
but that, it doesn't matter.
If the user tries to fetch this document
that because of the displayed text,
we can assume this displayed text is
probably relevant is interesting to user,
so we can learn from such information.
And this is called Implicit Feedback and
we can again,
use the information to update the query.
This is a very important technique
used in modern search engines.
You know, think about Google and Bing and
they can collect a lot of user activities.
Why they are serverless?
Right.
So they would observe what documents we
click on, what documents we skip.
And this information is very valuable and
they can use this to
encode the search engine.
So to summarize,
we would talk about the three kinds of
feedback here rather than feedback.
Where the use exquisite judgement,
it takes some used effort, but
the judgement that
information is reliable.
We talked about the Pseudo Feedback, where
we simply assumed top random documents.
We get random,
we don't have to involve the user.
Therefore, we could do
that actually before we,
we return the results to the user.
And the third is Implicit Feedback,
where we use clickthroughs.
Where we don't, we involve users, but
the user doesn't have to make
explicit effort to make judgement.
[MUSIC]

[SOUND] This lecture is about
the feedback in the vector space model.
In this lecture, we continue talking
about the feedback and text retrieval.
Particularly we're going to talk about
feedback in the vector space model.
As we have discussed before in
the case of feedback the task of
a text retrieval system is relearned from
examples to improve retrieval accuracy.
We will have positive examples,
those are the documents that
are assumed that will be random or
judged with being random and all
the documents that are viewed by users.
We also have negative examples, those
are documents known to be non-relevant.
They can also be the documents
that are escaped by users.
The general method in
the vector space model for
feedback is to modify our query vector.
Now we want to place the query vector in
a better position to make that accurate
and what does that mean exactly?
Well, if you think about the query vector
that would mean you would have to do
something to vector elements.
And in general that would
mean we might add new terms.
We might adjust weights of old terms or
assign weights to new terms.
And as a result in general
the query will have more terms so
we often call this query expansion.
The most effective method in the vector
space model of feedback is called Rocchio
feedback which was actually
proposed several decades ago.
So, the idea is quite simple we illustrate
this idea by using a two-dimensional
display of all the documents in
the collection and also the query vector.
So, now we can see
the query vector is here in
the center and
these are all of the documents.
So when we use a query vector and
use a similarity function to
find the most similar documents.
We are basically drawing a circle here and
then these documents would be
basically the top-ranked documents.
And this process of relevant documents,
right?
And these are random documents for
example that's relevant, etc.
And then these minuses
are negative documents like this.
So our goal here is trying
to move this query vector to some position
to improve the retrieval accuracy.
By looking at this diagram
what do you think where
should we move the query vector so that
we can improve the retrieval accuracy.
Intuitively, where do you want
to move the query back to?
If you want to think more
you can pause the video.
Now if you think about
this picture you can realize that
in order to work well in this case
you want the query vector to be as close
to the positive vectors as possible.
That means, ideally you want to place
the query vector somewhere here or
we want to move the query
vector closer to this point.
Now, so what exactly at this point?
Well, if you want these relevant
documents to be ranked on the top
you want this to be in the center of
all of these relevant documents, right?
Because then if you draw
a circle around this one
you get all these relevant documents.
So that means we can move the query
back toward the centroid of
all the relevant document vectors.
And this is basically the idea of Rocchio,
of course you then can see that
the centroid of negative documents.
And one move away from
the negative documents.
Now geometrically we're
talking about a moving vector
closer to some other vector and
away from other vectors.
Algebraically it just means
we have this formula.
Here you can see this is
original query vector and
this average basically is the centroid
vector of relevant documents.
When we take the average
over these vectors
then we're computing
the centroid of these vectors.
And similarly this is the average in
that non-relevant document of vectors so
it's essentially of now random, documents.
And we have these three parameters here,
alpha, beta and gamma.
They're controlling
the amount of movement.
When we add these two vectors together
we're moving the query at the closer
to the centroid, alright, so
when we add them, together.
When we subtracted this part we kind
of move the query vector away from that
centroid so
this is the main idea of Rocchio Feedback.
And after we have done this we
will get a new query vector
which can use it to store documents.
This new New query vector will then
reflect the move of this
Original query vector toward
this Relevant centroid vector and
away from the Non-relevant
centroid vector, okay?
So let's take a look at example, right?
This is the example that we have seen
earlier only that I in the, the display
of the actual documents I only showed the
vector representation of these documents.
We have five documents here and we have
true red in the documents here, right?
They are displayed in red and
these are the term vectors.
Now, I just assumed an idea of weights,
a lot of times we have
zero weights of course.
These are negative documents, there
are two here, there is another one here.
Now in this Rocchio method we
first compute the centroid of
each category and so let's see.
Look at the centroid of
the positive document but
we simply just so it's very easy to see.
We just add this with this one
the corresponding element and
that's down here and take the average.
And then we're going to add
the corresponding elements and
then just take the average, right?
So we do this for all these.
In the end, what we have is this one.
This is the average vector of these two so
it's a centroid of these two, right?
Let's also look at the centroid
of the nested documents.
This is basically the same we're going to
take the average of three elements.
And these are the corresponding
elements in these three vectors and
so on and so forth.
So in the end, that we have this one.
Now, in the Rocchio feedback
method we're going to combine all
these with original query vector,
which is this.
So now let's see how we
combine them together.
Well, that's basically this, right?
So we have a parameter outlier controlling
the original query term weight that's 1.
And now I've beta to control
the inference of the positive
centroid Vector weight that's
1.5 that comes from here, right?
So this goes here and
we also have this negative wait here.
Conduit by a gamma here and
this weight has come from of
course the nective centroid here.
And we do exactly the same for
other terms each is for one term.
And this is our new vector.
And we're going to use this new query
vector, this one to run the documents.
You can imagine what would happen, right?
Because of the movement that this one or
the match of these red
documents much better.
Because we move this
vector closer to them and
it's going to penalize these black
documents, these non-relevant documents.
So this is precisely what
we want from feedback.
Now of course, if we apply this method in
practice we will see one potential problem
and that is the original query has
only four times that are not zero.
But after we do queries,
imagine you can imagine we'll have many
terms that would have a number of weights.
So the calculation would
have to involve more terms.
In practice,
we often truncate this vector and
only retain the terms which
is the highest weight.
So let's talk about how we
use this method in practice.
I just mentioned that we often truncate
the vector consider only a small number
of words that have highest
weights in the centroid vector.
This is for efficiency concern.
I also say that here that a negative
examples or non-relevant examples
tend not to be very useful especially
compared with positive examples.
Now you can think about the, why.
One reason is because negative documents
tend to distract the query in
all directions so when you take
the average it doesn't really tell you
where exactly it should be moving to.
Whereas, positive documents tend
to be clustered together and
they respond to you to
consistent the direction.
So that also means that sometimesw we
don't have those negative examples but
note that in,
in some cases in difficult queries where
most top random results are negative.
Negative feedback
afterwards is very useful.
Another thing is to avoid
over-fitting that means we have to
keep relatively high weight
on the original query terms.
Why?
Because the sample that we see in
feedback is a relatively small sample.
We don't want to overly
trust the small sample and
the original query terms
are still very important.
Those terms are typed in by the user and
the user has decided that those
terms are most important.
So in order to prevent the us
from over-fitting or drifting.
A type of drift prevent type of
drifting due to the bias toward the,
the feedback examples.
We generally would have to keep a pretty
high weight on the original terms so
it is safe to do that.
And this is especially, true for
pseudo awareness feedback.
Now this method can be used for
both relevance feedback and
pseudo relevance feedback.
In the case of pseudo feedback,
the parameter beta should be set to a,
a smaller value because
the random examples are assumed
to be random there not as reliable
as your relevance feedback, right?
In the case of relevance feedback,
we obviously could use a larger value.
So, those parameters
still have to be set and.
And the ro, Rocchio method is
usually robust and effective.
It's, it's still a very popular method for
feedback.
[MUSIC]

This lecture is about the web search.
In this lecture we
are going to talk about one of
the most important applications of
text retrieval, web search engines.
So let's first look at some
general challenges and
opportunities in web search.
Now, many information retrieval
algorithms had been developed at the,
before the web was born.
So, when the web was born,
it created the best opportunity to apply
those algorithms to major application
problem that everyone would care about.
So naturally, there had to be some
further extensions of the classical
search algorithms to address some new
challenges encountered in web search.
So here are some general challenges.
Firstly, this is a scalability challenge.
How we handle the size of the web,
and ensure completeness of
coverage of all the information.
How to serve many users quickly,
and by answering all their queries.
All right, so, that's one major challenge.
And before the web was born,
the scale of search was relatively small.
The second problem is that there
is low quality information.
And there are often spams.
The third challenge is
dynamics of the web.
The new pages are constantly created and
some pages may be updated,
eve-, very quickly.
So it makes it harder to,
keep the index fresh.
So these are some of
the challenges that the,
we have to solve in order to,
build a high quality web search engine.
On the other hand, there are also some
interesting opportunities that we can
leverage to improve search results.
There are many additional heuristics.
For example you know using links that
we can leverage to improve scoring.
Now the errors that we talked about such
as the vector space model are general
algorithms.
And they can be applied to any search
applications, so that's, the advantage.
On the other hand, they also don't take
advantage of special characteristics
of pages, or documents, in the specific
applications such as web search.
Web pages are linked with each other so
obviously the linking is something
that we can also leverage.
So because of these challenges and
opportunities there are new techniques
that have been developed for web search,
or due to the need of a web search.
One is parallel indexing and searching,
and this is to address the issue of
scalability, in particular
Google's imaging of MapReduce
is very inferential, and
has been very helpful in that aspect.
Second, there are techniques
that are developed for,
addressing the problem of spams.
So, spam detection.
We'll have to prevent those,
spam pages from being ranked high.
And there are also techniques
to achieve robust ranking.
And we're going to use a lot
of signals to rank pages so
that it's not easy to spam the search
engine with particular tricks.
And the third line of
techniques is link analysis.
And these are techniques
that can allow us to
to improve search results by
leveraging extra information.
And in general in web
search we're going to use
multiple features for ranking.
Not just link analysis but
also exploiting all kinds of crawls like
the layout of web pages or anchor text
that describes a link to another page.
So here's a picture showing the basic
search engine technologies.
Basically, this is the web on the left and
then user on the right side.
And we're going to help these, this
user get access to the web information.
And the first component is the crawler
that with the crawl pages and
the second component is indexer.
That will take these pages
create an invert index.
The third component that is a retrieval,
not with the using,
but the index to answer user's query,
by talking to the user's browser.
And then, the search results would be,
given to the user.
And, and then the browser
will show those results and,
to allow the user to
interact with the web.
So we're going to talk about
each of these component.
First we're going to talk about
the crawler also called a spider or
a software robot that would do something
like a crawling pages on the web.
To build a toy crawler is relatively easy
because you just need to start with a set
of seed pages and then fetch pages from
the web and parse these pages new links.
And then add them to the priority of q and
then just explore those additional links,
right.
But to build a real crawler
actually is tricky and
there are some complicated issues
that we have do deal with.
For example robustness,
what if the server doesn't respond.
What if there's a trap that generates
dynamically generated webpages that might,
attract your crawler to keep
crawling the same site and
to fetch dynamically generated pages.
The results of this issue of crawling and
you don't want to overload one particular
server with many crawling requests.
And you have to respect the,
the robot exclusion protocol.
You also need to handle
different types of files.
There are images, PDF files,
all kinds of formats on the web.
And you have to also
consider URL extensions.
So, sometimes those are cgi scripts, and,
you know, internal references, etc., and
sometimes, you have JavaScripts on the
page that, they also create challenges.
And you ideally should also
recognize [INAUDIBLE] the pages
because you don't have to
duplicate to the, those pages.
And finally, you may be interesting
to discover hidden URLs.
Those are URLs that may not be linked,
to any page.
But if you truncate the URL to,
shorter pass,
you might be able to get
some additional pages.
So, what are the major
crawling strategies?
In general, Breadth-First, is most common,
because it naturally balance,
balances server load.
You would not, keep probing
a particular server [INAUDIBLE].
Also parallel crawling is very natural,
because this task is very easy
to parallelise and there are some
variations of the crawling task.
One interesting variation
is called focused crawling.
In this kind we're going to crawl just
some pages about a particular topic.
For example, all pages about automobiles.
And, and, this is typically
going to start with a query,
and then you can use the query
to get some results.
From the major search engine.
And then you can start it with those
results and gradually crawl more.
So one challenge in crawling is to find
the new pages that people have created,
and people probably are creating
new pages all the time, and this is
very challenging if the new pages have
not been actually linked to any old page.
If they are, then you can probably refine
them by recrawling the older page.
So these are also some um,interesting
challenges that have to be solved.
And finally we might face the scenario of
incremental crawling or repeated crawling.
Right?
So your first,
let's say if you want to be
able to web search engine.
And you were the first to crawl
a lot of data from the web.
And then, but then once you
have collected all the data and
in future we just need to crawl the,
the update pages.
You, you, in general you don't have
to re-crawl everything, right?
Or it's not necessary.
So, in this case you,
you go as you minimize a resource overhead
by using minimum resource to,
to just still crawl updated pages.
So this is after a very interesting
research question here.
And [INAUDIBLE] research
question is that there aren't
many standard algorithms [INAUDIBLE] for
doing this, this task.
Right?
But in general, you can imagine,
you can learn from the past experience.
Right.
So the two major factors that
you have to consider are first,
will this page be updated frequently?
And do I have to crawl this page again?
If the page is a static page
that hasn't been changed for
months you probably don't have
to re-crawl it everyday, right?
Because it's unlikely that it
will be changed frequently.
On the other hand if it's you know,
sports score page that gets
updated very frequently and
you may need to re-crawl it maybe
even multiple times, on the same day.
The other factor to consider is,
is this page frequently accessed by users?
If it, if it is,
that means it's a high utility page, and
then thus it's more important to
ensure such a page to be fresh.
Compare it with another page that has
never been fetched by any users for
a year.
Than, even though that page
has been changed a lot, then,
it's probably not necessary to crawl that
page or at least it's not as urgent as,
to maintain the freshness of
frequently accessed page by users.
So to summarize,
web search is one of the most important
applications of text retrieval.
And there are some new challenges
particularly scalability,
efficiency, quality information.
There are also new opportunities
particularly, rich link information and
layout, et cetera.
Crawler is an essential component
of web search applications.
And, in general,
we can classify two scenarios.
Once is initial crawling and
here we want to have complete crawling
of the web if you are doing
a general search engine or
focus crawling if you want to just
target it at a certain type of pages.
And then there is another scenario that's
incremental updating of the crawl data or
incremental crawling.
In this case you need to
optimize the resource.
For to use minimum resource
we get the [INAUDIBLE]
[MUSIC].

[SOUND].
This lecture is about recommender systems.
So, so far we have talked about
a lot of aspects of search engines.
And we have talked about the problem
of search and the ranking problem,
different methods for ranking,
implementation of search engine and
how to evaluate the search engine,
et cetera.
This is partly because we know
that web search engines are,
by far, the most important
applications of text retrieval.
And they are the most useful tools
to help people convert big raw text
data into a small set
of relevant documents.
Another reason why we spend so
many lectures on search engines is because
many techniques used in search engines
are actually also very useful for
recommender systems,
which is the topic of this lecture.
And so overall the two systems
are actually well connected,
and there are many techniques
that are shared by them.
So this is a slide that you have
seen before when we talked about
the two different modes of
text access pull and push.
And, we mentioned that recommender
systems are the main systems to serve
users in the push mode, where
the systems will take the initiative to
recommend the information to user, or to
push the relevant information to the user.
And this often works well when the user
has a relatively stable information need,
when the system has good
knowledge about what a user wants.
So a recommender system is sometimes
called a filtering system.
And it's because recommending
useful items to people is like
discarding or
filtering out the useless articles.
So in this sense,
they are kind of similar.
And in all the cases,
the system must make a binary decision.
And usually, there is a dynamic
source of information items,
and you have some knowledge about the
user's interest, and then the system would
make a delivery decision, whether
this item is interesting to the user.
And then if it is interesting then
the system would recommend the article to
the user.
So the basic filtering question here is
really, will this user like, this item?
Will U like item X?
And there are two ways to answer this
question if you think about it, right?
One is look at what items U likes, and
then we can see if X is
actually like those items.
The other is to look at who likes X ,and
we can see if this user looks like a,
one of those users, or
like most of those users.
And these strategies can be combined.
If we follow the first strategy and
look at item similarity in the case
of recommended text objects,
then we are talking about a content-based
filtering or content-based recommendation.
If we look at the second strategy then,
this will compare users.
And in this case,
we're exploiting user similarity,
and the technique is often called
a collaborative filtering.
So let's first look at
the content-based filtering system.
This is what a system would look like.
Inside the system, there would be
a binary classifier that would have some
knowledge about the user's interests, and
it's called the user interest profile.
It maintains the profile to keep
track of the user's interest.
And then there is a utility function to
guide the user to make decisions, and
I'll explain the utility of
the function in a moment.
It helps the system decide
where to set the threshold.
And then the accepted documents
will be those that have passed
the threshold according to the classifier.
There should be also an initialization
module that would take a user's input,
maybe from a user's, specified keywords,
or a chosen category, et cetera.
And this will be, to feed the system
with a initial user profile.
There is also typically a learning
module that will learn from
users' feedback over time.
Now note that in this case, typically
users' information need is stable so
the system would have a lot of
opportunities to observe the users,
you know, if the user has taken
a recommended item as viewed that, and
this is a cu, a signal to indicate that
the recommended item may be relevant.
If the user discarded it,
no, it's not relevant.
And so, such feedback can be a long-term
feedback and can last for a long time and
the system can clock, collect a lot of
information about this user's interests.
And this can then be used
to improve the classifier.
Now whats the criteria for
evaluating such a system?
How do we know this filtering
system actually performs well?
Now in this case we cannot use the ranking
evaluation measures, like a map,
because we can't afford waiting for
a lot of documents,
and then rank the documents to
make a decision for the user.
And so, the system must make a decision,
in real time,
in general to decide whether the item
is above the threshold or not.
So in other words,
we're trying to decide absolute relevance.
So in this case one common use
strategy is to use a utility function
through a valid system.
So here I show a linear utility function
that's defined as, for example,
3 multiplied by the number of
good items that you delivered,
minus 2 multiplied by the number of bad
items you delete, that you delivered.
So in other words, we,
we could kind of just
treat this as almost a,
in a gambling game.
If you delete,
if you deliver one good item,
let's say you win $3, you gain $3.
But if you deliver a bad one,
you would lose $2.
And this utility function
basically kind of measures,
how much money you would,
get by doing this kind of game, right.
And so it's clear that if you want
to maximize this utility function,
your strategy should be to deliver
as many good articles as possible,
and minimize the delivery of bad articles.
That, that's obvious, right.
Now one interesting question here is,
how should we set these coefficients?
Now I just showed a 3 and a negative 2,
as the possible coefficients, but one can
ask the question, are they reasonable?
So what do you think?
Do you think that's a reasonable choice?
What about other choices?
So for example, we can have 10 and
minus 1, or 1 minus 10.
What's the difference?
What do you think?
How would this utility function affect
the system's threshold of this issue?
Right, you can think of these two extreme
cases, 10 minus 1 versus 1 minus 10.
Which one do we think it would
encourage the system to over-deliver?
And which one would encourage
the system to be conservative?
Yeah?
If you think about it, you will see
that when we get a big award for
delivering a good document, you incur only
a small penalty for delivering a bad one.
Intuitively, you would be
encouraging to deliver more, right?
And you can try to deliver more in hope
of getting a good one delivered, and
then you'll get a big award.
Right, so on the other hand,
if you choose 1 minus 10,
you don't really get such a big prize
if you deliver a good document.
On the other hand, you will have
a big loss if you deliver bad one.
You can imagine that the system
would be very reluctant to
deliver lot of documents.
It has to be absolutely sure
that it's a non-relevant one.
So this utility function has to be
designed based on a specific application.
The three basic problems in content-based
filtering are the following.
First has to make a filtering decision.
So it has to be a binary decision maker,
a binary classifier.
Given a text, a text document, and
a profile description of the user,
it has to say yes or no, whether this
document should be delivered or not.
So that's a decision module, and
there should be a initialization
module as you have seen earlier.
And this is to get the system started.
And we have to initialize the system based
on only very limited text description,
or very few examples from the user.
And the third component is
a learning module which ha,
has to be able to learn from limited
relevance judgments because we
can only learn from the user about their
preferences on the delivery documents.
If we don't deliver a document
to the user, we'd never know
we would never be able to know whether
the user likes it or not, right.
And we can accumulate a lot of documents,
we can learn from the entire history.
Now, all these models would have to
be optimized to maximize the utility.
So how can we build a such a system?
And there are many different approaches.
Here we are going to talk about
how to extend a retrieval system,
a search engine for information filtering.
Again, here's why we've spent a lot of
times talk about the search engines.
Because it's actually not very hard
to extend the search engine for
information filtering.
So, here is the basic idea for
extending a retrieval system for
information filtering.
First, we can reuse a lot of
retrieval techniques to do scoring.
All right, so we know how to score
documents against queries et cetera.
We can measure the similarity between
a profile text description and a document.
And then we can use a score threshold for
the filtering decision.
We, we do retrieval and then we kind
of find the scores of documents, and
then we apply a threshold to, to say,
to see whether a document is
passing this threshold or not.
And if it's passing the threshold,
we are going to say it's relevant and
we are going to deliver it to the user.
And another component that we have to add
is, of course, to learn from the history.
And here we can use the traditional
feedback techniques
to learn to improve scoring.
And we know Rocchio can be used for
scoring improvement, right?
And, but we have to develop new approaches
to learn how to set the threshold.
And you know,
we need to set it initially, and
then we have to learn how to
update the threshold over time.
So here's what the system
might look like if we just
generalized a vector-space model for
filtering problems, right?
So you can see the document vector could
be fed into a scoring module, which it
already exists in in a search engine
that implements a vector-space model.
And the profile will be treated
as a query essentially.
And then the profile vector can be
matched with the document vector,
to generate the score.
And then this score will be fed into
a thresholding module that would
say yes or no.
And then the evaluation would be based on
the utility for the filtering results.
If it says yes, and then the document
will be sent to the user, and
then the user could give some feedback.
And the feedback information would
have been use, would be used to both
adjust to the threshold and
adjust the vector representation.
So the vector learning is essentially
the same as query modification or
feedback in the case of search.
The threshold learning is a no,
new component in that we need
to talk a little bit more about.
[MUSIC]

[SOUND].
There are some interesting
challenges in threshold.
Would have known in the filtering problem.
So here I show the,
sort of the data that you can collect in,
in the filtering system.
So you can see the scores and
the status of relevance.
So the first one has a score 36.5,
and it's relevant.
The second one is not relevant.
Of course, we have a lot of documents for
which we don't know the status,
because we will have to the user.
So as you can see here,
we only see the judgements of
documents delivered to the user.
So this is not a random sample.
So it's a censored data.
It's kind of biased, so
that creates some difficulty for learning.
And secondly, there are in general very
little labeled data and very few relevant
data, so it's, it's also challenging for
machine learning approaches.
Typically they require
require more training data.
And in the extreme case at the beginning,
we don't even have any,
label there as well.
The system still has to make a decision,
so
that's a very difficult
problem at the beginning.
Finally, the results of this issue of
exploration versus exploitation tradeoff.
Now this means we also want to
explore the document space a little bit,
and to, to see if the user
might be interested in the documents
that we have not yet labeled.
So, in other words, we're going to
explore the space of user interests
by testing whether the user might be
interested in some other documents that
currently are not matching
the user's interest.
This so well.
So how do we do that?
Well we could lower the threshold a little
bit and do we just deliver some near
misses to the user to see what
the user would respond so
see how the user will,
would respond to this extra document.
And, and this is a trade off, because
on the one hand, you want to explore,
but on the other hand,
you don't want to really explore too much,
because then you would over-deliver
non-relevant information.
So exploitation means you would,
exploit what you learn about the user.
And let's say you know the user is
interested in this particular topic, so
you don't want to deviate that much.
And, but if you don't deviate at all,
then you don't explore at all.
That's also not good.
You might miss opportunity to learn
another interest of the user.
So this is a dilemma.
And that's also a difficult
problem to solve.
Now how do we solve these problems?
In general, I think why can't I used the
empirical utility optimization strategy?
And this strategy is basically to optimize
the threshold based on, historical data,
just as you have seen on
the previous slide, right?
So you can just compute the utility
on the training data for
each candidate score threshold.
Pretend that [INAUDIBLE]
cut at this point.
What if I cut out the [INAUDIBLE]
threshold, what would happen?
What's utility?
Compute the utility, right?
We know the status, what's it based on
approximation of click-throughs, right?
So then we can just choose this
threshold that gives the maximum
utility on the training data.
Now but this of course doesn't account for
exploration that we just talked about.
And there is also the difficulty of bias.
Training sample, as we mentioned.
So in general, we can only get an upper
bound or, for the true optimal threshold
because the, the al, the threshold
might be actually lower than this.
So it's possible that the discarded item
might be actually interesting to the user.
So how do we solve this problem?
Well we generally as I said we can lower
the threshold to explore a little bit.
So here's one particular approach called
the beta-gamma threshold learning.
So the, the idea is foreign.
So, here I show a ranked list of
all the training documents
that we have seen so far.
And they are ranked by their positions.
And on the Y-axis, we show the Utility.
Of course, this function depends on
how you specify the coefficients in
the Utility function.
But we can not imagine depending on the
cut off position we will have a utility.
That means suppose I cut at this
position and that will be the utility.
So we can for
example I then find some cut off point.
The optimal point theta
optimal is the point
when we would achieve the maximum
utility if we had chosen this threshold.
And there is also 0 threshold,
0 utility threshold.
As you can see at this cut off.
The utility is 0.
Now, what does that mean?
That means if I lower the threshold, and
then get the, and now I'm I reach this
threshold, the utility would be lower,
but it's still positive.
Still non-elective, at least.
So it's not as high as
the optimal utility, but
it gives us a a safe point
to explore the threshold.
As I just explained, it's desirable
to explore the interest space.
So it's desirable to lower the threshold
based on your training data.
So that means, in general, we want to set
the threshold somewhere in this range.
It's the when user off fault to
control the the deviation from
the optimal utility point.
So you can see the formula of the
threshold will be just the incorporation
of the zero utility threshold and
the optimal between the threshold.
Now the question is how,
how should we set r form, you know and
when should we deviate more
from the optimal utility point.
Well this can depend on multiple factors
and the one way to solve the problem is to
encourage this threshold
mechanism to explore
up the 0 point, and
that's a safe point, but
we're not going to necessarily
reach all the way to the 0 point.
But rather we're going to use other
parameters to further define alpha.
And this specifically is as follows.
So there will be a beta
parameter to control.
The deviation from the optimal threshold.
And this can be based on for
example can be accounting for
the over throughout
the training data let's say.
And so
this can be just the adjustment factor.
But what's more interesting is this gamma
parameter here, and you can see in this
formula gamma is controlling
the the influence
of the number of examples
in training data set.
So you can see in this formula as N which
denotes the number of training examples.
Becomes bigger than it would
actually encourage less exploration.
In other words, when N is very small,
it will try to explore more.
And that just means if we
have seen few examples,
we're not sure whether we have
exhausted the space of interests.
So [INAUDIBLE].
But as we have seen many examples
from the user, many data points,
then we feel that we probably
dont' have to explore more.
So this gives us a dynamic of strategy for
exploration, right?
The more examples we have seen,
the less exploration we are going to do.
So, the threshold will be closer
to the optimal threshold.
So, that's the basic
idea of this approach.
Now, this approach actually, has been
working well in some evaluation studies.
And, particularly effective.
And, also can welcome arbitrary utility
with a appropriate lower bound.
And explicitly addresses
exploration-exploration tradeoff.
And it kind of uses a zero in this
threshold point as a, a safeguard.
For exploration and exploiting tradeoff.
We're not, never going to explore
further than the zero utility point.
So, if you take the analogy of gambling,
and you,
you don't want to risk losing money.
You know, so it's a safe strategy,
a conservative strategy for exploration.
And the problem is, of course,
this approach is purely heuristic.
And the zero utility lower bound
is also often too conservative.
And there are, of course, calls
are more advanced than machine learning
projects that have been proposed for
solving these problems.
And this is a very active research area.
So to summarize there
are two strategies for
recommending systems or filtering systems.
One is content based,
which is looking at the item similarity.
And the other is collaborative filtering,
which is looking at the user similarity.
In this lecture we have covered
content-based filtering approach.
In the next lecture, we're going to
talk about collaborative filtering.
The content-based filtering
system we generally have to solve
several problems related to filtering
decision and learning, etc.
And such a system can actually
be based on a search engine
system by adding a threshold mechanism and
adding adaptive learning
algorithm to allow the system
to learn from long term
feedback from the user.
[MUSIC]

[SOUND] This lecture is about
Collaborative Filtering.
In this lecture, we're going to continue
the discussion of Recommender Systems.
In particular, we're going to look at
the approach of collaborative filtering.
You have seen this slide before
when we talked about the two
strategies to answer the basic
question will user U like item X.
In the previous lecture,
we looked at the item similarity,
that's content-based filtering.
In this lecture, we're going to
look at the user similarity.
This is a different strategy
called collaborative filtering.
So first of all,
what is collaborative filtering?
It is to make filtering decisions for
individual user based on
the judgement of other users and
that is to say,
we will infer individual's interest or
preferences from that,
of other similar users.
So the general idea is the following.
Given a user u, we are going to
first find the similar users,
u1 through and then we're going to
predict the used preferences based on
the preferences of these similar users,
u1 through.
Now the users similarity here can be
judged based on their similarity.
The preference is on
a common set of items.
Now here you'll see that the exact
content of item doesn't really matter.
We're going to look at the only,
the relationship between the users and
the items.
So this means this approach
is very general if it can be
applied to any items not
just with text objects.
So this approach, it would work well
under the following assumptions.
First users with the same interests
will have similar preferences.
Second, the users with similar preferences
probably share the same interests.
So for example, if the interest of
the user is in information retrieval,
then we can infer the user
probably favor SIGIR papers.
And so those who are interested in
information retrieval researches probably
all favor SIGIR papers,
that's something that we make.
And if this assumption is true,
then it would help collaborative
filtering to work well.
We can also assume that if we
see people favor SIGIR papers,
then we can infer the interest is
probably information retrieval.
So these simple examples,
it seems what makes sense.
And in many cases such as assumption
actually does make sense.
So, another assumption you have to make
is that there are a sufficiently large
number of user preferences
available to us.
So for example, if you see a lot
of ratings of users for movies and
those indicate their
preferences in movies.
And if you have a lot of such data,
then collaborative filtering
can be very effective.
If not, there will be a problem and
that's often called a cold start problem.
That means you don't have many
preferences available, so
the system could not fully take advantage
of collaborative filtering yet.
So let's look at the collaborative
filtering problem in a more formal way.
And so this picture shows that we are in
general considering a lot of users and
showing we're showing m users here.
So, u1 through and we're also
considering a number of objects.
Let's say,
n objects denoted as o1 through on and
then we will assume that the users will
be able to judge those objects and
the user could for example,
give ratings to those items.
For example, those items could be movies,
could be products and
then the users would give ratings
one through five, let's say.
So what you see here is that we have
assumed some ratings available for
some combinations.
So some users have watched movies,
they have rated those movies.
They obviously won't be able
to watch all the movies and
some users may actually
only watch a few movies.
So this is in general a response matrix,
right?
So many item many entries
have unknown values and
what's interesting here is
we could potentially infer
the value of a element in this
matrix based on other values and
that's actually the central question
in collaborative filtering.
And that is,
we assume an unknown function here f,
that would map a pair of user and
object to a rating.
And we have observed there are some
values of this function and
we want to infer the value
of this function for
other pairs that we,
that don't have values available here.
So this is ve, very similar to
other machine learning problems,
where we would know the values of the
function on some training there that and
we hope to predict the the values of
this function on some test there.
All right.
So this is the function approximation.
And how can we pick out the function
based on the observed ratings?
So this is the, the setup.
Now there are many approaches
to solving this problem.
And in fact,
this is a very active research area.
A reason that there are special
conferences dedicated to the problem
is a major conference
devoted to the problem.
[MUSIC]

[NOISE].
And here what will do is
talk about basic strategy,
and that would be based on
similarity of users and
then predicting the rating
of an object by a, a,
active user using the ratings of
similar users to this active user.
This is called a memory-based approach
because it's a little bit similar to
storing all the user information.
And when we are considering a particular
user, we're going to try to
kind of retrieve the relevant users, or
the similar users through this user case.
And then try to use that
user's information about those users
to predict the preference of this user.
So here's the general idea, and
we use some notations here, so.
X sub i j denotes the rating
of object o j by user u i.
And n sub i is average rating
of all objects by this user.
So this n i is needed.
Because we would like to normalize
the ratings of objects by this user.
So how do you do normalization?
Well, where do you adjust that?
Subtract the,
the average rating from all the ratings.
Now this is the normalized ratings so
that the ratings from different
users will be comparable.
Because some users might be more generous
and they generally give more high ratings.
But, some others might be more critical.
So, their ratings can not be
directly compared with each other or
aggregated them together.
So, we need to do this normalization.
Now, the prediction of the rating.
On the item by another user or
active user, u sub a here
can be based on the average
ratings of similar users.
So the user u sub a is the user that we
are interested in recommending items to.
And we now are interested in
recommending this o sub j.
So we're interested in knowing how
likely this user will like this object.
How do we know that?
Well the idea here is to look at the how
whether similar users to this user
have liked this object.
So mathematically, this is, as you say,
the predict the rating of this
user on this app, object.
User A on object Oj is
basically combination of
the normalized ratings of different users.
And in fact, here,
we're picking a sum of all the users.
But not all users contribute
equally to the average.
And this is controlled by the weights.
So this.
Weight controls the inference
of a user on the prediction.
And of course, naturally this weight
should be related to the similarity
between ua and this particular user, ui.
The more similar they are then
the more contribution we would like
user u i to make in predicting
the preference of u a.
So the formula is extremely simple.
You're going to see it's a sum
of all the possible users.
And inside the sum, we have their ratings,
well their normalized
ratings as I just explained.
The ratings need to be normalized in
order to be comfortable with each other.
And then these ratings
are rated by their similarity.
So we can imagine a W of A and
I is just a similarity of user A user I.
Now, what's k here?
Well, k is a simpler normalizer.
It's just it's just one over the sum
of all the weights, over all the users.
And so this means, basically, if you
consider the weight here together with k.
And we have coefficients or weights
that would sum to one for all the users.
And it's just a normalization strategy,
so that you get this predicted rating
in the same range as the these ratings
that we use to make the prediction.
Right?
So, this is basically the main idea
of memory-based approaches for
collaborative filtering, okay?
Once we make this prediction,
we also would like to map back
to the rating that the user.
The user would actually make.
And this is to further add the,
mean rating or
average rating of this user u
sub a to the predicted value.
This would recover.
A meaningful rating for this user.
So if this user is generous,
then the average would be somewhat high,
and when we added that, the rating will
be adjusted to a relatively high rating.
Now, when you recommend an item to a user,
this actually doesn't really matter
because you are interested in basically
the normalized rating
that's more meaningful.
But when they evaluate these collaborative
filtering approach is typically
assumed that actual ratings of user
on these objects to be unknown.
And then you do the prediction and
then you compare the predicted
ratings with their actual ratings.
So they,
you do have access to the actual ratings.
But then you pretend you don't know.
And then you compare real systems
predictions with the actual ratings.
In that case, obviously the system's
prediction would have to be adjusted to
match the actual result the user, and this
is not what's happening here, basically.
Okay?
So this is the memory-based approach.
Now of course if you look at the formula,
if you want to write
the program to implement it.
You still face the problem of determining
what is this w function, right?
Once you know the w function, then
the formula is very easy to implement.
So indeed there are many different ways to
compute this function or this weight, w.
And, specific approaches generally
differ in how this is computed.
So, here are some possibilities.
And, you can imagine,
there are many pro, other possibilities.
One popular approach is we use
the Pearson Correlation Coefficient.
This would be a sum of a common
range of items, and the formula
is a standard Pearson correlation
coefficient formula, as shown here.
So, this basically measures
weather the two users tended
to all give higher ratings to similar
items, or lower ratings to similar items.
Another measure is the cosine measure and
this is the retreat the rating vectors as
vectors in the vector space, and then
we're going to measure the the angel and
compute the cosign of
the angle of the two vectors.
And this measure has been used in the
vector space more for retrieval as well.
So as you can imagine, there are so
many different ways of doing that.
In all these cases, note that the user
similarity is based on their preferences
on items, and we did not actually use
any content information of these items.
It didn't matter what these items are.
They can be movies, they can be books,
they can be products,
they can be tax documents.
We just didn't care about the content.
And so this allows such approach to be
applied to a wide range of problems.
Now in some newer approaches of course,
we would like to use more
information about the user.
Clearly, we know more about the user, not
just a, these preferences on these items.
And so in a actual filtering system, using
collaborative filtering, we could also
combine that with content-based filtering,
we could use context information.
And those are all interesting approaches
that people are still studying.
There are newer approaches proposed.
But this approach has been shown
to work reasonably well and
it's easy to implement.
And practical applications
could be a starting point to
see if the strand here works well for
your application.
So there are some obvious ways
to also improve this approach.
And mainly would like to improve
the user similarity measure.
And there are some practical
issues to deal with here as well.
So for example,
there will be a lot of missing values.
What do you do with them?
Well, you can set them to default values
or the average ratings of the user.
And that will be a simple solution.
But there are advantages to approaches
that can actually try to predict those
missing values and then use the predicted
values to improve the similarity.
So in fact, the memory database approach,
you can predict those with missing values,
right?
So you can imagine,
you have iterative approach where you
first do some preliminary prediction and
then you can use the predictor values to
further improve the similarity function.
Right so this is here is
a way to solve the problem.
And the strategy of this in the effect
of the performance of clarity filtering,
just like in the other heuristics,
we improve the similarity function.
Another idea which is actually very
similar to the idea of IDF that we
have seen in text research, is called
the inverse user frequency or IUF.
Now here the idea is to look at the where
the two users share similar ratings.
If the item is a popular item that has
been aah, viewed by many people and
seemingly leads to people interested
in this item may not be so interesting.
But if it's a rare item and
has not been viewed by many users.
But, these two users
[INAUDIBLE] to this item.
And they give similar ratings, and it
says more about their similarity, right?
So it's kind of to emphasize
more on similarity
on items that are not
viewed by many users.
[MUSIC]

[SOUND] So to summarize our
discussion of recommender systems
in some sense the filtering
task of recommended is easy and
in some other sense and
the task is actually difficult.
So its easy because the user
dexpectations, though in this case,
the system takes initiative to
push the information to the user.
So the user doesn't really make an effort.
So any recommendation is
better than nothing, right?
So unless you recommend
that all the you know,
noisy items or useless documents,
if you can recommend that
some useful information uses general,
would appreciate it, all right.
So that's in that sense, that's easy.
However, filtering is
actually a much harder task.
Because you have to make a binary
decision, and you can't afford waiting for
a lot of items and then you will
whether one item is better than others.
You have to make a decision
when you see this item.
Let's think about news filtering
as well as you see the news.
And you have to decide whether the news
would be interesting to a user.
If you wait for a few days, well, even if
you can make accurate recommendation of
the most relevant news, only two days
wouldn't be significantly decreased.
Another reason why it's hard,
it's because of data sparseness.
If you think of this as a learning
problem in collaborative filtering, for
example, it's purely based on
learning from the past ratings.
So if you don't have many ratings,
there's really not much you can do, right?
And may I just mention this problem.
This is actually a very serious problem.
But of course there are strategies that
have been proposed to solve the problem.
And there are,
there are different strategies that
we will use to alleviate the problem.
We can use, for example, more user
information to assess their similarity
instead of using the preferences.
Of these users on these items
the immediate additional information or
better for
about the user etcetera and, and
we also talked about the two
strategies for filtering task.
One is content based where we
look at items in clarity you
know there's a clarity of filtering
where we look at the user similarity.
And they obviously can be combined.
In a practical system, you can imagine,
they generally would have to be combined.
So that will give us a hybrid strategy for
filtering.
A, and, we also could recall that we
talked about push versus
pull as two strategies for
getting access to the text data.
And recommend the system is it will help,
users in the push mode.
And search engines are,
certain users in the pull mode.
Of using the tool should be combined, and
they can be combined into have a system
that can support user with multiple
mode and formation access.
So in the future, we could anticipate for
such a system to be more usable to a user.
And also this is a active research area so
there are a lot of new algorithms being,
being proposed over time.
In particular, those new algorithms tend
to use a lot of context information.
Now the context here could be
the context of the user, you know,
it could also be context of documents or
items.
The items are not isolated.
They are connected in many ways.
The users might form social network as
well, so there's a rich context there
that we can leverage in order to really
solve the problem well, and then that's
a active research area where also machine
learning algorithms have been applied.
Here are some additional readings in
the handbook called Recommender Systems.
And has a collection of
a lot of good articles that
can give you an overview
of a number of specific
approaches to recommender systems.
[MUSIC]

[SOUND] This lecture is
a summary of this course.
This map shows the major topics
we have covered in this course.
And here are some key
high-level take-away messages.
First we talk about natural
language content analysis.
Here the main take-away message is natural
language processing is the foundation for
textual retrieval, but
current NLP isn't robust enough.
So the back of words
replenishing is generally
the main method used in
modern search engines and
it's often sufficient for
most of the search tasks.
But obviously, for
more compass search tasks,
then we need a deeper measurement
processing techniques.
And we then talked about
a high-level strategies for
text access and we talked about
push versus pull in plural.
We talked about a query,
which is browsing.
Now, in general in future search engines,
we should integrate
all these techniques to provide
a multiple information access and
then we talked about a number of
issues related to search engines.
We talked about the search problem and
we framed that as a ranking problem and
we talked about the a number
of retrieval methods.
We start with an overview of
the vector space model and
probabilistic model and then we talked
about the vector space model in that.
We also later talked about
leverageable learning approach and
that's probabilistic model.
And here, the main take-away message is
that model retrieval functions tend to
look similar and
they generally use various heuristics.
Most important ones are TF-IDF waiting
document length normalization and
that TF is often transformed through
a sub-linear transformation function and
then we talked about how to
implement a retrieval system.
And here the main technique that we talked
about how to construct an inverted index.
So that we can prepare the system
to answer a query quickly and
we talked about how to, to fast research
by using the inverted index and
we then talked about how to
evaluate the text retrieval system
mainly introduced the Cranfield
evaluation methodology.
This was a very important the various
methodology of that can be applied to
many tasks.
We talked about the major
evaluation measures.
So the most important measures for
a search engine are MAP mean
average precision and nDCG.
Normalized discounted accumulative
gain and also precision and
record the two basic measures.
And we then talked about
feedback techniques.
And we talked about the rock you
in the vector space model and
the mixture model in
the language modeling approach.
Feedback is very important
technique especially considering
the opportunity of learning from
a lot of pixels on the web.
We then talked about the web search.
And here, we talk about the how to
use parallel indexing to resolve
the scalability issue in indexing,
we introduce a MapReduce and
then we talked about the how to using
information interacting pull search.
We talked about page random
hits as the major algorithms
to analyze links on the web.
We then talked about learning to rank.
This is a use of machine learning
to combine multiple features for
improving scoring.
Not only the effectiveness can be
improved using this approach but
we can also improve the robustness
of the ranking function,
so that it's not easy to spam
a search engine with just a,
a some features to promote a page.
And finally,
we talked about the future of web search.
We talked about some major
interactions that we might assume
in the future in improving the current
generation of search engines.
And then finally, we talked about the
Recommender System and these are systems
to implement the push mode and
we'll talk about the two approaches.
One is content based,
one is collaborative filtering and
they can be combined together.
Now an obvious missing piece in this
picture is the user, you can see.
So user interface is also a important
component in any search engine,
even though the current search
interface is relatively simple.
There actually have been a lot
of studies of user interfaces
related to visualization for
example and this is topic to that,
you can learn more by reading this book.
It's a excellent book about all kind
of studies of search user interface.
If you want to know more about the,
the topics that we talked about,
you can also read some additional
readings that are listed here.
In this short course, we are only managing
to cover some basic topics in text
retrieval in search engines.
And these resources provide additional
information about more advanced topics and
they give more thorough treatment of
some of the topics that we talked about.
And a main source is
synthesis digital library
where you can see a lot
of short textbook or
textbooks or long tutorials.
They tend to provide us with a lot of
information to explain a topic and
there are multiple series that
are related to this course.
One is information concepts,
retrieval and services.
Another is human Language technology and
yet, another is artificial
intelligence and machine learning.
There are also some major journals and
conferences listed over here that
tend to have a lot of research papers
related to the topic of this course.
And finally for
more information about resources
including readings and tool kits, etc.
You can check out this URL.
So, if you have not taken
the text mining course in this
in this data mining specialization series,
then naturally,
the next step is to take that calls.
As this picture shows
to mine the text data,
we generally need two kinds of techniques.
One is text retrieval,
which is covered in this course.
And these techniques will help us
convert raw big text data into small,
relevant text data, which are actually
needed in the specific application.
And human plays important
role in mining any text data,
because text data is written for
humans to consume.
So, involving humans in the process
of data mining is very important.
And in this course,
we have covered various strategies to
help users get access to
the most relevant data.
These techniques are also essential
in any text mining system to help
provide providence and
to help users interpret the inner
patterns that the user would
find through text data mining.
So, in general, the user would have to
go back to the original data to better
understand the patterns.
So the text mining course or
rather text mining and ana,
analytics course will be deal,
dealing with what to do once
the user has found the information.
So this is a in this picture
where we would convert
the text data into action or knowledge.
And this has to do with helping
users to go further digest with
a found information or
to find the patterns and
to reveal knowledge buried in text and
such knowledge can be used in
application system to help decision-making
or to help user finish a task.
So, if you have not taken that
course the natural step and
the natural next step would
be to take that course.
Thank you for taking this course.
I hope you have found this
course to be useful to you and
I look forward to interacting
with you at a future activity.
[MUSIC]

[SOUND].
This lecture is about web indexing.
In this lecture, we will continue
talking about web search, and
we're going to talk about how
to create a web scale index.
So once we crawl the web
we've got a lot of web pages.
The next step is we use the indexer
to create the inverted index.
In general, we can use the standard
information retrieval techniques for
creating the index, and that is what we
talked about in the previous lecture.
But there are new challenges that we
have to solve for web scale indexing,
and the two main challenges of
scalability and efficiency.
The index will be so large that it cannot
actually fit into any single machine or
single disk, so we have to store
the data on multiple machines.
Also, because the data is so large,
it's beneficial to process the data in
parallel so
that we can produce the index quickly.
To address these challenges,
Google has made a number of innovations.
One is the Google File System,
that's a general distributed file system
that can help programmers manage files
stored on a cluster of machines.
The second is MapReduce.
This is a general software framework for
supporting parallel computation.
Hadoop is the most well known open
source implementation of MapReduce,
now used in many applications.
So this is the architecture
of the Google File System.
It uses a very simple centralized
management mechanism to manage
all the specific locations of files.
So it maintains the file namespace and
look up table to know where
exactly each file is stored.
The application client would
then talk to this GFS master.
And that obtains specific locations of
the files that they want to process.
And once the GFS client obtained
the specific information about the files,
then the application client
can talk to the specific
servers where the data
actually sits directly.
So that you can avoid avoid involving
other nodes in the network.
So when this file system
stores the files on machines
the system also would create
a fixed sizes of chunks.
So the data files are separate
into many chunks,
each chunk is 64 megabytes,
so it's pretty big.
And that's appropriate for
large data processing.
These chunks are replicated
to ensure reliability.
So this is something that the, the
programmer doesn't have to worry about,
and it's all taken care
of by this file system.
So from the application perspective,
the programmer would see this
as if it's a normal file.
The program doesn't have to know
where exactly it's stored, and
can just invoke high level
operators to process the file.
And another feature is that the data
transfer is directly between
application and chunk servers, so
it's, it's efficient in this sense.
On top of the Google file system, and
Google also proposed MapReduce as
a general framework for
parallel programming.
Now, this is very useful to support
a task like building inverted index.
And so this framework is hiding a lot of
low level features from the programmer.
As a result, the programmer can
make minimum effort to create
a application that can be run
on a large cluster in parallel.
So, some of the low level
details hidden in the framework,
including the specific natural
communications, or load balancing,
or where the tasks are executed, all these
details are hidden from the programmer.
There is also a nice feature which
is the built-in fault tolerance.
If one server is broken,
let's say, so it's down, and
then some tasks may not be finished,
then the MapReduce mechanism would
know that the task has not been done.
So it would automatically dispatch the
task on other servers that can do the job.
And therefore, again, the programmer
doesn't have to worry about that.
So here's how MapReduce works.
The input data will be separated
into a number of key, value pairs.
Now, what exactly is in the value
will depend on the data.
And it's actually a fairly
general framework to allow you to
just partition the data
into different parts.
And each part can be then
processed in parallel.
Each key, value pair will be
then sent to a map function.
The programmer will write
the map function, of course.
And then the map function will then
process this key value pair and
generate the,
a number of other key value pairs.
Of course, the new key is usually
different from the old key
that's given to the map as input.
And these key value pairs
are the output of the map function.
And all the outputs of all the map
functions will be then collected.
And then they will be further
sorted based on the key.
And the result is that all the values
that are associated with the same
key will be then grouped together.
So now we've got a pair of a key and a set
of values that are attached to this key.
So this will then be sent
to a reduce function.
Now, of course, each reduce function will
handle a different each a different key.
So we will send this,
these output values to
multiple reduce functions,
each handling a unique key.
A reduce function would then process
the input, which is a key and
a set of values, to produce another
set of key values as the output.
So these output values would be then
collected together to form the,
the final output.
Right, so this is the,
the general framework of MapReduce.
Now, the programmer only needs to
write the the map function and
the reduce function.
Everything else is actually taken
care of by the MapReduce framework.
So, you can see the programmer really
only needs to do minimum work.
And with such a framework, the input data
can be partitioned into multiple parts.
Each is processed in
parallel first by map, and
then in the process after
we reach the reduce stage,
then much more reduce functions
can also further process
the different keys and
their associated values in parallel.
So it achieves some it
achieves the purpose of parallel
processing of a large dataset.
So let's take a look at a simple example,
and that's word counting.
The input is is files containing words.
And the output that we want to generate is
the number of occurrences of each word, so
it's the word count.
Right, we know this,
this kind of counting would be useful to,
for example, assess the popularity
of a word in a large collection.
And this is useful for achieving
a factor of IDF weighting for search.
So how can we solve this problem?
Well, one natural thought is that,
well, this task can be done in
parallel by simply counting different
parts of the file in parallel and
then in the end,
we just combine all the counts.
And that's precisely the idea of
what we can do with MapReduce.
We can parallelize lines
in this input file.
So more specifically, we can assume
the input to each map function
is a key value pair that represents the
line number and the stream on that line.
So the first line, for
example, has a key of one.
And the value is Hello World Bye World,
and just four words on that line.
So this key-value pair will
be sent to a map function.
The map function would then just
count the words in this line.
And in this case, of course,
there are only four words.
Each word gets a count of one.
And these are the output that you see here
on this slide, from this map function.
So, the map function
is really very simple.
If you look at the, what the pseudocode
looks like on the right side, you see,
it simply needs to iterate over
all the words in this line,
and then just call a Collect function,
which means it would then send the word
and the counter to the collector.
The collector would then try to
sort all these key value pairs
from different map functions.
Right?
So the functions are very simple.
And the programmer specifies this function
as a way to process each part of the data.
Of course, the second line will be
handled by a different map function,
which will produce a similar output.
Okay, now the output from the map
functions will be then sent to
a collector.
And the collector will do
the internal grouping or sorting.
So at this stage, you can see we
have collected multiple pairs.
Each pair is a word and
its count in the line.
So once we see all these these pairs,
then we can sort them based on the key,
which is the word.
So we will collect all the counts of
a word, like bye, here, together.
And similarly, we do that for other words.
Like Hadoop, hello, etc.
So each word now is attached to
a number of values, a number of counts.
And these counts represent the occurrences
of this word in different lines.
So now we have got a new pair of a key and
a set of values,
and this pair will then be
fed into a reduce function.
So the reduce function now will
have to finish the job of counting
the total occurrences of this word.
Now it has already got all
these partial counts, so
all it needs to do is
simply to add them up.
So the reduce function shown
here is very simple as well.
You have a counter and then iterate over
all the words that you see in this array,
and then you just accumulate these counts,
right.
And then finally, you output the key and
and the total count,
and that's precisely what we want as
the output of this whole program.
So, you can see, this is already very
similar to building a inverted index,
and if you think about it,
the output here is indexed by a word, and
we have already got a dictionary,
basically.
We have got the count.
But what's missing is the document IDs and
the specific
frequency counts of words
in those documents.
So we can modify this slightly to actually
build a inverted index in parallel.
So here's one way to do that.
So in this case, we can assume
the input to a map function is a pair
of a key which denotes the document ID and
the value denoting the string for
that document.
So it's all the words in that document.
And so the map function will
do something very similar to
what we have seen in
the water company example.
It simply groups all the counts of
this word in this document together.
And it will then generate
a set of key value pairs.
Each key is a word.
And the value is the count of this word
in this document plus the document ID.
Now, you can easily see why we
need to add document ID here.
Of course, later, in the inverted index,
we would like to keep this information, so
the map function should keep track of it.
And this can then be sent to
the reduce function later.
Now, similarly another document D2
can be processed in the same way.
So in the end, again, there is a sorting
mechanism that would group them together.
And then we will have just
a key like java associated
with all the documents
that match this key, or
all the documents where java occurred,
and their counts,
right, so
the counts of java in those documents.
And this will be collected together.
And this will be, so
fed into the reduced function.
So, now you can see,
the reduce function has already got input
that looks like a inverted index entry,
right?
So, it's just the word and all
the documents that contain the word and
the frequency of the word
in those documents.
So, all you need to do is simply to
concatenate them into a continuous chunk
of data, and this can be then
retained into a file system.
So basically, the reduce function
is going to do very minimal work.
And so, this is pseudo-code for
inverted index construction.
Here we see two functions,
procedure Map and procedure Reduce.
And a programmer would specify these two
functions to program on top of MapReduce.
And you can see, basically,
they are doing what I just described.
In the case of Map,
it's going to count the occurrences
of a word using an associative array,
and will output all the counts
together with the document ID here.
Right?
So this,
the reduce function,
on the other hand simply concatenates
all the input that it has been given and
then put them together as one
single entry for this key.
So this is a very simple
MapReduce function, yet
it would allow us to construct an inverted
index at a very large scale, and
data can be processed
by different machines.
The program doesn't have to
take care of the details.
So this is how we can do parallel
index construction for web search.
So to summarize, web scale indexing
requires some new techniques that
go beyond the standard
traditional indexing techniques.
Mainly, we have to store index on
multiple machines, and this is usually
done by using a file system like Google
File System, a distributed file system.
And secondly, it requires creating
the index in parallel, because it's so
large, it takes a long time to create
an index for all the documents.
So if we can do it in parallel,
it would be much faster, and
this is done by using
the MapReduce framework.
Note that the both the GFS and
MapReduce frameworks are very general, so
they can also support
many other applications.
[MUSIC]

[SOUND].
This lecture is about link analysis for
web search.
In this lecture we're going to talk
about web search, and particularly
focusing on how to do link analysis and
use the results to improve search.
The main topic of this lecture is to look
at the ranking algorithms for web search.
In the previous lecture,
we talked about how to create index.
Now that we have got index,
we want to see how we can improve
ranking of pages on the web.
Standard IR models can
also be applied here,
in fact they are important building
blocks for supporting web search,
but they aren't sufficient,
mainly for the following reasons.
First, on the web we tend to have
very different information needs.
For example, people might search for
a web page or entry page, and
this is different from
the traditional library search
where people are primarily interested
in collecting literature information.
So these kind of queries are often
called navigational queries,
the purpose is to navigate into
a particular targeted page.
So for such queries, we might
benefit from using link information.
Secondly, documents have
additional information.
And on the web, web pages are web format.
There are a lot of other groups,
such as the layout, the title,
or link information again.
So this has provided an opportunity to
use extra context information of
the document to improve scoring.
And finally,
information quality varies a lot.
So that means we have to consider many
factors to improve the ranking algorithm.
This would give us a more robust way to
rank the pages making it the harder for
any spammer to just manipulate the one
signal to improve the ranking of a page.
So as a result people have made
a number of major extensions
to the ranking algorithms.
One line is to exploit links to
improve scoring and
that's the main topic of this lecture.
People have also proposed
algorithms to exploit large scale
implicit feedback information
in the form of clickthroughs.
That's of course in the category
of feedback techniques, and
machinery is often used there.
In general, in web search the ranking
algorithms are based on machinery
algorithms to combine
all kinds of features.
And many of them are based on the standard
original models such as BM25 that
we talked about, or
queried iCode to score
different parts of documents or
to, provide additional features
based on content matching.
But link information
is also very useful so
they provide additional scoring signals.
So let's look at links in
more detail on the web.
So this is a snapshot of some
part of the web, let's say.
So we can see there are many links
that link different pages together.
And in this case you can also look at the,
the center here.
There is a description of a link
that's pointing to the document on
the right side.
Now this description text
is called anchor text.
If you think about this text,
it's actually quite useful
because it provides some extra description
of that page being pointed to.
So, for example, if someone wants
to bookmark Amazon.com front page,
the person might say,
the big online bookstore, and
then with a link to Amazon, right?
So the description here is actually very
similar to what the user would type in
in the query box when they are looking for
such a page.
That's why it's very useful for,
for, ranking pages.
Suppose someone types in a query
like online bookstore or
big online bookstore, right.
The query would match this
anchor text in the page here.
And then this actually
provides evidence for
matching the page that's been pointed to,
that is the Amazon entry page.
So if you match the anchor text
that describes the link to a page,
actually that provides good evidence for
the relevance of the page
being pointing to.
So anchor text is very useful.
If you look at the bottom part of this
picture, you can also see there are some
patterns of links, and these links might
indicate the utility of a document.
So for example,
on the right side you can see this
page has received many in, in links.
That means many other pages
are pointing to this page.
And this shows that this
page is quite useful.
On the left side you can see, this is
a page that points to many other pages.
So, this is a theater page
that would allow you to
actually see a lot of other pages.
So we can call the first case authority
page and the second case a hub page.
This means the link information
can help in two ways.
One is to provide extra text for matching.
The other is to provide some
additional scores for the web
pages to characterize how likely a page is
a hub, how likely a page is a authority.
So people then, of course, propose ideas
to leverage this, this link information.
Google's PageRank,
which was a main technique that they
used in early days, is a good example.
And that, that is the algorithm
to capture page popularity,
basically to score authority.
So the intuitions here are, links are just
like citations in the literature.
Think about one page
pointing to another page.
This is very similar to one
paper citing another paper.
So, of course,
then if a page is cited often,
then we can assume this page to
be more useful in general, right?
So that's a very good intuition.
Now, page rank is essentially
to take advantage of this
intuition to implement the,
with the principle approach.
Intuitively it's essentially doing
citation counting or in link counting.
It just improves this simple idea in,
in two ways.
One is would consider indirect citations.
So that means you don't just look
at the how many in links you have,
you also look at the what are those
pages that are pointing to you.
If those pages, themselves, have a lot
of in links, well that means a lot.
In some sense you will get
some credit from that.
But if those pages that are pointing
to you are not are being pointed to
by other pages, they themselves don't
have many in links, then, well,
you don't get that much credit.
So that's the idea of
getting indirect citation.
Right, so you can also understand
this idea by looking at, again,
the research papers.
If you are cited by, let's say ten papers,
and those ten papers are, just
workshop papers and that, or some papers
that are not very influential, right,
so although you got ten in links,
that's not as good as if you have,
you're cited by ten papers that themselves
have attracted a lot of other citations.
So this is a case where
we would like to consider indirect
links and PageRank does that.
The other idea is,
it's good to smooth the citations.
Or, or, or
assume that basically every page is
having a non-zero pseudo citation count.
Essentially, you are trying to imagine
there are many virtual links that
will link all the pages together so
that you,
you actually get pseudo
citations from everyone.
The, the reason why they want to
do that is this would allow them
to solve the problem elegantly
with linear algebra technique.
So I think maybe the best
way to understand the page
rank is through think of
this as through computer,
the probability of a random surfer,
visiting every web page, right.
[MUSIC]

[SOUND].
So let's take a look at this in detail.
So in this random surfing model.
And any page would assume random surfer
would choose the next page to visit.
So this is a small graph here.
That's, of course an oversimplification
of the complicate it well.
But let's say there
are four documents here.
Right, D1, D2, D3 and D4.
And let's assume that a random surfer or
random walker can be any of these pages.
And then the random surfer could decide
to just randomly jump into any page.
Or follow a link and
then visit the next page.
So if the random server is at d1.
Then, you know, with some probability
that random surfer will follow the links.
Now there two outlinks here.
One is pointing to this D3.
The other is pointing to D4.
So the random surfer could pick any
of these two to reach e3 and d4.
But it also assumes that the random
surfer might, get bored sometimes.
So the random surfer would decide
to ignore the actual links, and
simply randomly jump to
any page on the web.
So, if it does that, eh,
it would be able to reach
any of the other pages even though there
is no link directly from to that page.
So this is the assume the randoms of.
Imagine a random server is
really doing surfing like this,
then we can ask the question.
How likely on average
the server would actually reach
a particular page d1, or d2, or d3.
That's the average probability
of visiting a particular page.
And this probability is precisely
what page rank computes.
So the page rank score of the document
is the average probability
that the surfer visits a particular page.
Now, intuitively this will basically
kept you the [INAUDIBLE] link account.
Why?
Because if a page has
a lot of in-links then
it would have a higher chance of being
visited, because there will be more
opportunities of having the surfer to
follow a link to come to this page.
And this is why
the random surfing model actually captures
the idea of counting the in links.
Note that is also considers
the indirect in links.
Why?
Because if the pages that point to you
have themselves a lot of in links,
that would mean the random server
would very likely reach one of them.
And therefore it increases
the chance of visiting you.
So this is a nice way to capture
both indirect and direct links.
So mathematically, how can we compute
this problem enough to see that we need
to take a look at how this
problem [INAUDIBLE] in computing.
So first let's take a look at
the transition matching sphere.
And this is just a matrix with
values indicating how likely a rand,
the random surfer will go
from one page to another.
So each rule stands for a starting page.
For example,
rule one would indicate the probability
of going to any other four pages from e1.
And here we see there are only
non two non zero entries.
Each is 1 over 2, a half.
So this is because if you look at
the graph, d1 is pointing to d3 and d4.
There's no link from d1 to d1 server or
d2,
so we've got 0s for
the first two columns and
0.5 for d3 and d4.
In general, the M in this matrix
M sub i j is the probability
of going from d, i, to d, j.
And obviously for each rule,
the values should sum to one,
because the surfer will have to go to
precisely one of these other pages.
Right?
So this is a transition matrix.
Now how can we compute the probability
of a server visiting a page?
Well if you look at the,
the server model, then basically
we can compute the probability
of reaching a page as follows.
So, here on the left-hand side,
you see it's the probability of
visiting page DJ at time t plus 1
because it's the next time cont.
On the right hand side, you can see
the question involves the probability
of, at page ei at time t.
So you can see the subsequent index t,
here.
And that indicates that's the probability
that the server was at
a document at time t.
So the equation basically captures
the two possibilities of
reaching at d j at time t plus 1.
What are these two possibilities?
Well one is through random surfing, and
one is through following
a link as we just explained.
So the first part captures the probability
that the random server would reach
this page by following a link.
And you can see, and
the random surfer chooses this
strategy was probably
the [INAUDIBLE] as we assumed.
And so
there is a factor of one minus alpha here.
But the main part is really
sum over all the possible
pages that the server could
have been at time t, right?
There were N pages, so
it's a sum over all the possible N pages.
Inside the sum is the product
of two probabilities.
One is the probability that
the server was at d i at time t.
That's p sub t of d i.
The other is the transition
probability from di to dj.
And so in order to reach this dj page,
the surfer must first be at di at time t.
And then also would have to follow
the link to go from di to dj,
so the probability is the probability
of being at di at time t, not divide by
the probability of, going from that
page to the top of the page dj here.
The second part is a similar sum.
The only difference is that now
the transition probability is uniform,
transition probability.
1 over n.
And this part captures the probability of
reaching this page,
through random jumping.
Right.
So, the form is exactly the same.
And in, in, this also allows us to see
why PageRank essentially assumes
smoothing of the transition matrix.
If you think about this 1 over N as
coming from another transition matrix
that has all the elements being 1 over N,
the uniform matrix.
Then you can see very clearly
essentially we can merge the two parts.
Because they are of the same form,
we can imagine there's a difference of
metrics that's a combination of this m and
that uniform matrix where
every element is 1 over n.
In this sense,
page one uses this idea of smoothing and
ensuring that there's no 0,
entry in such a transition matrix.
Of course this is, time depend,
calculation of probabilities.
Now, we can imagine if we want to
compute average probabilities,
the average probabilities probably
would satisfy this equation
without considering the time index.
So let's drop the time index and
just assume that they would be equal.
Now this would give us N equations.
Because for
each page we have such a equation.
And if you look at the what variables
we have in these equations,
there are also precisely N variables,
right?
So this basically means
we now have a system of
n equations with n variables,
and these are linear equations.
So basically, now the problem boils down
to solve this system of equations and
here I also show that
the equations in the metric form.
It's the vector P here equals a metrics or
the transports of the metrics here.
And multiply it by the vector again.
Now if you still remember some knowledge
that you learned from linear algebra and
then you will realize this is precisely
the equation for item vector.
Right?
When [INAUDIBLE] metrics by this method
you get the same value as this method.
And this can solved by using
an iterative algorithm.
So is it, because she's here, on the ball,
easily taken from the previous, slide.
So you see the, relationship between the,
the page source of different pages.
And in this iterative approach or
power approach
we simply start with, randomly the p.
And then we repeatedly just
updated this p by multiplying.
The metrics here by this P-Vector.
So I also show a concrete example here.
So you can see this now, if we assume.
How far is point two.
Then with the example that
we show here on this slide
we have the original
transition metrics here.
Right?
That encodes, that encodes the graph.
The actual links.
And we have this smoothing
transition metrics,
uniform transition metrics,
representing random jumping.
And we can combine them together with
interpolation to form another
metrics that would be like this.
So essentially we can
imagine now the looks.
Like this can be captured by that.
There are virtual links
between all the pages now.
So the page rank algorithm will
just initialize the p vector first,
and then just computed
the updating of this p vector
by using this, metrics multiplication.
Now if you rewrite this metrics multi,
multiplication
in terms of just a,
an individual equations, you'll see this.
And this is a, basically,
the updating formula for
this particular page is a,
page ranking score.
So you can also see, even if you
want to compute the value of this
updated score for d1,
you basically multiple this rule.
Right?
By this column, I will take
the total product of the two, right?
And that will give us the value for
this value.
So this is how we updated the vector.
We started with some initial values for
these guys.
For, for this, and then,
we just revise the scores which
generate a new set of scores.
And the updated formula is this one.
So we just repeatedly apply this,
and here it converges.
And when the metrics is like this.
Where there is no zero values and
it can be guaranteed to converge.
And at that point we will just, have
the PageRank scores for all the pages.
Now we typically set the initial
values just to 1 over n.
So interestingly, this update
formula can be also interpreted as
propagating scores on the graph.
All right.
Can you see why?
Well if you look at this formula and
then compare that with this graph,
and can you imagine how we
might be able to interpret this
as essentially propagating
scores over the graph.
I hope you will see that indeed
we can imagine we have values
initialized on each of these page.
All right, so we can have values here
that say, that's one over four for each.
And then welcome to use these
matrix to update this, the scores.
And if you look at the equation here,
this one, basically we're
going to combine the scores
of the pages that possible would lead to,
reaching this page.
So we'll look at all the pages
that are pointing to this page.
And then combine their scores and
the propagated score,
the sum of the scores to this document D1.
We look after the, the scores
that represented the probability
that the random server would be visiting
the other pages before it reaches the D1.
And then just do the propagation
to simulate the probability
of reaching this, this page D 1.
So there are two interpretations here.
One is just the matrix multiplication.
And we repeated that.
Multiply the vector by this metrics.
The other is to just think of it as
propagating the scores repeatedly
on the web.
So in practice the composition of PageRank
score is actually efficient because
the metrices are sparse and there are some
ways to transform the equation so
you avoid actually literally computing
the values of all of those elements.
Sometimes you may also
normalize the equation, and
that will give you a somewhat
different form of the equation,
but then the ranking of
pages will not change.
The results of this potential
problem of zero out link problem.
In that case if the page does not have
any outlook, then the probability of
these pages will, will not sum to 1.
Basically, the probability of
reaching the next page from this
page will not sum to 1.
Mainly because we have lost some
probability mass when we assume that
there's some probability that
the server will try to follow links but
then there's no link to follow, right?
And one possible solution is simply to
use page specific damping factor and
that, that could easily fix this.
Basically that's to say, how far do we
want from zero for a page with no outlink.
In that case the server would just have to
render them [INAUDIBLE] to another page
instead of trying to follow the link.
So there are many extensions of page rank.
One extension is to do
top-specific page rank.
Note that page rank doesn't really
use the query format machine, right?
So, [INAUDIBLE] so we can make page rank,
appear specific, however.
So, for example,
in the topic specific page rank,
we can simply assume when the surfer,
is bored.
The surfer is not going to randomly
jump into any page on the web.
Instead, it's going to jump,
to only those pages that are to a query.
For example, if the query is about sports
then we could assume that when it's
doing random jumping, it's going
to randomly jump to a sports page.
By doing this then we canbuy
a PageRank to topic align with sports.
And then if you know the current query
is about sports then you can use this
specialized PageRank score
to rank the options.
That would be better than if you
use a generic PageRank score.
PageRank is also general algorithm
that can be used in many other.
Locations for network analysis, particular
for example for social networks.
We can imagine if you compute their
PageRank scores for social network,
where a link might indicate
friendship relation,
you'll get some meaningful scores for
people.
[MUSIC]

[SOUND] So
we talked about a page rank as a way to
to capture the Authorities.
Now we also looked at the, some other
examples where a hub might be interesting.
So, there is another
algorithm called the HITS and
that's going to do compute the scores for
us.
Authorities & Hubs.
Intuitions of,
pages that are widely cited, good, sorry,
there is, then,
there is pages that are cited.
Many other pages are good Hubs, right?
But there, I think that the.
Most interesting idea of this
algorithm HITS is, it's going to use,
a reinforcement mechanism to kind of
help improve the scoring for
Hubs and the Authorities.
And here, so here's the idea,
it will assume that good
authorities are cited by good hubs.
That means if you're cited by
many pages with good hub scores,
then that increases your authority score.
And similarly, good hubs are those
that pointed to good authorities.
So if you get you point it to
a lot of good authority pages,
then your hub score would be increased.
So you then, you would have
iterative reinforce each other,
because you can point
it to some good hubs.
Sorry, you can point it
to some good authorities.
To get a good hub score.
Whereas those authority scores,
would be also improved,
because they are pointed to by a good hub.
And this hub is also general,
it can have many applications in graph and
network analysis.
So just briefly, here's how it works.
We first also construct the matrix, but
this time we're going to
construct the Adjacency matrix.
We're not going to normalize the values,
so if there's a link there's a y.
If there's no link that's zero.
Right again, it's the same graph and then,
we're going to define the top score of
page as a sum of the authority scores
of all the pages that it appoints to.
So whether you are hub that really depends
on whether you are pointing to a lot of,
good authority pages.
That's what it says in the first equation.
Your second equation,
will define the authority score of a page
as a sum of the hub scores
of all those pages.
That they point to, so whether you
are a good authority would depend on
whether those pages that
are pointing to you are good Hubs.
So you can see this a forms
a iterative reinforcement mechanism.
Now these two equations
can be also written.
In the matrix fo-, format.
Right, so
what we get here is then the hub vector is
equal to the product of
the Adjacency matrix.
And the authority vector.
And this is basically the first equation.
Right.
And similarly, the second equation can
be returned as the authority vector
is equal to the product of A transpose
multiplied by the hub vector.
And these are just different ways
of expressing these equations.
But what's interesting is that if
you look at to the matrix form.
You can also plug-in the authority
equation into the first one.
So if you do that, you can actually
make it limited to the authority vector
completely, and
you get the equation of only hub scores.
Right, the hub score vector is equal
to A multiplied by A transpose.
Multiplied by the hub score vector again.
And similarly we can do
a transformation to have equation for
just the authorities scores.
So although we framed the problem
as computing Hubs & Authorities,
we can actually eliminate the one of them
to obtain equation just for one of them.
Now the difference between this and
page is that, now the matrix
is actually a multiplication of the mer-,
Adjacency matrix and its transpose.
So this is different from page rank.
Right?
But mathematically then we would
be computing the same problem.
So in ha, in hits,
we're keeping would initialize the values
that state one for all these values.
And then with the algorithm will apply
these, these equations essentially and
this is equivalent if you multiply that.
By, by the matrix.
A and A transpose.
Right.
And so the arrows of these are exactly
the same in the debate rank.
But here, because the Adjacency matrix
is not normalized, so what we have to do
is to, what we have to do is after each
iteration we have to do normalize.
And this would allow us to
control the grooves of value.
Otherwise they would,
grew larger and larger.
And if we do that, and
then we will basically get a, HITS.
I was in the computer, the hub scores and
also the scores for all of the pages.
And these scores can then be used,
in ranging to start the PageRank scores.
So to summarize, in this lecture we have
seen that link information is very useful.
In particular,
the Anchor text base is very useful.
To increase the the text
representation of a page.
And we also talk about the PageRank and
HITS algorithm as two major
link analysis algorithms.
Both can generate scores for.
What pages that can be used for
the, the ranking function.
Those that PageRank and
the HITS also very general algorithms, so
they have many applications in
analyzing other graphs or networks.
[MUSIC]

[SOUND] This lecture is about
learning to rank.
In this lecture, we're going to
continue talking about web search.
In particular, we're going to talk about
using machine running to combine definite
features to improve ranking function.
So the question that we
address in this lecture is
how we can combine many
features to generate a,
a single ranking function
to optimize search results.
In the previous lectures,
we have talked about the,
a number of ways to rank documents.
We have talked about some retrieval
models, like a BM25 or clear light code.
They can generate a content based scores
for matching documents with a query.
And we also talked about
the link-based approaches,
like page rank that can give additional
scores to help us improve ranking.
Now the question now is how can
we combine all these features and
potentially many other
features to do ranking?
And this will be very useful for
ranking web pages not only just to improve
accuracy, but also to improve
the robustness of the ranking function.
So that's it not easy for
a spammer to just perturb a one or
a few features to promote a page.
So the general idea of learning to
rank is to use machine learning to
combine these features to optimize
the weight on different features to
generate the optimal ranking function.
So we would assume that
the given a query document pair,
Q and D,
we can define a number of features.
And these features can vary from
content based features such as
a score of the document it
was respected to the query
according to a retrieval function,
such as BM25 or
Query Light or pivot commands
from a machine or PL2, et cetera.
It can also be linked based
score like PageRank score.
It can be also application of retrieval
models to the anchor text of the page.
Right?
Those are the types of descriptions
of links that pointed to this page.
So these can all be clues about whether
this document is relevant or not.
We can even include a, a feature such
as whether the URL has a [INAUDIBLE],
because this might be the indicator
of home page or entry page.
So, all of these features can then be
combined together to generate the ranking
functions.
The question is of course,
how can we combine them?
In this approach,
we simply hypothesize that the probability
that this document is random to this query
is a function of all these features.
So we can hypothesize this
that the probability of
relevance is related to these
features through a particular
form of the function
that has some parameters.
These parameters can
control the influence of
different features on the final relevance.
This is of course, just a assumption.
Whether this assumption really makes
sense is still a, a big question.
However, you have to empirically
evaluate the, the, the function.
But by hypothesizing that the relevance
is related to those features
in the particular way, we can then
combine these futures to generate
the potentially more powerful ranking
function, a more robust ranking function.
Naturally, the next question is how
do we estimate loose parameters?
You know, how do we know which
features should have high weight and
which features should have low weight?
So this is a task of training or learning.
All right.
So,
in this approach what we will
do is use some training data.
Those are the data that
have been judged by users.
So that we already know
the relevance judgments.
We already know which documents should
be rather high for which queries and
this information can be based on
real judgments by users or can,
this can also be approximated by just
using click through information.
Where we can assume the clicked documents
are better than the skipped documents or
clicked documents are relevant and
the skipped documents are not relevant.
So, in general, the fit such hypothesize
ranging function to the training day,
meaning that we will try to optimize its
retrieval accuracy on the training data.
And we adjust these parameters to see
how we can optimize the performance
of the function on the training data in
terms of some measure such as map or NDCG.
So the training data would
look like a table of tuples.
H-tuple it has three elements, the query,
the document and the judgment.
So, it looks very much like
our relevance judgment that we
talked about in evaluation
of retrieval systems.
[MUSIC]

[SOUND] So
now let's take a look at the specific,
method that's based on regression.
Now this is one of the many
different methods in fact,
it's the one of the simplest methods.
And I choose this to explain the idea
because it's it's so simple.
So in this approach we simply assume
that the relevance of a document
with respect to the query, is related to
a linear combination of all the features.
Here I used the Xi to emote the feature.
So Xi of Q and D is a feature.
And we can have as many features as,
we would like.
And we assume that these features
can be combined in a linear manner.
And each feature is controlled
by a parameter here.
And this beta is a parameter,
that's a weighting parameter.
A larger value would mean the feature
would have a higher weight and
it would contribute more
to the scoring function.
The specific form of the function
actually also involves
a transformation of
the probability of relevance.
So this is the probability of relevance.
We know that the probability of relevance
is within the range from 0 to 1.
And we could have just assumed
that the scoring function is
related to this linear combination.
Right, so we can do a,
a linear regression but
then the value of this linear
combination could easily go beyond 1.
So this transformation here would map ze,
0 to 1 range through the whole
range of real values.
You can, you can verify it,
it by yourself.
So this allows us then to connect
to the probability of relevance
which is between 0 and 1 to a linear
combination of arbitrary efficients.
And if we rewrite this into a probability
function, we will get the next one.
So on this side on this equation,
we will have the probability of relevance.
And on the right hand side,
we will have this form.
Now this form is created non-active.
And it still involves the linear
combination of features.
And it's also clear that is,
if this value is,
is.
Of the linear combination
in the equation above.
If this this, this value here,
if this value is large then it
will mean this value is small.
And therefore, this probability,
this whole probability, would be large.
And that's what we expect.
Basically, it would be if this
combination gives us a high value,
then the document's more likely relevant.
So this is our hypothesis.
Again, this is not necessarily
the best hypothesis.
That this is a simple way to connect
these features with
the probability of relevance.
So now we have this this
combination function.
The next task is to see how we
need to estimate the parameters so
that the function can truly be applied.
Right.
Without them knowing
that they have values, it's,
it's harder to apply this function, okay.
So let's how we can estimate, beta values.
All right.
Let's take a look, at a simple example.
In this example, we have three features.
One is BM25 score of
the document under the query.
One is the page rank score of
the document, which might or
might not depend on the query.
Hm, we might have a top
sensitive page rank.
That would depend on the query.
Otherwise, the general page rank
doesn't really depend on the query.
And then we have BM25 score on
the Anchor task of the document.
These are then the feature values for
a particular doc, document query pair.
And in this case the document is D1.
And the,
the judgment says that it's relevant.
Here's another training instance,
and these features values.
But in this case it's non-relevant, okay?
This is a overly simplified case,
where we just have two instances.
But it,
it's sufficient to illustrate the point.
So what we can do is we use the maximum
likelihood estimator to actually estimate
the parameters.
Basically, we're going to do, predict
the relevance status of the document,
the, based on the feature values.
That is given that we observe
these feature values here.
Can we predict the relevance?
Yeah.
And of course, the prediction will be
using this function that you see here.
And we hypothesize this that
the probability of relevance is related
features in this way.
So we're going to see for
what values of beta we can
predict that the relevance well.
What do we mean?
Well, what, what do we mean by
predicting the relevance well?
Well we just mean.
In the first case for D1,
this expression here,
right here, should give higher values.
In fact, they would hope this
to give a value close to one.
Why?
Because this is a relevant document.
On the other hand, in the second case for
D2 we hope this value would be small.
Right.
Why?
It's because it's a non-relevant document.
So now let's see how this can
be mathematical expressed.
And this is similar to,
expressing the probability of a document.
Only that we are not talking about
the probability of words but
talking about the probability
of relevance, 1 or 0.
So what's the probability
of this document?
The relevant if it has
these feature values.
Well this is.
Just this expression, right?
We just need to pluck in the X, the Xis.
So that's what we'll get.
It's exactly like, what we have seen that,
only that we replace these Xis.
With now specific values.
And so, for example, this 0.7 goes
to here and this 0.11 goes to here.
And these are different feature values and
we'll combine them in this particular way.
The beta values are still unknown.
But this gives us the probability
that this document is relevant
if we assume such a model.
Okay, and
we want to maximize this probability since
this is a random document.
What we do for the second document.
Well, we want to compute to the
probability that the predictions is, is n,
non-relevant.
So, this would mean, we have to compute
a 1 minus, right this expression.
Since this expression.
Is actually the probability of relevance,
so to compute the non relevance
from relevance, we just do 1 minus
the probability of relevance, okay?
So this whole expression then.
Just is our probability of predicting
these two relevance values.
One is 1.
Here, one is a 0.
And this whole equation
is our probability.
Of observing a 1 here and
observing a 0 here.
Of course this probability depends
on the beta values, right?
So then our goal is to
adjust the beta values to make this
whole thing reach its maximum.
Make that as large as possible.
So that means we
are going to compute this.
The beta is just the, the parameter
values that would maximize this for
like holder expression.
And what it means is if
look at the function is
we're going to choose betas to
make this as large as possible.
And make this also as large as possible
which is equivalent to say make
this the part as small as possible.
And this is precisely what we want.
So once we do the training,
now we will know the beta values.
So then this function will be well
defined once their values are known.
Both this and
this will become pretty less specified.
So for any new query and new document we
can simply compute the features [NOISE]
For that pair and then we just use this
formula to generate a ranking score.
And this scoring function can be used in
for rank documents for a particular query.
So that's the basic idea of,
learning to rank.
[MUSIC]

[NOISE].
There are many more advanced learning
algorithms than the regression based
reproaches.
And they generally
account to theoretically
optimize or retrieval method.
Like map or nDCG.
Note that the optimization objecting
function that we have seen
on the previous slide is not directly
related to retrieval measure.
Right?
By maximizing the prediction of one or
zero.
Or we don't necessarily optimize
the ranking of those documents.
One can imagine that why,
our prediction may not be too bad and
let's say both are around 0.5.
So it's kind of in the middle of zero and
one for
the two documents, but
the ranking can be wrong.
So we might have the, a larger value for.
D2 and then e1.
So that won't be good from
retrieval perspective,
even though by likelihood function,
it's not bad.
In contrast, we might have another
case where we predicted values.
Or around 0.9 let's say,
and by the objective function,
the error will be larger, but if we
can get the order of the two documents
correct, that's actually a better result.
So these new more advanced approaches
will try to correct that problem.
Of course then the challenge is that.
That the optimization problem
will be harder to solve.
And then researchers have proposed
many solutions to the problem.
And you can read more of
the references at the end.
Know more about the these approaches.
Now these learning to random approaches.
Are actually general, so they can also be
applied to many other ranking problems,
not just retrieval problem.
So here I list some for
example recommender systems,
computational adv, advertising,
or summarization, and
there are many others that you can
probably encounter in your applications.
To summarize this lecture,
we have talked about, using machine
learning to combine much more features
to incorporate a ranking without.
Actually the use of machine learning,
in information retrieval has
started since many decades ago.
So for example on the Rocchio feedback
approach that we talked about earlier
was a machine learning approach
applied to to learn this feedback, but
the most reasonable use of machine
learning has been driven by some changes.
In the environment of applications
of retrieval systems.
And first it's, mostly,
driven by the availability of a lot of
training data in the form of clicks rules.
Such data weren't available before.
So the data can provide a lot
of useful knowledge about
relevance and machine learning methods
can be applied to leverage this.
Secondly it's also due by
the need of combining them.
In the features.
And
this is not only just because there
are more features available on the web
that can be naturally re-used
with improved scoring.
It's also because by combining them,
we can improve the robustness of ranking.
So this is designed for combating spams.
Modern search engines all use some kind
of machine learning techniques to combine
many features to optimize ranking and
this is a major feature of these
current engines such as Google, Bing.
The topic of learning to rank
is still active research.
Topic in the community, and so you can
expect to see new results being developed,
in the next, few years.
Perhaps.
Here are some additional readings that
can give you more information about.
About, how learning to rank books and
also some advanced methods.
[MUSIC]

[SOUND].
This lecture is about
the future of web search.
In this lecture, we're going to talk
about some possible future trends
of web search and intelligent information
retrieval systems in general.
In order to further improve
the accuracy of a search engine,
it's important that to consider
special cases of information need.
So one particular trend could be to
have more and more specialized than
customized search engines, and they
can be called vertical search engines.
These vertical search engines can be
expected to be more effective than
the current general search engines
because they could assume that
users are a special group of users that
might have a common information need,
and then the search engine can be
customized with this ser, so, such users.
And because of the customization,
it's also possible to do personalization.
So the search can be personalized,
because we have a better
understanding of the users.
Because of the restrictions with domain,
we also have some advantages
in handling the documents, because we can
have better understanding of documents.
For example, particular words may
not be ambiguous in such a domain.
So we can bypass the problem of ambiguity.
Another trend we can expect to see,
is the search engine will
be able to learn over time.
It's like a lifetime learning or
lifelong learning, and this is, of course,
very attractive because that means the
search engine will self-improve itself.
As more people are using it, the search
engine will become better and better, and
this is already happening,
because the search engines can learn
from the [INAUDIBLE] of feedback.
More users use it, and the quality
of the search engine allows for
the popular queries that are typed in by
many users allow it to become better,
so this is sort of another
feature that we will see.
The third trend might be
to the integration of
bottles of information access.
So search, navigation, and
recommendation or filtering might be
combined to form a full-fledged
information management system.
And in the beginning of this course,
we talked about push versus pull.
These are different modes of information
access, but these modes can be combined.
And similarly, in the pull mode, querying
and the browsing could also be combined.
And in fact we're doing that basically,
today, is the [INAUDIBLE] search endings.
We are querying, sometimes browsing,
clicking on links.
Sometimes we've got some
information recommended.
Although most of the cases the information
recommended is because of advertising.
But in the future, you can imagine
seamlessly integrate the system with
multi-mode for information access, and
that would be convenient for people.
Another trend is that we might see systems
that try to go beyond the searches
to support the user tasks.
After all, the reason why people want
to search is to solve a problem or
to make a decision or perform a task.
For example consumers might search for
opinions about products in
order to purchase a product,
choose a good product by, so
in this case it would be beneficial to
support the whole workflow of purchasing
a product, or choosing a product.
In this era, after the common search
engines already provide a good support.
For example, you can sometimes look at the
reviews, and then if you want to buy it,
you can just click on the button to go the
shopping site and directly get it done.
But it does not provide a,
a good task support for many other tasks.
For example, for researchers,
you might want to find the realm in
the literature or site of the literature.
And then, there's no, not much support for
finishing a task such as writing a paper.
So, in general, I think,
there are many opportunities in the wait.
So in the following few slides, I'll
be talking a little bit more about some
specific ideas or thoughts that hopefully,
can help you in imagining new
application possibilities.
Some of them might be already relevant
to what you are currently working on.
In general, we can think about any
intelligent system, especially intelligent
information system, as we specified
by these these three nodes.
And so
if we connect these three into a triangle,
then we'll able to specify
an information system.
And I call this
Data-User-Service Triangle.
So basically the three questions you
ask would be who are you serving and
what kind of data are you are managing and
what kind of service you provide.
Right there, this would help us
basically specify in your system.
And there are many different ways
to connect them depending on
how you connect them,
you will have a different kind of systems.
So let me give you some examples.
On the top,
you can see different kinds of users.
On the left side, you can see different
types of data or information, and
on the bottom,
you can see different service functions.
Now imagine you can connect
all these in different ways.
So, for example, you can connect
everyone with web pages, and
the support search and
browsing, what do you get?
Well, that's web search, right?
What if we connect UIUC employees with
organization documents or enterprise
documents to support the search and
browsing, but that's enterprise search.
If you connect the scientist
with literature information
to provide all kinds of service,
including search, browsing, or
alert of new random documents or
mining analyzing research trends,
or provide the task with support or
decision support.
For example, we might be,
might be able to provide a support for
automatically generating
related work section for
a research paper, and
this would be closer to task support.
Right?
So then
we can imagine this would
be a literature assistant.
If we connect the online shoppers
with blog articles or product reviews
then we can help these people
to improve shopping experience.
So we can provide, for example data mining
capabilities to analyze the reviews,
to compare products, compare sentiment of
products and to provide task support or
decision support to have them
choose what product to buy.
Or we can connect customer service
people with emails from the customers,
and, and we can imagine a system
that can provide a analysis
of these emails to find that the major
complaints of the customers.
We can imagine a system we
could provide task support
by automatically generating
a response to a customer email.
Maybe intelligently attach
also a promotion message
if appropriate, if they detect that that's
a positive message, not a complaint, and
then you might take this opportunity
to attach some promotion information.
Whereas if it's a complaint,
then you might be able to
automatically generate some
generic response first and
tell the customer that he or she can
expect a detailed response later, etc.
All of these are trying to help
people to improve the productivity.
So this shows that
the opportunities are really a lot.
It's just only restricted
by our imagination.
So this picture shows the trend
of the technology, and also,
it characterizes the, intelligent
information system in three angles.
You can see in the center, there's
a triangle that connects keyword queries
to search a bag of words representation.
That means the current search engines
basically provides search support
to users and mostly model
users based on keyword queries
and sees the data through
bag of words representation.
So it's a very simple approximation of
the actual information in the documents.
But that's what the current system does.
It connects these three nodes
in such a simple way, or
it only provides a basic search function
and doesn't really understand the user,
and it doesn't really understand that
much information in the documents.
Now, I showed some trends to push each
node toward a more advanced function.
So think about the user node here, right?
So we can go beyond the keyword queries,
look at the user search history,
and then further model the user
completely to understand the,
the user's task environment,
task need context or other information.
Okay, so this is pushing for
personalization and complete user model.
And this is a major
direction in research in,
in order to build intelligent
information systems.
On the document side,
we can also see, we can
go beyond bag of words implementation
to have entity relation representation.
This means we'll recognize people's names,
their relations, locations, etc.
And this is already feasible with
today's natural processing tec